Analyzing Disaster response massages pipeline to classify type of call
- machine learning pipeline to categorize disaster events
- send the messages to an appropriate disaster relief agency
- disply results in webapp
- provide an API to improve Dataset (on progress)
In this project, I analyzed disaster data provided by Figure Eight to build a model for an API that classifies disaster messages.
The Project dataset contained real messages that were sent during disaster events.
The aim is creating a machine learning pipeline to categorize these events so that its possible to send the messages to an appropriate disaster relief agency.
The project includes a web app where an emergency worker can input a new message and get classification results in several categories.
Below are a few screenshots of the web app.
| |
|
| Metric | What does that even means | Why it's OK | Why it's NOT OK | What should we do? |
|---|---|---|---|---|
| Precision | Precision is the measure of how many of our predicted positive (ypred=1) actually were positive (y=1) | In case of irrelevance - the call would just be ignored | In case of Disaster - many calls will be sent, and that may cause an unnecessary load | For classes that are'nt life or resource allocation crucial - medium to high threshold should be chosen |
| Practically, low rate means we will send many wrong calls to agencies | There could be a situation where an agency sends resources by mistake - that might cause an insufficient resource allocation for the real needed calls | |||
| Recall | how many actual positives (y=1) we predicted as positive (y=0) | Some topics are not crucial and can go undetected like reports that aren’t important for help | practically - that means agencies will not get the distress call - because it will be classified as False by mistake | For classes that Are life or resource allocation crucial - low to threshold should be chosen |
| Low Recall means a High False Negative rate. In case of emergency, its crucial not to miss any stress calls that revolve saving life’s | In this domain - for life saving classes, there soul'd be an attempt to get the lowest FN rate as possible |
There are three main components to this project.
- ETL Pipeline - process_data.py:
- Loads the messages and categories datasets
- Merges the two datasets
- Cleans the data
- Stores it in a SQLite database
- ML Pipeline - train_classifier.py, a machine learning pipeline that:
- Loads data from the SQLite database
- Splits the dataset into training and test sets
- Builds a text processing and machine learning pipeline
- Trains and tunes a model using GridSearchCV
- Outputs results on the test set
- Exports the final model as a pickle file
- Flask Web App -
- classes data visualizations using Plotly in the web app.
- input massage to get class classification
Notebook 1 : ETL
Notebook 2 : ML Pipeline
Complete_Project: Full project
looking at the image above, it's clear we are dealing with a highly imballanced dataset where only 3 classes has more then 20% minority class ratio, and many classes are pretty much all labeled as False. This is not an easy situation, and among the techniques to deal with it I would emphasize the following (there are meny any more):
- Data improvement:
- Undersampling the Majority Class
- Oversampling the Minority Class
- Combine Data Undersampling and Oversampling
- Cost-Sensitive Algorithms
- Feature engineering
- Threshold-Moving for Imbalanced Classification:
- Converting Probabilities to Class Labels
- Threshold-Moving for Imbalanced Classification
- Optimal Threshold for ROC Curve
- Optimal Threshold for Precision-Recall Curve
- Optimal Threshold Tuning
At this point, no such techniques were used at this project, but I plan on improving the results in the near future.
by undertanding that we are dealing with a highly imbalanced dataset, some point needs to be taken in considiration when analazig the current results:
- We can't use accuracy as a metric as it will produce excelent results due to the imbalance
- According to the imbalance in class (positive or negative) to the topic, we will have to choose the right metric:
- Precision: appropriate when minimizing false positives is the focus - appropriate classes that demand high resource allocation but are not life saving.
- Recall: Appropriate when minimizing false negatives is the focus - apropriate classes that are highly life saving.
- F-Measure: provides a way to combine both precision and recall into a single measure that captures both properties.
- No single threshold can be activated for all classes, and we will have to change threshold according to:
- Class imbalance (positive or negative).
- Metric selection (logical class determines the metric).
- use of ROC curves as indicators.
- Below there is a comparison of Threshold tuning with effect on the results. The impact is dramatic.
-
Install the dependencies: If you are running this in your local environment, run conda install --file requirements.txt Or pip install -r requirements.txt to install the required python module dependencies
-
Instructions: Run the following commands in the project's root directory to set up database and model.
- To run ETL pipeline that cleans data and stores in database:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db - To run ML pipeline that trains classifier and saves it as pickle file:
python train_classifier.py data/DisasterResponse.db models/classifier.pkl
3.Running the Web App from the Project Workspace IDE:
- Step 1: Type in the command line: python run.py
- Step 2: Open another Terminal Window, Type: env|grep WORK
- Step 3: In a new web browser window, type in the following: https://SPACEID-3001.SPACEDOMAIN where SPACEID & SPACEDOMAIN are shown in step 2.
- File Descriptions:
- app-->
- app -- template - master.html - main page of web app - go.html - classification result page of web app
- app -- run.py - Flask file that runs app
- data-->
- disaster_categories.csv – file containing the categories
- disaster_messages.csv – file containing disaster messages to be categorized
- process_data.py – file containing data loaded code
- DisasterResponse.db - database to save the cleaned and categorized data
- models -->
- train_classifier.py – file containing ML pipeline
- classifier.pkl – model is saved in this pickle file
LICENSE: This project is licensed under the terms of the esri license product. There is no approval to copy or use this code without permission

.png)