Disaster_response

Analyzing Disaster response massages pipeline to classify type of call

Objective

machine learning pipeline to categorize disaster events
send the messages to an appropriate disaster relief agency
disply results in webapp
provide an API to improve Dataset (on progress)

Overview

In this project, I analyzed disaster data provided by Figure Eight to build a model for an API that classifies disaster messages.

The Project dataset contained real messages that were sent during disaster events.

The aim is creating a machine learning pipeline to categorize these events so that its possible to send the messages to an appropriate disaster relief agency.

The project includes a web app where an emergency worker can input a new message and get classification results in several categories.

Below are a few screenshots of the web app.

Metrics_in_the_Disaster_response_domain

Metric	What does that even means	Why it's OK	Why it's NOT OK	What should we do?
Precision	Precision is the measure of how many of our predicted positive (ypred=1) actually were positive (y=1)	In case of irrelevance - the call would just be ignored	In case of Disaster - many calls will be sent, and that may cause an unnecessary load	For classes that are'nt life or resource allocation crucial - medium to high threshold should be chosen
Precision	Practically, low rate means we will send many wrong calls to agencies	In case of irrelevance - the call would just be ignored	There could be a situation where an agency sends resources by mistake - that might cause an insufficient resource allocation for the real needed calls
Recall	how many actual positives (y=1) we predicted as positive (y=0)	Some topics are not crucial and can go undetected like reports that aren’t important for help	practically - that means agencies will not get the distress call - because it will be classified as False by mistake	For classes that Are life or resource allocation crucial - low to threshold should be chosen
Recall	Low Recall means a High False Negative rate. In case of emergency, its crucial not to miss any stress calls that revolve saving life’s		In this domain - for life saving classes, there soul'd be an attempt to get the lowest FN rate as possible

Components

There are three main components to this project.

ETL Pipeline - process_data.py:

Loads the messages and categories datasets
Merges the two datasets
Cleans the data
Stores it in a SQLite database

ML Pipeline - train_classifier.py, a machine learning pipeline that:

Loads data from the SQLite database
Splits the dataset into training and test sets
Builds a text processing and machine learning pipeline
Trains and tunes a model using GridSearchCV
Outputs results on the test set
Exports the final model as a pickle file

Flask Web App -

classes data visualizations using Plotly in the web app.
input massage to get class classification

Files

Notebook 1 : ETL

Notebook 2 : ML Pipeline

Complete_Project: Full project

Results_discussion

Unballanced situation

looking at the image above, it's clear we are dealing with a highly imballanced dataset where only 3 classes has more then 20% minority class ratio, and many classes are pretty much all labeled as False. This is not an easy situation, and among the techniques to deal with it I would emphasize the following (there are meny any more):

Data improvement:

Undersampling the Majority Class
Oversampling the Minority Class
Combine Data Undersampling and Oversampling
Cost-Sensitive Algorithms
Feature engineering

Threshold-Moving for Imbalanced Classification:

Converting Probabilities to Class Labels
Threshold-Moving for Imbalanced Classification
Optimal Threshold for ROC Curve
Optimal Threshold for Precision-Recall Curve
Optimal Threshold Tuning

At this point, no such techniques were used at this project, but I plan on improving the results in the near future.

Precision vs. Recall

by undertanding that we are dealing with a highly imbalanced dataset, some point needs to be taken in considiration when analazig the current results:

We can't use accuracy as a metric as it will produce excelent results due to the imbalance

According to the imbalance in class (positive or negative) to the topic, we will have to choose the right metric:

Precision: appropriate when minimizing false positives is the focus - appropriate classes that demand high resource allocation but are not life saving.
Recall: Appropriate when minimizing false negatives is the focus - apropriate classes that are highly life saving.
F-Measure: provides a way to combine both precision and recall into a single measure that captures both properties.

No single threshold can be activated for all classes, and we will have to change threshold according to:

Class imbalance (positive or negative).
Metric selection (logical class determines the metric).
use of ROC curves as indicators.

Below there is a comparison of Threshold tuning with effect on the results. The impact is dramatic.

Activation

Install the dependencies: If you are running this in your local environment, run conda install --file requirements.txt Or pip install -r requirements.txt to install the required python module dependencies
Instructions: Run the following commands in the project's root directory to set up database and model.

To run ETL pipeline that cleans data and stores in database:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
To run ML pipeline that trains classifier and saves it as pickle file:
python train_classifier.py data/DisasterResponse.db models/classifier.pkl

3.Running the Web App from the Project Workspace IDE:

Step 1: Type in the command line: python run.py
Step 2: Open another Terminal Window, Type: env|grep WORK
Step 3: In a new web browser window, type in the following: https://SPACEID-3001.SPACEDOMAIN where SPACEID & SPACEDOMAIN are shown in step 2.

File Descriptions:

app-->

app -- template - master.html - main page of web app - go.html - classification result page of web app
app -- run.py - Flask file that runs app

data-->

disaster_categories.csv – file containing the categories
disaster_messages.csv – file containing disaster messages to be categorized
process_data.py – file containing data loaded code
DisasterResponse.db - database to save the cleaned and categorized data

models -->

train_classifier.py – file containing ML pipeline
classifier.pkl – model is saved in this pickle file

LICENSE: This project is licensed under the terms of the esri license product. There is no approval to copy or use this code without permission

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Files		Files
statis		statis
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaster_response

Contents

Objective

Overview

Metrics_in_the_Disaster_response_domain

Components

Files

Results_discussion

Unballanced situation

Precision vs. Recall

Activation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disaster_response

Contents

Objective

Overview

Metrics_in_the_Disaster_response_domain

Components

Files

Results_discussion

Unballanced situation

Precision vs. Recall

Activation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages