Skip to content

uriaLevko/Disaster_response

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Disaster_response

Analyzing Disaster response massages pipeline to classify type of call

Contents

Objective

Overview

Metrics Meaning

Components

Files

Results_discussion

Activation

Objective

  • machine learning pipeline to categorize disaster events
  • send the messages to an appropriate disaster relief agency
  • disply results in webapp
  • provide an API to improve Dataset (on progress)

Overview

In this project, I analyzed disaster data provided by Figure Eight to build a model for an API that classifies disaster messages.

The Project dataset contained real messages that were sent during disaster events.

The aim is creating a machine learning pipeline to categorize these events so that its possible to send the messages to an appropriate disaster relief agency.

The project includes a web app where an emergency worker can input a new message and get classification results in several categories.

Below are a few screenshots of the web app.

Metrics_in_the_Disaster_response_domain

Metric What does that even means Why it's OK Why it's NOT OK What should we do?
Precision Precision is the measure of how many of our predicted positive (ypred=1) actually were positive (y=1) In case of irrelevance - the call would just be ignored In case of Disaster - many calls will be sent, and that may cause an unnecessary load For classes that are'nt life or resource allocation crucial - medium to high threshold should be chosen
Practically, low rate means we will send many wrong calls to agencies There could be a situation where an agency sends resources by mistake - that might cause an insufficient resource allocation for the real needed calls
Recall how many actual positives (y=1) we predicted as positive (y=0) Some topics are not crucial and can go undetected like reports that aren’t important for help practically - that means agencies will not get the distress call - because it will be classified as False by mistake For classes that Are life or resource allocation crucial - low to threshold should be chosen
Low Recall means a High False Negative rate. In case of emergency, its crucial not to miss any stress calls that revolve saving life’s In this domain - for life saving classes, there soul'd be an attempt to get the lowest FN rate as possible

Components

There are three main components to this project.

  1. ETL Pipeline - process_data.py:
  • Loads the messages and categories datasets
  • Merges the two datasets
  • Cleans the data
  • Stores it in a SQLite database
  1. ML Pipeline - train_classifier.py, a machine learning pipeline that:
  • Loads data from the SQLite database
  • Splits the dataset into training and test sets
  • Builds a text processing and machine learning pipeline
  • Trains and tunes a model using GridSearchCV
  • Outputs results on the test set
  • Exports the final model as a pickle file
  1. Flask Web App -
  • classes data visualizations using Plotly in the web app.
  • input massage to get class classification

Files

Notebook 1 : ETL

Notebook 2 : ML Pipeline

Complete_Project: Full project

Results_discussion

Unballanced situation

looking at the image above, it's clear we are dealing with a highly imballanced dataset where only 3 classes has more then 20% minority class ratio, and many classes are pretty much all labeled as False. This is not an easy situation, and among the techniques to deal with it I would emphasize the following (there are meny any more):

  1. Data improvement:
  • Undersampling the Majority Class
  • Oversampling the Minority Class
  • Combine Data Undersampling and Oversampling
  • Cost-Sensitive Algorithms
  • Feature engineering
  1. Threshold-Moving for Imbalanced Classification:
  • Converting Probabilities to Class Labels
  • Threshold-Moving for Imbalanced Classification
  • Optimal Threshold for ROC Curve
  • Optimal Threshold for Precision-Recall Curve
  • Optimal Threshold Tuning

At this point, no such techniques were used at this project, but I plan on improving the results in the near future.

Precision vs. Recall

by undertanding that we are dealing with a highly imbalanced dataset, some point needs to be taken in considiration when analazig the current results:

  1. We can't use accuracy as a metric as it will produce excelent results due to the imbalance

  1. According to the imbalance in class (positive or negative) to the topic, we will have to choose the right metric:
  • Precision: appropriate when minimizing false positives is the focus - appropriate classes that demand high resource allocation but are not life saving.
  • Recall: Appropriate when minimizing false negatives is the focus - apropriate classes that are highly life saving.
  • F-Measure: provides a way to combine both precision and recall into a single measure that captures both properties.
  1. No single threshold can be activated for all classes, and we will have to change threshold according to:
  • Class imbalance (positive or negative).
  • Metric selection (logical class determines the metric).
  • use of ROC curves as indicators.
  1. Below there is a comparison of Threshold tuning with effect on the results. The impact is dramatic.

Activation

  1. Install the dependencies: If you are running this in your local environment, run conda install --file requirements.txt Or pip install -r requirements.txt to install the required python module dependencies

  2. Instructions: Run the following commands in the project's root directory to set up database and model.

  • To run ETL pipeline that cleans data and stores in database:
    python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
  • To run ML pipeline that trains classifier and saves it as pickle file:
    python train_classifier.py data/DisasterResponse.db models/classifier.pkl

3.Running the Web App from the Project Workspace IDE:

  • Step 1: Type in the command line: python run.py
  • Step 2: Open another Terminal Window, Type: env|grep WORK
  • Step 3: In a new web browser window, type in the following: https://SPACEID-3001.SPACEDOMAIN where SPACEID & SPACEDOMAIN are shown in step 2.
  1. File Descriptions:
  • app-->
  • app -- template - master.html - main page of web app - go.html - classification result page of web app
  • app -- run.py - Flask file that runs app
  • data-->
  • disaster_categories.csv – file containing the categories
  • disaster_messages.csv – file containing disaster messages to be categorized
  • process_data.py – file containing data loaded code
  • DisasterResponse.db - database to save the cleaned and categorized data
  • models -->
  • train_classifier.py – file containing ML pipeline
  • classifier.pkl – model is saved in this pickle file

LICENSE: This project is licensed under the terms of the esri license product. There is no approval to copy or use this code without permission

About

Analyzing Disaster response massages pipeline to classify type of call

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors