# ETL Pipeline

This notebook will use disaster response message data from [Figure 8](https://www.figure-eight.com/) to create an ETL pipeline.

In [1]:
import pandas as pd
import numpy as np
import sqlite3

## Assess

### Messages

In [2]:
messages = pd.read_csv('disaster_messages.csv')
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [6]:
messages.genre.unique()

array(['direct', 'social', 'news'], dtype=object)

**Oberservations:** 
- It appears that the messages are available in both English and French. Because we will only be cleaning the data in English, it is possible to drop the `original` column.
- If the `genre` column is to be used, it will need to be converted into dummy variables. It seems that this column could also be considered the "source".

### Categories

In [3]:
categories = pd.read_csv('disaster_categories.csv')
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


In [4]:
# Review contents of categories column
print(categories.categories.iloc[0])

related-1;request-0;offer-0;aid_related-0;medical_help-0;medical_products-0;search_and_rescue-0;security-0;military-0;child_alone-0;water-0;food-0;shelter-0;clothing-0;money-0;missing_people-0;refugees-0;death-0;other_aid-0;infrastructure_related-0;transport-0;buildings-0;electricity-0;tools-0;hospitals-0;shops-0;aid_centers-0;other_infrastructure-0;weather_related-0;floods-0;storm-0;fire-0;earthquake-0;cold-0;other_weather-0;direct_report-0


How many categories are there?

In [5]:
len(categories.categories.iloc[0].split(';'))

36

**Observations:**
- To prep for machine learning the categories column needs to be split into 36 columns
- Once the categories have been joined to the messages, it would be possible to drop the id column