# ETL Pipeline Preparation
Follow the instructions below to help you create your ETL pipeline.
### 1. Import libraries and load datasets.
- Import Python libraries
- Load `messages.csv` into a dataframe and inspect the first few lines.
- Load `categories.csv` into a dataframe and inspect the first few lines.

In [47]:
# import libraries
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


In [48]:
# load messages dataset
messages = pd.read_csv("messages.csv")
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [49]:
messages.shape

(26248, 4)

In [50]:
# load categories dataset
categories = pd.read_csv("categories.csv")
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


In [51]:
categories.shape

(26248, 2)

In [52]:
df3 = messages.equals(categories)
print('Matches:', df3)

Matches: False


### 2. Merge datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned in the following steps

In [53]:
# merge datasets
df = pd.merge(categories,messages, how='inner', on = "id", sort=True,
         suffixes=('_c', '_m'), copy=True, indicator=False,
         validate=None)
df.head()

Unnamed: 0,id,categories,message,original,genre
0,2,related-1;request-0;offer-0;aid_related-0;medi...,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,related-1;request-0;offer-0;aid_related-1;medi...,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,related-1;request-0;offer-0;aid_related-0;medi...,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,related-1;request-1;offer-0;aid_related-1;medi...,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,related-1;request-0;offer-0;aid_related-0;medi...,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [54]:
df.categories[8]

'related-0;request-0;offer-0;aid_related-0;medical_help-0;medical_products-0;search_and_rescue-0;security-0;military-0;child_alone-0;water-0;food-0;shelter-0;clothing-0;money-0;missing_people-0;refugees-0;death-0;other_aid-0;infrastructure_related-0;transport-0;buildings-0;electricity-0;tools-0;hospitals-0;shops-0;aid_centers-0;other_infrastructure-0;weather_related-0;floods-0;storm-0;fire-0;earthquake-0;cold-0;other_weather-0;direct_report-0'

In [55]:
df.shape

(26386, 5)

In [56]:
df.columns[df.isnull().sum()/len(df) > .45]

Index(['original'], dtype='object')

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26386 entries, 0 to 26385
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          26386 non-null  int64 
 1   categories  26386 non-null  object
 2   message     26386 non-null  object
 3   original    10246 non-null  object
 4   genre       26386 non-null  object
dtypes: int64(1), object(4)
memory usage: 2.5+ MB


In [58]:
df.describe()

Unnamed: 0,id
count,26386.0
mean,15217.885886
std,8823.741128
min,2.0
25%,7438.25
50%,15650.5
75%,22916.75
max,30265.0


### 3. Split `categories` into separate category columns.
- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

In [59]:
# create a dataframe of the 36 individual category columns
categories = df['categories'].str.split(";", expand=True)
categories.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-0,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-1,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-0,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-1,infrastructure_related-1,transport-0,buildings-1,electricity-0,tools-0,hospitals-1,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-0,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [60]:
categories.iloc[1,:]

0                    related-1
1                    request-0
2                      offer-0
3                aid_related-1
4               medical_help-0
5           medical_products-0
6          search_and_rescue-0
7                   security-0
8                   military-0
9                child_alone-0
10                     water-0
11                      food-0
12                   shelter-0
13                  clothing-0
14                     money-0
15            missing_people-0
16                  refugees-0
17                     death-0
18                 other_aid-1
19    infrastructure_related-0
20                 transport-0
21                 buildings-0
22               electricity-0
23                     tools-0
24                 hospitals-0
25                     shops-0
26               aid_centers-0
27      other_infrastructure-0
28           weather_related-1
29                    floods-0
30                     storm-1
31                      fire-0
32      

In [61]:
# select the first row of the categories dataframe
row = categories.iloc[0,:]

# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
category_colnames = row.apply(lambda x: pd.Series(x[:-2]))
print(category_colnames)

                         0
0                  related
1                  request
2                    offer
3              aid_related
4             medical_help
5         medical_products
6        search_and_rescue
7                 security
8                 military
9              child_alone
10                   water
11                    food
12                 shelter
13                clothing
14                   money
15          missing_people
16                refugees
17                   death
18               other_aid
19  infrastructure_related
20               transport
21               buildings
22             electricity
23                   tools
24               hospitals
25                   shops
26             aid_centers
27    other_infrastructure
28         weather_related
29                  floods
30                   storm
31                    fire
32              earthquake
33                    cold
34           other_weather
35           direct_report


In [62]:
category_colnames.loc[:,0]

0                    related
1                    request
2                      offer
3                aid_related
4               medical_help
5           medical_products
6          search_and_rescue
7                   security
8                   military
9                child_alone
10                     water
11                      food
12                   shelter
13                  clothing
14                     money
15            missing_people
16                  refugees
17                     death
18                 other_aid
19    infrastructure_related
20                 transport
21                 buildings
22               electricity
23                     tools
24                 hospitals
25                     shops
26               aid_centers
27      other_infrastructure
28           weather_related
29                    floods
30                     storm
31                      fire
32                earthquake
33                      cold
34            

In [63]:
category_colnames.head()

Unnamed: 0,0
0,related
1,request
2,offer
3,aid_related
4,medical_help


In [64]:
type(row)

pandas.core.series.Series

In [65]:
type(category_colnames)

pandas.core.frame.DataFrame

In [66]:
# rename the columns of `categories`
categories.columns = category_colnames.loc[:,0]
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-0,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-1,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-0,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-1,infrastructure_related-1,transport-0,buildings-1,electricity-0,tools-0,hospitals-1,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,water-0,food-0,shelter-0,clothing-0,money-0,missing_people-0,refugees-0,death-0,other_aid-0,infrastructure_related-0,transport-0,buildings-0,electricity-0,tools-0,hospitals-0,shops-0,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [67]:
type(categories.columns)

pandas.core.indexes.base.Index

In [68]:
type(categories)

pandas.core.frame.DataFrame

In [69]:
categories.columns

Index(['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report'], dtype='object', name=0)

In [70]:
categories.iloc[1,:]

0
related                                  related-1
request                                  request-0
offer                                      offer-0
aid_related                          aid_related-1
medical_help                        medical_help-0
medical_products                medical_products-0
search_and_rescue              search_and_rescue-0
security                                security-0
military                                military-0
child_alone                          child_alone-0
water                                      water-0
food                                        food-0
shelter                                  shelter-0
clothing                                clothing-0
money                                      money-0
missing_people                    missing_people-0
refugees                                refugees-0
death                                      death-0
other_aid                              other_aid-1
infrastructure_related    inf

### 4. Convert category values to just numbers 0 or 1.
- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.
- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`.

In [71]:
categories.values[0][0][-1]

'1'

In [72]:
categories.iloc[:,0]

0        related-1
1        related-1
2        related-1
3        related-1
4        related-1
           ...    
26381    related-0
26382    related-0
26383    related-1
26384    related-1
26385    related-1
Name: related, Length: 26386, dtype: object

In [73]:
categories.shape

(26386, 36)

In [74]:
cat=categories.copy()
for column in cat.columns:
    cat[column] =  cat[column].apply(lambda x: '1' if '1' in x else 0)
cat.head()


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [75]:
cat.iloc[1,:]

0
related                   1
request                   0
offer                     0
aid_related               1
medical_help              0
medical_products          0
search_and_rescue         0
security                  0
military                  0
child_alone               0
water                     0
food                      0
shelter                   0
clothing                  0
money                     0
missing_people            0
refugees                  0
death                     0
other_aid                 1
infrastructure_related    0
transport                 0
buildings                 0
electricity               0
tools                     0
hospitals                 0
shops                     0
aid_centers               0
other_infrastructure      0
weather_related           1
floods                    0
storm                     1
fire                      0
earthquake                0
cold                      0
other_weather             0
direct_report     

In [76]:
cat.tail()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
26381,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26382,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26383,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26384,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26385,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [77]:
cat.related.value_counts()

1    20042
0     6344
Name: related, dtype: int64

In [78]:
cat.hospitals.value_counts()

0    26103
1      283
Name: hospitals, dtype: int64

In [79]:
    # convert column from string to numeric
cat = cat.astype(int)
cat.dtypes

0
related                   int32
request                   int32
offer                     int32
aid_related               int32
medical_help              int32
medical_products          int32
search_and_rescue         int32
security                  int32
military                  int32
child_alone               int32
water                     int32
food                      int32
shelter                   int32
clothing                  int32
money                     int32
missing_people            int32
refugees                  int32
death                     int32
other_aid                 int32
infrastructure_related    int32
transport                 int32
buildings                 int32
electricity               int32
tools                     int32
hospitals                 int32
shops                     int32
aid_centers               int32
other_infrastructure      int32
weather_related           int32
floods                    int32
storm                     int32
fire  

In [80]:
cat.sample(3)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
11626,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
15463,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
24570,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [81]:
df.head()

Unnamed: 0,id,categories,message,original,genre
0,2,related-1;request-0;offer-0;aid_related-0;medi...,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,related-1;request-0;offer-0;aid_related-1;medi...,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,related-1;request-0;offer-0;aid_related-0;medi...,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,related-1;request-1;offer-0;aid_related-1;medi...,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,related-1;request-0;offer-0;aid_related-0;medi...,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [82]:
# drop the original categories column from `df`

# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df.drop('categories',axis=1),cat],axis=1)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed.

In [83]:
df.duplicated(['id']).sum() == 0

False

In [84]:
# check number of duplicates
print(sum(df.duplicated()))

170


In [85]:
ids = df["id"]
df[ids.isin(ids[ids.duplicated()])].sort_values(by="id")

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
162,202,?? port au prince ?? and food. they need gover...,p bay pap la syen ak manje. Yo bezwen ed gouve...,direct,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
163,202,?? port au prince ?? and food. they need gover...,p bay pap la syen ak manje. Yo bezwen ed gouve...,direct,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
164,202,?? port au prince ?? and food. they need gover...,p bay pap la syen ak manje. Yo bezwen ed gouve...,direct,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
165,202,?? port au prince ?? and food. they need gover...,p bay pap la syen ak manje. Yo bezwen ed gouve...,direct,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
657,804,elle est vraiment malade et a besoin d'aide. u...,she is really sick she need your help. please ...,direct,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
658,804,elle est vraiment malade et a besoin d'aide. u...,she is really sick she need your help. please ...,direct,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
659,804,elle est vraiment malade et a besoin d'aide. u...,she is really sick she need your help. please ...,direct,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
660,804,elle est vraiment malade et a besoin d'aide. u...,she is really sick she need your help. please ...,direct,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
716,862,What is the address of the radio station? I as...,Ki adres radyo a? Paske m bezwen al depoze dos...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
715,862,What is the address of the radio station? I as...,Ki adres radyo a? Paske m bezwen al depoze dos...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [86]:
df.query('related==2')

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report


In [87]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0
mean,15217.885886,0.759569,0.171038,0.004586,0.415144,0.07955,0.049989,0.027477,0.01785,0.032707,0.0,0.063822,0.112029,0.088759,0.015539,0.022967,0.011408,0.033351,0.04563,0.131282,0.064769,0.045933,0.050974,0.02039,0.006026,0.010725,0.004548,0.011711,0.043773,0.278292,0.082506,0.093383,0.010687,0.093269,0.0202,0.052263,0.193777
std,8823.741128,0.427353,0.376549,0.067564,0.492756,0.2706,0.217926,0.163471,0.13241,0.177871,0.0,0.24444,0.315408,0.284401,0.123684,0.1498,0.106197,0.179555,0.208686,0.337715,0.246123,0.209345,0.219949,0.141332,0.077394,0.103009,0.067286,0.107583,0.204594,0.448166,0.275139,0.290974,0.102828,0.290815,0.140687,0.22256,0.395264
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7438.25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15650.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22916.75,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [88]:
df.sample(5)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
14205,16762,290 islands of fires appeared during the day.,,news,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0
20107,23268,"Of a population of 17,000, some 9,500 people r...",,news,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
15134,17786,1 crore through Tamil Nadu Cements Corporation...,,news,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
9657,10803,"When they'll give the new passport? please,it'...",Kil yap bay nouvo pasp s.v.p.enpotan,direct,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17599,20503,"Moreover, Rakhine State relief and rehabilita-...",,news,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [89]:
# drop duplicates
df=df.drop_duplicates()

In [90]:
# check number of duplicates
print(sum(df.duplicated()))

0


In [91]:
df.related.value_counts()

1    19906
0     6310
Name: related, dtype: int64

In [92]:
df = df.replace(2,1).sort_values(by="id")

In [93]:
df.shape

(26216, 40)

In [94]:
print(df.related.unique())

[1 0]


In [95]:
df.related.value_counts()

1    19906
0     6310
Name: related, dtype: int64

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26216 entries, 0 to 26385
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      26216 non-null  int64 
 1   message                 26216 non-null  object
 2   original                10170 non-null  object
 3   genre                   26216 non-null  object
 4   related                 26216 non-null  int32 
 5   request                 26216 non-null  int32 
 6   offer                   26216 non-null  int32 
 7   aid_related             26216 non-null  int32 
 8   medical_help            26216 non-null  int32 
 9   medical_products        26216 non-null  int32 
 10  search_and_rescue       26216 non-null  int32 
 11  security                26216 non-null  int32 
 12  military                26216 non-null  int32 
 13  child_alone             26216 non-null  int32 
 14  water                   26216 non-null  int32 
 15  fo

### 7. Save the clean dataset into an sqlite database.
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below.

In [97]:
from sqlalchemy import create_engine 
engine = create_engine('sqlite:///InsertDatabaseName.db')
df.to_sql('master', engine, index=False)

### 8. Use this notebook to complete `etl_pipeline.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later.

In [3]:
from sqlalchemy import create_engine 
import pandas as pd
engine = create_engine('sqlite:///DisasterResponse.db')
df_gather = pd.read_sql('SELECT * FROM master', engine)

In [4]:
df_gather.head(2)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


In [5]:
df_gather.shape

(26216, 40)

In [6]:
df_gather.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,...,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.759307,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,...,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.427512,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,...,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Reviewing dataset for App visualizations

In [110]:
genre_counts = df.groupby('genre').count()['message']
genre_counts

genre
direct    10766
news      13054
social     2396
Name: message, dtype: int64

In [112]:
genre_names = list(genre_counts.index)
genre_names

['direct', 'news', 'social']

In [144]:
df2 = df.copy()
df2 = pd.melt(df2,id_vars = ['id'],value_vars=df2.drop(['id','genre','message','original'], axis=1), var_name='category' )

In [159]:
grouped = df2.groupby('category', as_index=False).sum()
top15 = grouped.sort_values(by='value',ascending=False)[:15]
top15

Unnamed: 0,category,id,value
25,related,399133915,19906
1,aid_related,399133915,10860
35,weather_related,399133915,7297
7,direct_report,399133915,5075
26,request,399133915,4474
21,other_aid,399133915,3446
12,food,399133915,2923
8,earthquake,399133915,2455
31,storm,399133915,2443
29,shelter,399133915,2314


In [156]:
bottom15 = grouped.sort_values(by='value',ascending=True)[:15]
bottom15

Unnamed: 0_level_0,id,value
category,Unnamed: 1_level_1,Unnamed: 2_level_1
child_alone,399133915,0
offer,399133915,118
shops,399133915,120
tools,399133915,159
fire,399133915,282
hospitals,399133915,283
missing_people,399133915,298
aid_centers,399133915,309
clothing,399133915,405
security,399133915,471


In [186]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

clean_tokens = []
def tokenize(text):
    """a tokenization function to process our text data, which is splitting text into words / tokens"""
    tokenizer = RegexpTokenizer(r'[a-zA-Z]{3,}')
    tokens = tokenizer.tokenize(text)
    lemmatizer = WordNetLemmatizer()
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    return clean_tokens

In [187]:
tokenize(df['message'].to_string())

['weather',
 'update',
 'cold',
 'front',
 'from',
 'cuba',
 'that',
 'the',
 'hurricane',
 'over',
 'not',
 'over',
 'looking',
 'for',
 'someone',
 'but',
 'name',
 'report',
 'leogane',
 'destroyed',
 'only',
 'hospi',
 'say',
 'west',
 'side',
 'haiti',
 'rest',
 'the',
 'country',
 'information',
 'about',
 'the',
 'national',
 'palace',
 'storm',
 'sacred',
 'heart',
 'jesus',
 'please',
 'need',
 'tent',
 'and',
 'water',
 'are',
 'sil',
 'would',
 'like',
 'receive',
 'the',
 'message',
 'thank',
 'you',
 'croix',
 'de',
 'bouquets',
 'have',
 'health',
 'i',
 'there',
 'nothing',
 'eat',
 'and',
 'water',
 'starving',
 'petionville',
 'need',
 'more',
 'information',
 'thomassin',
 'number',
 'the',
 'area',
 'named',
 'let',
 'together',
 'need',
 'food',
 'delma',
 'more',
 'information',
 'the',
 'number',
 'order',
 'comitee',
 'delmas',
 'rue',
 'street',
 'janvier',
 'need',
 'food',
 'and',
 'water',
 'klecin',
 'are',
 'are',
 'you',
 'going',
 'call',
 'you',
 'want',

In [196]:
for tok in clean_tokens:
    print(tok, clean_tokens.count(tok))

weather 133
update 25
cold 67
front 34
from 743
cuba 17
that 1374
the 12067
hurricane 196
over 222
not 832
over 222
looking 44
for 2103
someone 107
but 359
name 48
report 137
leogane 80
destroyed 98
only 152
hospi 4
say 206
west 48
side 23
haiti 526
rest 9
the 12067
country 218
information 683
about 623
the 12067
national 123
palace 6
storm 162
sacred 1
heart 31
jesus 18
please 978
need 1357
tent 288
and 3030
water 797
are 2171
sil 3
would 946
like 1035
receive 71
the 12067
message 460
thank 250
you 1461
croix 53
de 48
bouquets 18
have 2137
health 186
i 3
there 797
nothing 94
eat 72
and 3030
water 797
starving 21
petionville 9
need 1357
more 409
information 683
thomassin 1
number 145
the 12067
area 350
named 9
let 72
together 29
need 1357
food 1069
delma 23
more 409
information 683
the 12067
number 145
order 35
comitee 1
delmas 124
rue 56
street 135
janvier 3
need 1357
food 1069
and 3030
water 797
klecin 1
are 2171
are 2171
you 1461
going 165
call 142
you 1461
want 390
don 465
understa

creol 1
unrecognized 1
characterse 1
initial 22
formation 6
acc 4
continues 35
previous 11
truncated 12
message 460
regardi 1
bizoton 9
desperately 3
need 1357
help 1034
are 2171
humilia 1
where 475
can 1200
find 488
car 50
leave 41
hello 266
good 517
morning 193
live 231
delmas 124
are 2171
carrefour 106
feuille 6
extended 8
need 1357
help 1034
carrefour 106
street 135
near 71
edh 9
elect 8
everyone 54
not 832
forget 24
the 12067
people 1043
merger 5
digicel 100
have 2137
problem 164
family 247
doesn 40
please 978
send 269
humanitarian 72
aid 230
can 1200
surviv 3
good 517
evening 181
carrefour 106
feuill 2
are 2171
the 12067
people 1043
the 12067
village 145
god 117
need 1357
food 1069
and 3030
can 1200
someone 107
who 441
ha 1035
visa 32
travel 13
out 257
what 890
there 797
are 2171
alot 12
victimes 7
that 1374
have 2137
arrived 32
help 1034
delmas 124
orchidee 1
street 135
are 2171
want 390
people 1043
know 759
that 1374
people 1043
that 1374
were 472
and 3030
marc 16
just 169
came

perodin 1
since 216
the 12067
disaster 117
have 2137
been 559
staying 12
message 460
read 25
survivors 7
are 2171
still 171
sleeping 56
are 2171
matisan 5
have 2137
medical 102
problem 164
and 3030
wa 636
port 149
prince 84
and 3030
went 35
the 12067
north 81
have 2137
card 119
tell 158
what 890
kind 50
sms 43
nee 25
around 94
house 316
there 797
are 2171
cyber 22
cafe 23
for 2103
the 12067
message 460
you 1461
spoke 7
registering 3
would 946
like 1035
leave 41
this 1241
country 218
and 3030
the 12067
cyber 22
cafe 23
are 2171
not 832
all 520
damaged 67
clercin 1
don 465
know 759
where 475
there 797
cyber 22
cafe 23
open 84
come 161
our 260
rescue 65
regarding 45
water 797
please 978
help 1034
don 465
have 2137
food 1069
don 465
westnet 1
avenue 22
muller 3
neighborhood 13
you 1461
something 57
for 2103
don 465
forget 24
the 12067
province 83
like 1035
mawouj 1
don 465
know 759
the 12067
address 59
cyber 22
cafe 23
that 1374
register 23
your 224
name 48
someone 107
your 224
looking 44


need 1357
food 1069
leogane 80
section 52
citronniers 1
goodmournig 1
the 12067
name 48
mocdep 1
communotary 1
are 2171
without 81
home 134
rue 56
panamericaine 1
need 1357
help 1034
hangry 2
and 3030
nothing 94
are 2171
the 12067
church 33
eglise 2
evangelique 1
please 978
profession 2
mechanic 5
truck 43
driver 27
cold 67
front 34
ha 1035
hit 157
cuba 17
this 1241
morning 193
cou 25
the 12067
ips 1
and 3030
opa 1
are 2171
available 45
raphael 2
not 832
have 2137
anything 97
left 89
only 152
the 12067
clothes 68
carrefour 106
can 1200
find 488
food 1069
nor 19
water 797
clercine 13
impasse 29
cenor 1
need 1357
aide 4
fontamara 45
the 12067
road 131
blocked 11
ther 11
digicel 100
please 978
cannot 68
make 188
phone 86
call 142
see 151
would 946
like 1035
for 2103
you 1461
help 1034
the 12067
people 1043
the 12067
there 797
are 2171
people 1043
coming 62
the 12067
department 74
comi 4
port 149
prince 84
house 316
wa 636
destroyed 98
panamerican 1
street 135
cannot 68
count 27
the 12067


timoun 1
croix 53
where 475
are 2171
the 12067
people 1043
charge 19
were 472
suppos 1
electronic 2
technician 12
foreman 2
and 3030
dying 51
hunger 59
fontamara 45
don 465
are 2171
dying 51
hunger 59
are 2171
located 57
sour 1
please 978
ask 108
for 2103
urgent 56
help 1034
for 2103
the 12067
populat 3
wa 636
trying 34
operate 3
school 220
doe 65
not 832
the 12067
nam 5
fthe 1
population 55
area 350
that 1374
hello 266
have 2137
problem 164
want 390
make 188
int 16
need 1357
security 64
while 138
wait 23
for 2103
your 224
still 171
not 832
have 2137
help 1034
gressier 31
leogane 80
would 946
like 1035
thank 250
all 520
the 12067
country 218
that 1374
re 20
thanks 150
for 2103
you 1461
help 1034
are 2171
counting 8
far 52
here 131
woman 93
not 832
good 517
morning 193
wrote 19
message 460
not 832
know 759
need 1357
food 1069
and 3030
medical 102
supply 167
there 797
please 978
can 1200
still 171
counted 1
victim 187
ajacdeb 1
association 19
young 42
haiti 526
for 2103
dev 4
make 188
in

people 1043
left 89
the 12067
city 123
the 12067
pro 34
called 53
radio 73
station 35
they 619
not 832
need 1357
white 10
man 43
give 285
tent 288
cause 62
the 12067
gover 11
there 797
food 1069
distribution 82
the 12067
area 350
wher 5
thank 250
you 1461
the 12067
northwest 8
the 12067
country 218
please 978
this 1241
the 12067
number 145
call 142
get 316
more 409
family 247
goave 31
and 3030
mother 52
who 441
the 12067
minustah 13
radio 73
working 119
now 304
law 22
student 78
don 465
have 2137
place 99
are 2171
kan 3
people 1043
stationed 4
ribonc 1
calboi 1
place 99
need 1357
food 1069
and 3030
water 797
need 1357
some 639
help 1034
please 978
carrefour 106
perni 1
dear 18
friends 5
say 206
something 57
for 2103
have 2137
need 1357
rescue 65
team 216
need 1357
help 1034
the 12067
del 12
starving 21
for 2103
food 1069
located 57
carref 4
are 2171
group 121
people 1043
living 118
tabarre 16
cold 67
cuba 17
this 1241
morning 193
could 142
reac 4
can 1200
communicate 3
cause 62
have 21

good 517
evening 181
digicel 100
are 2171
here 131
the 12067
secon 3
survived 16
the 12067
january 99
port 149
princ 2
are 2171
hungry 95
need 1357
food 1069
our 260
house 316
iar 1
but 359
going 165
pray 22
with 945
faith 2
someone 107
not 832
have 2137
access 37
internet 26
wher 5
thank 250
you 1461
very 146
much 146
for 2103
your 224
help 1034
want 390
good 517
day 210
happy 81
your 224
desk 1
accepted 5
me 6
are 2171
near 71
savanne 2
the 12067
road 131
jacmel 43
croix 53
de 48
bouquet 17
where 475
can 1200
find 488
haitian 75
center 58
cooperation 23
internationnal 1
people 1043
together 29
area 350
the 12067
area 350
carrefour 106
the 12067
people 1043
mahoti 1
are 2171
hungry 95
not 832
have 2137
computer 15
way 99
have 2137
haven 64
been 559
able 58
find 488
work 254
have 2137
received 100
the 12067
message 460
will 923
happy 81
fin 12
not 832
port 149
prince 84
island 41
for 2103
people 1043
carrefour 106
zone 80
bertin 5
where 475
can 1200
leave 41
there 797
internet 26
give 

water 797
location 33
delm 2
are 2171
going 165
have 2137
aftershock 85
tonight 26
true 58
they 619
say 206
that 1374
americans 5
have 2137
taken 35
today 128
feel 51
happy 81
since 216
heard 61
this 1241
messa 4
aren 10
the 12067
people 1043
fontamara 45
going 165
get 316
that 1374
later 16
there 797
will 923
aftersho 5
need 1357
help 1034
fontamara 45
tent 288
for 2103
message 460
for 2103
radio 73
inspiration 2
good 517
evenin 1
doctor 43
can 1200
help 1034
the 12067
peopl 18
but 359
need 1357
our 260
hous 12
destroyed 98
have 2137
wounded 21
and 3030
preg 1
plea 36
will 923
the 12067
earthquake 806
really 71
come 161
back 101
lik 9
like 1035
have 2137
info 99
about 623
people 1043
the 12067
what 890
the 12067
be 1
shelter 133
can 1200
find 488
for 2103
wife 27
like 1035
find 488
tent 288
sleep 82
with 945
kid 40
aren 10
the 12067
people 1043
fontamara 45
included 35
you 1461
can 1200
satisfy 1
everyone 54
but 359
don 465
forget 24
marianie 1
there 797
are 2171
alot 12
vitimes 1
tha

need 1357
know 759
someone 107
who 441
the 12067
milita 3
need 1357
some 639
people 1043
who 441
can 1200
help 1034
with 945
people 1043
the 12067
haitian 75
national 123
police 64
general 69
director 22
hello 266
like 1035
know 759
really 71
there 797
will 923
looking 44
for 2103
more 409
information 683
concerning 22
would 946
like 1035
know 759
true 58
that 1374
there 797
hello 266
all 520
people 1043
here 131
the 12067
street 135
hou 9
would 946
like 1035
information 683
the 12067
earthquake 806
are 2171
concerned 31
because 288
heard 61
that 1374
there 797
like 1035
get 316
more 409
information 683
about 623
the 12067
po 6
would 946
like 1035
get 316
some 639
info 99
about 623
the 12067
earthq 28
don 465
understand 62
the 12067
first 137
part 113
understand 62
you 1461
think 96
someone 107
with 945
expired 5
visa 32
will 923
would 946
like 1035
information 683
the 12067
last 184
tremor 27
good 517
evening 181
writing 18
just 169
have 2137
some 639
inf 9
are 2171
survivor 51
the 12

earthquake 806
have 2137
many 301
family 247
member 69
who 441
were 472
victim 187
are 2171
starving 21
death 69
without 81
anything 97
ple 20
hello 266
would 946
like 1035
know 759
more 409
information 683
when 416
sleep 82
our 260
house 316
hello 266
would 946
like 1035
know 759
wether 1
someone 107
people 1043
are 2171
supposed 10
leave 41
january 99
hello 266
please 978
come 161
baby 84
site 52
delmas 124
thank 250
person 169
need 1357
domingo 7
and 3030
doe 65
one 359
ha 1035
talked 3
about 623
the 12067
home 134
that 1374
have 2137
are 2171
the 12067
giving 46
people 1043
with 945
residency 2
status 12
hello 266
people 1043
are 2171
pleading 1
for 2103
you 1461
find 488
true 58
that 1374
will 923
have 2137
another 118
earthquak 16
there 797
agency 84
that 1374
will 923
open 84
print 1
pas 55
would 946
like 1035
information 683
the 12067
response 92
did 112
the 12067
american 55
military 60
come 161
occupy 2
hello 266
people 1043
please 978
find 488
way 99
send 269
wate 16
when 41

countri 3
want 390
find 488
information 683
the 12067
earthquake 806
listening 17
you 1461
and 3030
would 946
like 1035
some 639
info 99
information 683
about 623
visa 32
for 2103
canada 24
need 1357
know 759
haiti 526
going 165
back 101
occup 1
tell 158
about 623
areplike 2
what 890
doe 65
that 1374
mean 47
survivor 51
the 12067
earthquake 806
came 57
want 390
know 759
some 639
information 683
are 2171
student 78
union 26
belok 3
leogane 80
please 978
send 269
message 460
letting 4
know 759
hello 266
leogane 80
the 12067
danpis 1
shelter 133
they 619
predict 12
one 359
aftershock 85
this 1241
weekend 9
acco 2
good 517
morning 193
this 1241
animator 1
caraibes 2
the 12067
commercial 15
bank 35
will 923
open 84
thank 250
you 1461
good 517
morning 193
like 1035
know 759
how 521
the 12067
weather 133
where 475
located 57
the 12067
embassy 23
senegal 36
haiti 526
need 1357
more 409
information 683
the 12067
quake 126
dying 51
hunger 59
christ 23
coming 62
how 521
can 1200
you 1461
help 103

earthquake 806
water 797
food 1069
masson 2
leogane 80
please 978
reside 5
the 12067
southern 61
haiti 526
like 1035
who 441
can 1200
for 2103
american 55
money 78
for 2103
buying 6
and 3030
question 37
person 169
wa 636
deported 1
and 3030
need 1357
help 1034
geral 1
bataille 1
impasse 29
salem 3
would 946
like 1035
receive 71
information 683
phon 3
like 1035
know 759
where 475
can 1200
get 316
what 890
are 2171
the 12067
political 31
party 43
and 3030
type 23
when 416
will 923
the 12067
tremor 27
stop 39
true 58
will 923
would 946
like 1035
know 759
when 416
the 12067
aftershock 85
will 923
would 946
like 1035
ask 108
everyone 54
who 441
ha 1035
hope 57
who 441
good 517
morning 193
know 759
where 475
can 1200
get 316
need 1357
information 683
how 521
get 316
visa 32
can 1200
would 946
like 1035
know 759
the 12067
result 80
the 12067
spanish 17
need 1357
information 683
regarding 45
what 890
happening 21
need 1357
help 1034
before 85
the 12067
catastophy 1
the 12067
people 1043
are 21

precisely 2
cote 7
would 946
hope 57
they 619
give 285
people 1043
who 441
are 2171
thinking 12
info 99
canada 24
senagal 1
usa 20
for 2103
refugee 57
status 12
need 1357
food 1069
and 3030
care 82
lived 12
delma 23
house 316
need 1357
food 1069
for 2103
energy 16
water 797
treatment 21
pai 1
would 946
like 1035
have 2137
information 683
about 623
what 890
homeless 31
family 247
sleep 82
under 153
the 12067
canal 18
bois 6
comune 1
gonaives 28
would 946
mosf 1
terrs 1
and 3030
all 520
family 247
moth 1
borther 2
sister 15
niece 1
and 3030
nephew 1
courage 2
good 517
evening 181
would 946
like 1035
know 759
the 12067
offi 3
good 517
evening 181
would 946
like 1035
known 30
when 416
the 12067
when 416
can 1200
out 257
house 316
are 2171
the 12067
aftershoo 1
please 978
forward 11
this 1241
message 460
the 12067
air 50
the 12067
how 521
long 83
will 923
the 12067
aftershock 85
last 184
good 517
evening 181
digical 1
please 978
since 216
this 1241
mornin 3
are 2171
the 12067
earthquake 806

made 111
for 2103
the 12067
eve 11
they 619
said 394
there 797
are 2171
aftershock 85
like 1035
the 12067
one 359
there 797
school 220
name 48
sbne 1
carrefour 106
which 266
there 797
house 316
brochet 2
near 71
the 12067
school 220
don 465
have 2137
anything 97
need 1357
water 797
food 1069
hello 266
lost 109
house 316
have 2137
nowhere 4
were 472
very 146
affected 121
the 12067
earthquake 806
need 1357
your 224
help 1034
need 1357
food 1069
water 797
and 3030
medi 2
live 231
diquini 3
the 12067
seminary 2
school 220
have 2137
lot 158
problem 164
since 216
thursday 23
can 1200
please 978
give 285
info 99
the 12067
aftershock 85
felt 14
good 517
evening 181
this 1241
dificul 1
moment 23
would 946
need 1357
help 1034
cabaret 8
because 288
are 2171
sleepin 4
are 2171
gonave 24
ten 34
day 210
after 252
this 1241
desas 1
hello 266
house 316
ha 1035
been 559
destroyed 98
how 521
thanks 150
for 2103
the 12067
message 460
did 112
your 224
mom 15
and 3030
dad 11
need 1357
answer 76
about 623
t

victim 187
the 12067
earthquake 806
january 99
would 946
like 1035
foreign 38
and 3030
haitian 75
official 106
how 521
many 301
more 409
aftershock 85
are 2171
expect 19
family 247
sleeping 56
outside 41
the 12067
open 84
how 521
will 923
you 1461
for 2103
the 12067
unable 7
ethe 1
poor 65
hello 266
live 231
zone 80
maurace 1
which 266
the 12067
hello 266
are 2171
victim 187
the 12067
earthquake 806
becaus 11
need 1357
money 78
transfer 14
agency 84
like 1035
western 24
would 946
like 1035
know 759
how 521
many 301
marine 7
are 2171
comi 4
separate 4
they 619
haven 64
brought 31
anything 97
for 2103
want 390
know 759
how 521
thing 73
yesterday 33
research 15
parent 9
the 12067
usofa 1
thank 250
you 1461
never 116
seen 31
any 231
response 92
hello 266
there 797
doctor 43
who 441
life 80
here 131
ave 6
are 2171
separating 2
them 95
please 978
don 465
forget 24
the 12067
baby 84
diaper 10
are 2171
under 153
the 12067
rubble 22
along 47
our 260
neighborhood 13
the 12067
population 55
great

there 797
are 2171
dead 64
below 22
plese 2
send 269
out 257
someone 107
see 151
the 12067
people 1043
refugee 57
ha 1035
not 832
received 100
any 231
aid 230
where 475
from 743
nippes 2
where 475
can 1200
find 488
bus 22
back 101
hello 266
please 978
would 946
like 1035
you 1461
help 1034
with 945
under 153
rubble 22
luke 2
hospital 93
del 12
family 247
montreal 1
because 288
have 2137
paper 23
strong 67
all 520
the 12067
rescuer 9
the 12067
good 517
lord 20
thank 250
you 1461
for 2103
all 520
the 12067
information 683
good 517
work 254
nothing 94
but 359
and 3030
our 260
clothes 68
need 1357
tent 288
and 3030
food 1069
the 12067
guedon 1
zone 80
are 2171
whole 33
bunch 5
mangonese 2
road 131
chr 5
would 946
like 1035
little 76
information 683
child 162
moleard 2
the 12067
adventist 3
church 33
morija 3
been 559
day 210
since 216
people 1043
butte 2
boyer 2
where 475
can 1200
find 488
someone 107
selling 12
cooked 6
food 1069
husband 8
need 1357
rice 69
and 3030
water 797
there 797
ar

the 12067
minister 71
hello 266
government 335
would 946
like 1035
know 759
what 890
rol 1
the 12067
city 123
port 149
prince 84
following 75
sack 5
rice 69
sack 5
corn 7
bottle 13
oil 18
bott 1
there 797
fire 75
coupe 1
national 123
carrefour 106
off 74
paloma 3
franclklin 1
stree 2
would 946
like 1035
information 683
immigration 18
sen 15
need 1357
help 1034
please 978
because 288
are 2171
hungry 95
and 3030
are 2171
people 1043
the 12067
hights 1
fantamara 1
can 1200
sleep 82
our 260
house 316
again 62
wait 23
where 475
can 1200
find 488
water 797
petionville 9
please 978
how 521
many 301
soldier 33
haiti 526
house 316
collapsed 30
need 1357
food 1069
and 3030
water 797
bos 4
mason 3
computer 15
tech 5
and 3030
ironwork 1
address 59
boulveard 1
toussaint 1
opening 19
when 416
you 1461
think 96
that 1374
commercial 15
plane 30
will 923
good 517
morning 193
governemnt 2
this 1241
country 218
right 67
now 304
left 89
the 12067
city 123
becuase 2
peop 17
new 240
information 683
the 1206

goodmourning 1
here 131
have 2137
many 301
problem 164
with 945
nothing 94
really 71
miserable 2
matisan 5
not 832
have 2137
great 46
damage 67
hello 266
are 2171
oscb 1
social 30
organisation 35
petio 2
very 146
sick 30
and 3030
starving 21
have 2137
nothing 94
all 520
the 12067
school 220
are 2171
down 77
used 59
the 12067
hello 266
please 978
how 521
get 316
flight 43
usa 20
mess 16
mess 16
cry 17
help 1034
hurry 7
cuz 2
purified 1
water 797
leave 41
croix 53
de 48
bouquet 17
they 619
sleep 82
vast 6
open 84
space 11
they 619
are 2171
drink 24
could 142
you 1461
pelase 1
send 269
the 12067
signal 13
number 145
after 252
the 12067
earthquake 806
falling 8
into 95
darkness 4
need 1357
help 1034
are 2171
dying 51
anger 5
need 1357
hello 266
all 520
those 131
concerned 31
look 42
the 12067
delmas 124
are 2171
sarte 2
impasse 29
philadelphia 3
lost 109
sleeping 56
the 12067
alley 5
help 1034
please 978
please 978
help 1034
those 131
who 441
left 89
pap 26
for 2103
the 12067
what 890
help

radio 73
everyday 3
already 114
have 2137
please 978
don 465
forget 24
about 623
are 2171
located 57
please 978
help 1034
with 945
the 12067
thief 4
delmas 124
goodness 2
those 131
silo 7
haven 64
had 296
need 1357
gourde 11
because 288
need 1357
call 142
are 2171
group 121
victim 187
need 1357
help 1034
such 74
don 465
have 2137
those 131
problems 5
gave 23
shortage 24
can 1200
you 1461
please 978
reconnect 2
phone 86
the 12067
inter 17
dont 67
find 488
food 1069
what 890
suppose 9
how 521
can 1200
get 316
france 13
because 288
stuck 6
have 2137
problem 164
with 945
the 12067
food 1069
that 1374
given 53
would 946
like 1035
know 759
the 12067
president 79
not 832
have 2137
lot 158
people 1043
croix 53
de 48
bouquet 17
they 619
are 2171
camped 1
the 12067
square 9
mother 52
are 2171
now 304
petit 29
goave 31
benoit 3
street 135
have 2137
heard 61
from 743
everyone 54
now 304
are 2171
all 520
the 12067
people 1043
the 12067
ground 37
right 67
now 304
haiti 526
help 1034
because 288
don 

sun 22
help 1034
needed 74
mon 18
repos 25
delmas 124
hospital 93
what 890
time 202
can 1200
please 978
are 2171
there 797
doctor 43
see 151
infection 25
this 1241
seems 23
omer 1
centre 58
juvenile 2
mar 10
the 12067
people 1043
really 71
need 1357
food 1069
what 890
should 153
they 619
need 1357
medicine 53
for 2103
our 260
stomach 4
because 288
need 1357
tent 288
miragoan 2
rained 4
last 184
night 68
you 1461
should 153
create 11
big 68
canteen 1
every 89
area 350
like 1035
that 1374
message 460
too 57
but 359
writing 18
are 2171
they 619
going 165
sent 87
help 1034
for 2103
the 12067
victimes 7
have 2137
people 1043
family 247
some 639
are 2171
dead 64
too 57
trying 34
enter 14
card 119
that 1374
says 2
this 1241
cou 25
until 36
now 304
people 1043
the 12067
mountain 16
are 2171
burning 3
alas 1
are 2171
dying 51
hunger 59
and 3030
thirst 10
are 2171
have 2137
lost 109
entire 21
family 247
and 3030
can 1200
find 488
are 2171
dying 51
hunger 59
have 2137
water 797
plea 36
begging 2


victim 187
the 12067
earthquake 806
not 832
even 96
tent 288
our 260
child 162
have 2137
fallen 13
ill 8
would 946
like 1035
get 316
more 409
info 99
the 12067
earthquak 16
send 269
live 231
canada

KeyboardInterrupt: 

In [210]:
import collections
from collections import Counter
ignore = {'the','a','if','in','it','of','or','are','for','that','and','would','will','with','what','this','there'}
word_counts = Counter(x for x in clean_tokens if x not in ignore)
word_counts

Counter({'weather': 133,
         'update': 25,
         'cold': 67,
         'front': 34,
         'from': 743,
         'cuba': 17,
         'hurricane': 196,
         'over': 222,
         'not': 832,
         'looking': 44,
         'someone': 107,
         'but': 359,
         'name': 48,
         'report': 137,
         'leogane': 80,
         'destroyed': 98,
         'only': 152,
         'hospi': 4,
         'say': 206,
         'west': 48,
         'side': 23,
         'haiti': 526,
         'rest': 9,
         'country': 218,
         'information': 683,
         'about': 623,
         'national': 123,
         'palace': 6,
         'storm': 162,
         'sacred': 1,
         'heart': 31,
         'jesus': 18,
         'please': 978,
         'need': 1357,
         'tent': 288,
         'water': 797,
         'sil': 3,
         'like': 1035,
         'receive': 71,
         'message': 460,
         'thank': 250,
         'you': 1461,
         'croix': 53,
         'de': 48,

In [211]:
word_counts.most_common(15)

[('have', 2137),
 ('you', 1461),
 ('need', 1357),
 ('can', 1200),
 ('food', 1069),
 ('people', 1043),
 ('like', 1035),
 ('ha', 1035),
 ('help', 1034),
 ('please', 978),
 ('not', 832),
 ('earthquake', 806),
 ('water', 797),
 ('know', 759),
 ('from', 743)]

In [222]:
stop_words = stopwords.words('english')
word_counts = Counter(x for x in clean_tokens if x not in stop_words)
word_counts

Counter({'weather': 133,
         'update': 25,
         'cold': 67,
         'front': 34,
         'cuba': 17,
         'hurricane': 196,
         'looking': 44,
         'someone': 107,
         'name': 48,
         'report': 137,
         'leogane': 80,
         'destroyed': 98,
         'hospi': 4,
         'say': 206,
         'west': 48,
         'side': 23,
         'haiti': 526,
         'rest': 9,
         'country': 218,
         'information': 683,
         'national': 123,
         'palace': 6,
         'storm': 162,
         'sacred': 1,
         'heart': 31,
         'jesus': 18,
         'please': 978,
         'need': 1357,
         'tent': 288,
         'water': 797,
         'sil': 3,
         'would': 946,
         'like': 1035,
         'receive': 71,
         'message': 460,
         'thank': 250,
         'croix': 53,
         'de': 48,
         'bouquets': 18,
         'health': 186,
         'nothing': 94,
         'eat': 72,
         'starving': 21,
         'p

In [225]:
most_common_words = word_counts.most_common(15) 

In [221]:
import nltk
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\piewitheye\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [226]:
most_common_words

[('need', 1357),
 ('food', 1069),
 ('people', 1043),
 ('like', 1035),
 ('ha', 1035),
 ('help', 1034),
 ('please', 978),
 ('would', 946),
 ('earthquake', 806),
 ('water', 797),
 ('know', 759),
 ('information', 683),
 ('wa', 636),
 ('also', 568),
 ('haiti', 526)]

In [227]:
type(most_common_words)

list

In [228]:
from pandas import DataFrame
df_words = DataFrame(most_common_words,columns=['words','counts'])
df_words

Unnamed: 0,words,counts
0,need,1357
1,food,1069
2,people,1043
3,like,1035
4,ha,1035
5,help,1034
6,please,978
7,would,946
8,earthquake,806
9,water,797
