## Data Cleaning:
Here we want to see our data preciesly and clean it, if needed.

## Table of Contents

1. [Imports](#Imports)
2. [Read in Dataframes](#Read-in-Dataframes)
3. [Data Cleaning](#Data-Cleaning)
    1. [Creating Target Variable](#Creating-Target-Variable)
4. [Export Clean Data](#Export-Clean-Data)

#### Imports

In [1]:
import pandas as pd
import numpy as np

#### Read in Dataframes

In [2]:
# load earthquake data
earthquake=pd.read_csv("data/earthquake_50k_final.csv")
# load flood data
flood = pd.read_csv("data/flood_50k_final.csv")
# load tornado data
tornado= pd.read_csv("data/tornado_50k_final.csv")
# load hurricane data
hurricane = pd.read_csv("data/hurricane.csv", sep='|')
# load fire data
fire = pd.read_csv("data/fire.csv",sep='|')


#### Data Cleaning

In [3]:
# make a new column in each dataframe with the type of natural disaster it is
earthquake['type'] = 'earthquake'
flood['type'] = 'flood'
tornado['type'] = 'tornado'
hurricane['type'] = 'hurricane'
fire['type'] = 'fire'

In [4]:
# Combine dataframes
df = pd.concat([earthquake, flood, tornado, hurricane, fire], ignore_index=True)
df.head()

Unnamed: 0.1,Unnamed: 0,url,date,content,renderedContent,id,user,replyCount,retweetCount,likeCount,...,inReplyToUser,mentionedUsers,coordinates,place,hashtags,cashtags,card,type,user_location,north_america
0,0.0,https://twitter.com/eyeffemeral/status/1530923...,2022-05-29 14:47:25+00:00,they really removed earthquake and simon says ...,they really removed earthquake and simon says ...,1530923771980554241,"{'username': 'eyeffemeral', 'id': 138886522109...",0,0,0,...,,,,,,,,earthquake,,
1,1.0,https://twitter.com/BotDotCom1/status/15309237...,2022-05-29 14:47:14+00:00,Angelina Jolie Is Noticed By Fans In Wyoming S...,Angelina Jolie Is Noticed By Fans In Wyoming S...,1530923726107160576,"{'username': 'BotDotCom1', 'id': 1310354499987...",0,0,0,...,,,,,,,,earthquake,,
2,2.0,https://twitter.com/no_earthquake/status/15309...,2022-05-29 14:46:50+00:00,i started writing something about this &amp; s...,i started writing something about this &amp; s...,1530923627708702720,"{'username': 'no_earthquake', 'id': 1316897066...",0,0,0,...,"{'username': 'no_earthquake', 'id': 1316897066...",,,,,,,earthquake,,
3,3.0,https://twitter.com/no_earthquake/status/15309...,2022-05-29 14:46:09+00:00,tom sachs becoming basically uh a nike designe...,tom sachs becoming basically uh a nike designe...,1530923454622445569,"{'username': 'no_earthquake', 'id': 1316897066...",1,0,0,...,,,,,,,,earthquake,,
4,4.0,https://twitter.com/QuakeBotter/status/1530923...,2022-05-29 14:45:38+00:00,There was a 2.57 magnitude earthquake 6km NW o...,There was a 2.57 magnitude earthquake 6km NW o...,1530923322921345026,"{'username': 'QuakeBotter', 'id': 143967208484...",0,0,0,...,,,,,,,,earthquake,,


In [5]:
df.shape

(160000, 32)

In [6]:
df.columns

Index(['Unnamed: 0', 'url', 'date', 'content', 'renderedContent', 'id', 'user',
       'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
       'conversationId', 'lang', 'source', 'sourceUrl', 'sourceLabel',
       'outlinks', 'tcooutlinks', 'media', 'retweetedTweet', 'quotedTweet',
       'inReplyToTweetId', 'inReplyToUser', 'mentionedUsers', 'coordinates',
       'place', 'hashtags', 'cashtags', 'card', 'type', 'user_location',
       'north_america'],
      dtype='object')

In [7]:
# drop unncecessary columns from dataframe 
df=df[["content", "coordinates","user_location", "lang", "type"]]
df.head()

Unnamed: 0,content,coordinates,user_location,lang,type
0,they really removed earthquake and simon says ...,,,en,earthquake
1,Angelina Jolie Is Noticed By Fans In Wyoming S...,,,en,earthquake
2,i started writing something about this &amp; s...,,,en,earthquake
3,tom sachs becoming basically uh a nike designe...,,,en,earthquake
4,There was a 2.57 magnitude earthquake 6km NW o...,,,en,earthquake


In [8]:
df=df[df["lang"]=="en"]
df.shape

(124817, 5)

#### Drop Duplicates

In [9]:
#dropping duplicates
df.drop_duplicates(subset='content', inplace=True)

In [10]:
#re-check shape of dataframe
df.shape

(121671, 5)

In [11]:
df=df[["content", "coordinates","user_location","type"]]
df.head(20)

Unnamed: 0,content,coordinates,user_location,type
0,they really removed earthquake and simon says ...,,,earthquake
1,Angelina Jolie Is Noticed By Fans In Wyoming S...,,,earthquake
2,i started writing something about this &amp; s...,,,earthquake
3,tom sachs becoming basically uh a nike designe...,,,earthquake
4,There was a 2.57 magnitude earthquake 6km NW o...,,,earthquake
8,There was a 2.66 magnitude earthquake 6km ENE ...,,,earthquake
9,The military used to support film makers like ...,,,earthquake
10,@NickAdamsinUSA Red earthquake. \nAnd you will...,,,earthquake
11,love it shakes\nlike an earthquake \nwithin an...,,,earthquake
12,I have noticed a lot of earthquake clouds rece...,,,earthquake


In [12]:
# Reset index of new dataframe
df.reset_index(drop=True, inplace = True)

In [13]:
df.head(20)

Unnamed: 0,content,coordinates,user_location,type
0,they really removed earthquake and simon says ...,,,earthquake
1,Angelina Jolie Is Noticed By Fans In Wyoming S...,,,earthquake
2,i started writing something about this &amp; s...,,,earthquake
3,tom sachs becoming basically uh a nike designe...,,,earthquake
4,There was a 2.57 magnitude earthquake 6km NW o...,,,earthquake
5,There was a 2.66 magnitude earthquake 6km ENE ...,,,earthquake
6,The military used to support film makers like ...,,,earthquake
7,@NickAdamsinUSA Red earthquake. \nAnd you will...,,,earthquake
8,love it shakes\nlike an earthquake \nwithin an...,,,earthquake
9,I have noticed a lot of earthquake clouds rece...,,,earthquake


#### Check for nulls

In [14]:
df.isnull().sum()

content               0
coordinates      115672
user_location    116136
type                  0
dtype: int64

#### Romoving Unnecessary Punctuations

We decided to remove links to other websites and picture links because we wanted to use just the pure text to train our model. Additionally we wanted to remove unnecessary punctuation except !, #, and the digits 911 because they either convey urgency or in the case of the hashtag it is widely used to get attention to your tweet.

In [15]:
# Make sure all tweets in the text column are strings
df['content'] = df['content'].astype(str)

In [16]:
# check type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121671 entries, 0 to 121670
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   content        121671 non-null  object
 1   coordinates    5999 non-null    object
 2   user_location  5535 non-null    object
 3   type           121671 non-null  object
dtypes: object(4)
memory usage: 3.7+ MB


In [17]:
# Use regex to remove unnecessary punctuation 
import regex as re

In [18]:
df['content'][5]

'There was a 2.66 magnitude earthquake 6km ENE of Progreso, B.C., MX'

In [19]:
df['content'][6]

"The military used to support film makers like Tony Scott.   These days they're giving money to Kevin Feige.  What a sad decline."

In [20]:
# "r" in the beginning is making sure that the string is being treated as a "raw string"
# \S: Returns a match where the string DOES NOT contain a white space character
# +: One or more occurrences
# []: A set of characters
df["content"] = df["content"].apply(lambda x: re.sub(r"http\S+", "", x).lower())
df["content"] = df["content"].apply(lambda x: re.sub(r"pic.twitter\S+", "", x))
df["content"] = df["content"].apply(lambda x: re.sub('[^ a-zA-Z!#911]','', x))

In [21]:
# Re-check
df['content'][5]

'there was a  magnitude earthquake km ene of progreso bc mx'

In [22]:
# Re-check
df['content'][6]

'the military used to support film makers like tony scott   these days theyre giving money to kevin feige  what a sad decline'

#### Creating Target Variable

To create our target variable we must determine which tweets may constitute calls for emergency help. In order to categorize tweets we used a list of words that we decided would be used in a situation where an individual or group of individuals needed immediate help.

In [23]:
critical_words = ['fatality', 'destruction', 'rescue', 'stranded', 'stuck', 'injured', 'lost', 'dying', 'danger',
              'medivac', 'sos', 'save me', 'save us', 'debris', 'injury', 'drowning', 'ambulance', 'doctor', 
              'help us', 'help me', 'fire', 'life-threatening', 'starving', 'broke', 'please help', 'ambulance', 
              '911', 'casualty', 'death', 'need help'
]

In [24]:
#https://link.springer.com/article/10.1007/s10796-018-9843-x/tables/3
Emergency_Words= "ambulance anarchic anarchy angioplasty arsenc beachfront blizzard calamity coastline curfew cyclone devastation dike disaster dyke emergency eruption evacuate evacuation evacuee famine firefighter flood flood-hit flooding floodwater frantically gust gusty hard-hit hurricane impassable inaccessible insurgent insurrection inundate issues landfall levee low-lying malaria malnutrition melting meningitis mortuary mudslide nirmala non-essential ntsb overflow pilots preparedness rescuer rioter sandbag shelter stone-thrower submerge swollen tornado torrential tributary urgent volcanic volcano wildfire worst-hit"

critical_words_2 = Emergency_Words.split()

for i in critical_words_2:
    critical_words.append(i)

len(set(critical_words))


95

In [25]:
emergency_tweets = [] 
for tweet in df['content']:
    for word in critical_words:
        if word in tweet: 
            emergency_tweets.append(tweet)

In [26]:
df["target"] = df["content"].apply(lambda x: 1 if x in emergency_tweets else 0)

In [27]:
df['target'].value_counts(normalize=True)

1    0.672946
0    0.327054
Name: target, dtype: float64

Clearly the classes are unbalanced.

#### Location of Tweets:

In [28]:
df['coordinates'].value_counts()

{'longitude': -75.027179, 'latitude': 39.8593695}      49
{'longitude': -83.67529, 'latitude': 36.540739}        48
{'longitude': -84.3219475, 'latitude': 33.752879}      41
{'longitude': -118.668404, 'latitude': 33.704538}      31
{'longitude': -0.3470252, 'latitude': 5.51713}         23
                                                       ..
{'longitude': -66.843, 'latitude': 17.99083333}         1
{'longitude': -113.7116667, 'latitude': 47.5273333}     1
{'longitude': -146.672, 'latitude': 61.334}             1
{'longitude': -121.6096667, 'latitude': 36.8738333}     1
{'longitude': 26.7784419, 'latitude': -13.9406299}      1
Name: coordinates, Length: 4048, dtype: int64

In [29]:
null_coordinate=df[df['coordinates'].isnull()].index.tolist()
null_coordinate

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 37,
 38,
 39,
 41,
 42,
 43,
 44,
 46,
 47,
 48,
 49,
 50,
 52,
 53,
 54,
 55,
 56,
 57,
 59,
 60,
 61,
 63,
 64,
 65,
 67,
 68,
 69,
 71,
 72,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 99,
 101,
 102,
 103,
 105,
 106,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 149,
 150,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 162,
 163,
 164,
 166,
 167,
 168,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 186,
 187,
 188,
 189,
 190,
 191,
 192,
 193,
 194,
 195,
 196,
 197,
 198,
 199,
 200,
 201,
 203,
 204,
 205,
 206,
 207,


In [30]:
# creating the new coordinate columns
df["latitude"] = np.nan
df["longitude"] = np.nan

In [31]:
df.head()

Unnamed: 0,content,coordinates,user_location,type,target,latitude,longitude
0,they really removed earthquake and simon says ...,,,earthquake,0,,
1,angelina jolie is noticed by fans in wyoming s...,,,earthquake,1,,
2,i started writing something about this amp ste...,,,earthquake,0,,
3,tom sachs becoming basically uh a nike designe...,,,earthquake,0,,
4,there was a magnitude earthquake km nw of the...,,,earthquake,0,,


In [32]:
import ast

In [33]:
df['coordinates'][36]

"{'longitude': 176.21, 'latitude': -37.64}"

In [34]:
ast.literal_eval(df['coordinates'][36])

{'longitude': 176.21, 'latitude': -37.64}

In [35]:
df['coordinates'][36] is not np.nan

True

In [36]:
for i in range(df.shape[0]):
    if df['coordinates'][i] is not np.nan:
        df['longitude'][i]= ast.literal_eval(df['coordinates'][i])['longitude']
        df['latitude'][i] = ast.literal_eval(df['coordinates'][i])['latitude']
    else:
        df['longitude'][i]= np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['longitude'][i]= np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['longitude'][i]= ast.literal_eval(df['coordinates'][i])['longitude']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['latitude'][i] = ast.literal_eval(df['coordinates'][i])['latitude']


In [37]:
df.head()

Unnamed: 0,content,coordinates,user_location,type,target,latitude,longitude
0,they really removed earthquake and simon says ...,,,earthquake,0,,
1,angelina jolie is noticed by fans in wyoming s...,,,earthquake,1,,
2,i started writing something about this amp ste...,,,earthquake,0,,
3,tom sachs becoming basically uh a nike designe...,,,earthquake,0,,
4,there was a magnitude earthquake km nw of the...,,,earthquake,0,,


In [38]:
# creating North America boundaries
north_america_boundaries={"latitude":[24.521208, 49.382808],
                         "longitude":[-66.945392, -124.736342]}

In [53]:
df_lat_filter =df.query(("latitude> 24.5 and latitude< 49.38") and ("longitude >-124.736342 and longitude < -66.945392"))
#df_long_filter = df_lat_filter.query("longitude >-124.736342 and longitude < -66.945392")

In [54]:
df_lat_filter

Unnamed: 0,content,coordinates,user_location,type,target,latitude,longitude
23,usgs reports a m earthquake km ese of beatty ...,"{'longitude': -116.2717, 'latitude': 36.7641}",,earthquake,0,36.764100,-116.271700
40,usgs reports a m1 earthquake km wnw of smiths...,"{'longitude': -116.1626667, 'latitude': 44.3195}",,earthquake,0,44.319500,-116.162667
45,usgs reports a m earthquake km nw of indian s...,"{'longitude': -116.0749, 'latitude': 36.7963}",,earthquake,0,36.796300,-116.074900
51,usgs reports a m11 earthquake km ssw of markl...,"{'longitude': -119.8126, 'latitude': 38.6329}",,earthquake,0,38.632900,-119.812600
58,moffshore valparaiso chile depth km may 9 1 u...,"{'longitude': -71.84, 'latitude': -32.51}",,earthquake,0,-32.510000,-71.840000
...,...,...,...,...,...,...,...
121202,theerkj likefor real notifications on fire,"{'longitude': -118.3959042, 'latitude': 34.075...",los angeles 🌴,fire,1,34.075963,-118.395904
121269,movieendorser a few good menghostone crazy sum...,"{'longitude': -118.668404, 'latitude': 33.704538}",,fire,1,33.704538,-118.668404
121335,tractor trailer fire clean up in #frontroyal o...,"{'longitude': -78.19296, 'latitude': 38.96084}",Washington DC,fire,1,38.960840,-78.192960
121500,pfur1 i attempted to demonstrate the same thin...,"{'longitude': -88.795109, 'latitude': 44.597246}","Clintonville, WI",fire,1,44.597246,-88.795109


In [57]:
df_lat_filter.drop(columns=["user_location"], axis=1, inplace=True)
df_lat_filter

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,content,coordinates,type,target,latitude,longitude
23,usgs reports a m earthquake km ese of beatty ...,"{'longitude': -116.2717, 'latitude': 36.7641}",earthquake,0,36.764100,-116.271700
40,usgs reports a m1 earthquake km wnw of smiths...,"{'longitude': -116.1626667, 'latitude': 44.3195}",earthquake,0,44.319500,-116.162667
45,usgs reports a m earthquake km nw of indian s...,"{'longitude': -116.0749, 'latitude': 36.7963}",earthquake,0,36.796300,-116.074900
51,usgs reports a m11 earthquake km ssw of markl...,"{'longitude': -119.8126, 'latitude': 38.6329}",earthquake,0,38.632900,-119.812600
58,moffshore valparaiso chile depth km may 9 1 u...,"{'longitude': -71.84, 'latitude': -32.51}",earthquake,0,-32.510000,-71.840000
...,...,...,...,...,...,...
121202,theerkj likefor real notifications on fire,"{'longitude': -118.3959042, 'latitude': 34.075...",fire,1,34.075963,-118.395904
121269,movieendorser a few good menghostone crazy sum...,"{'longitude': -118.668404, 'latitude': 33.704538}",fire,1,33.704538,-118.668404
121335,tractor trailer fire clean up in #frontroyal o...,"{'longitude': -78.19296, 'latitude': 38.96084}",fire,1,38.960840,-78.192960
121500,pfur1 i attempted to demonstrate the same thin...,"{'longitude': -88.795109, 'latitude': 44.597246}",fire,1,44.597246,-88.795109


#### Export Clean Data

In [58]:
# Exported cleaned data to datasets folder
df_lat_filter.to_csv('data/clean_df.csv', index=False)

In [41]:
#df['nort_america'] = ''
#for idx, val in enumerate(df.itertuples()):
    #if df.loc[idx,'latitude'] > 24.521208 and df.loc[idx,'latitude'] < 49.382808:
        #df.loc[idx, 'north_america'] = 'Yes'
    #else:
        #df.loc[idx, 'north_america'] = 'No'

In [42]:
#df['north_america'].value_counts(normalize=True)

In [43]:
#df['nort_america_long'] = ''
#for idx, val in enumerate(df.itertuples()):
    #if df.loc[idx,'longitude'] > -124.736342 and df.loc[idx,'longitude'] < -66.945392:
        #df.loc[idx, 'north_america_long'] = 'Yes'
    #else:
        #df.loc[idx, 'north_america_long'] = 'No'

In [44]:
#df['north_america'].value_counts(normalize=True)
#len(null_coordinate)

In [45]:
#a=[i for i in range(df.shape[0]) if df['user_location'][i]=='Earth']
#len(a)

In [46]:
#len(set(null_coordinate).intersection(set(a)))

In [47]:
#state_names = ["Alaska", "Alabama", "Arkansas", "American Samoa", "Arizona", "California", "Colorado", "Connecticut", "District ", "of Columbia", "Delaware", "Florida", "Georgia", "Guam", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", "New York", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Virginia", "Virgin Islands", "Vermont", "Washington", "Wisconsin", "West Virginia", "Wyoming"]

In [48]:
#def find_states(data_frame):
    #index_list=[]
    #for i in range(df.shape[0]):
        #for j in state_names:
            #if j in df['user_location'][i]:
                #return index_list.append(i)
            #else:
                #return np.nan

In [49]:
# Import pandas package  
#import pandas as pd 
#import numpy as np
    
# Define a dictionary containing  data 
#data = {'City':state_names} 
    
# Convert the dictionary into DataFrame 
#df = pd.DataFrame(data) 
    
# Observe the result 
#df 

In [50]:
# Define a dictionary containing  data
#dg=pd.DataFrame(data)

In [51]:
#https://www.geeksforgeeks.org/how-to-find-longitude-and-latitude-for-a-list-of-regions-or-country-using-python/?ref=lbp

#from geopy.exc import GeocoderTimedOut
#from geopy.geocoders import Nominatim
# declare an empty list to store
# latitude and longitude of values
# of city column
#longitude = []
#latitude = []
# function to find the coordinate
# of a given city
#def findGeocode(city):
    # try and catch is used to overcome
    # the exception thrown by geolocator
    # using geocodertimedout
    #try:
        # Specify the user_agent as your
        # app name it should not be none
        #geolocator = Nominatim(user_agent="your_app_name")
        #return geolocator.geocode(city)
    #except GeocoderTimedOut:
        #return findGeocode(city)
#for i in (dg["City"]):
    #if findGeocode(i) != None:
        #loc = findGeocode(i)
        # coordinates returned from
        # function is stored into
        # two separate list
        #latitude.append(loc.latitude)
        #longitude.append(loc.longitude)
    #else:
        #latitude.append(np.nan)
        #longitude.append(np.nan)

#dg["Longitude"] = longitude
#dg["Latitude"] = latitude
#dg

In [52]:
#dg["City"] = dg["City"].str.lower()