### Exploratory Data Analysis

In [1]:
import pandas as pd

df_twitter = pd.read_csv('train.csv')
df_twitter.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [2]:
df_twitter.describe

<bound method NDFrame.describe of          id keyword location  \
0         1     NaN      NaN   
1         4     NaN      NaN   
2         5     NaN      NaN   
3         6     NaN      NaN   
4         7     NaN      NaN   
...     ...     ...      ...   
7608  10869     NaN      NaN   
7609  10870     NaN      NaN   
7610  10871     NaN      NaN   
7611  10872     NaN      NaN   
7612  10873     NaN      NaN   

                                                   text  target  
0     Our Deeds are the Reason of this #earthquake M...       1  
1                Forest fire near La Ronge Sask. Canada       1  
2     All residents asked to 'shelter in place' are ...       1  
3     13,000 people receive #wildfires evacuation or...       1  
4     Just got sent this photo from Ruby #Alaska as ...       1  
...                                                 ...     ...  
7608  Two giant cranes holding a bridge collapse int...       1  
7609  @aria_ahrary @TheTawniest The out of control w.

In [3]:
df_twitter['text'].str.len().mean()

101.03743596479706

The average tweet length is approximately 101 characters.

In [4]:
df_twitter['keyword'].describe()

count           7552
unique           221
top       fatalities
freq              45
Name: keyword, dtype: object

In [5]:
df_twitter['keyword'].value_counts(ascending=True)

radiation%20emergency     9
inundation               10
threat                   11
epicentre                12
forest%20fire            19
                         ..
damage                   41
harm                     41
deluge                   42
armageddon               42
fatalities               45
Name: keyword, Length: 221, dtype: int64

As you can see, we have a lot of different keywords in this data set. Let's check the same for the ``location`` column.

In [6]:
df_twitter['location'].describe()

count     5080
unique    3341
top        USA
freq       104
Name: location, dtype: object

In [7]:
df_twitter['location'].value_counts(ascending=False).idxmax()

'USA'

In [8]:
df_twitter['location'].value_counts(ascending=False).head()

USA              104
New York          71
United States     50
London            45
Canada            29
Name: location, dtype: int64

This is interesting, as when we select the most frequently used location value, it's listed as USA, however when we look at the head values, there seem to be duplicates of USA: New York is in the US, and United States is the same location. If we can find a way to combine these values to create a more accurate depiction of location, that would prove to be very useful.

Like we saw before with the ``keyword`` column, this column has a lot of unique values: 3341 of the total 5080 are unique, which makes this very difficult to deal with.

In [13]:
df_twitter['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

### Simple Modelling

Here, we're going to start out by creating a new column that exists as an array containing the words from the ``text`` column. Using this, we should be able to figure out the most common words in the dataset.

In [9]:
df_twitter['text_array'] = df_twitter['text'].str.split(' ')
df_twitter['text_array'].head()

0    [Our, Deeds, are, the, Reason, of, this, #eart...
1       [Forest, fire, near, La, Ronge, Sask., Canada]
2    [All, residents, asked, to, 'shelter, in, plac...
3    [13,000, people, receive, #wildfires, evacuati...
4    [Just, got, sent, this, photo, from, Ruby, #Al...
Name: text_array, dtype: object

Next, we're going to create a new column, containing only the hashtags from the text.

In [15]:
hashtags = list()
for row in df_twitter['text_array']:
    hashes = list()
    for i in row:
        if len(i) != 0 and i[0] == '#':
            hashes.append(i[1:])
    if len(hashes) == 0:
        hashes = None
    #elif len(hashes) == 1:
        #hashes = hashes[0]
    hashtags.append(hashes)
df_twitter['hashtags'] = pd.Series(hashtags)
df_twitter['hashtags'].head()

0           [earthquake]
1                   None
2                   None
3            [wildfires]
4    [Alaska, wildfires]
Name: hashtags, dtype: object

In [11]:
df_twitter['hashtags'].describe()

count                      1707
unique                     1347
top       [hot, prebreak, best]
freq                         30
Name: hashtags, dtype: object

In [21]:
df_twitter['hashtag_len'] = df_twitter['hashtags'].str.len()
df_twitter['hashtag_len'].head()

0    1.0
1    NaN
2    NaN
3    1.0
4    2.0
Name: hashtag_len, dtype: float64

In [20]:
df_twitter['hashtag_len'].value_counts()

1.0     951
2.0     388
3.0     197
4.0      86
5.0      34
6.0      28
7.0       8
8.0       4
9.0       3
12.0      3
11.0      2
10.0      2
13.0      1
Name: hashtag_len, dtype: int64

In [31]:
df_twitter.head()

Unnamed: 0,id,keyword,location,text,target,text_array,hashtags,hashtag_len
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart...",[earthquake],1.0
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask., Canada]",,
2,5,,,All residents asked to 'shelter in place' are ...,1,"[All, residents, asked, to, 'shelter, in, plac...",,
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati...",[wildfires],1.0
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al...","[Alaska, wildfires]",2.0


In [37]:
for i in range(len(df_twitter['hashtags'].iloc[df_twitter['hashtag_len'].idxmax])):
    #try:
    df_twitter['hashtag_%s' % i] = df_twitter['hashtags'].str[i]
    #except:
        #df_twitter['hashtag_%s' % i] = None
df_twitter.head()

Unnamed: 0,id,keyword,location,text,target,text_array,hashtags,hashtag_len,hashtag_0,hashtag_1,...,hashtag_3,hashtag_4,hashtag_5,hashtag_6,hashtag_7,hashtag_8,hashtag_9,hashtag_10,hashtag_11,hashtag_12
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart...",[earthquake],1.0,earthquake,,...,,,,,,,,,,
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask., Canada]",,,,,...,,,,,,,,,,
2,5,,,All residents asked to 'shelter in place' are ...,1,"[All, residents, asked, to, 'shelter, in, plac...",,,,,...,,,,,,,,,,
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati...",[wildfires],1.0,wildfires,,...,,,,,,,,,,
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al...","[Alaska, wildfires]",2.0,Alaska,wildfires,...,,,,,,,,,,


In [42]:
df_twitter.hashtag_0.value_counts()

hot                    30
news                   19
Hiroshima              16
GBBO                   13
Directioners            9
                       ..
deltachildren           1
LegionnairesDisease     1
Hellfire                1
DnB                     1
jishin_e                1
Name: hashtag_0, Length: 1202, dtype: int64

In [52]:
df_twitter.groupby('hashtag_0')['target'].mean()

hashtag_0
              0.4
#book         0.0
#fukushima    0.0
#youtube      1.0
034           0.0
             ... 
yes           0.0
yugvani       1.0
yyc           1.0
yyc!          1.0
Û_           0.5
Name: target, Length: 1202, dtype: float64

What I want to do with this column is find out a way to use hashtags to predict whether or not a tweet is about a disaster. However, this only accounts for 1707 out of the 7613 tweets, so it wouldn't be super accurate.