# Natural Language Processing with Disaster Tweets - Data Preprocessing

There are two main tasks in this notebook:

-  clean the `location` column,
-  tokenize the tweets so that they can be further used for word embedding,
-  map the words from tokenized tweets to their word embeddings using [glove.twitter.27B](https://nlp.stanford.edu/projects/glove/).

We will ignore the `keyword` column, because the keywords usually also appear in the tweet itself. So, we anticipate that the keywords do not give any new context.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import itertools as it
import random

# text processing libraries
import re                                  
import string  
import nltk                              
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer 
import contractions

Let's load the training and test data.

In [2]:
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

In [3]:
df_train = df_train.drop(columns=['keyword'])
df_test = df_test.drop(columns=['keyword'])

df_train.head()

Unnamed: 0,id,location,text,target
0,1,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,Forest fire near La Ronge Sask. Canada,1
2,5,,All residents asked to 'shelter in place' are ...,1
3,6,,"13,000 people receive #wildfires evacuation or...",1
4,7,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
print(
    pd.isna( pd.concat([df_train['location'], df_test['location']]) ).value_counts()
)

location
False    7238
True     3638
Name: count, dtype: int64


## Location column

As you can see, there lots of missing entries in the `location` column. Our approach to this problem will be to:
-  simplify the data so that the column `location` contains only names of countries,
- try to infer the names of the countries from the `text` column (if location is empty), or from the text in the 'location' entry if nonempty,
- if the above is not successful, fill in the entry by 'Worldwide'.

To this end, we load some geodata from the portal [Geonames](https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/table/?disjunctive.cou_name_en&sort=name).

In [5]:
df_geo_regions = pd.read_csv('geodata/subcountries.csv', delimiter=';')
df_geo_cities = pd.read_csv('geodata/cities.csv', delimiter=';')
df_geo_regions = df_geo_regions.dropna()

In [6]:
df_geo_regions.head()

Unnamed: 0,country,subcountry
0,Andorra,Escaldes-Engordany
1,Andorra,Andorra la Vella
2,United Arab Emirates,Umm al Qaywayn
3,United Arab Emirates,Raʼs al Khaymah
4,United Arab Emirates,Ash Shāriqah


In [7]:
df_geo_cities.head()

Unnamed: 0,city,country,population
0,Bridgewater,United States,7841
1,Brookline,United States,58732
2,Hinsdale,United States,1905
3,Marshfield,United States,4335
4,Milton,United States,27003


We remove the cities below 100k population.

In [8]:
df_geo_cities = df_geo_cities.drop(df_geo_cities[df_geo_cities['population']<100000].index)

The matching of cities and political regions to countries will be encoded in two dictionaries.

In [9]:
# region to country dictionary
subcountry_dict = df_geo_regions.set_index('subcountry')['country'].to_dict()

#city to country dictionary
city_dict = df_geo_cities.set_index('city')['country'].to_dict()

#set of all countries
countries = set(list(df_geo_regions['country'].values)+list(df_geo_cities['country'].values))

In [10]:
def find_country(loc_tweet, subcountry_dict, city_dict, countries):
    """
    Attempts to match country to string loc_tweet.
    
    Checks which of the dict keys (if any) are in the string.
    """
    reg_search = [val for key,val in subcountry_dict.items() if key in loc_tweet]
    city_search = [val for key,val in city_dict.items() if key in loc_tweet]
    country_search = [cntry for cntry in countries if str(cntry) in loc_tweet]
    results = reg_search+city_search+country_search
    
    if len(results) == 0:
        country = 'Worldwide'
    elif len(country_search) > 0:
        country = random.choice(country_search)
    elif 'United States' in results: #if there are a few candidates, pick US
        country = 'United States'
    elif 'United Kingdom' in results: #if there are a few candidates, pick UK after US
        country = 'United Kingdom'
    else:
        country = random.choice(results) #otherwise, pick randomly
    return country

First, we try to infer the country from the tweet if location is empty.

In [11]:
text_locna_train = df_train['text'].iloc[df_train.index[df_train['location'].isna()]]
text_locna_test = df_test['text'].iloc[df_test.index[df_test['location'].isna()]]

text_locna_train = text_locna_train.apply(
    lambda text: find_country(text, subcountry_dict, city_dict, countries)
)

text_locna_test = text_locna_test.apply(
    lambda text: find_country(text, subcountry_dict, city_dict, countries)
)

In [12]:
# Fill the empty location entries by the found ones.
df_train['location'] = df_train['location'].fillna(text_locna_train)
df_test['location'] = df_test['location'].fillna(text_locna_test)

Next, try to infer the country from the location. To this end, we normalise locations by replacing a few common abbreviations.

In [13]:
normalised_names = {
    'USA' : 'United States',
    'US' : 'United States',
    'U.S.A.' : 'United States',
    'U.S.A' : 'United States',
    'U.S.' : 'United States',
    'CA' : 'United States',
    'UK' : 'United Kingdom',
    'NY' : 'United States',
    'nyc' : 'United States',
    'U.K.' : 'United Kingdom'
}

def normalise_locations(df, normalised_names): 
    """
    Replaces the abbreviations from normalised_names (dict) with their corresponding countries.
    
    Removes hyphens, commas and slashes.
    """
    df = df.apply(
    lambda loc: [
        normalised_names[w] if w in normalised_names else w.strip(' ') for w in re.split( r'[,-/]' , str(loc) ) 
    ]
    )
    
    df = df.apply(
    lambda words: ' '.join(words)
    )
    
    return df

In [14]:
# Normalise the 'location' columns.
df_loc_train = normalise_locations(df_train['location'], normalised_names)
df_loc_test = normalise_locations(df_test['location'], normalised_names)

Next, we infer countries from the normalised `location` data.

In [None]:
df_loc_train = df_loc_train.apply(
lambda loc_tweet: find_country(loc_tweet, subcountry_dict, city_dict, countries)
)

df_loc_test = df_loc_test.apply(
lambda loc_tweet: find_country(loc_tweet, subcountry_dict, city_dict, countries)
)

In [None]:
print( pd.concat([df_loc_test,df_loc_train]).value_counts() )
print( pd.concat([df_loc_test,df_loc_train]).nunique() )

The results are OK. We have created extra 1.8K unknown (i.e. 'Worldwide') entries, but if one inspects the original column, one can see that sometimes it is not possible to tell the country from the `location` data.

We save the location data in separate files.

In [None]:
df_loc_train.to_csv('data/location_train.csv')
df_loc_test.to_csv('data/location_test.csv')

# Tweet processing and tokenization

First, we extract the tweets from the train and test data.

In [None]:
tweets_train = df_train['text'].copy()
tweets_test = df_test['text'].copy()

Next, we tokenize the tweets.

In [None]:
def remove_url(tweet):
    """
    removes url from tweet (string)
    """
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', tweet)

In [None]:
tokenizer = TweetTokenizer(
    preserve_case=False, strip_handles=True, reduce_len=True
)

def clean_tweet(tweet, tokenizer):
    """
    Takes a string (presumably a tweet) and returns a list of tokenized words.
    """
    # Remove hyperlinks
    tweet2 = remove_url(tweet)
    # Remove hashtags and @
    tweet2 = re.sub(r'#', '', tweet2)
    tweet2 = re.sub(r'@', '', tweet2)
    
    # Tokenize the tweet and expand contractions
    tweet_tokens = []
    for w in tokenizer.tokenize(tweet2):
        tweet_tokens = tweet_tokens + tokenizer.tokenize(contractions.fix(w))
    
    # Remove punctuation
    return [w for w in tweet_tokens if w not in string.punctuation]

In [None]:
tweets_train = tweets_train.apply(
    lambda tweet: clean_tweet(tweet, tokenizer)
)

tweets_test = tweets_test.apply(
    lambda tweet: clean_tweet(tweet, tokenizer)
)

Next, we move to the task of identifying the word embeddings. We will map each word in the given tokenized tweet to its index in the (**sorted**) GloVe dictionary.

We read the GloVe file (**note that this is not uploaded to my GitHub folder, because of the large size of the file**).

In [None]:
def read_glove_indices(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
        
        i = 1
        words_to_index = {}
        for w in sorted(words):
            words_to_index[w] = i
            i = i + 1
    return words_to_index

In [None]:
words_to_index = read_glove_indices('data/glove.twitter.27B/glove.twitter.27B.25d.txt')

In [None]:
tweets_indinces_train = tweets_train.apply(
    lambda tweet: [words_to_index[w] if w in words_to_index else words_to_index['unk'] for w in tweet]
)

tweets_indinces_test = tweets_test.apply(
    lambda tweet: [words_to_index[w] if w in words_to_index else words_to_index['unk'] for w in tweet]
)

In [None]:
print(tweets_indinces_train.head())
print(tweets_indinces_train.shape)

In [None]:
# Save the indices to files

with open(r'data/tweet_indices_train.txt', 'a') as fp:
    for ind_list in tweets_indinces_train.values:
            ind_arr = np.array(ind_list, dtype=int).T
            np.savetxt(fp, ind_arr, fmt='%d', newline = ' ')
            fp.write('\n')
    
    with open(r'data/tweet_indices_test.txt', 'a') as fp:
        for ind_list in tweets_indinces_test.values:
            ind_arr = np.array(ind_list, dtype=int).T
            np.savetxt(fp, ind_arr, fmt='%d', newline = ' ')
            fp.write('\n')

We are now finished with data preprocessing and can move on to actual language processing!