## NLP - Processing Disaster Tweets Data Analysis
Author: Matthew Williams

This noteboook provides an analysis of the dataset provided by the Natural Language Processing with Disaster Tweets competion. This notebook started as a copy of a [Getting Started Tutorial](https://www.kaggle.com/philculliton/nlp-getting-started-tutorial) provided by [phil culliton](http://https://www.kaggle.com/philculliton). In this analysis I will be following this [Guide to exploratory data analysis for NLP](https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [None]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

### A quick look at the data

Going over the competition page's data section they describe the contents of the data as a csv with the following columns:

    id - a unique identifier for each tweet
    text - the text of the tweet
    location - the location the tweet was sent from (may be blank)
    keyword - a particular keyword from the tweet (may be blank)
    target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In order to confirm this we can take a quick glance at data we have imported.

In [None]:
train_df.head(5)

Here we can see that the dataset does indeed match what was described on competition page. One thing to note here is that the keyword and location columns do not always contain a value.

Now let's take a look at a tweet that is not about a natural disaster:

In [None]:
train_df[train_df["target"] == 0]["text"].values[1]

And one that is:

In [None]:
train_df[train_df["target"] == 1]["text"].values[1]

Finally, before we take a deeper look at the overall dataset, let us check the total number of datapoints the dataset contains:

In [None]:
len(train_df)

Overall the training dataset has 5 columns and 7613 unique entries corresponding to tweets.

### Text Statistics Visualizations and Analysis

Now lets take a closer look at the overall dataset. First, Let's check the lengths of the tweets and see how many characters they contain. For the following histograms, the Y axis shows the number of tweets in each category, while the x-axis is the metric we are examing.

In [None]:
train_df["text"].str.len().hist()

Here we can see that the majority of tweets contain at most 140 characters, with only a few having more characters. Additionally, a significant portion of the data resides in the range of 110-140 characters, about 3,400 tweets, which is close to half of the total number of tweets (7613).

Next, we can check the avarage number of words per sentence in each tweet:

In [None]:
train_df["text"].str.split(".").apply(lambda x : [len(s.split()) for s in x]).map(lambda x: np.mean(x)).hist()

We can see most sentences avarage around 5 words however there appears to be quite a few tweets with longer sentences or that avoided periods altogether.

Now lets check the number of words in each tweet:

In [None]:
train_df["text"].str.split().\
    map(lambda x: len(x)).\
    hist()

Here we we see something that looks much closer to a normal distribution, with most tweets containing between 10 and 22 words. Additionally there are few tweets with over 25 words.

Next, let's check the avarage word length in each tweet:

In [None]:
train_df["text"].str.split().\
   apply(lambda x : [len(i) for i in x]). \
   map(lambda x: np.mean(x)).hist()

Here we can see the avarage word length of the tweets falls between 4 and 7 characters. However this could include a lot of stopwords or other short words that provide little to no information. So let us take a try and filter some of that out. Here we will use an [nltk](https://www.nltk.org/) library of stopwords to acomplish the task.

In [None]:
import nltk
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

First, let's check to see which stop words occur most frequently in our dataset. This will also help give us an idea of the amount data that needs to be trimmed.

In [None]:
corpus = []
new = train_df["text"].str.split()
new = new.values.tolist()
corpus = [word for i in new for word in i]

from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
    if word in stop:
        dic[word]+=1

top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:12] 
x,y=zip(*top)
plt.bar(x,y)

As we can see from this image, there are a significant number of stopwords contained in the tweets, with the words "the", "a", and "to" being the most frequent.

Now lets take a look at the most common words in our tweets that are not stop words.

In [None]:
import seaborn as sns
from collections import  Counter

counter = Counter(corpus)
most=counter.most_common()

x, y= [], []
for word,count in most[:60]:
    if (word not in stop):
        x.append(word)
        y.append(count)
        
sns.barplot(x=y,y=x)

Almost unsuprisingly, "I" is the most commonly found word in tweets outside of stopwords. Following "I", we have "-", "The", and "like" appearing the most fequent. There also appear to be some tokens that might be worth removign from the dataset, like '-', '??' and '&amp'.

### Ngram exploration

Here we will examine the most frequent n-grams in our dataset using another [nltk](https://www.nltk.org/) library.

In [None]:
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer

def get_top_ngram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) 
                  for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:10]

top_n_bigrams=get_top_ngram(train_df["text"], 2)[:10]
x,y=map(list, zip(*top_n_bigrams))
sns.barplot(x=y,y=x) 

Here, most of the bigrams are quite short and provide very little information, or seem to be some part of a link that might have been partially removed.

Now let's look at trigrams:

In [None]:
top_tri_grams=get_top_ngram(train_df["text"],n=3)
x,y=map(list,zip(*top_tri_grams))
sns.barplot(x=y,y=x)

Again, more weird link stuff potentially related to sharing videos, however the non-garbage results actually look quite promissing. At least 3 of the trigrams on the top ten appear to be natural disaster related.

Another item to check before moving on are the targets. How many of our 7,613 tweets are actually about natural disasters?

In [None]:
train_df["target"].hist()

In this histogram we can see there are around 4,400 tweets that are not talking about natural disasters (target = 0), and 3,200 that are related to natural disasters (target = 1).

### Data Pre-Processing

As our earlier exporation showed there is a significant ammount of extraneous data that we should try and remove before proceeding with tokenizing the words in the tweets so they can be processed by our model.

In [None]:
import nltk
import gensim
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.tokenize import word_tokenize
import pyLDAvis.gensim

Here I will be adding a few additional words to the list of stop words in order to better clean up the data.

In [None]:
additional_stop_words = {"http", "https", "-", "amp", "&amp;", "??", "???", "????", "\x89Û_", "\x89ÛÓ","co", "|", "..."}

stop = stop.union(additional_stop_words)

Build a new corpus without any stopwords:

In [None]:
def preprocess(df):
    corpus = []
    stem = PorterStemmer()
    lem = WordNetLemmatizer()
    for text in df['text']:
        words = [w for w in word_tokenize(text) if (w not in stop)]
        
        words=[lem.lemmatize(w) for w in words if len(w)>2]
        
        corpus.append(words)
    return corpus

corpus=preprocess(train_df)

In [None]:
dic = gensim.corpora.Dictionary(corpus)
bow_corpus = [dic.doc2bow(doc) for doc in corpus]

After creating a Bag of words model from out corpus we will use it to create n LDA in order to do some topic analysis.

In [None]:
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics = 5, 
                                       id2word = dic,                                    
                                       passes = 10,
                                       workers = 2)
lda_model.show_topics()

Unfortunately, the model seems to have some trouble with topic analysis. I would guess this is due to the source of the datasets being tweets, which can discuss a wide range of topics.

Next we will create a wordcloud which can help give us a general idea of our vocabulary:

In [None]:
from wordcloud import WordCloud

def show_wordcloud(data):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stop,
        max_words=128,
        max_font_size=30,
        scale=4,
        random_state=1)
   
    wordcloud=wordcloud.generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')

    plt.imshow(wordcloud)
    plt.show()

show_wordcloud(corpus)

Here we can see a wide range of words, with none particulary sticking out. Overall, it seems like "The" is the most frequent word, with not to great a difference between word frequencey. Additionaly, you can see a string 'x89U_' which most likely represents an emoji, which may present a challenge to the model.

From our analysis we have clearly defined this challenge as a classification problem, Natural Disaster tweet or not a Natural Disaster tweet