## Motivation:
I basically wanted to learn Exploratory Data Analysis in order to perform EDA quite easily on the datasets that I work with in the future. So I took up a dataset and learnt from quite other good notebooks and produced my own findings. I hope you too will learn from this notebook. <br>
### Please upvote if you like the work...

## References:
#### This kernel includes ideas from kernels below.
- https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert/notebook
- https://www.kaggle.com/ratan123/start-from-here-disaster-tweets-eda-basic-model/notebook
- https://www.kaggle.com/datafan07/disaster-tweets-nlp-eda-bert-with-transformers/notebook

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
from nltk.corpus import stopwords, wordnet
stop = set(stopwords.words('english'))

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from wordcloud import WordCloud, STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from collections import Counter, defaultdict

import spacy
#!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
import en_core_web_sm

# For customizing our plots.
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches

#plotly
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot, plot

# Core packages for general use throughout the notebook.
import random
import warnings
import time
import datetime
import re
import string

# Setting some options for general use.
plt.style.use('fivethirtyeight')
sns.set(font_scale=1.5)
pd.options.display.max_columns = 250
pd.options.display.max_rows = 250
warnings.filterwarnings('ignore')

## Load the data

In [None]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

In [None]:
display(train.sample(5))

In [None]:
display(test.sample(5))

In [None]:
print(train.shape)
print(test.shape)

## Exploratory Data Analysis

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

### Target Distribution

This helps us to analyse how our data is distributed between the 2 classes - disaster and non disaster

In [None]:
colors = ['lightsalmon', 'lightskyblue']
fig = go.Figure([go.Bar(x=train['target'].value_counts().index, y=train['target'].value_counts().values,
                        marker_color=colors)])
fig.update_layout(title_text='Distribution of Tweets', title_x=0.5)
fig.show()

### Meta Features

With the help of meta features we can figure out the difference between diaster and non disaster tweets.<br>
Some of the meta features are:
- word_count: number of words in text
- unique_word_count: number of unique words in text
- stopword_count: number of stop words in text
- url_count number: of urls in text
- mean_word_length: average character count in words
- char_count number: of characters in text
- punctuation_count: number of punctuations in text
- hashtag_count: number of hashtags (#) in text
- mention_count: number of mentions (@) in text

In [None]:
# total word count
train['word_count'] = train['text'].apply(lambda x: len(str(x).split()))
test['word_count'] = test['text'].apply(lambda x: len(str(x).split()))

# total unique word count
train['unique_word_count'] = train['text'].apply(lambda x: len(set(str(x).split())))
test['unique_word_count'] = test['text'].apply(lambda x: len(set(str(x).split())))

# stopwords count
train['stopword_count'] = train['text'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
test['stopword_count'] = test['text'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

# url_count
train['url_count'] = train['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))
test['url_count'] = test['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))

# mean_word_length
train['mean_word_length'] = train['text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test['mean_word_length'] = test['text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

# char_count
train['char_count'] = train['text'].apply(lambda x: len(str(x)))
test['char_count'] = test['text'].apply(lambda x: len(str(x)))

# punctuation_count
train['punctuation_count'] = train['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
test['punctuation_count'] = test['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

# hashtag_count
train['hashtag_count'] = train['text'].apply(lambda x: len([c for c in str(x) if c == '#']))
test['hashtag_count'] = test['text'].apply(lambda x: len([c for c in str(x) if c == '#']))

# mention_count
train['mention_count'] = train['text'].apply(lambda x: len([c for c in str(x) if c == '@']))
test['mention_count'] = test['text'].apply(lambda x: len([c for c in str(x) if c == '@']))

In [None]:
train.head(2)

In [None]:
meta_features = ['word_count','unique_word_count','stopword_count','url_count','mean_word_length',
                 'char_count','punctuation_count','hashtag_count','mention_count']

DISASTER_TWEETS = train['target'] == 1

for i, feature in enumerate(meta_features):
    
    a = f'{feature} Target Distribution in Training Set'
    b = f'{feature} Training & Test Set Distribution'
    fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5], subplot_titles=[a, b])
    
    hist_data = [train.loc[~DISASTER_TWEETS][feature], train.loc[DISASTER_TWEETS][feature]]

    group_labels = ['Not Disaster', 'Disaster']
    
    fig2 = ff.create_distplot(hist_data, group_labels)
    
    fig.add_trace(go.Histogram(fig2['data'][0],
                           marker_color='mediumslateblue'
                          ), row=1, col=1)

    fig.add_trace(go.Histogram(fig2['data'][1],
                               marker_color='mediumvioletred'
                              ), row=1, col=1)

    fig.add_trace(go.Scatter(fig2['data'][2],
                             line=dict(color='mediumslateblue', width=0.5)
                            ), row=1, col=1)

    fig.add_trace(go.Scatter(fig2['data'][3],
                             line=dict(color='mediumvioletred', width=0.5)
                            ), row=1, col=1)
    

    
    hist_data = [train[feature], test[feature]]

    group_labels = ['Training', 'Testing']
    
    fig2 = ff.create_distplot(hist_data, group_labels)
    
    fig.add_trace(go.Histogram(fig2['data'][0],
                           marker_color='mediumturquoise'
                          ), row=1, col=2)

    fig.add_trace(go.Histogram(fig2['data'][1],
                               marker_color='indianred'
                              ), row=1, col=2)

    fig.add_trace(go.Scatter(fig2['data'][2],
                             line=dict(color='mediumturquoise', width=0.5)
                            ), row=1, col=2)

    fig.add_trace(go.Scatter(fig2['data'][3],
                             line=dict(color='indianred', width=0.5)
                            ), row=1, col=2)
    
    fig.show()



### Most Common Words

Common words usually help us to get to know our data. For example lets say disaster and non disaster were not known to you so using common words we can get to know what the class is about

In [None]:
lis = [
    train[train['target'] == 0]['text'],
    train[train['target'] == 1]['text']
]

Before cleaning the data we can see that most of the commmon words are useless and have no meaning so text cleaning is to be performed.

In [None]:
cnt = 1
a = 'Non Disaster Tweets'
b = 'Disaster Tweets'
fig = make_subplots(rows=1, cols=2, subplot_titles=[a, b])
colors = ['darkturquoise', 'deepskyblue']
for i, j in zip(lis, colors):
    
    new = i.str.split()
    new = new.values.tolist()
    corpus = [word for i in new for word in i]
    
    counter = Counter(corpus)
    most = counter.most_common()
    x, y = [], []
    for word, count in most[:50]:
        if (word not in stop):
            x.append(word)
            y.append(count)

    fig.add_trace(go.Bar(x=x, y=y, marker_color=colors[cnt-1]),
                  row=1, col=cnt)
    cnt+=1
    
fig.update_layout(
    title="Most Common Unigrams", title_x=0.5,
    xaxis_title="Word",
    yaxis_title="Count",
)
fig.show()

## Data Cleaning

**Steps followed to clean the data:**
- Removed urls, emojis, html tags and punctuations,
- Tokenized the tweet base texts,
- Lower cased clean text,
- Removed stopwords,
- Applied part of speech tags,
- Converted part of speeches to wordnet format,
- Applying word lemmatizer,
- Converted tokenized text to string again.

In [None]:
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_emoji(text):
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_html(text):
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return re.sub(html, '', text)


def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

# Applying helper functions

train['text_clean'] = train['text'].apply(lambda x: remove_url(x))
train['text_clean'] = train['text_clean'].apply(lambda x: remove_emoji(x))
train['text_clean'] = train['text_clean'].apply(lambda x: remove_html(x))
train['text_clean'] = train['text_clean'].apply(lambda x: remove_punct(x))

In [None]:
train['tokenized'] = train['text_clean'].apply(word_tokenize)

train['lower'] = train['tokenized'].apply(lambda x: [word.lower() for word in x])

train['stopwords_removed'] = train['lower'].apply(lambda x: [word for word in x if word not in stop])

train['pos_tags'] = train['stopwords_removed'].apply(nltk.tag.pos_tag)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
train['wordnet_pos'] = train['pos_tags'].apply(
    lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])

wnl = WordNetLemmatizer()

train['lemmatized'] = train['wordnet_pos'].apply(
    lambda x: [wnl.lemmatize(word, tag) for word, tag in x])

train['lemmatized'] = train['lemmatized'].apply(
    lambda x: [word for word in x if word not in stop])

train['lemma_str'] = [' '.join(map(str, l)) for l in train['lemmatized']]

As Data Cleaning is complete lets focus on Most Common Words

In [None]:
lis = [
    train[train['target'] == 0]['lemma_str'],
    train[train['target'] == 1]['lemma_str']
]

In [None]:
cnt = 1
fig = make_subplots(rows=1, cols=2)
colors = ['darkturquoise', 'deepskyblue']
for i, j in zip(lis, colors):
    
    new = i.str.split()
    new = new.values.tolist()
    corpus = [word for i in new for word in i]
    
    counter = Counter(corpus)
    most = counter.most_common()
    x, y = [], []
    for word, count in most[:30]:
        if (word not in stop):
            x.append(word)
            y.append(count)

    fig.add_trace(go.Bar(x=x, y=y, marker_color=colors[cnt-1]),
                  row=1, col=cnt)
    cnt+=1
    
fig.update_layout(
    title="Most Common Unigrams", title_x=0.5,
    xaxis_title="Word",
    yaxis_title="Count",
)
fig.show()

In [None]:
def ngrams(n, title):
    """A Function to plot most common ngrams"""
    cnt = 1
    fig = make_subplots(rows=1, cols=2)
    colors = ['darkturquoise', 'deepskyblue']
    for i, j in zip(lis, colors):

        new = i.str.split()
        new = new.values.tolist()
        corpus = [word for i in new for word in i]

        def _get_top_ngram(corpus, n=None):
            #getting top ngrams
            vec = CountVectorizer(ngram_range=(n, n),
                                  max_df=0.9,
                                  stop_words='english').fit(corpus)
            bag_of_words = vec.transform(corpus)
            sum_words = bag_of_words.sum(axis=0)
            words_freq = [(word, sum_words[0, idx])
                          for word, idx in vec.vocabulary_.items()]
            words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
            return words_freq[:15]

        top_n_bigrams = _get_top_ngram(i, n)[:15]
        x, y = map(list, zip(*top_n_bigrams))
        fig.add_trace(go.Bar(x=x, y=y, marker_color=colors[cnt-1]),
                  row=1, col=cnt)
        cnt+=1
        
    fig.update_layout(
        title="Most Common Unigrams", title_x=0.5,
        xaxis_title="Word",
        yaxis_title="Count",
    )
    fig.show()

In [None]:
ngrams(2, 'Most Common Bigrams')

In [None]:
ngrams(3, 'Most Common Trigrams')

## WordCloud

In [None]:
def plotly_wordcloud(text, title):
    wc = WordCloud(stopwords = set(STOPWORDS),
                   max_words = 200,
                   max_font_size = 100)
    wc.generate(text)
    
    word_list=[]
    freq_list=[]
    fontsize_list=[]
    position_list=[]
    orientation_list=[]
    color_list=[]

    for (word, freq), fontsize, position, orientation, color in wc.layout_:
        word_list.append(word)
        freq_list.append(freq)
        fontsize_list.append(fontsize)
        position_list.append(position)
        orientation_list.append(orientation)
        color_list.append(color)
        
    # get the positions
    x=[]
    y=[]
    for i in position_list:
        x.append(i[0])
        y.append(i[1])
            
    # get the relative occurence frequencies
    new_freq_list = []
    for i in freq_list:
        new_freq_list.append(i*100)
    new_freq_list
    
    trace = go.Scatter(x=x, 
                       y=y, 
                       textfont = dict(size=new_freq_list,
                                       color=color_list),
                       hoverinfo='text',
                       hovertext=['{0}{1}'.format(w, f) for w, f in zip(word_list, freq_list)],
                       mode='text',  
                       text=word_list
                      )
    
    layout = go.Layout({'xaxis': {'showgrid': False, 'showticklabels': False, 'zeroline': False},
                        'yaxis': {'showgrid': False, 'showticklabels': False, 'zeroline': False}})
    
    fig = go.Figure(data=[trace], layout=layout)
    
    fig.update_layout(
        title=title, title_x=0.5,
    )
    
    return fig

In [None]:
text = ' '.join(train[train["target"]==1]['lemma_str'])
plotly_wordcloud(text, 'Disaster Tweets')

In [None]:
text = ' '.join(train[train["target"]==0]['lemma_str'])
plotly_wordcloud(text, 'Non Disaster Tweets')

## Named Entity Recognition Analysis

When we look at the NER results we can get lots of great insights. We can see that in disaster tweets countries, cities, states are much more common than non disaster ones. Again [](http://)nationality or religious or political group names are more likely to be mentioned in disaster tweets. These are great indicators for us...

In [None]:
# Loading NER.
nlp = en_core_web_sm.load() 

def plot_named_entity_barchart(text):
    
    """A function for extracting named entities and comparing them"""
    
    def _get_ner(text):
        doc = nlp(text)
        return [X.label_ for X in doc.ents]

    ent = text.apply(lambda x: _get_ner(x))
    ent = [x for sub in ent for x in sub]
    counter = Counter(ent)
    count = counter.most_common()

    x, y = map(list, zip(*count))
    
    fig.add_trace(go.Bar(x=x, y=y, marker_color=colors[cnt-1]),
                  row=1, col=cnt)
    
cnt = 1
a = 'Non Disaster Tweets'
b = 'Disaster Tweets'
fig = make_subplots(rows=1, cols=2, subplot_titles = [a, b])
colors = ['darkorange', 'cadetblue']
for i, j in zip(lis, colors):

    def _get_ner(i):
        doc = nlp(i)
        return [X.label_ for X in doc.ents]

    ent = i.apply(lambda x: _get_ner(x))
    ent = [x for sub in ent for x in sub]
    counter = Counter(ent)
    count = counter.most_common()[:15]

    x, y = map(list, zip(*count))
    
    fig.add_trace(go.Bar(x=x, y=y, marker_color=colors[cnt-1]),
                  row=1, col=cnt)
    
    cnt+=1

fig.update_layout(
    title="Common Named-Entity Counts", title_x=0.5,
    xaxis_title="Word",
    yaxis_title="Count",
)

fig.show()

If you learn something or find it interesting do upvoteüëç