In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud, STOPWORDS
from nltk.tokenize import word_tokenize
from IPython.display import Image, display

In [None]:
plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'
# sns.set(font_scale=1.1)

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')

In [None]:
train.shape

In [None]:
train.head()

Most of the data science and machine learning stuffs should start from gather and cleaning the data and analyzing it and not directly moving to building models specially when it is about learning and knowledge. Nevertheless, the most difficult part was already handled by Kaggle. Now we should start with proper EDA and exploration of the dataset and all the EDA should always be based on some questions about the data and answering them keeping the actual objective in mind. Here our objective is to see if a tweet is really about the disaster or it is metaphorically typed. So lets start with our questions.

## Does people talk about disaster metaphorically more?

In [None]:
def without_hue(plot, feature):
    total = len(feature)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y+30), size = 12)
    plt.show()

In [None]:
plt.figure(figsize=(8, 5))
plt.title('Distribution of class labels')
ax = sns.countplot(x='target', data=train);
without_hue(ax, train.target)

#### Conclusions
1. It seems almost 14% more people love metaphores
2. The difference is still not a lot between these which indicates a lot of people also communicate their status with other using twitter.
3. If there arises any class imbalance issue, we can use libraries like [nlpAug](https://github.com/makcedward/nlpaug) for upscaling the lower classes.

## Does location play any role for disasters?

This can be important because some location are more prone to natural disasters as compared to others like Japan suffers from earthquakes a lot. So lets see if it is important or not.

In [None]:
def plot_most_common(data, n, column, filter_by_class=True, **kwargs):
    # Remove any NaN values in the column
    d = data[~data[column].isna()]
    if kwargs.get('make_copy', True):
        data_f = d.copy()
    else:
        data_f = d
    target_col = kwargs['target']
    if filter_by_class:
        # Plot for a particular class
        target_val = kwargs['target_val']
        data_f = data_f[data_f[target_col] == target_val]
        c = Counter(data_f[column])
        plot_data = pd.DataFrame(c.most_common(n))
        plot_data.columns = [column, 'count']
        ax = sns.barplot(x='count', y=column, data=plot_data, 
                         palette=kwargs.get('palette', 'Blues_r'))
        plt.title(kwargs.get('title'))
        return ax
    else:
        # Plot for both the classes combined
        both_target = kwargs.get('both_target', [0, 1])
        data_f_0 = data_f[data_f[target_col] == both_target[0]]
        data_f_1 = data_f[data_f[target_col] == both_target[1]]
        c1 = Counter(data_f_0[column])
        c2 = Counter(data_f_1[column])
        plot_data = pd.DataFrame([c1, c2]).T
        if kwargs['handle_singular'] == 'drop':
            plot_data = plot_data.dropna().reset_index()
        elif kwargs['handle_singular'] == 'fill':
            plot_data = plot_data.fillna(value=0).reset_index()
        
        plot_data.columns = [column, both_target[0], both_target[1]]
        plot_data = plot_data.sort_values(by=[1, 0], ascending=[False, True]).head(n)
        out = pd.melt(plot_data, id_vars=column, var_name='class', value_name='count')
        out['count'] = out['count'].astype(int)
        ax = sns.barplot(y='keyword', x='count', hue='class', 
                         data=out, palette=['tomato', 'cornflowerblue'])
        plt.title(kwargs.get('title'))
        return ax

sns.set(font_scale=1.8)
plt.figure(figsize=(28, 16))
plt.subplot(121)
plot_most_common(train, 10, 'location', target='target', target_val=0, 
                 title='Top locations for metaphores')
plt.subplot(122)
plot_most_common(train, 10, 'location', target='target', target_val=1, palette='Reds_r', 
                 title='Top locations for real disasters')
plt.show()

#### Conclusions

1. Location should have played a very important role but we can see that it very random and unstructured. Hence I think location would not be a very efficent predictor.
2. Some location and their variations are repeated. For example **USA**, **US** and **United States**. Combining these and reproducting the plot might reveal some more information.
3. However if other features are not well performing, we can use this to boost up the validation score during modeling.

## How does keyword play a role into this?

Keywords are not easy unlike the location. Keyword don't associate with any type with assurance. For example, the keyword **blast** can be equally good or bad and can be related to disaster or not.

In [None]:
plt.figure(figsize=(28, 16), dpi=100)
plt.subplot(121)
sns.set(font_scale=1.5)
plot_most_common(train, 20, 'keyword', target='target', target_val=0, 
                 title='Top keywords for metaphores')
plt.subplot(122)
plot_most_common(train, 20, 'keyword', target='target', target_val=1, palette='Reds_r', 
                 title='Top keywords for real disasters')
plt.show()

We can see a lot of keywords like 

 - Crush
 - Wrecked
 - Explode

are used as a methaphore which may have good association with actual disasters. Thinking carefully this makes real sense because these keywords in real life can actually be used for a lot of other purposes where they are used metaphorically.

It also seems a lot of keywords are overlapping. We may want to see that if a keyword is being used for both the tags, where it is more prevalent. For example **damage** is a keyword that can be used for some person or situation metaphorically or in literal sense too. Let's explore that. If a lot of such keywords exists, then **keyword** alone won't be a good feature to predict the labels.

In [None]:
sns.set(font_scale=1)
plt.figure(figsize=(8, 40), dpi=100)
g = plot_most_common(train, 221, 'keyword', False, target='target', handle_singular='fill',
                    title='Top 80 overlapping keywords');
g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

### Conclusions
1. We can see a lot of overlapping of with just top 80 keywords.
2. This indicates that building a rule based systems is not easy.
3. This also indicates that feature engineering on this column may not useful a lot.
4. This feature can still be useful to some extent.
5. Apart from only 2 keywords, almost all the keywords seems to be associated with both the class labels. Clearly no linear seperation. This may indicate that the class labels can not be well predicted from a linear model like simply with the **keyword** feature.

## Do people elaborate the real disasters more ?

It can be a common human nature to talk about the metaphorical statements more than the disasters. Ususlly when people tweet about disasters, they should not be in the mood to be poetic but rather they should be focusing on the real incident and tweet about it. Sometime they may share some news clips and other things as compared to writting more about it. Let's verify these claims. One nice way to do that is to check the distribution of the length of the tweets for both the class labels and see if they reveal something or not.

In [None]:
train['tweet_len'] = train.text.apply(lambda x: len(x))

In [None]:
train.head()

In [None]:
plt.figure(figsize=(10, 6))
sns.distplot(train[train['target'] == 1]['tweet_len'] , 
             label = 'Disaster', color = 'red')
sns.distplot(train[train['target'] == 0]['tweet_len'], 
             label = 'Metaphore' , color = 'blue')
plt.suptitle('How tweet length impact class label', fontsize=20)
plt.legend()
plt.show()

### Conclusion

1. **Alas!** there seems to almost a complete overlapping on the tweet length.
2. We might have to explore some other features to see if we can find some trend or not.

One idea is to check most frequent occuing words in the actual tweet itself and see if we can find any particualar pattern.

In [None]:
# " ".join(word_tokenize(train['text'][0].lower()))

In [None]:
def plot_wordcloud(data, col, background_color, 
                   colormap, remove_stopwords=True,
                  split_class=False, **kwargs):
    
    text = data[col].apply(lambda x: " ".join(word_tokenize(x.lower())))
    text = " ".join(text)
    if remove_stopwords:
        stopwords = STOPWORDS
    else:
        stopwords = []
    wordcloud = WordCloud(width=3000, height=2000, random_state=42, 
                          background_color=background_color, 
                          colormap=colormap, collocations=False, 
                          stopwords=stopwords).generate(text)
    
    plt.figure(figsize=(20, 15))
    plt.imshow(wordcloud)
    plt.axis("off");
    plt.show()
    return text

In [None]:
# text = plot_wordcloud(train, 'text', 'black', 'Set2')
display(Image(filename='../input/wordcloud2/wordcloud.png'))

> I could have used my own code to display the wordcloud but [this](https://monkeylearn.com/word-cloud/) looked much more polished and they performed some preprocessing too. Hence I am using this. 

### Conclusions

1. We can see very clearly that two keywords poped very heavily. One is **http** and another is **building**.
2. This raise to two facts, first is that building related disasters can be more prone than others and secondly a lot of people share some link in their tweets.
3. It can be useful to see if more people are sharing link with class 0 or class 1.
4. People did talk about some random **Youtube** videos also in tweets.
5. Out of all the countries in the world, we did see **Japan** being more prominent and locgically they should be because of the sufferings they had to go through in the past.

## Who loves to use web links in their tweets?

In [None]:
!pip install urlextract

In [None]:
from urlextract import URLExtract
extractor = URLExtract()

In [None]:
train['has_url'] = train.text.apply(lambda x: extractor.has_urls(x))
train.head()

In [None]:
sns.countplot(x='target', hue='has_url', data=train);

### Conclusion

1. Interesting to see that for class label 1 more people has url in their tweets as compared to class labels 0. This ensures that when people use url, they tend talk more about disaster as compared to metaphor. 
2. This raises the question that what is the difference between the links shared by people who talk about disasters vs people who use it metaphorically.

### Difference in links in these two groups

In [None]:
train_with_links = train[train.has_url == True]

In [None]:
train_with_links_class_0 = train_with_links[train_with_links.target == 0].copy()
train_with_links_class_1 = train_with_links[train_with_links.target == 1].copy()

In [None]:
def remove_urls(text):
    urls = extractor.find_urls(text)
    for url in urls:
        text = text.replace(url, "")
    return text

In [None]:
train_with_links_class_0['text_no_url'] = train_with_links_class_0.text.apply(remove_urls)
train_with_links_class_1['text_no_url'] = train_with_links_class_1.text.apply(remove_urls)

In [None]:
display?

### Metaphorical Tweets

In [None]:
display(Image('../input/wordcloud2/wordcloud0.png', retina=True))

### Actual Disaster

In [None]:
display(Image('../input/wordcloud2/wordcloud1.png', retina=True))

#### Conclusion

1. We can see some startling differences between the wordclouds. This indicates that words in raw text of the tweet can be useful.
2. A lot of people while taling about non disaster, refer hollywood movies and youtube videos.
3. When talking about actual disaster, people generally tend to talk about hiroshima, disaster etc.
4. I wonder what are the primary domain names of these links for both the categories.

## What are the domains of these two category of people?

In [None]:
!pip install Pandarallel

In [None]:
import requests
from pandarallel import pandarallel

In [None]:
pandarallel.initialize(progress_bar=True)

In [None]:
def extract_urls(text):
    urls = extractor.find_urls(text)
    u = []
    
    for url in urls:
        try:
            r = requests.get(url, timeout=10)
            for h in r.history:
                pass
            u.append(r.url)
        except:
            u.append(url)
    return u

In [None]:
train_with_links_class_0['urls'] = train_with_links_class_0.text.parallel_apply(extract_urls)

In [None]:
train_with_links_class_1['urls'] = train_with_links_class_1.text.parallel_apply(extract_urls)

In [None]:
newTrain = pd.concat([train_with_links_class_0, train_with_links_class_1])
newTrain.to_csv('newTrain.csv', index=False)

Now that we have this data, lets see the difference domain for both these group into a word cluster.

# Stay tuned for more

### Does people only talks about disaster when it comes to Japan?

### When people talk about buildings?