# Introduction

In this notebook I will explore the Disaster Tweets competition dataset and use [scikit-learn's](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) `CountVectorizer` to convert a collection of text - tweets - to a matrix of token counts.

Submissions are evaluated using [F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) between the predicted and expected answers.

F1 is calculated as follows:
$$\text{F1}=2 * \frac{precision * recall}{precision + recall}$$

where:
$$\text{precision}=\frac{TP}{TP + FP}$$

$$\text{recall}=\frac{TP}{TP + FN}$$

and:
- True Positive [TP] = prediction is 1, and the ground truth is also 1
- False Positive [FP] = prediction is 1, and the ground truth is 0
- False Negative [FN] = prediction is 0, and the ground truth is 1

In [None]:
import numpy as np
import pandas as pd
import re
from textblob import TextBlob
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model, model_selection
import warnings
warnings.filterwarnings('ignore')

In [None]:
df_train = pd.read_csv('../input/nlp-getting-started/train.csv')
df_train.head(20)

### Class distribution

In [None]:
sns.countplot(df_train['target'])
df_train['target'].value_counts() / len(df_train)

43% of tweets are about disasters, the other 57% of tweets are not.

There are several ways to extract numerical data from the 'text' field. These include:
- length of the tweet
- the number of words and [stop words](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words)
- [polarity and subjectivity](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis)

and a lot more, but I will get started with just these for now.

In [None]:
tweets = []
for tweet in df_train['text']:
    tweets += [tweet]

num_words = []
num_stop_words = []
polarity = []
subjectivity = []

for tweet in tweets:
    num_words += [len(tweet.split())]
    num_stop_words += [len([stopword for stopword in tweet.lower().split() if stopword in STOPWORDS])]
    tweet_blob = TextBlob(tweet)
    polarity += [tweet_blob.sentiment.polarity]
    subjectivity += [tweet_blob.sentiment.subjectivity]

df_train['length'] = df_train['text'].str.len()
df_train['num_words'] = num_words
df_train['num_stop_words'] = num_stop_words
df_train['polarity'] = polarity
df_train['subjectivity'] = subjectivity

In [None]:
df_train.groupby('target')['length', 'num_words', 'num_stop_words', 'polarity', 'subjectivity'].mean()

On average, disaster tweets are longer and contain fewer stop words. The differences in polarity and subjectivity are less pronounced but our model may make use of them.

If we create a list of words, we can create a word cloud with the most frequently used words. Let's do that to see if anything stands out.

In [None]:
regular_tweets = df_train[df_train['target'] == 0]['text'].to_list()
disaster_tweets = df_train[df_train['target'] == 1]['text'].to_list()

joined_regular_tweets = ' '.join(regular_tweets)
joined_disaster_tweets = ' '.join(disaster_tweets)

regular_cloud = WordCloud().generate(joined_regular_tweets)
disaster_cloud = WordCloud().generate(joined_disaster_tweets)

In [None]:
fig = plt.figure(figsize=(16, 12))

fig.add_subplot(221)
plt.title('Regular tweets')
plt.imshow(regular_cloud)

fig.add_subplot(222)
plt.title('Disaster tweets')
plt.imshow(disaster_cloud)

A lot of text cleaning will be needed.

### Work in progress

In [None]:
vect = CountVectorizer(max_features=100, ngram_range=(1, 3))
vect.fit(tweets)

bow = vect.transform(tweets)

X = pd.DataFrame(bow.toarray(), columns=vect.get_feature_names())
print(X.head())

In [None]:
train_vectors = vect.fit_transform(df_train['text'])

clf = linear_model.RidgeClassifier()

scores = model_selection.cross_val_score(clf, train_vectors, df_train['target'], cv=5, scoring='f1')
scores