## NLP Tutorial

NLP - or *Natural Language Processing* - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

In [2]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")  # this is the submission set

In [None]:
train_df

### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [None]:
train_df[train_df["target"] == 0]["text"].values[1]

And one that is:

In [None]:
train_df[train_df["target"] == 1]["text"].values[1]

### Building vectors

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.

Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()
## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [None]:
train_df["text"][0]

In [None]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors.todense())
# example_train_vectors.toarray()
# count_vectorizer.get_feature_names()
count_vectorizer.get_feature_names()

# model for using frequency vector - from the code, testing different CV strategy

The above tells us that:
1. There are 54 unique words (or "tokens") in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

### Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!

In [None]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])
## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

In [257]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

In [265]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold

# in this exercise, I have tried 1) use the typical train test split methods (which has shuffle), 2) ShuffleSplits, and 3) StratifiedKFold with or without shuffle;
# if not shuffle, the model's result is not as good compared with shuffling
# cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
# cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition is F1, so let's use that here.

In [266]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=cv, scoring="f1")
# scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
# scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.72340426, 0.72647489, 0.7322298 ])

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of them a shot!

In the meantime, let's do predictions on our training set and build a submission for the competition.

In [70]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier()

In [None]:
test_target = clf.predict(test_vectors)


In [None]:
# sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

# model for using frequency vector; first split

In [3]:
df_train, df_cv = train_test_split(train_df, test_size=0.3, random_state=10)
df_train.reset_index(inplace=True, drop=True)
df_cv.reset_index(inplace=True, drop=True)

# train_df.shape
# ## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# # that the tokens in the train vectors are the only ones mapped to the test vectors - 
# # i.e. that the train and test vectors use the same set of tokens.
# test_vectors = count_vectorizer.transform(test_df["text"])

In [4]:
# df_train.shape
# df_cv.head()
df_cv.tail()

Unnamed: 0,id,keyword,location,text,target
2279,6126,hellfire,Silvermoon or Ironforge,Fel Lord Zakuun is about to DIE ! #Hellfire #W...,1
2280,2739,crushed,Pennsylvania,Nick Williams just hit another bomb. Just crus...,0
2281,8665,sinkhole,Êwagger!ÌominicanÌ÷,#LoMasVisto THOUSANDS OF HIPSTERS FEARED LOST:...,1
2282,3184,deluge,617-BTOWN-BEATDOWN,Photo: boyhaus: Heaven sent by JakeåÊ ÛÏAfter...,0
2283,7481,obliteration,Illinois,Which is true to an extent. The obliteration o...,0


In [5]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [6]:
# count_vectorizer = feature_extraction.text.CountVectorizer()
# count_vectorizer.fit(train_df["text"])
count_vectorizer.fit(df_train['text'])
train_vectors = count_vectorizer.transform(df_train['text'])
cv_vectors = count_vectorizer.transform(df_cv['text'])
# test_vectors = 

In [7]:
clf = linear_model.RidgeClassifier()

# clf = linear_model.
clf.fit(train_vectors, df_train["target"])
cv_target = clf.predict(cv_vectors)
train_target = clf.predict(train_vectors)

In [8]:
df_cv_predicted = pd.DataFrame({'predicted': cv_target})
df_cv = pd.concat((df_cv, df_cv_predicted), axis='columns')

In [9]:
df_cv[df_cv['predicted'] == 1][['target', 'predicted']].sum()  # precision is 77%

# this looks good. Buy why is ridge 
# df_cv[(df_cv['predicted'] == 1) & (df_cv['target'] == 1)].sum()

target       668
predicted    855
dtype: int64

In [10]:
# train is much better
print('precision: ' +  str(precision_score(y_pred=train_target, y_true=df_train['target'])))
print('recall: ' + str(recall_score(y_pred=train_target, y_true=df_train['target'])))
print('f1: ' +  str(f1_score(y_pred=train_target, y_true=df_train['target'])))

precision: 0.998254037538193
recall: 0.9934839270199827
f1: 0.9958632701937732


In [12]:
print('precision: ' +  str(precision_score(y_pred=cv_target, y_true=df_cv['target'])))
print('recall: ' + str(recall_score(y_pred=cv_target, y_true=df_cv['target'])))
print('f1: ' +  str(f1_score(y_pred=cv_target, y_true=df_cv['target'])))
print('auc: ' +  str(roc_auc_score(y_score=cv_target, y_true=df_cv['target'])))


# print('recall': str(recall_score(y_pred=cv_target, y_true=df_cv['target'])))

# df_result.loc[i, 'train_recall'] = recall_score(y_pred=df_train['predicted_class'], y_true=df_train[y_var])
# df_result.loc[i, 'train_accuracy'] = accuracy_score(y_pred=df_train['predicted_class'], y_true=df_train[y_var])
# df_result.loc[i, 'train_auc'] = roc_auc_score(y_true=df_train[y_var], y_score=df_train['predicted_proba'])
# df_result.loc[i, 'train_f1_score'] = f1_score(y_pred=df_train['predicted_class'], y_true=df_train[y_var])

precision: 0.7812865497076024
recall: 0.6893704850361198
f1: 0.7324561403508772
auc: 0.7735825809211018


# try RNN 


Now, in the viewer, you can submit the above file to the competition! Good luck!