1. [Load and process the data](#section-one)
2. [Word2Vec model](#section-two)
  - [Transform Words vectors into doc vectors](#section-three)
  - [3D visualization](#section-four)
  - [SVC model for Word2Vec](#section-five)
  - [Decision regions for SVC](#section-0)
  - [Google pre-trained Word2Vec model](#section-six)
  - [SVC on google model](#section-6)
3. [Doc2Vec model](#section-seven)
  - [Train Doc2Vec model](#section-eight)
  - [Find the most similar targets for test dataset](#section-nine)
  - [SVC model using Doc2Vec vectors](#section-ten) 
4. [Conclusion](#section-eleven)


In [None]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from nltk.stem.snowball import SnowballStemmer 
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
import gensim
from gensim.models.word2vec import Word2Vec
from mpl_toolkits.mplot3d import Axes3D
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from mlxtend.plotting import plot_decision_regions
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LogisticRegression
from statistics import mode
from tqdm import tqdm
from sklearn import metrics

<a id="section-one"></a>
## Load and process the data using our own tokenizer and SnowballStemmer

In [None]:
test = pd.read_csv('../input/nlp-getting-started/test.csv')
train = pd.read_csv('../input/nlp-getting-started/train.csv')
X = train['text']
y = train['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
X_train, X_test, y_train, y_test = list(X_train), list(X_test), list(y_train), list(y_test)

In [None]:
stop = stopwords.words('english')
def tokenizer(text):
    tokenized = []
    for string in text:
        string = re.sub('[^a-z\sA-Z]', '', string)
        string = re.sub('http\S+', '', string)
        tokenized.append([w for w in string.split() if w not in stop])
    return tokenized

snow_stemmer = SnowballStemmer(language='english') 
def stemmer(text):
    stem_string = []
    for string in text: 
        stem_string.append([snow_stemmer.stem(word) for word in string])
    return stem_string 

X_train = tokenizer(X_train)
X_train = stemmer(X_train)
X_test = tokenizer(X_test)
X_test = stemmer(X_test)

<a id="section-two"></a>
## Word2Vec

In [None]:
corpus = X_train
nlp = gensim.models.word2vec.Word2Vec(corpus, size=200,   
            window=6, min_count=1, sg=1, iter=40)
len(nlp.wv.vocab) # number of words in a dictionary

We just trained our model on a train data set. Using vectors embedding of the words we can find the most similar words to the given word.

In [None]:
nlp.most_similar('fire', topn = 3)

<a id="section-three"></a>
Now we have vector for each word. For classification we need to somehow transform our words vectors into documents vectors. I am sure there might be thousands of ways to do it, but we will use the easiest and quickest way. We will just sum words vectors for each document and find 3 values: its mean, min value and max value. To some extent these values are supposed to reflect document meaning.

In [None]:
def get_features(model, text):
    data = pd.DataFrame({'mean' : [], 'min' : [],'max' : []})
    model_words = set(model.wv.vocab.keys()) # words known to model
    for t, i in zip(text, range(len(text))):  
        vec = np.zeros(model.vector_size, dtype="float32")
    
        # Initialize a counter for number of words in a tweet
        nwords = 0
        # Loop over each word in the tweet and, if it is in the model's 
        #vocabulary, add its feature vector to the total
        for word in t:
            if word in  model_words: 
                vec = np.add(vec, model[word])
                nwords += 1.

        # get the average, min and max
        if nwords > 0:
            vec_mean = np.divide(np.sum(vec), nwords)
            vec_min = np.min(vec)
            vec_max = np.max(vec)
            
        data.loc[i] = list((vec_mean, vec_min, vec_max))
    return data

In [None]:
train_data = get_features(nlp, corpus)
test_data = get_features(nlp, X_test)
train_data['target'] = y_train
train_data.head()

<a id="section-four"></a>
This way we got 3 features for each document. Using these 3 dimensions we are able to build 3D scatter plot.

In [None]:
# plot 3d
fig = plt.figure(figsize = (10,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(train_data[train_data["target"]==0]['mean'], 
           train_data[train_data["target"]==0]['min'], 
           train_data[train_data["target"]==0]['max'], c="black")
ax.scatter(train_data[train_data["target"]==1]['mean'], 
           train_data[train_data["target"]==1]['min'], 
           train_data[train_data["target"]==1]['max'], c="red")
ax.set(xlabel='mean', ylabel='min', zlabel='max')
ax.set_zlim(0,10)
ax.set_ylim(-8,0)
ax.set_xlim(-15,5)


<a id="section-five"></a>
From the scatter plot we see that 'mean-min-max' method is not able to clearly seperate tweets. Nevertheless I will try to train SVC model on this data.

In [None]:
svc = SVC(random_state=1, C = 1, gamma = 10, kernel = 'rbf', probability = True)
svc.fit(train_data.drop(['target'],axis=1, inplace = False), train_data['target'])

y_pred = svc.predict(test_data)
print('F1 = ', f1_score(y_true = y_test, y_pred = y_pred))
print('Accuracy = ', precision_score(y_true = y_test, y_pred = y_pred))

<a id="section-0"></a>

In [None]:
svc = SVC(random_state=1, C = 1, gamma = 5, kernel = 'rbf', probability = True)
svc.fit(train_data.drop(['target', 'max'],axis=1, inplace = False), train_data['target'])
# plot decision regions
fig= plt.figure(figsize=(14,4))
plot_decision_regions(np.array(train_data.drop(['target', 'max'],axis=1, inplace = False)), 
                      np.array(train_data['target']), svc)
plt.show()

<a id="section-six"></a>
Neither accuracy nor decision regions showed that model is able to understand and distinguish tweets. Maybe the problem is on the stage of transforming of words vectors into documents vectors. Or maybe we will get better prediction if we use well pre-trained word2vec model. On the next step we will download Google word2vec model trained on news.

In [None]:
model = api.load("word2vec-google-news-300")  # download the model and return as object ready for use
len(model.wv.vocab) # the model has 3000000 words in a vocabulary

In [None]:
model.most_similar('fire', topn = 3)

But words in a model are not stemmed. Instead of stemming it we will just use our initial training data. 

In [None]:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.1, stratify=y)
X_tr, X_ts, y_tr, y_ts = list(X_tr), list(X_ts), list(y_tr), list(y_ts)
train_data = get_features(model, X_tr)
test_data = get_features(model, X_ts)
train_data['target'] = y_tr
train_data.head()

<a id="section-6"></a>
This way we got the same 3 features as when we trained model on tweets. There mean, min and max value of sum of vectorized words in each tweet. But this time we use words embeding from google model.

In [None]:
svc = SVC(random_state=1, C = 1, gamma = 5, kernel = 'rbf', probability = True)
svc.fit(train_data.drop(['target'],axis=1, inplace = False), train_data['target'])

y_pred = svc.predict(test_data)
print('F1 = ', f1_score(y_true =  y_ts, y_pred = y_pred))
print('Accuracy = ', precision_score(y_true = y_test, y_pred = y_pred))

But the result looks even worse. So I can conclude that using Word2Vec algorithm with 'mean-min-max' approach is not a good way for classifying documents. For these purposes you should try another algorithm of transforming words vectors into document vectors. 

<a id="section-seven"></a>
## Doc2Vec

<a id="section-eight"></a>
On this stage we create 'TaggedDocument' using our X_train data. This object is supposed to be sent in Doc2Vec function. The Doc2Vec function will build vector embedings for each document in received data.

In [None]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(X_train)]
model_dbow = Doc2Vec(documents, vector_size=200, dm = 0, window=7, min_count=1, epochs = 40)
len(model_dbow.wv.vocab.keys()) # number of words in a vocabulary

Using the next line we can get vector representation for given string according to our model weights. Then each such vector is used to find the most similar document among document used for training model. It returns id of 5 the most similar documents in training data.

In [None]:
v = model_dbow.infer_vector(['fire in the forest'])
model_dbow.docvecs.most_similar([v], topn=5)

<a id="section-nine"></a>
Knowing this we can write an algorithm which will estimate the same way each document in our test data. This way we will get vector representation of each document in test data. Then we will use these vector embeddings to find 5 the most similar documents in our training dataset. After that we will find out which class majority of these 5 training documents belong to. It will be a predicted class for test document.  

In [None]:
test_vec  = []
y_pred = []
train_vec  = []
for i in X_train:
    train_vec.append(model_dbow.infer_vector(i, steps = 20))

for i in X_test:
    test_vec.append(model_dbow.infer_vector(i, steps = 20))

for v in tqdm(test_vec): 
    temp = []
    for i in range(5):
        id = model_dbow.docvecs.most_similar([v], topn=5)[i][0]
        temp.append(y_train[id])
    if (len(set(temp)) == 2):
        y_pred.append(mode(temp))
    else:
        y_pred.append(temp[0])
        

In [None]:
# plot ROC curve, print accuracy and F1 score
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b')
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
print('F1 = ', f1_score(y_true = y_test, y_pred = y_pred))
print('Accuracy = ', precision_score(y_true = y_test, y_pred = y_pred))


<a id="section-ten"></a>
Now lets train SVC model using vector representation of documents 

In [None]:
svc = SVC(random_state=1, C = 5, gamma = 6, kernel = 'rbf', probability = True)
svc.fit(train_vec, y_train)
y_pred = svc.predict(test_vec)
print('F1 = ', f1_score(y_true = y_test, y_pred = y_pred))
print('Accuracy = ', precision_score(y_true = y_test, y_pred = y_pred))
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
plt.title('ROC curve')
plt.plot(fpr, tpr, 'b')
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
print('F1 = ', f1_score(y_true = y_test, y_pred = y_pred))
print('Accuracy = ', precision_score(y_true = y_test, y_pred = y_pred))

<a id="section-eleven"></a>
To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents. The way I did it in this notebook is not the best. Likely that is why we got such bad result for Word2Vec model. Meanwhile, 'mean-min-max' approach is an easy way to transform word vectors and might work better on small and well distinguished datasets.