# Assignment 3: Word2Vec

In this assignment, we will see how we can use Word2Vec (or any similar word embedding) to use information from unlabelled data to help us classify better!

You will be using the sentiment data from last week, either the yelps or movies, whichever you wish. 

Your goal will be to simulate the following situation: you have a **small** set of labelled data and a large set of unlabelled data. Show how the two follow 2 techniques compare as the amount of labelled data increases. You should train them on the small labelled subset and test their performance on the rest of the data. 

In other words, train on 1k, test on 99k. Then train on 2k, test on 98k. Then train on 4k, test on 96k. Etc.

1. Logistic regression trained on labelled data, documents represented as term-frequency matrix of your choice. You can learn the vocabulary from the entire dataset or only the labelled data.

2. Logistic regression trained on the labelled data, documents represented as word2vec vectors where you train word2vec using the entire dataset. Play around with different settings of word2vec (training window size, K-negative, skip-gram vs BOW, training windows, etc.). Note: we didn't go over the options in detail in class, so you will need to read about them a bit!

You can read about the gensime word2vec implementation [here](https://radimrehurek.com/gensim/models/word2vec.html).

## Import & Split & Preprocessing Data

In [None]:
# Run this if running in Google Collab
# Mount google drive if running from Google Collab
from google.colab import drive
drive.mount('/content/drive')

# Set current directory if running from Google Collab

import os
os.chdir('/content/drive/My Drive/Colab Notebooks')# here use your path to current notebook

Mounted at /content/drive


In [None]:
#import auxiliar functions
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
sys.path.insert(1, parentdir)



In [None]:
import re
import spacy
import seaborn as sns
import pandas as pd
import numpy as np
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# EXERCISE 4
#
# Prove Ng and Jordan right!!!

yelps = pd.read_csv('yelps.csv').fillna(' ')
imdb = pd.read_csv('movies.csv').fillna(' ')

In [None]:
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV



import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import re

#lemmitizing with spacy takes to long, so we just do:
def preprocess(X):
  documents = []
  APPO = {
  "aren't" : "are not",
  "can't" : "cannot",
  "couldn't" : "could not",
  "didn't" : "did not",
  "doesn't" : "does not",
  "don't" : "do not",
  "hadn't" : "had not",
  "hasn't" : "has not",
  "haven't" : "have not",
  "he'd" : "he would",
  "he'll" : "he will",
  "he's" : "he is",
  "i'd" : "I would",
  "i'd" : "I had",
  "i'll" : "I will",
  "i'm" : "I am",
  "isn't" : "is not",
  "it's" : "it is",
  "it'll":"it will",
  "i've" : "I have",
  "let's" : "let us",
  "mightn't" : "might not",
  "mustn't" : "must not",
  "shan't" : "shall not",
  "she'd" : "she would",
  "she'll" : "she will",
  "she's" : "she is",
  "shouldn't" : "should not",
  "that's" : "that is",
  "there's" : "there is",
  "they'd" : "they would",
  "they'll" : "they will",
  "they're" : "they are",
  "they've" : "they have",
  "we'd" : "we would",
  "we're" : "we are",
  "weren't" : "were not",
  "we've" : "we have",
  "what'll" : "what will",
  "what're" : "what are",
  "what's" : "what is",
  "what've" : "what have",
  "where's" : "where is",
  "who'd" : "who would",
  "who'll" : "who will",
  "who're" : "who are",
  "who's" : "who is",
  "who've" : "who have",
  "won't" : "will not",
  "wouldn't" : "would not",
  "you'd" : "you would",
  "you'll" : "you will",
  "you're" : "you are",
  "you've" : "you have",
  "'re": " are",
  "wasn't": "was not",
  "we'll":" will",
  "didn't": "did not",
  "tryin'":"trying"
  }
  stemmer = WordNetLemmatizer()
  for sen in range(0, len(X)):
    document = re.sub(r'\W', ' ', str(X.iloc[sen])) # Remove all the special characters
    document = [APPO[word] if word in APPO else word for word in document.split()]
    document = ' '.join(document)
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document) # remove all single characters
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)  # Remove single characters from the start
    document = re.sub(r'\s+', ' ', document, flags=re.I) # Substituting multiple spaces with single space
    document = re.sub(r'^b\s+', '', document) # Removing prefixed 'b'
    document = document.lower() # Converting to Lowercase
    document = document.split() # Lemmatization
    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    document = re.sub(r'\b\d+(?:\.\d+)?\s+', '', document) #Removes numbers and words containing numbers

    documents.append(document)
  return documents

df = pd.concat([yelps[['positive', 'text']], imdb[['positive', 'text']]])
df = df.reset_index(drop=True)
msk1 = np.random.rand(len(df)) < 0.01
msk2 = np.random.rand(len(df)) < 0.02
msk4 = np.random.rand(len(df)) < 0.04

train1 = df[msk1]
test99 = df[~msk1]
train2 = df[msk2]
test98 = df[~msk2]
train4 = df[msk4]
test96 = df[~msk4]


corpus1 = preprocess(train1.text)
y1 = train1.positive
corpus_test99 = preprocess(test99.text)
y_test99 = test99.positive

corpus2 = preprocess(train2.text)
y2 = train2.positive
corpus_test98 = preprocess(test98.text)
y_test98 = test98.positive

corpus4 = preprocess(train4.text)
y4 = train4.positive
corpus_test96 = preprocess(test96.text)
y_test96 = test96.positive

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
print(len(y1))
print(len(corpus1))


1611
1611


In [None]:
print(len(yelps))
print(len(imdb))

100003
50003


#Logestic Regression w/o word2vec

We can test different feature extraction methods and different classifiers to improve fitting and we can tune hyperparameters. Since running pipelines with different classifiers takes a lot of time, we just tune hyperparameters for Tfidfvectorizer and LogesticRegression here. 

In [None]:
from sklearn.pipeline import Pipeline
# Pipeline & Gridsearch setup
# TFIDF pipeline setup
tvc_pipe = Pipeline([
 ('tvec', TfidfVectorizer()),
 ('lr', LogisticRegression())
])

# Setting params for TFIDF Vectorizer gridsearch
tf_params = {
 'tvec__min_df':[2,5,10,12,15,18,20],
 'tvec__max_df': [0.8,0.82, 0.85, 0.78, 0.75],
 'tvec__norm': ['l2'],
 'tvec__use_idf': [True],
 'lr__penalty': ['l2'],
}

print('1% train and 99% test')
# Setting up GridSearch for TFIDFVectorizer
tvc_gs1 = GridSearchCV(tvc_pipe, param_grid=tf_params, cv = 5, verbose =1, n_jobs = 100)
tvc_gs1.fit(corpus1, y1)
# Scoring Training data on TFIDFVectorizer
print(f'score for train data: {tvc_gs1.score(corpus1, y1)}')
# Scoring Test data on TFIDFVectorizer
print(f'score for test data: {tvc_gs1.score(corpus_test99, y_test99)}')

print('2% train and 98% test')
# Setting up GridSearch for TFIDFVectorizer
tvc_gs2 = GridSearchCV(tvc_pipe, param_grid=tf_params, cv = 5, verbose =1, n_jobs = 100)
tvc_gs2.fit(corpus2, y2)
# Scoring Training data on TFIDFVectorizer
print(f'score for train data: {tvc_gs2.score(corpus2, y2)}')
# Scoring Test data on TFIDFVectorizer
print(f'score for test data: {tvc_gs2.score(corpus_test98, y_test98)}')

print('4% train and 96% test')
# Setting up GridSearch for TFIDFVectorizer
tvc_gs4 = GridSearchCV(tvc_pipe, param_grid=tf_params, cv = 5, verbose =1, n_jobs = 100)
tvc_gs4.fit(corpus4, y4)
# Scoring Training data on TFIDFVectorizer
print(f'score for train data: {tvc_gs4.score(corpus4, y4)}')
# Scoring Test data on TFIDFVectorizer
print(f'score for test data: {tvc_gs4.score(corpus_test96, y_test96)}')

1% train and 99% test
Fitting 5 folds for each of 35 candidates, totalling 175 fits


[Parallel(n_jobs=100)]: Using backend LokyBackend with 100 concurrent workers.
[Parallel(n_jobs=100)]: Done 152 out of 175 | elapsed:  1.8min remaining:   16.3s
[Parallel(n_jobs=100)]: Done 175 out of 175 | elapsed:  1.8min finished


score for train data: 0.9484792054624457
score for test data: 0.8720711614272718
2% train and 98% test
Fitting 5 folds for each of 35 candidates, totalling 175 fits


[Parallel(n_jobs=100)]: Using backend LokyBackend with 100 concurrent workers.
[Parallel(n_jobs=100)]: Done 152 out of 175 | elapsed:  1.2min remaining:   11.2s
[Parallel(n_jobs=100)]: Done 175 out of 175 | elapsed:  1.3min finished


score for train data: 0.9528519617540389
score for test data: 0.8873398515373572
4% train and 96% test
Fitting 5 folds for each of 35 candidates, totalling 175 fits


[Parallel(n_jobs=100)]: Using backend LokyBackend with 100 concurrent workers.
[Parallel(n_jobs=100)]: Done 152 out of 175 | elapsed:  2.5min remaining:   22.5s
[Parallel(n_jobs=100)]: Done 175 out of 175 | elapsed:  2.7min finished


score for train data: 0.9436177248677249
score for test data: 0.9016240848025119


#Logestic Regression with word2vec

In [None]:
lst_corpus1 = []
for i in corpus1:
  lst_corpus1.append(i.split(' '))

In [None]:
from gensim.models import Word2Vec, word2vec
import logging
import matplotlib.pyplot as plt
import nltk
import numpy as np
from nltk.corpus import stopwords
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
import sqlite3
import re
from tqdm import tqdm

num_features = 100    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 3       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words
model = word2vec.Word2Vec(lst_corpus1, workers=num_workers, \
                size=num_features, min_count = min_word_count, \
                window = context, sample = downsampling)

In [None]:
def make_feature_vec(words, model, num_features):
    """
    Average the word vectors for a set of words
    """
    feature_vec = np.zeros((num_features,),dtype="float32")  # pre-initialize (for speed)
    nwords = 0.
    index2word_set = set(model.wv.index2word)  # words known to the model

    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            feature_vec = np.add(feature_vec,model[word])
    
    feature_vec = np.divide(feature_vec, nwords)
    return feature_vec


def get_avg_feature_vecs(reviews, model, num_features):
    """
    Calculate average feature vectors for all reviews
    """
    counter = 0.
    review_feature_vecs = np.zeros((len(reviews),num_features), dtype='float32')  # pre-initialize (for speed)
    
    for review in reviews:
        review_feature_vecs[int(counter)] = make_feature_vec(review, model, num_features)
        counter = counter + 1.
    return review_feature_vecs

In [None]:
trainDataVecs1 = get_avg_feature_vecs(lst_corpus1, model, num_features=100)

  if sys.path[0] == '':


In [None]:
lst_corpus_test99 = []
for i in corpus_test99:
  lst_corpus_test99.append(i.split(' '))

In [None]:
testDataVecs99 = get_avg_feature_vecs(lst_corpus_test99, model, num_features=100)

  if sys.path[0] == '':
  


In [None]:
from sklearn.impute import KNNImputer
knn = KNNImputer(n_neighbors=50)
testDataVecs99 = knn.fit_transform(testDataVecs99)
    

In [None]:
testDataVecs99.shape

(148395, 100)

In [None]:
np.isnan(testDataVecs99).any()

False

In [None]:
lr1 = LogisticRegression()
lr1.fit(trainDataVecs1, y1)
result = lr1.predict(testDataVecs99)
print(classification_report(y_test99, result))

              precision    recall  f1-score   support

           0       0.66      0.68      0.67     74189
           1       0.67      0.65      0.66     74206

    accuracy                           0.67    148395
   macro avg       0.67      0.67      0.67    148395
weighted avg       0.67      0.67      0.67    148395



In [None]:
print(classification_report(y1, lr1.predict(trainDataVecs1)))

              precision    recall  f1-score   support

           0       0.68      0.70      0.69       817
           1       0.68      0.67      0.67       794

    accuracy                           0.68      1611
   macro avg       0.68      0.68      0.68      1611
weighted avg       0.68      0.68      0.68      1611

