## Easy

Load Vectors from Gensim, and write cosine distance function that will and find top 100 nearest words for :

```
apple, queen, ignorance, possibility
```

In [23]:
import gensim.downloader as api
import numpy as np
import collections

In [16]:
w2v = api.load("glove-wiki-gigaword-50")

In [4]:
len(w2v.vocab)

400000

In [5]:
words = ['apple', 'queen', 'ignorance', 'possibility']

In [12]:
for i in words:
  if i not in w2v:
    print(f"Word {i} not in w2v")

In [47]:
def cosSim(word, num=100):
  res = {}
  for i in w2v.vocab:
    res[i] = 1 - np.dot(w2v[word],w2v[i]) \
     / \
     ( \
         np.sqrt(np.dot(w2v[word],w2v[word])) * np.sqrt(np.dot(w2v[i],w2v[i])) \
      )
  return collections.OrderedDict(sorted(res.items(), key=lambda item: item[1])[1:1+num])

In [49]:
for i in words:
  print(f"Similarities for {i} are:\n{cosSim(i, 2)}")

Similarities for apple are:
OrderedDict([('blackberry', 0.24569332599639893), ('chips', 0.25613564252853394)])
Similarities for queen are:
OrderedDict([('princess', 0.14848339557647705), ('lady', 0.1949390172958374)])
Similarities for ignorance are:
OrderedDict([('injustice', 0.17221593856811523), ('selfishness', 0.1891566514968872)])
Similarities for possibility are:
OrderedDict([('possible', 0.06444436311721802), ('whether', 0.10292595624923706)])


## Medium

Load IMDB dataset, tokenize texts, and display its representation for each word

In [54]:
import pandas as pd
import re

In [85]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [76]:
df = pd.read_csv('./sample_data/IMDB Dataset.csv',error_bad_lines=False, engine="python")

Skipping line 2419: unexpected end of data


In [77]:
df['tokens'] = df['review'].apply(lambda doc: list(filter(None,re.split(r'[^\w]', doc))))

In [78]:
df['w2v'] = df['tokens'].apply(lambda tokens: [w2v[i.lower()] for i in tokens if i.lower() in w2v])
df['w2v'].head()

0    [[0.31474, 0.41662, 0.1348, 0.15854, 0.88812, ...
1    [[0.21705, 0.46515, -0.46757, 0.10082, 1.0135,...
2    [[0.11891, 0.15255, -0.082073, -0.74144, 0.759...
3    [[-0.11901, -0.72028, 0.067149, -0.44532, 0.62...
4    [[-0.11497, -0.81098, 0.32244, -0.0046624, 0.4...
Name: w2v, dtype: object

## Hard

Finish medium task
For each document find its vector form, by taking a mean of word vectors of each vector. 

Use logistic regression to classify messages where input is mean of word2vec vectors

In [80]:
df['mean'] = df['w2v'].apply(lambda x: np.mean(x, axis=0))
df['mean'].head()

0    [0.23563969, 0.12975053, -0.07267406, -0.17217...
1    [0.22607522, 0.20523989, -0.23802738, -0.11300...
2    [0.27521876, 0.1606651, -0.13258399, -0.169513...
3    [0.25059623, 0.13714078, -0.08698385, -0.15979...
4    [0.37243506, 0.24857606, -0.13379185, -0.04750...
Name: mean, dtype: object

In [81]:
def posneg(x):
    if x=="negative":
        return 0
    elif x=="positive":
        return 1
    return x

filtered_score = df["sentiment"].map(posneg)
df["score"] = filtered_score


In [82]:
X_train, X_test, y_train, y_test = train_test_split(df["mean"], df["score"], test_size=0.2)

In [89]:
logReg = LogisticRegression(fit_intercept=True, n_jobs=-1)
logReg.fit(list(X_train), y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [90]:
y_pred_test = logReg.predict(list(X_test))

In [91]:
print(f"Accuracy: {accuracy_score(y_test, y_pred_test)}")
print(f"Precision: {precision_score(y_test, y_pred_test)}")
print(f"Recall: {recall_score(y_test, y_pred_test)}")
print(f"F1: {f1_score(y_test, y_pred_test)}")

Accuracy: 0.7169421487603306
Precision: 0.7489361702127659
Recall: 0.6929133858267716
F1: 0.719836400817996
