### Word Embeddings

- We'll be using the [spacy](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/) library for embeddings. 

In [1]:
import spacy

Run the following cell once, it downloads the relevant spacy embeddings. 

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---- ----------------------------------- 1.6/12.8 MB 10.5 MB/s eta 0:00:02
     --------------- ------------------------ 5.0/12.8 MB 14.4 MB/s eta 0:00:01
     -------------------------- ------------- 8.4/12.8 MB 15.8 MB/s eta 0:00:01
     ---------------------------------- ---- 11.3/12.8 MB 15.3 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 14.3 MB/s eta 0:00:00
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
# create sentence.
sentence = nlp('The grass is green .')

# now check out the embedded tokens.
for token in sentence:
    print(token.text, token.vector.shape, token.vector)

The (96,) [ 0.46308956 -0.476834   -0.43478188  0.7646948  -0.63464177 -0.72864527
 -0.1083423  -0.03606443 -0.37319994  0.23259653  0.1794821   0.9033147
  0.37599707 -0.16151157 -0.6921803  -0.34063888 -0.58255136  1.8662513
 -0.162445   -0.22811128 -0.82284606 -0.16138142  0.5386862  -0.84876907
  0.94887006 -0.3058412   0.40681487 -0.55957776 -0.2906301   1.6037045
  1.1047919  -1.1239388  -0.0670287  -1.4549143  -0.40158284 -0.4605913
 -0.89699286  0.6834641  -0.39152545  1.7604558   0.27963328  0.93046767
 -0.6345934   0.46363148 -0.241721    0.05683845  0.1307706   1.0328741
  0.37555254  0.11903231  0.07902983  1.0012553   0.6178818   1.5738294
  0.66949344  0.32361752 -1.1712271  -0.11899582 -1.1904695  -0.0384821
 -0.5886693   0.80128086  0.02617992 -0.8680332   0.5289372  -0.8510635
  0.30884215 -0.98843426 -0.30716825 -0.95105475 -0.45381197  0.8629532
 -0.5900868  -0.17367877 -0.3976024  -0.77372044 -0.24516934 -0.4000978
  1.4433401  -0.5883524  -0.06499428 -1.1099193  -0

### Topic Modeling

- Given a document, determine the topic of the document
- For this task, we'll use the Brown corpus of texts accessible via NLTK

In [5]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\reggi\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [None]:
from nltk.corpus import brown
import numpy as np
from collections import defaultdict
import tqdm # tqdm displays a progress bar
from tqdm import tqdm_notebook as tqdm # tqdm is a nice process indicator 

category_vectors = []

cats = brown.categories()
    
# for each category
for cat in cats:
    print(cat)
    # grab all of the documents
    for fileid in tqdm(brown.fileids(categories=[cat])):
        sents = brown.sents(fileids=[fileid])
        sent_vecs = []
        for sent in sents:
            # convert from a list of tokens to a string
            sent = ' '.join(sent)
            sent = nlp(sent)
            # grab all of the words, find their embedding, sum all embeddings
            word_sum = np.sum([tok.vector for tok in sent], axis=0) # why axis=0?
            # add the now summed embedding to the list for this category
            sent_vecs.append(word_sum)
        category_vectors.append((cat,np.sum(sent_vecs, axis=0)))
    

adventure


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for fileid in tqdm(brown.fileids(categories=[cat])):


  0%|          | 0/29 [00:00<?, ?it/s]

belles_lettres


  0%|          | 0/75 [00:00<?, ?it/s]

editorial


  0%|          | 0/27 [00:00<?, ?it/s]

fiction


  0%|          | 0/29 [00:00<?, ?it/s]

government


  0%|          | 0/30 [00:00<?, ?it/s]

hobbies


  0%|          | 0/36 [00:00<?, ?it/s]

humor


  0%|          | 0/9 [00:00<?, ?it/s]

learned


  0%|          | 0/80 [00:00<?, ?it/s]

lore


  0%|          | 0/48 [00:00<?, ?it/s]

mystery


  0%|          | 0/24 [00:00<?, ?it/s]

news


  0%|          | 0/44 [00:00<?, ?it/s]

In [31]:
import pandas as pd

# move category touple into a dataframe
keys,values=zip(*category_vectors) # unzip using a *
data = pd.DataFrame({'cat':keys,'vectors':values})

In [32]:
data[:3]

Unnamed: 0,cat,vectors
0,adventure,"[-501.40533, -619.5586, 22.687773, -96.53734, ..."
1,adventure,"[-245.56874, -583.061, -97.339584, -95.76089, ..."
2,adventure,"[-222.43172, -467.51596, -98.293884, -108.7934..."


In [33]:
total = len(data)
total

500

#### compute the baselines

In [56]:
print('random baseline {}'.format(1.0/len(cats)))

print('most common baseline?')
for cat in cats:
    print(cat, len(data[data.cat==cat])/total)

random baseline 0.06666666666666667
most common baseline?
adventure 0.058
belles_lettres 0.15
editorial 0.054
fiction 0.058
government 0.06
hobbies 0.072
humor 0.018
learned 0.16
lore 0.096
mystery 0.048
news 0.088
religion 0.034
reviews 0.034
romance 0.058
science_fiction 0.012


#### split the data into train/test

In [57]:
test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

test.shape, train.shape 

((50, 2), (450, 2))

#### train a classifier

In [58]:
from sklearn import preprocessing

# initializes label encoder
le = preprocessing.LabelEncoder()
# create a list of the training vectors X
X = [x for x in train.vectors]
# learns mapping from original training set, then transforms training labels
y = le.fit_transform(train.cat)

In [59]:
from sklearn.linear_model import LogisticRegression

In [60]:
# multinominal: when we have more than 2 classes/categories
# lbfgs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is an cost function optimization algo
# lbfgs: fast, works w/ small-med datasets, multinomial log reg, handles L2 optimization
# multinominal was depricated and will be used by default
# L2 optimization: technique to reduce overfitting (learns training data too closely)
clfr = LogisticRegression(solver='lbfgs')

In [61]:
# trains/fits the logistic regression model using training data (using something like gradient descent), minimzes log loss
# then stores the best weights to classify new data
clfr.fit(X,y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### evaluate 

In [66]:
from sklearn.metrics import accuracy_score

In [67]:
test_y = le.transform(test.cat)
test_X = [x for x in test.vectors]

# predict uses the regression model to predict the class labels for test_X
# accuracy_score compares the generated labels to the correct test labels and calculates the % correct
score = accuracy_score(clfr.predict(test_X), test_y)
score

0.32

### Results

- GoogleNews-vectors-negative300.magnitude 0.4 (w2v)
- wiki-news-300d-1M.magnitude 0.56 (bert)
- glove.6B.300d.magnitude 0.52 (glove)

In [68]:
test.shape, train.shape 

((50, 2), (450, 2))

In [69]:
from sklearn import preprocessing

#test = data.sample(frac=0.1,random_state=200)
#train = data.drop(test.index)

le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder()
le.fit(data.cat) 
data_y = le.transform(data.cat).reshape(-1, 1) # this is magic
ohe.fit(data_y)

test_y = le.transform(test.cat).reshape(-1, 1)  # Ensure it's 2D
y = ohe.transform(test_y).todense()

X = np.array([x for x in train.vectors])

X.shape, y.shape

((450, 96), (50, 15))