### Word Embeddings

- We'll be using the [spacy](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/) library for embeddings. 

In [147]:
import spacy

Run the following cell once, it downloads the relevant spacy embeddings. 

In [148]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [149]:
nlp = spacy.load("en_core_web_sm")

In [150]:
# create sentence.
sentence = nlp('The grass is green .')

# now check out the embedded tokens.
for token in sentence:
    print(token.text, token.vector.shape, token.vector)

The (96,) [ 0.46308956 -0.476834   -0.43478197  0.7646948  -0.63464177 -0.72864527
 -0.10834235 -0.03606439 -0.37319988  0.2325965   0.17948198  0.9033147
  0.37599713 -0.16151144 -0.6921803  -0.3406388  -0.5825513   1.8662513
 -0.16244504 -0.22811107 -0.822846   -0.16138133  0.53868645 -0.848769
  0.9488702  -0.3058413   0.40681458 -0.5595779  -0.29063013  1.6037045
  1.1047919  -1.1239386  -0.06702872 -1.4549145  -0.40158293 -0.46059126
 -0.89699274  0.68346405 -0.39152563  1.7604561   0.27963334  0.9304676
 -0.63459337  0.4636314  -0.2417211   0.0568383   0.13077062  1.0328739
  0.37555254  0.1190322   0.07902986  1.0012553   0.6178818   1.5738294
  0.66949344  0.32361758 -1.1712269  -0.11899585 -1.1904697  -0.03848198
 -0.58866924  0.80128056  0.02618014 -0.8680333   0.52893746 -0.85106355
  0.30884224 -0.988434   -0.307168   -0.9510547  -0.45381203  0.8629531
 -0.5900868  -0.1736787  -0.39760253 -0.77372056 -0.24516937 -0.40009767
  1.44334    -0.5883523  -0.06499413 -1.1099194  -

### Topic Modeling

- Given a document, determine the topic of the document
- For this task, we'll use the Brown corpus of texts accessible via NLTK

In [151]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/reggiewade/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [152]:
from nltk.corpus import brown
import numpy as np
from collections import defaultdict
import tqdm # tqdm displays a progress bar
from tqdm import tqdm_notebook as tqdm # tqdm is a nice process indicator 

category_vectors = []

cats = brown.categories()
    
# for each category
for cat in cats:
    print(cat)
    # grab all of the documents
    for fileid in tqdm(brown.fileids(categories=[cat])):
        sents = brown.sents(fileids=[fileid])
        sent_vecs = []
        for sent in sents:
            # convert from a list of tokens to a string
            sent = ' '.join(sent)
            sent = nlp(sent)
            # grab all of the words, find their embedding, sum all embeddings
            word_sum = np.sum([tok.vector for tok in sent], axis=0) # why axis=0?
            # add the now summed embedding to the list for this category
            sent_vecs.append(word_sum)
        category_vectors.append((cat,np.sum(sent_vecs, axis=0)))
    

adventure


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for fileid in tqdm(brown.fileids(categories=[cat])):


  0%|          | 0/29 [00:00<?, ?it/s]

belles_lettres


  0%|          | 0/75 [00:00<?, ?it/s]

editorial


  0%|          | 0/27 [00:00<?, ?it/s]

fiction


  0%|          | 0/29 [00:00<?, ?it/s]

government


  0%|          | 0/30 [00:00<?, ?it/s]

hobbies


  0%|          | 0/36 [00:00<?, ?it/s]

humor


  0%|          | 0/9 [00:00<?, ?it/s]

learned


  0%|          | 0/80 [00:00<?, ?it/s]

lore


  0%|          | 0/48 [00:00<?, ?it/s]

mystery


  0%|          | 0/24 [00:00<?, ?it/s]

news


  0%|          | 0/44 [00:00<?, ?it/s]

religion


  0%|          | 0/17 [00:00<?, ?it/s]

reviews


  0%|          | 0/17 [00:00<?, ?it/s]

romance


  0%|          | 0/29 [00:00<?, ?it/s]

science_fiction


  0%|          | 0/6 [00:00<?, ?it/s]

In [153]:
import pandas as pd

# move category touple into a dataframe
keys,values=zip(*category_vectors) # unzip using a *
data = pd.DataFrame({'cat':keys,'vectors':values})

In [154]:
data[:3]

Unnamed: 0,cat,vectors
0,adventure,"[-501.4054, -619.5587, 22.687777, -96.53732, 4..."
1,adventure,"[-245.56876, -583.06104, -97.33961, -95.76086,..."
2,adventure,"[-222.43176, -467.51605, -98.2939, -108.793495..."


In [155]:
total = len(data)
total

500

#### compute the baselines

In [156]:
print('random baseline {}'.format(1.0/len(cats)))

print('most common baseline?')
for cat in cats:
    print(cat, len(data[data.cat==cat])/total)

random baseline 0.06666666666666667
most common baseline?
adventure 0.058
belles_lettres 0.15
editorial 0.054
fiction 0.058
government 0.06
hobbies 0.072
humor 0.018
learned 0.16
lore 0.096
mystery 0.048
news 0.088
religion 0.034
reviews 0.034
romance 0.058
science_fiction 0.012


#### split the data into train/test

In [157]:
test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

test.shape, train.shape 

((50, 2), (450, 2))

#### train a classifier

In [158]:
from sklearn import preprocessing

# initializes label encoder
le = preprocessing.LabelEncoder()
# create a list of the training vectors X
X = [x for x in train.vectors]
# learns mapping from original training set, then transforms training labels
y = le.fit_transform(train.cat)

In [159]:
from sklearn.linear_model import LogisticRegression

In [160]:
# multinominal: when we have more than 2 classes/categories
# lbfgs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is an cost function optimization algo
# lbfgs: fast, works w/ small-med datasets, multinomial log reg, handles L2 optimization
# multinominal was depricated and will be used by default
# L2 optimization: technique to reduce overfitting (learns training data too closely)
clfr = LogisticRegression(solver='lbfgs')

In [161]:
# trains/fits the logistic regression model using training data (using something like gradient descent), minimzes log loss
# then stores the best weights to classify new data
clfr.fit(X,y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### evaluate 

In [162]:
from sklearn.metrics import accuracy_score

In [572]:
test_y = le.transform(test.cat)
test_X = [x for x in test.vectors]

# predict uses the regression model to predict the class labels for test_X
# accuracy_score compares the generated labels to the correct test labels and calculates the % correct
score = accuracy_score(clfr.predict(test_X), test_y)
score

0.34

### Results

- GoogleNews-vectors-negative300.magnitude 0.4 (w2v)
- wiki-news-300d-1M.magnitude 0.56 (bert)
- glove.6B.300d.magnitude 0.52 (glove)

In [573]:
test.shape, train.shape 

((50, 2), (450, 2))

In [574]:
from sklearn import preprocessing

test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

# Prep train/test data
le = preprocessing.LabelEncoder() # convert to numerical categories
ohe = preprocessing.OneHotEncoder() # convert categories to distributions
le.fit(data.cat)
y = le.transform(train.cat).reshape(-1, 1) # basically go from shape (n, ) to (n, 1)
ohe.fit(y)
y = ohe.transform(y).todense()
X = np.array([x for x in train.vectors])

X.shape, y.shape

((450, 96), (450, 15))

In [653]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.optimizers import Adam

### Define the Model

In [797]:
model = Sequential()
model.add(Flatten())
model.add(Dense(96, activation='swish'))
model.add(Dropout(0.2))
model.add(Dense(96, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(15, activation='softmax'))  

model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.0001), metrics=['accuracy'])

model.fit(X, y, epochs=150, batch_size=10, verbose=1)

Epoch 1/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.0455 - loss: 711.7973      
Epoch 2/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0926 - loss: 429.3835
Epoch 3/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.0864 - loss: 342.7373    
Epoch 4/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.0875 - loss: 286.1534
Epoch 5/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.1375 - loss: 258.2058
Epoch 6/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.1297 - loss: 248.6000    
Epoch 7/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.1977 - loss: 234.0627
Epoch 8/150
[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.2323 - loss: 212.0981  
Epoch 9/

<keras.src.callbacks.history.History at 0x465257cd0>

In [798]:
_, train_accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (train_accuracy*100))

[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.7185 - loss: 6.9124 
Accuracy: 67.33


In [799]:
y_test = le.transform(test.cat).reshape(-1, 1) # basically go from shape (n, ) to (n, 1)
y_test = ohe.transform(y_test).todense()
X_test = np.array([x for x in test.vectors])

X_test.shape, y_test.shape

((50, 96), (50, 15))

In [800]:
# check test accuracy
_, test_accuracy = model.evaluate(X_test, y_test)
print('Accuracy: %.2f' % (test_accuracy*100))

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - accuracy: 0.4600 - loss: 22.6459
Accuracy: 44.00


### Q&A

1. What would you say is the neural network "learning"?<br>
I would say that the neural network is learning the weights of the data, which are the connections between the artificial neurons in the model.

2. How does the depth or width of the network affect the training and the results?<br>
From what I have read online, the depth of the neural network is related to how complex the training data is.  If we were to train on complex hierarchical features like images and audio, however, adding more layers increases training time.  The width of the network (how many neurons on a layer) defines how many features the network can capture on a given layer.  Increasing the width of a shallow network could increase performance, but we have to be careful about overfitting (model memorizing the training data itself).

3. As you made changes to the network, what do you notice about how hyperparameters (network depth, number of nodes, learning rate, etc.) and how they interact with each other? We said that neural networks are learning non-convex problems, but what about finding the best parameters? Is that a convex problem?<br>
I noticed that adding network depth usually caused overfitting because while the training score was high, the test score was lower, to combat this I added a few dropout layers at 20% which caused the performance to increase on the test set.  Changing the learning rate to something too high caused the model to never converge and a learning rate too low will take too long to converge.  I don't really have a great grasp on how all of the hyperparameters interact with one another, but there are a few that are related somewhat.  The depth and dropout/regularization are clearly related as you need more regularization with more layers.  I think the learning rate a epochs are related because a fast learning rate will allow us to have less epochs, while a slow learning rate could be more accurate but needs more time.  I think finding the best parameters is not a convex problem because changing a parameter and getting a lower loss doesn't mean we are necessarily on the right track.  There are many local minima and maxima in the actual training which makes it non convex.

4. What is regularization? Why is it important?<br>
Regularization is a technique used to prevent overfitting.  It's important because it's necessary if we have a deep neural network to prevent it from learning specific parameters from the training data too well.  To prevent this we can use Dropouts, which randomly deactivate neurons in the model while training.

5. Which activation functions did you choose (besides logitistic/sigmoid)? For one of the activation functions you tried, spend some time learning about it. Whereas logistic/sigmoid maps from inputs to a probability between 0-1, what does the activation function you chose do?
I tried linear, leaky_relu, relu, and swish.  Swish was my favorite because it was described as a better relu.  Swish's calculation is: $x * sigmoid(x)$ which forces x to be between 0-1, then multiplies by the input.  The key here is that it's differentiable everywhere, which gets rid of the sharp edge at $x=0$.