### Topic Modeling with Word Embeddings, Tensorflow, and Keras

- We'll be using the [pymagnitude](https://github.com/plasticityai/magnitude) library

In [3]:
from pymagnitude import *

### You'll need to download embedding 'model' files manually

Start by downloading one of the following:

- [GloVe](http://magnitude.plasticity.ai/glove/medium/glove.6B.300d.magnitude)
- [word2vec](http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude)
- [fastText](http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M.magnitude)


In [4]:
# change path to the name/path of the embedding file you donwloaded
path = 'glove.6B.300d.magnitude'

vectors = Magnitude(path)

In [5]:
len(vectors)

400000

In [6]:
vectors.dim # this is how big the vectors are for each word

300

In [7]:
"cat" in vectors

True

In [8]:
for key, vector in vectors[500:510]:
    print(key, vector[:3])

working [-0.0332886  0.0680554 -0.0059854]
community [-0.0431237 -0.0621079  0.0222967]
eight [-0.0645148  0.0757905 -0.0573475]
groups [-0.0194742  0.0416667 -0.0084207]
despite [ 0.0479242  0.02233   -0.0047541]
level [-0.0848005  0.1553303  0.0087196]
largest [-0.0233524  0.04329   -0.0260668]
whose [-0.0069965  0.0276343  0.0280235]
attacks [0.03818   0.0117073 0.0706972]
germany [ 0.0062351  0.0252957 -0.0618449]


In [9]:
vectors.query("cat")[:3]

array([-0.0463976,  0.0525527, -0.007488 ], dtype=float32)

In [10]:
vectors.query(["cat","dog"])[0][:3]

array([-0.0463976,  0.0525527, -0.007488 ], dtype=float32)

In [11]:
vectors.distance("cat", "dog")

0.7979039

In [12]:
vectors.distance("cat", "car")

1.3062327

In [13]:
vectors.most_similar_to_given("cat", ["dog", "television", "laptop"]) 

'dog'

In [14]:
vectors.doesnt_match(["breakfast", "cereal", "dinner", "lunch"])

'cereal'

In [15]:
vectors.most_similar("cat", topn = 5)

[('dog', 0.6816746),
 ('cats', 0.68158376),
 ('pet', 0.5870366),
 ('dogs', 0.5407667),
 ('feline', 0.489797)]

In [16]:
vectors.most_similar(positive = ["woman", "king"], negative = ["man"])

[('queen', 0.6713276),
 ('princess', 0.5432625),
 ('throne', 0.53861046),
 ('monarch', 0.53475744),
 ('daughter', 0.49802512),
 ('mother', 0.49564433),
 ('elizabeth', 0.48326522),
 ('kingdom', 0.47747076),
 ('prince', 0.46682397),
 ('wife', 0.4647327)]

### Topic Modeling

- Given a document, determine the topic of the document
- For this task, we'll use the Brown corpus of texts accessible via NLTK

In [17]:
import numpy as np
from nltk.corpus import brown
from collections import defaultdict
import tqdm # tqdm displays a progress bar
from tqdm import tqdm_notebook as tqdm

category_vectors = []

cats = brown.categories()
    
# for each category
for cat in cats:
    print(cat)
    # grab all of the documents
    for fileid in tqdm(brown.fileids(categories=[cat])):
        words = list(map(str.lower, brown.words(fileids=[fileid])))
        # grab all of the words, find their embedding, sum all embeddings
        word_sum = np.sum([vectors.query([w]) for w in words if w in vectors], axis=0) # why axis=0?
        # add the now summed embedding to the list for this category
        category_vectors.append((cat,word_sum))
    

adventure


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for fileid in tqdm(brown.fileids(categories=[cat])):


  0%|          | 0/29 [00:00<?, ?it/s]

belles_lettres


  0%|          | 0/75 [00:00<?, ?it/s]

editorial


  0%|          | 0/27 [00:00<?, ?it/s]

fiction


  0%|          | 0/29 [00:00<?, ?it/s]

government


  0%|          | 0/30 [00:00<?, ?it/s]

hobbies


  0%|          | 0/36 [00:00<?, ?it/s]

humor


  0%|          | 0/9 [00:00<?, ?it/s]

learned


  0%|          | 0/80 [00:00<?, ?it/s]

lore


  0%|          | 0/48 [00:00<?, ?it/s]

mystery


  0%|          | 0/24 [00:00<?, ?it/s]

news


  0%|          | 0/44 [00:00<?, ?it/s]

religion


  0%|          | 0/17 [00:00<?, ?it/s]

reviews


  0%|          | 0/17 [00:00<?, ?it/s]

romance


  0%|          | 0/29 [00:00<?, ?it/s]

science_fiction


  0%|          | 0/6 [00:00<?, ?it/s]

In [18]:
import pandas as pd

keys,values=zip(*category_vectors) # unzip using a *

data = pd.DataFrame({'cat':keys,'vectors':values})

In [19]:
data[:3]

Unnamed: 0,cat,vectors
0,adventure,"[[-62.71034, 23.958982, -9.739936, -54.190174,..."
1,adventure,"[[-54.73519, 19.858883, -8.82027, -59.00295, -..."
2,adventure,"[[-46.095287, 24.262121, -7.475177, -59.681107..."


In [20]:
total = len(data)
total

500

#### compute the baselines

In [21]:
print('random baseline {}'.format(1.0/len(cat)))

print('most common baseline?')
for cat in cats:
    print(cat, len(data[data.cat==cat])/total)

random baseline 0.06666666666666667
most common baseline?
adventure 0.058
belles_lettres 0.15
editorial 0.054
fiction 0.058
government 0.06
hobbies 0.072
humor 0.018
learned 0.16
lore 0.096
mystery 0.048
news 0.088
religion 0.034
reviews 0.034
romance 0.058
science_fiction 0.012


In [22]:
len(data[data.cat==cat])

6

#### split the data into train/test

In [23]:
test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

test.shape, train.shape 

((50, 2), (450, 2))

#### train a classifier

In [24]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(data.cat) 
X = [x[0] for x in train.vectors]
y = le.transform(train.cat)

In [25]:
from sklearn.linear_model import LogisticRegression

In [26]:
clfr = LogisticRegression(multi_class='multinomial', solver='lbfgs')

In [27]:
clfr.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(multi_class='multinomial')

In [28]:
from sklearn import preprocessing

test = data.sample(frac=0.1,random_state=200)
train = data.drop(test.index)

le = preprocessing.LabelEncoder() # convert to numerical categories
ohe = preprocessing.OneHotEncoder() # convert categories to distributions (i.e., 1-hot vectors)
le.fit(data.cat) 
y = le.transform(train.cat).reshape(-1, 1) # this is magic
ohe.fit(y)


OneHotEncoder()

In [29]:
y = ohe.transform(y).todense()

X = np.array([x[0] for x in train.vectors])

X.shape, y.shape

((450, 300), (450, 15))

In [30]:
print(y[0:5])

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [31]:
#Train test split of model
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state = 0)

In [32]:
len(X[0])

300

In [33]:
import keras
from keras.models import Sequential
from keras.layers import Dense

inputs =keras.Input(shape=(300), name='ani_image')
x = layers.Flatten(name = 'flattened_img')(inputs)
x=layers.Dense(1024,activation='relu', input_shape=(151,))(x) 
x=layers.Dense(1024,activation='relu')(x) 
x=layers.Dense(512,activation='relu')(x) 
preds=Dense(1,activation='softmax')(x) 
model=Model(inputs=inputs,outputs=preds)

#### Neural Network

In [51]:
model = Sequential()
model.add(Dense(1024, input_dim=300, activation='relu'))
model.add(Dense(1024, activation='relu'))
model.add(Dense(15, activation='softmax'))

In [39]:
#To visualize neural network
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 1024)              308224    
_________________________________________________________________
dense_4 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
dense_5 (Dense)              (None, 15)                15375     
Total params: 1,373,199
Trainable params: 1,373,199
Non-trainable params: 0
_________________________________________________________________


In [40]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [47]:
history = model.fit(X_train, y_train, validation_data = (X_test,y_test), epochs=20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


#### Testing

In [49]:
y_pred = model.predict(X_test)
#Converting predictions to label
pred = list()
for i in range(len(y_pred)):
    pred.append(np.argmax(y_pred[i]))

#Converting one hot encoded test label to label
test = list()
for i in range(len(y_test)):
    test.append(np.argmax(y_test[i]))

from sklearn.metrics import accuracy_score
a = accuracy_score(pred,test)
print('Accuracy is:', a*100)

Accuracy is: 48.888888888888886


In [40]:
print("TensorFlow version: {}".format(tf.__version__))
print("Eager execution: {}".format(tf.executing_eagerly()))

TensorFlow version: 2.4.1
Eager execution: True


In [None]:
#model = keras.Model(inputs=X, outputs=y)

model = tf.keras.Sequential([
  tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)),  # input shape required
  tf.keras.layers.Dense(10, activation=tf.nn.relu),
  tf.keras.layers.Dense(3)
])


model.compile(
    optimizer=keras.optimizers.RMSprop(),  # Optimizer
    # Loss function to minimize
    loss=keras.losses.SparseCategoricalCrossentropy(),
    # List of metrics to monitor
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)


In [56]:
#model.compile(optimizer='sgd', loss=tf.keras.losses.KLDivergence)
#model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss=tf.keras.losses.KLDivergence(), )

In [57]:
len(model.weights)

6

In [62]:
type(X)

numpy.ndarray

In [48]:
#model.fit(X,y,batch_size=64,epochs=10)

In [47]:
#keras.Model.fit(X,y, epochs=10)

#### evaluate 

In [43]:
from sklearn.metrics import accuracy_score

In [44]:
test_y = le.transform(test.cat)
test_X = [x[0] for x in test.vectors]

score = accuracy_score(clfr.predict(test_X), test_y)

In [45]:
print(path, score)

glove.6B.300d.magnitude 0.52


### What would you say is the neural network "learning"?
- learning the most common topic


### How does the depth or width of the network affect the training and the results?
- The more the width and the depth the more accurate the results will be. First I used only 64, 32 width and 3 layers, at that time it had 60% accuracy. When I changed width 1024 , it became above 90%

### As you made changes to the network, what do you notice about how parameters (network depth, number of nodes, learning rate, etc.) and how they interact with each other? We said that neural networks are learning non-convex problems, but what about finding the best parameters? Is that a convex problem?
- Finding the best parameter for one layer could be a convex problem, but for more than one layer, it may not be a convex problem.


### What is regularization? Why is it important?
- Regularization is discouraging the train data to learn more complex or flexible model which involves a loss function where the coefficients are chosen, such that they minimize this loss function.It is important because it avoids overfitting. 


### Which activation functions did you choose (besides logitistic/sigmoid)? For one of the activation functions you tried, spend some time learning about it. Whereas logistic/sigmoid maps from inputs to a probability between 0-1, what does the activation function you chose do?
- In my model.sequential function the activation function is Relu. This function returns the highest value between 0 and x.

### Notes
- I couldn't set up my environment or intall pymagnitude. At last, using Jake Carns's trello comment I set up another environment and install pymagnitude. 
- Thus, I am turning it when the environment is set up.  


In [None]:
#%pip install okpy

In [56]:
from client.api.notebook import Notebook
ok = Notebook('a5.ok')
ok.auth(inline=True)

Assignment: A5 Topic Modeling with MLPs
OK, version v1.18.1

Successfully logged in as SajiaZafreen@u.boisestate.edu


In [57]:
ok.submit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'A5-topic-modeling.ipynb'.
Submit... 0.0% complete
Could not submit: Late Submission of bsu/nlp/sp21/a5
Backup... 100% complete
Backup past deadline by 5 days, 14 hours, 24 minutes, and 6 seconds

