# Clustering User Profiles Using Word Embeddings

## Introduction
Our objective is to find for every user a pen pal that might be the most suitable, this is a challenging problem: First we can not train a neural network 
from scratch as we do not have any labeled data, in other words, we do not have ground truth to tell us how well two users like each other. Secondly, all
the users will have is unstructured text which is a difficult input process (any word can have a number of forms and synonyms, making word count based methods 
hard and meanigless). Luckily [unsupervised learning](https://medium.com/machine-learning-for-humans/unsupervised-learning-f45587588294) can solve both issues.

In [3]:
import bcolz
import pickle
import numpy as np
import torch
from torch import nn, optim, utils

## Part 1 pretrained word embeddings

We want to somehow map unstructured texts into a vectors in space and then find the most similar vector 
(nearest neighbor) which will (hopefully) belong to the most similar profile.
Luckily there are many ready made word embeddings: https://www.youtube.com/watch?v=6xPnEh_tJEc (watch until 14:50 at least, but it is all useful to know)

Therefore, our first step is to load an existing word embedding. We will begin with a 
[pretrained GLOVE embedding](https://nlp.stanford.edu/projects/glove/). Specifically the glove.6B packadge.

In [4]:
glove_path = './glove.6B/'

Note that glove.6B has embeddings into multiple different sizes. We will first begin with embedding of size 50

In [5]:
words = []
idx = 0
word2idx = {}
vectors = bcolz.carray(np.zeros(1), rootdir=glove_path+'6B.50.dat', mode='w')

with open(f'{glove_path}/glove.6B.50d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)
    
vectors = bcolz.carray(vectors[1:].reshape((400000, 50)), rootdir=glove_path+'6B.50.dat', mode='w')
vectors.flush()
pickle.dump(words, open(glove_path+'6B.50_words.pkl', 'wb'))
pickle.dump(word2idx, open(glove_path+'6B.50_idx.pkl', 'wb'))

In [6]:
vectors = bcolz.open(glove_path+'/6B.50.dat')[:]
words = pickle.load(open(glove_path+'/6B.50_words.pkl', 'rb'))
word2idx = pickle.load(open(glove_path+'/6B.50_idx.pkl', 'rb'))
glove = {w: vectors[word2idx[w]] for w in words}

Our first test case will be to try to use word embeddings to classify
movie reviews into positive and negative. This is from [Dr Lilian Lee's imdb database](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf).

We will be using the ready made [K-Nearest Neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) (make sure you understand this one) from teh sklearn packadge 

In [10]:
# KNN
import nltk
import os
import collections
from sklearn.neighbors import KNeighborsClassifier 
POS_PATH = '../../TA/bayesian/posTrain/'
NEG_PATH = '../../TA/bayesian/negTrain/'
POS_TEST_PATH = '../../TA/bayesian/posTest/'
NEG_TEST_PATH = '../../TA/bayesian/negTest/'
tokenizer = nltk.tokenize.RegexpTokenizer(r'[a-zA-Z]+[\']*[a-zA-Z]+')
# These are common words that do not make our vectors more sensative to the semantics of the text
# so we remove them
common_words = ['the','an','in','of', 'to', 'and', 'in', 'a', 'for', \
                'that', 'on', 'is', 'he', 'as', 'it', 'by', 'at', 'my', 'i']

In [8]:
# Number of words in the vocabulary and the length of each vector (will be 50 in the training example)
NUM_WORDS = len(word2idx)
VEC_DIM = len(glove['the'])

In [9]:
# Given a text we want to get a vector. This functons returns a dictionary
# of file_name and it's vector.
def get_text_vectors(dir_path,vocab,tokenizer,VEC_DIM,common_words=[]):
    file_list = [dir_path+f for f in os.listdir(dir_path)]
    vec_dict = dict()
    for file_name in file_list:
        file_vec = np.zeros(VEC_DIM)
        n = 0
        with  open(file_name,'r',encoding='ISO-8859-1', errors='ignore') as f:
            rawText = f.read()
            counter = collections.Counter(tokenizer.tokenize(rawText))
            for word in counter:
                if word in common_words:
                    continue
                if word in vocab:
                    # If we just add all vectors we will have unbalanced vector values
                    # becuase some profiles are simply longer, so we normalize by the number of words.
                    n += #......
                    file_vec += vocab[word]*#......
        vec_dict[file_name] = file_vec / #......
    return vec_dict

SyntaxError: invalid syntax (<ipython-input-9-0a4583657db3>, line 18)

In [177]:
pos_train_vec = get_file_vectors(POS_PATH,glove,tokenizer,VEC_DIM,common_words)
neg_train_vec = get_file_vectors(NEG_PATH,glove,tokenizer,VEC_DIM,common_words)
pos_test_vec = get_file_vectors(POS_TEST_PATH,glove,tokenizer,VEC_DIM,common_words)
neg_test_vec = get_file_vectors(NEG_TEST_PATH,glove,tokenizer,VEC_DIM,common_words)

In [178]:
x_train = list(pos_train_vec.values()) + list(neg_train_vec.values())
y_train = list(np.ones(len(pos_train_vec)))+list(np.zeros(len(neg_train_vec)))

In [184]:
classifier = KNeighborsClassifier(n_neighbors=5)  
classifier.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [185]:
x_test = list(pos_test_vec.values()) + list(neg_test_vec.values())

In [186]:
y_pred = classifier.predict(x_test)

In [187]:
y_test = list(np.ones(len(pos_test_vec)))+list(np.zeros(len(neg_test_vec)))

In [188]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[58 32]
 [37 52]]
              precision    recall  f1-score   support

         0.0       0.61      0.64      0.63        90
         1.0       0.62      0.58      0.60        89

   micro avg       0.61      0.61      0.61       179
   macro avg       0.61      0.61      0.61       179
weighted avg       0.61      0.61      0.61       179

