# Initialize Models
This notebook will walk you through building and saving the most basic 
models we used for analyzing our text data.

We first import the libraries and utility files we are going to be using.

In [None]:
# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim

# Import utility files
from utils import read_df, remove_links, clean_sentence, save_object, load_object

#### Setup directories

If this is the first time doing this analysis, 
we first will set up all the directories we need
to save and load the models we will be using

In [None]:
import os
directories = ['objects', 'models', 'clusters', 'matricies']
for dirname in directories:
    if not os.path.exists(dirname):
        os.makedirs(dirname)

#### Name Model

Before begining the rest of our project, we select a name for our model.
This name will be used to save and load the files for this model

In [None]:
model_name = "example_model"

#### Parse and Clean Data

We first parse and clean our data. Our data is assumed to be in csv format, 
in a directory labeled 'data'.

In [None]:
# Get the data from the csv
df = read_df('data',extension = "/*.csv")

In [None]:
# Do an inspection of our data to ensure nothing went wrong
df.info()

In [None]:
df.head()

In [None]:
# Clean the text in the dataframe
df = df.replace(np.nan, '', regex = True)
df = df.replace("\[deleted\]", '', regex = True)
df["rawtext"] = df["title"] + " " + df["selftext"]
df["cleantext"] = df["rawtext"].apply(remove_links).apply(clean_sentence)

In [None]:
# Check that the cleaning was successful
df.info()

In [None]:
df.head()

### Phrase Analysis

After parsing and cleaning the data we run the gensim phraser
tool on our text data to join phrases like "new york city" 
together to form the word "new_york_city"

In [None]:
# Get a stream of tokens
posts = df["cleantext"].apply(lambda str: str.split()).tolist()

In [None]:
# Train a phraseDetector to join two word phrases together
two_word_phrases = gensim.models.Phrases(posts)
two_word_phraser = gensim.models.phrases.Phraser(two_word_phrases)

In [None]:
# Train a phraseDetector to join three word phrases together
three_word_phrases = gensim.models.Phrases(two_word_phraser[posts])
three_word_phraser = gensim.models.phrases.Phraser(three_word_phrases)
posts = list(three_word_phraser[two_word_phraser[posts]])

In [None]:
# Update Data frame
df["phrasetext"] = df["cleantext"].apply(lambda str: " ".join(three_word_phraser[two_word_phraser[str.split()]]))

In [None]:
# Ensure posts contain same number of elements
len(posts) == len(df)

In [None]:
# Check that the dataframe was updated correctly
for i in range(len(posts)):
    if not " ".join(posts[i]) == list(df["phrasetext"])[i]:
        print("index :" + str(i) + " is incorrect")

### Data Saving

After cleaning and parsing all of our data, we can now
save it, so that we can analysis it later without having
to go through lengthy computations

In [None]:
save_object(posts, 'objects/', model_name + "-posts")
save_object(df, 'objects/', model_name + "-df")

### Initialize Word2Vec Model

After all of our data has been parsed and saved, 
we generate our Word2Vec Model

In [None]:
# Set the minimum word count to 10. This removes all words that appear less than 10 times in the data
minimum_word_count = 10
# Set skip gram to 1. This sets gensim to use the skip gram model instead of the Continuous Bag of Words model
skip_gram = 1
# Set Hidden layer size to 300.
hidden_layer_size = 300
# Set the window size to 5. 
window_size = 5
# Set hierarchical softmax to 1. This sets gensim to use hierarchical softmax
hierarchical_softmax = 1
# Set negative sampling to 20. This is good for relatively small data sets, but becomes harder for larger datasets
negative_sampling = 20

In [None]:
# Build the model
model = gensim.models.Word2Vec(posts, min_count = minimum_word_count, sg = skip_gram, size = hidden_layer_size,
                               window = window_size, hs = hierarchical_softmax, negative = negative_sampling)

### Basic Model test

After generating our model, we run some basic tests
to ensure that it has captured some semantic information results

In [None]:
model.most_similar(positive = ["kitten"])

In [None]:
model.most_similar(positive = ["father", "woman"], negative = ["man"])

In [None]:
model.most_similar(positive = ["family", "obligation"], negative = ["love"])

### Save Model

After generating our model, and runing some basic tests,
we now save it so that we can analysis it later without having
to go through lengthy computations. We also delete and then reload
the model, as an example of how to do so.

In [None]:
model.save('models/' + model_name + '.model')
del model

In [None]:
model = gensim.models.Word2Vec.load('models/' + model_name + '.model')

### Generate Matricies

After generating our Word2Vec Model, we generate 
a collection of matricies that will be useful for
analysis. This includes a Words By feature matrix,
and a Post By Words Matrix. Note, we will use camelCase 
for matrix names, and only matrix names

In [None]:
# Initialize the list of words used
vocab_list = sorted(list(model.wv.vocab))

In [None]:
# Extract the word vectors
vecs = []
for word in vocab_list:
    vecs.append(model.wv[word].tolist())

In [None]:
# change array format into numpy array
WordsByFeatures = np.array(vecs)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer(vocabulary = vocab_list, analyzer = (lambda lst:list(map((lambda s:s), lst))), min_df = 0)

In [None]:
# Make Posts By Words Matrix
PostsByWords = countvec.fit_transform(posts)

### Basic Matrix tests

After generating our matricies, we run some basic tests
to ensure that they seem resaonable later without having
to go through lengthy computations

In [None]:
# Check that PostsByWords is the number of Posts by the number of words
PostsByWords.shape[0] == len(posts)

In [None]:
# check that the number of words is consistant for all matricies
PostsByWords.shape[1] == len(WordsByFeatures)

### Save Matricies

After generating our matricies, we save them so we can 
analyze them later without having to go through lengthy
computations.

In [None]:
save_object(PostsByWords,'matricies/', model_name + "-PostsByWords")
save_object(WordsByFeatures,'matricies/', model_name + "-WordsByFeatures")

### Generate Word Clusters

Now that we have generated and saved our matricies,
we will proceed to generate word clusters using 
kmeans clustering, and save them for later analysis.

In [None]:
from sklearn.cluster import KMeans
# get the fit for different values of K
test_points = [12] + list(range(25, 401, 25))
fit = []
for point in test_points:
    kmeans = KMeans(n_clusters = point, random_state = 42).fit(WordsByFeatures)
    save_object(kmeans, 'clusters/', model_name + "-words-cluster_model-" + str(point))
    fit.append(kmeans.inertia_)

In [None]:
save_object(fit, 'objects/', model_name + "-words" + "-fit")
save_object(test_points, 'objects/', model_name + "-words" + "-test_points")
del fit
del test_points