# Initialize Models
This notebook will walk you through building and saving the most basic 
models we used for analysing our text data.

We first import the libraries and utility files we are going to be using.

In [17]:
# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim

# Import utility files
from utils import read_df,remove_links,clean_sentence,save_object,load_object

#### Setup directories

If this is the first time doing this analysis, 
we first will set up all the directories we need
to save and load the models we will be using

In [2]:
import os
directories = ['objects','models','clusters','matricies']
for dirname in directories:
    if not os.path.exists(dirname):
        os.makedirs(dirname)

#### Name Model

Before begining the rest of our project, we select a name for our model.
This name will be used to save and load the files for this model

In [3]:
model_name= "example_model"

#### Parse and Clean Data

We first parse and clean our data. Our data is assumed to be in csv format, 
in a directory labeled 'data'.

In [4]:
# Get the data from the csv
df = read_df('data')

In [5]:
# Do an inspection of our data to ensure nothing went wrong
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33069 entries, 0 to 3399
Data columns (total 15 columns):
title           33069 non-null object
created_utc     33069 non-null int64
ups             33069 non-null int64
downs           33069 non-null int64
num_comments    33069 non-null int64
name            33069 non-null object
id              33069 non-null object
from            0 non-null float64
from_id         0 non-null float64
selftext        32652 non-null object
subreddit       33069 non-null object
score           33069 non-null int64
author          33069 non-null object
url             33069 non-null object
permalink       33069 non-null object
dtypes: float64(2), int64(5), object(8)
memory usage: 4.0+ MB


In [6]:
df.head()

Unnamed: 0,title,created_utc,ups,downs,num_comments,name,id,from,from_id,selftext,subreddit,score,author,url,permalink
0,"New year's eve is always a bad time, but we wi...",1451606721,1,0,1,t3_3yyvad,3yyvad,,,[deleted],SuicideWatch,1,[deleted],https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyvad/new_years_eve_...
1,I'd rather end it...,1451608072,2,0,4,t3_3yyxqz,3yyxqz,,,I have been very depressed lately. I was a cut...,SuicideWatch,2,deathproof69,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyxqz/id_rather_end_it/
2,Let's do this!,1451608080,3,0,2,t3_3yyxrg,3yyxrg,,,Alright. It's about time. \n\nLong time lurker...,SuicideWatch,3,gangerousdoblin,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyxrg/lets_do_this/
3,Please someone talk to me,1451608127,2,0,1,t3_3yyxux,3yyxux,,,[deleted],SuicideWatch,2,[deleted],https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyxux/please_someone...
4,Mostly Lifeless At The Moment,1451608748,2,0,3,t3_3yyyyb,3yyyyb,,,[deleted],SuicideWatch,2,[deleted],https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyyyb/mostly_lifeles...


In [7]:
# Clean the text in the dataframe
df =df.replace(np.nan, '', regex=True)
df =df.replace("\[deleted\]", '', regex=True)
df["rawtext"]= df["title"]+" "+df["selftext"]
df["cleantext"]=df["rawtext"].apply(remove_links).apply(clean_sentence)

In [8]:
# Check that the cleaning was successful
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33069 entries, 0 to 3399
Data columns (total 17 columns):
title           33069 non-null object
created_utc     33069 non-null int64
ups             33069 non-null int64
downs           33069 non-null int64
num_comments    33069 non-null int64
name            33069 non-null object
id              33069 non-null object
from            33069 non-null object
from_id         33069 non-null object
selftext        33069 non-null object
subreddit       33069 non-null object
score           33069 non-null int64
author          33069 non-null object
url             33069 non-null object
permalink       33069 non-null object
rawtext         33069 non-null object
cleantext       33069 non-null object
dtypes: int64(5), object(12)
memory usage: 4.5+ MB


In [9]:
df.head()

Unnamed: 0,title,created_utc,ups,downs,num_comments,name,id,from,from_id,selftext,subreddit,score,author,url,permalink,rawtext,cleantext
0,"New year's eve is always a bad time, but we wi...",1451606721,1,0,1,t3_3yyvad,3yyvad,,,,SuicideWatch,1,,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyvad/new_years_eve_...,"New year's eve is always a bad time, but we wi...",new years eve is always a bad time but we wil...
1,I'd rather end it...,1451608072,2,0,4,t3_3yyxqz,3yyxqz,,,I have been very depressed lately. I was a cut...,SuicideWatch,2,deathproof69,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyxqz/id_rather_end_it/,I'd rather end it... I have been very depresse...,id rather end it i have been very depressed...
2,Let's do this!,1451608080,3,0,2,t3_3yyxrg,3yyxrg,,,Alright. It's about time. \n\nLong time lurker...,SuicideWatch,3,gangerousdoblin,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyxrg/lets_do_this/,Let's do this! Alright. It's about time. \n\nL...,lets do this alright its about time long ...
3,Please someone talk to me,1451608127,2,0,1,t3_3yyxux,3yyxux,,,,SuicideWatch,2,,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyxux/please_someone...,Please someone talk to me,please someone talk to me
4,Mostly Lifeless At The Moment,1451608748,2,0,3,t3_3yyyyb,3yyyyb,,,,SuicideWatch,2,,https://www.reddit.com/r/SuicideWatch/comments...,/r/SuicideWatch/comments/3yyyyb/mostly_lifeles...,Mostly Lifeless At The Moment,mostly lifeless at the moment


### Phrase Analysis

After parsing and cleaning the data we run the gensim phraser
tool on our text data to join phrases like "new york city" 
together to form the word "new_york_city"

In [10]:
# Get a stream of tokens
posts= df["cleantext"].apply(lambda str: str.split()).tolist()

In [13]:
# Train a phraseDetector to join two word phrases together
two_word_phrases = gensim.models.Phrases(posts)
two_word_phraser = gensim.models.phrases.Phraser(two_word_phrases)

In [14]:
# Train a phraseDetector to join three word phrases together
three_word_phrases = gensim.models.Phrases(two_word_phraser[posts])
three_word_phraser = gensim.models.phrases.Phraser(three_word_phrases)
posts              = list(three_word_phraser[two_word_phraser[posts]])

In [15]:
# Update Data frame
df["phrasetext"]=df["cleantext"].apply(lambda str: " ".join(three_word_phraser[two_word_phraser[str.split()]]))

In [19]:
# Ensure posts contain same number of elements
len(posts)==len(df)

33069

In [25]:
# Check that the dataframe was updated correctly
for i in range(len(posts)):
    if not " ".join(posts[i])==list(df["phrasetext"])[i]:
        print("index :"+str(i) +" is incorrect")

### Data Saving

After cleaning and parsing all of our data, we can now
save it, so that we can analysis it later without having
to go through lengthy computations

In [18]:
save_object(posts,'objects/',model_name+"-posts")
save_object(df,'objects/',model_name+"-df")

### Initialize Word2Vec Model

After all of our data has been parsed and saved, 
we generate our Word2Vec Model

In [28]:
# Set the minimum word count to 10. This removes all words that appear less than 10 times in the data
minimum_word_count=10
# Set skip gram to 1. This sets gensim to use the skip gram model instead of the Continuous Bag of Words model
skip_gram = 1
# Set Hidden layer size to 300.
hidden_layer_size =300
# Set the window size to 5. 
window_size = 5
# Set hierarchical softmax to 1. This sets gensim to use hierarchical softmax
hierarchical_softmax =1
# Set negative sampling to 20. This is good for relatively small data sets, but becomes harder for larger datasets
negative_sampling =20

In [43]:
# Build the model
model = gensim.models.Word2Vec(posts,min_count =minimum_word_count, sg=skip_gram, size =hidden_layer_size,
                                window=window_size,hs=hierarchical_softmax,negative=negative_sampling)

### Basic Model test

After generating our model, we run some basic tests
to ensure that it has captured some semantic information results

In [44]:
model.most_similar(positive=["kitten"])

[('father_figure', 0.5078849196434021),
 ('puppy', 0.5020877122879028),
 ('sis', 0.4523952305316925),
 ('girlfriend_whom', 0.4390547573566437),
 ('baby', 0.4342191815376282),
 ('kidnapped', 0.4331890940666199),
 ('soul_mate', 0.43072372674942017),
 ('our_daughter', 0.4306313097476959),
 ('bunny', 0.427238404750824),
 ('ceremony', 0.426519513130188)]

In [45]:
model.most_similar(positive=["father","woman"],negative=["man"])

[('mother', 0.5760773420333862),
 ('druggie', 0.48784691095352173),
 ('mom', 0.46765851974487305),
 ('husband', 0.46576985716819763),
 ('dad', 0.46441134810447693),
 ('step_father', 0.44937026500701904),
 ('grandmother', 0.44449758529663086),
 ('brother', 0.43873152136802673),
 ('stepfather', 0.43668779730796814),
 ('babysitter', 0.43309086561203003)]

In [46]:
model.most_similar(positive=["family","obligation"],negative =["love"])

[('relatives', 0.36684978008270264),
 ('parents', 0.3621269464492798),
 ('job_prospects', 0.3588736653327942),
 ('extended_family', 0.35447224974632263),
 ('incentive', 0.3542650640010834),
 ('family_members', 0.35329771041870117),
 ('moms_side', 0.338132381439209),
 ('dirt_poor', 0.3373897075653076),
 ('utilities', 0.3347025513648987),
 ('family_hates', 0.32735511660575867)]

### Save Model

After generating our model, and runing some basic tests,
we now save it so that we can analysis it later without having
to go through lengthy computations. We also delete and then reload
the model, as an example of how to do so.

In [47]:
model.save('models/'+model_name+'.model')
del model

In [48]:
model = gensim.models.Word2Vec.load('models/'+model_name+'.model')

### Generate Matricies

After generating our Word2Vec Model, we generate 
a collection of matricies that will be useful for
analysis. This includes a Words By feature matrix,
and a Post By Words Matrix. Note, we will use camelCase 
for matrix names, and only matrix names

In [49]:
# Initialize the list of words used
vocab_list = sorted(list(model.wv.vocab))

In [50]:
# Extract the word vectors
vecs = []
for word in vocab_list:
    vecs.append(model.wv[word].tolist())

In [68]:
# change array format into numpy array
WordByFeature = np.array(vecs)

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer(vocabulary =vocab_list,analyzer=(lambda lst:list(map((lambda s:s),lst))),min_df=0)

In [56]:
# Make Posts By Words Matrix
PostsByWords = countvec.fit_transform(posts)

### Basic Matrix tests

After generating our matricies, we run some basic tests
to ensure that they seem resaonable later without having
to go through lengthy computations

In [65]:
# Check that PostsByWords is the number of Posts by the number of words
PostsByWords.shape[0]==len(posts)

True

In [66]:
# check that the number of words is consistant for all matricies
PostsByWords.shape[1]== len(WordByFeatureMat)

True

### Save Matricies

After generating our matricies, we save them so we can 
analyze them later without having to go through lengthy
computations.

In [67]:
save_object(PostsByWords,'objects/',model_name+"-PostsByWords")
save_object(WordByFeature,'objects/',model_name+"-WordByFeatureMat")

### Generate Word Clusters

Now that we have generated and saved our matricies,
we will proceed to generate word clusters using 
kmeans clustering, and save them for later analysis.

In [None]:
from sklearn.cluster import KMeans
# get the fit for different values of K
test_points = [12]+ list(range(25,401,25))
fit = []
for point in test_points:
    kmeans = KMeans(n_clusters=point, random_state=42).fit(WordByFeatureMat)
    save_object(kmeans,'clusters/',model_name+"-cluster_model-"+str(point))
    fit.append(kmeans.inertia_)