# Initialize Models
This notebook will walk you through building and saving the most basic 
models we used for analyzing our text data.

We first import the libraries and utility files we are going to be using.

In [1]:
# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim

# Import utility files
from utils import read_df, remove_links, clean_sentence, save_object, load_object

#### Setup directories

If this is the first time doing this analysis, 
we first will set up all the directories we need
to save and load the models we will be using

In [2]:
import os
directories = ['objects', 'models', 'clusters', 'matricies']
for dirname in directories:
    if not os.path.exists(dirname):
        os.makedirs(dirname)

#### Name Model

Before begining the rest of our project, we select a name for our model.
This name will be used to save and load the files for this model

In [4]:
model_name = "PTSD_model4"

#### Parse and Clean Data

We first parse and clean our data. Our data is assumed to be in csv format, 
in a directory labeled 'data'.

In [5]:
# Get the data from the csv
df = read_df('data')

In [6]:
# Do an inspection of our data to ensure nothing went wrong
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7057 entries, 0 to 208
Data columns (total 15 columns):
title           7057 non-null object
created_utc     7057 non-null float64
ups             7057 non-null float64
downs           7057 non-null float64
num_comments    7057 non-null float64
name            6076 non-null object
id              7057 non-null object
from            0 non-null object
from_id         0 non-null object
selftext        6215 non-null object
subreddit       7057 non-null object
score           7057 non-null float64
author          7057 non-null object
url             7057 non-null object
permalink       7057 non-null object
dtypes: float64(5), object(10)
memory usage: 882.1+ KB


In [7]:
df.head()

Unnamed: 0,title,created_utc,ups,downs,num_comments,name,id,from,from_id,selftext,subreddit,score,author,url,permalink
0,Posttraumatic stress disorder (Wikipedia Entry),1220412859,4,1,0,t3_6zcej,6zcej,,,,ptsd,3,Crito,http://en.wikipedia.org/wiki/PTSD,/r/ptsd/comments/6zcej/posttraumatic_stress_di...
1,Psychiatric Service Dog Society,1220412757,4,1,0,t3_6zce5,6zce5,,,,ptsd,3,Crito,http://www.psychdog.org/about_mission.html,/r/ptsd/comments/6zce5/psychiatric_service_dog...
2,PTSD leaves physical footprints on the brain,1220637551,4,1,0,t3_6zvkd,6zvkd,,,,ptsd,3,Crito,http://www.sfgate.com/cgi-bin/article.cgi?f=/c...,/r/ptsd/comments/6zvkd/ptsd_leaves_physical_fo...
3,Computer therapy soothes symptoms in combat ve...,1220976763,5,0,0,t3_70hw3,70hw3,,,,ptsd,5,Crito,http://www.signonsandiego.com/news/military/20...,/r/ptsd/comments/70hw3/computer_therapy_soothe...
4,"Trauma, PTSD Followed By Reduction In Region O...",1221046789,5,0,0,t3_70nr2,70nr2,,,,ptsd,5,Crito,http://www.sciencedaily.com/releases/2008/08/0...,/r/ptsd/comments/70nr2/trauma_ptsd_followed_by...


In [8]:
# Clean the text in the dataframe
df = df.replace(np.nan, '', regex = True)
df = df.replace("\[deleted\]", '', regex = True)
df["rawtext"] = df["title"] + " " + df["selftext"]
df["cleantext"] = df["rawtext"].apply(remove_links).apply(clean_sentence)

In [9]:
# Check that the cleaning was successful
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7057 entries, 0 to 208
Data columns (total 17 columns):
title           7057 non-null object
created_utc     7057 non-null float64
ups             7057 non-null float64
downs           7057 non-null float64
num_comments    7057 non-null float64
name            7057 non-null object
id              7057 non-null object
from            7057 non-null object
from_id         7057 non-null object
selftext        7057 non-null object
subreddit       7057 non-null object
score           7057 non-null float64
author          7057 non-null object
url             7057 non-null object
permalink       7057 non-null object
rawtext         7057 non-null object
cleantext       7057 non-null object
dtypes: float64(5), object(12)
memory usage: 992.4+ KB


In [10]:
df.head()

Unnamed: 0,title,created_utc,ups,downs,num_comments,name,id,from,from_id,selftext,subreddit,score,author,url,permalink,rawtext,cleantext
0,Posttraumatic stress disorder (Wikipedia Entry),1220412859,4,1,0,t3_6zcej,6zcej,,,,ptsd,3,Crito,http://en.wikipedia.org/wiki/PTSD,/r/ptsd/comments/6zcej/posttraumatic_stress_di...,Posttraumatic stress disorder (Wikipedia Entry),posttraumatic stress disorder wikipedia entry
1,Psychiatric Service Dog Society,1220412757,4,1,0,t3_6zce5,6zce5,,,,ptsd,3,Crito,http://www.psychdog.org/about_mission.html,/r/ptsd/comments/6zce5/psychiatric_service_dog...,Psychiatric Service Dog Society,psychiatric service dog society
2,PTSD leaves physical footprints on the brain,1220637551,4,1,0,t3_6zvkd,6zvkd,,,,ptsd,3,Crito,http://www.sfgate.com/cgi-bin/article.cgi?f=/c...,/r/ptsd/comments/6zvkd/ptsd_leaves_physical_fo...,PTSD leaves physical footprints on the brain,ptsd leaves physical footprints on the brain
3,Computer therapy soothes symptoms in combat ve...,1220976763,5,0,0,t3_70hw3,70hw3,,,,ptsd,5,Crito,http://www.signonsandiego.com/news/military/20...,/r/ptsd/comments/70hw3/computer_therapy_soothe...,Computer therapy soothes symptoms in combat ve...,computer therapy soothes symptoms in combat ve...
4,"Trauma, PTSD Followed By Reduction In Region O...",1221046789,5,0,0,t3_70nr2,70nr2,,,,ptsd,5,Crito,http://www.sciencedaily.com/releases/2008/08/0...,/r/ptsd/comments/70nr2/trauma_ptsd_followed_by...,"Trauma, PTSD Followed By Reduction In Region O...",trauma ptsd followed by reduction in region o...


### Phrase Analysis

After parsing and cleaning the data we run the gensim phraser
tool on our text data to join phrases like "new york city" 
together to form the word "new_york_city"

In [11]:
# Get a stream of tokens
posts = df["cleantext"].apply(lambda str: str.split()).tolist()

In [12]:
# Train a phraseDetector to join two word phrases together
two_word_phrases = gensim.models.Phrases(posts)
two_word_phraser = gensim.models.phrases.Phraser(two_word_phrases)

In [13]:
# Train a phraseDetector to join three word phrases together
three_word_phrases = gensim.models.Phrases(two_word_phraser[posts])
three_word_phraser = gensim.models.phrases.Phraser(three_word_phrases)
posts = list(three_word_phraser[two_word_phraser[posts]])

In [14]:
# Update Data frame
df["phrasetext"] = df["cleantext"].apply(lambda str: " ".join(three_word_phraser[two_word_phraser[str.split()]]))

In [15]:
# Ensure posts contain same number of elements
len(posts) == len(df)

True

In [16]:
# Check that the dataframe was updated correctly
for i in range(len(posts)):
    if not " ".join(posts[i]) == list(df["phrasetext"])[i]:
        print("index :" + str(i) + " is incorrect")

### Data Saving

After cleaning and parsing all of our data, we can now
save it, so that we can analysis it later without having
to go through lengthy computations

In [17]:
save_object(posts, 'objects/', model_name + "-posts")
save_object(df, 'objects/', model_name + "-df")

### Initialize Word2Vec Model

After all of our data has been parsed and saved, 
we generate our Word2Vec Model

In [18]:
# Set the minimum word count to 10. This removes all words that appear less than 10 times in the data
minimum_word_count = 10
# Set skip gram to 1. This sets gensim to use the skip gram model instead of the Continuous Bag of Words model
skip_gram = 1
# Set Hidden layer size to 300.
hidden_layer_size = 300
# Set the window size to 5. 
window_size = 5
# Set hierarchical softmax to 1. This sets gensim to use hierarchical softmax
hierarchical_softmax = 1
# Set negative sampling to 20. This is good for relatively small data sets, but becomes harder for larger datasets
negative_sampling = 20
# number of iterations to run default 5
iterations =80

In [19]:
# Build the model
model = gensim.models.Word2Vec(posts, min_count = minimum_word_count, sg = skip_gram, size = hidden_layer_size,
                               window = window_size, hs = hierarchical_softmax, negative = negative_sampling,iter=iterations)

### Basic Model test

After generating our model, we run some basic tests
to ensure that it has captured some semantic information results

In [20]:
model.most_similar(positive = ["kitten"])

[('puppy', 0.29517465829849243),
 ('baby', 0.28231385350227356),
 ('sleeps', 0.2780076265335083),
 ('poured', 0.26375722885131836),
 ('niece', 0.2555815577507019),
 ('cpr', 0.2549140453338623),
 ('vodka', 0.2545509338378906),
 ('pair', 0.25444141030311584),
 ('a_half_ago', 0.25060075521469116),
 ('guards', 0.24627558887004852)]

In [21]:
model.most_similar(positive = ["her"])

[('she', 0.6510519981384277),
 ('my', 0.5369052290916443),
 ('me', 0.5062918663024902),
 ('him', 0.4567736089229584),
 ('shes', 0.44906753301620483),
 ('his', 0.4077233076095581),
 ('my_mom', 0.3999635577201843),
 ('herself', 0.39447176456451416),
 ('and', 0.37899160385131836),
 ('mother', 0.37736374139785767)]

In [22]:
model.most_similar(positive = ["my"])

[('his', 0.5938185453414917),
 ('and', 0.5448551774024963),
 ('her', 0.5369052290916443),
 ('the', 0.49303215742111206),
 ('your', 0.46680885553359985),
 ('their', 0.458759605884552),
 ('i', 0.4112635850906372),
 ('a', 0.40751779079437256),
 ('my_own', 0.4025762677192688),
 ('our', 0.38989442586898804)]

In [23]:
model.most_similar(positive = ["father", "woman"], negative = ["man"])

[('mother', 0.4488663077354431),
 ('brother', 0.31456589698791504),
 ('my_mom', 0.2941199541091919),
 ('parents', 0.2693387269973755),
 ('child', 0.26594051718711853),
 ('my_dad', 0.2658591568470001),
 ('abuser', 0.26399141550064087),
 ('molested_by', 0.263275146484375),
 ('molester', 0.2622520327568054),
 ('growing_up', 0.26060760021209717)]

In [24]:
model.most_similar(positive = ["family", "obligation"], negative = ["love"])

[('friends', 0.2847524881362915),
 ('parents', 0.2632507085800171),
 ('identity', 0.25016000866889954),
 ('immediate_family', 0.24966305494308472),
 ('impending', 0.24379894137382507),
 ('moral', 0.24031078815460205),
 ('relatives', 0.23715239763259888),
 ('phone', 0.23234713077545166),
 ('mother', 0.23103675246238708),
 ('mess', 0.2234201282262802)]

### Save Model

After generating our model, and runing some basic tests,
we now save it so that we can analysis it later without having
to go through lengthy computations. We also delete and then reload
the model, as an example of how to do so.

In [25]:
model.save('models/' + model_name + '.model')
del model

In [26]:
model = gensim.models.Word2Vec.load('models/' + model_name + '.model')

### Generate Matricies

After generating our Word2Vec Model, we generate 
a collection of matricies that will be useful for
analysis. This includes a Words By feature matrix,
and a Post By Words Matrix. Note, we will use camelCase 
for matrix names, and only matrix names

In [27]:
# Initialize the list of words used
vocab_list = sorted(list(model.wv.vocab))

In [28]:
# Extract the word vectors
vecs = []
for word in vocab_list:
    vecs.append(model.wv[word].tolist())

In [29]:
# change array format into numpy array
WordsByFeatures = np.array(vecs)

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer(vocabulary = vocab_list, analyzer = (lambda lst:list(map((lambda s:s), lst))), min_df = 0)

In [31]:
# Make Posts By Words Matrix
PostsByWords = countvec.fit_transform(posts)

### Basic Matrix tests

After generating our matricies, we run some basic tests
to ensure that they seem resaonable later without having
to go through lengthy computations

In [32]:
# Check that PostsByWords is the number of Posts by the number of words
PostsByWords.shape[0] == len(posts)

True

In [33]:
# check that the number of words is consistant for all matricies
PostsByWords.shape[1] == len(WordsByFeatures)

True

### Save Matricies

After generating our matricies, we save them so we can 
analyze them later without having to go through lengthy
computations.

In [34]:
save_object(PostsByWords,'matricies/', model_name + "-PostsByWords")
save_object(WordsByFeatures,'matricies/', model_name + "-WordsByFeatures")

### Generate Word Clusters

Now that we have generated and saved our matricies,
we will proceed to generate word clusters using 
kmeans clustering, and save them for later analysis.

In [35]:
from sklearn.cluster import KMeans
# get the fit for different values of K
test_points = [12] + list(range(25, 401, 25))
fit = []
for point in test_points:
    kmeans = KMeans(n_clusters = point, random_state = 42).fit(WordsByFeatures)
    save_object(kmeans, 'clusters/', model_name + "-words-cluster_model-" + str(point))
    fit.append(kmeans.inertia_)

In [36]:
save_object(fit, 'objects/', model_name + "-words" + "-fit")
save_object(test_points, 'objects/', model_name + "-words" + "-test_points")
del fit
del test_points