# Suicide Watch analysis
This notebook will walk you through building the models we
built after collecting our data from the Suicide Watch Subreddit

We first import the libraries and utility files we are going to be using,
and parse and clean our data.

In [6]:
# Import machine learning libraries
import gensim
import textmining
from scipy.sparse import dok_matrix
from sklearn.cluster import KMeans
import numpy as np

# Import utility files
import dataUtils
import clusterUtils

In [8]:
# Get the data from the csv
postsStream =dataUtils.read('data',["title","selftext"])
posts = [dataUtils.cleanSentence(p).split() for p in postsStream]

KeyboardInterrupt: 

#### Data summary statistics

Before building models, we first look at that data that we are using.

In [20]:
# Get the number of posts
num_posts = len(posts)
num_posts

131728

In [22]:
#get the number of users (minus [deleted])
userStream = dataUtils.read('data',["author"])
userDict = {}
for user in userStream:
    if user in userDict.keys() and user != "[deleted]":
        userDict[user] =1+userDict[user]
    else:
        userDict[user] =1
len(list(userDict.keys()))

63267

#### Build word2vec model
At this step we will build the word2vec model that we will use in the rest of the analysis.
Becuase this is a compuationally expensive process, we save the results of running our model
as "model1.model" in the models directory. We can then load this model later, and do not need
to re build it every time we want to analyze it.

In [None]:
# Build the model
model = gensim.models.Word2Vec(posts,min_count =10,
                               sg=1, size =300,window=5,hs=1)
model.save('models/model1.model')
del model

In [9]:
# load the model
model = gensim.models.Word2Vec.load('models/model1.model')
# Test the model: you should see cat somewhere in this list, near the top
model.most_similar(positive=["kitten"])

[('cat', 0.44724366068840027),
 ('baby', 0.42420050501823425),
 ('kitties', 0.40364179015159607),
 ('grandson', 0.3845823407173157),
 ('dog', 0.3836134672164917),
 ('pup', 0.37856483459472656),
 ('puppy', 0.37702861428260803),
 ('kittens', 0.3761734366416931),
 ('cottage', 0.3717459440231323),
 ('pet', 0.3630608320236206)]

#### Word usage summary

At this step, after our model has looked at all the words, 
and filtered some out, we will look at the words used by our model.

In [69]:
# Initialize the list of words used
vocab_list = list(model.wv.vocab)

5088

In [33]:
unique_words = len(vocab_list)
unique_words

24116

In [35]:
total_freq = 0
for word in vocab_list:
    total_freq += model.wv.vocab[word].count
total_freq

28219923

#### Run Clustering
At this step we run the KMeans clustering algorithm 
implemented by sklearn on the word vectors we got from word2vec.

The first step for this proccess is to extract the word vectors,
and the words they correspond with from the model. After this we 
fit the Kmeans model to that data to get the clusterings. Finally,
we use the kmeans model to generate a list of dictionaries, where each dictionary corresponds to a cluster, and contains following fields:
    'unique_words': The number of different unique words in the cluster
    'total_freq'  : The total number of times one of the words in the cluster appeared in the corpus
    'word_list'   : A list of words in the cluster, paired with how often they appeared in the cluster
    
We finally save this result to analyze it later

In [9]:
# Extract the word vectors
vecs = []
for word in vocab_list:
    vecs.append(model.wv[word].tolist())

In [6]:
# change array format into numpy array
WordByFeatureMat = np.array(vecs)

24116

In [None]:
#initialize kmeans model
num_clusters = 50
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(WordByFeatureMat)

In [None]:
# Check the fit of the model
kmeans.inertia_

In [11]:
clusters = clusterUtils.makeClusteringObjects(model,kmeans,vocab_list,WordByFeatureMat)

In [12]:
# Save the clusters, and the WordByFeaturMatrix in the objects directory
dataUtils.save_object(kmeans,'objects/','kmeans')
dataUtils.save_object(clusters,'objects/','clusters')
dataUtils.save_object(WordByFeatureMat,'objects/','WordByFeatureMat')

In [36]:
# determine the total words in the clusters, and the total number of unique words in the clusters
clusters_total_words  = 0
clusters_unique_words = 0
for cluster in clusters:
    clusters_total_words  += cluster['total_freq']
    clusters_unique_words += cluster['unique_words']

True

In [37]:
# Check that the total number of words in clusters matches the total
total_words  ==  clusters_total_words   

True

In [38]:
# Check that the number of unique words in clusters matches the total number of unique words
unique_words == clusters_unique_words

True

#### Prepare for regression :TODO

At this step, we will initialize the matricies we need to run a linear regression algorithm.
We first will create a Document term matrix, and then we will create a words by cluster matrix.