# Suicide Watch analysis
This notebook will walk you through building the models we
built after collecting our data from the Suicide Watch Subreddit

We first import the libraries and utility files we are going to be using,
and parse and clean our data.

In [None]:
%matplotlib inline

# Import machine learning libraries
import gensim
import textmining
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import dok_matrix
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer

# Import utility files
import dataUtils
import clusterUtils

In [None]:
# Get the data from the csv
df = dataUtils.read_df('data')

In [None]:
# Get the text for building the model
df =df.replace(np.nan, '', regex=True)
df["rawtext"]= df["title"]+" "+df["selftext"]
posts= df["rawtext"].apply(dataUtils.cleanSentence).apply(lambda str: str.split()).tolist()

#### Data summary statistics

Before building models, we first look at that data that we are using.

In [None]:
# Get the number of posts
num_posts = len(posts)
num_posts

In [None]:
#get the number of users (minus [deleted])
userList= df["author"].tolist()
userDict = {}
for user in userList:
    if user in userDict.keys() and user != "[deleted]":
        userDict[user] =1+userDict[user]
    else:
        userDict[user] =1
len(list(userDict.keys()))

#### Build word2vec model
At this step we will build the word2vec model that we will use in the rest of the analysis.
Becuase this is a compuationally expensive process, we save the results of running our model
as the value of model_name +".model" in the models directory. We can then load this model later, and do not need
to re build it every time we want to analyze it.

In [None]:
model_name = "model1"

In [None]:
# Build the model
model = gensim.models.Word2Vec(posts,min_count =10,
                               sg=1, size =300,window=5,hs=1,negative=20)
model.save('models/'+model_name+'.model')
del model

In [None]:
# load the model
model = gensim.models.Word2Vec.load('models/'+model_name+'.model')
# Test the model: you should see cat somewhere in this list, near the top
model.most_similar(positive=["kitten"])

#### Word usage summary

At this step, after our model has looked at all the words, 
and filtered some out, we will look at the words used by our model.

In [None]:
# Initialize the list of words used
vocab_list = list(model.wv.vocab)

In [None]:
unique_words = len(vocab_list)
unique_words

In [None]:
total_freq = 0
for word in vocab_list:
    total_freq += model.wv.vocab[word].count
total_freq

#### Run Clustering
At this step we run and analyze the KMeans clustering algorithm 
implemented by sklearn on the word vectors we got from word2vec.

The first step for this proccess is to extract the word vectors,
and the words they correspond with from the model. We then tests 
different values of K to observe the effect of the number of centers on the fit of the model.
After this we select a value of K to use to get the clusterings. 
We then save this result in the directory "clustures" with the name model_name + num_centers+".pkl", to save future computational time

We then use the kmeans model to generate a list of dictionaries, where each dictionary corresponds to a cluster, and contains following fields:
    'unique_words': The number of different unique words in the cluster
    'total_freq'  : The total number of times one of the words in the cluster appeared in the corpus
    'word_list'   : A list of words in the cluster, paired with how often they appeared in the cluster

Finally we print a representation of this list to a csv, so that the clusters can be manuelly inspected.
This representation includes the number of unique words in the cluster, the total frequency of words in the cluster, and the size_words_list most frequent words in the cluster

In [None]:
# Extract the word vectors
vecs = []
for word in vocab_list:
    vecs.append(model.wv[word].tolist())

In [None]:
# change array format into numpy array
WordByFeatureMat = np.array(vecs)

In [None]:
# get the fit for different values of K
test_points = [12]+ list(range(25,401,25))
fit = []
for point in test_points:
    tempMeans = KMeans(n_clusters=point, random_state=42).fit(WordByFeatureMat)
    fit.append(tempMeans.inertia_)

In [None]:
# Save the fit values for this model
dataUtils.save_object(fit,'objects/',model_name+"-fit")
dataUtils.save_object(test_points,'objects/',model_name+"-testpoints")
del fit
del test_points

In [None]:
# Load the fit and test point values
fit         = dataUtils.load_object('objects/',model_name+"-fit")
test_points = dataUtils.load_object('objects/',model_name+"-testpoints")

In [None]:
# graph the fit for different values of K
plt.plot(test_points,fit,'ro')
plt.show()

In [None]:
# set the number of clusters
num_clusters = 100

In [None]:
#initialize kmeans model
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(WordByFeatureMat)
# Save the clusters directory
dataUtils.save_object(kmeans,'clusters/',model_name+str(num_clusters))
del kmeans

In [None]:
# load kmeans
kmeans = dataUtils.load_object('clusters/',model_name+str(num_clusters))

In [None]:
clusters = clusterUtils.makeClusteringObjects(model,kmeans,vocab_list,WordByFeatureMat)

In [None]:
# determine the total words in the clusters, and the total number of unique words in the clusters
clusters_total_words  = 0
clusters_unique_words = 0
for cluster in clusters:
    clusters_total_words  += cluster['total_freq']
    clusters_unique_words += cluster['unique_words']

In [None]:
# Check that the total number of words in clusters matches the total
clusters_total_words   

In [None]:
# Check that the number of unique words in clusters matches the total number of unique words
clusters_unique_words

##### Print clusters

Print clusters so we can analyze them

In [None]:
# Sort all the words in the words list
for cluster in clusters:
    cluster["word_list"].sort(key=lambda x:x[1],reverse = True)

In [None]:
size_words_list =10
table =[]
for i in range(len(clusters)):
    row =[]
    row.append("cluster " + str(i+1))
    row.append(clusters[i]["total_freq"])
    row.append(clusters[i]["unique_words"])
    for j in range(size_words_list):
        row.append(clusters[i]["word_list"][j])
    table.append(row)

In [None]:
import csv
with open('clusters.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    [writer.writerow(r) for r in table]

#### Prepare for regression :TODO

At this step, we will initialize the matricies we need to run a linear regression algorithm.
We will need to create a document term matrix, and a words by cluster matrix.
We will first use sklearn's CountVectorizer function to create the document term matrix. 
We will create the words by cluster matrix by giving each word a one hot vector, with a
one in the cluster number, and a 0 everywhere else.

In [None]:
df = dataUtils.read_df('data')

In [None]:
df.head()

In [None]:
df =df.replace(np.nan, '', regex=True)
df["rawtext"]= df["title"]+" "+df["selftext"]

In [None]:
wordDict ={}
for sentence in df["rawtext"]:
    for word in sentence.split():
        if word in wordDict.keys() and word != "[deleted]":
            wordDict[word] =1+wordDict[word]
        else:
            wordDict[word] =1

In [None]:
df["cleantext"]=df["rawtext"].apply((lambda str : ' '.join(list(filter(lambda s: wordDict[s]>=10 ,str.split())))))

In [None]:
countvec = CountVectorizer()

In [None]:
PostsByWords =countvec.fit_transform(df.cleantext)

In [None]:
PostsByWords

In [None]:
PostsByFeatures = np.dot(PostsByWords,WordByFeatureMat)