# Text Analysis

This notebook focuses on analysing the Text we are using for our model. This includes counting the total words, counting the unique words, counting total number of posts, and the total number of users, and determining the number of words removed by Gensim

We first import the libaries we will need throughout the project

In [1]:
#Import graphing utilities
%matplotlib inline
import matplotlib.pyplot as plt

# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim

# Import utility files
from utils import save_object,load_object

### Set model name

Before begining the rest of this project, we select a name for our model. This name will be used to save and load the files for this model

In [2]:
model_name = "PTSD_model"

### Load our Data

After selecting our model, we next load the data

In [3]:
posts = load_object('objects/',model_name+"-posts")
df    = load_object('objects/',model_name+"-df")

### Posts Analysis

We first analyze the posts, and their users

In [4]:
num_posts = len(df['cleantext'])

In [5]:
#get the number of users (minus [deleted])
user_list= df["author"].tolist()
user_dict = {}
for user in user_list:
    if user in user_dict.keys() and user != "[deleted]":
        user_dict[user] =1+user_dict[user]
    else:
        user_dict[user] =1
num_users = len(list(user_dict.keys()))

In [6]:
num_posts

7057

In [7]:
num_users

3330

### Word/Phrase Analysis

We now analyze the words and phrases used by our model

In [8]:
plain_words = list(df['cleantext'].apply(lambda x: x.split()))

In [9]:
total_phrases =0
for post in posts:
    for phrase in post:
        total_phrases +=1

In [10]:
total_words =0
for post in plain_words:
    for word in post:
        total_words +=1

In [11]:
phrase_dict = {}
for post in posts:
    for phrase in post:
        if phrase in phrase_dict.keys():
            phrase_dict[phrase] =1+phrase_dict[phrase]
        else:
            phrase_dict[phrase] =1

In [12]:
word_dict= {}
for post in plain_words:
    for word in post:
        if word in word_dict.keys():
            word_dict[word] =1+word_dict[word]
        else:
            word_dict[word] =1

In [13]:
# Total words in the corpus
total_words

1592918

In [14]:
# Total phrases in the corpus
total_phrases

1470300

In [15]:
# Total vocabulary of words
len(list(word_dict))

24942

In [19]:
# Total vocabulary of phrases
len(list(phrase_dict))

28137

In [20]:
phrases = list(phrase_dict.keys())
phrase_freq_count            = 0
filtered_phrase_freq_count   = 0
phrase_unique_count          = 0
filtered_phrase_unique_count = 0
for phrase in phrases:
    count = phrase_dict[phrase]
    phrase_freq_count            += count
    filtered_phrase_freq_count   += count if count >= 10 else 0
    phrase_unique_count          += 1
    filtered_phrase_unique_count += 1 if count >= 10 else 0

In [21]:
words = list(word_dict.keys())
word_freq_count            = 0
filtered_word_freq_count   = 0
word_unique_count          = 0
filtered_word_unique_count = 0
for word in words:
    count = word_dict[word]
    word_freq_count            += count
    filtered_word_freq_count   += count if count >= 10 else 0
    word_unique_count          += 1
    filtered_word_unique_count += 1 if count >= 10 else 0

In [22]:
# Total number of tokens, including phrases and words
phrase_freq_count

1470300

In [23]:
# Total number of words, not including phrases
word_freq_count

1592918

In [24]:
# Number of words removed by including them in phrases
word_freq_count-phrase_freq_count

122618

In [25]:
# Total number of tokens after filtering, including phrases and words
filtered_phrase_freq_count

1415807

In [26]:
# Total number of tokens after filtering, including just words
filtered_word_freq_count

1546630

In [27]:
# Check that unique count was calculated correctlly
phrase_unique_count == len(phrase_dict) and word_unique_count == len(word_dict)

True

In [28]:
# the size of the vocabulary after filtering phrases
filtered_phrase_unique_count

7787

In [29]:
# the number of unique tokens removed by filtering
phrase_unique_count - filtered_phrase_unique_count

20350

In [30]:
# The percent of total tokens removed
str((phrase_freq_count-filtered_phrase_freq_count)/phrase_freq_count*100) + str("%")

'3.7062504250833164%'

In [31]:
# The percent of total tokens preserved
str(100 -100*(phrase_freq_count-filtered_phrase_freq_count)/phrase_freq_count) + str("%")

'96.29374957491669%'

### Check model

We now will analyze the model, to ensure that it has a word count that coresponds to the posts word count.

In [32]:
model = gensim.models.Word2Vec.load('models/'+model_name+'.model')

In [33]:
vocab_list = sorted(list(model.wv.vocab))

In [34]:
# Ensure model has correct number of unique words
len(vocab_list)==filtered_phrase_unique_count

True

In [35]:
model_freq_count = 0
for word in vocab_list:
    model_freq_count += model.wv.vocab[word].count

In [36]:
# Ensure that the total count of the model's words is the total count of the filtered words
model_freq_count==filtered_phrase_freq_count

True