# Text Analysis

This notebook focuses on analysing the Text we are using for our model. This includes counting the total words, counting the unique words, counting total number of posts, and the total number of users, and determining the number of words removed by Gensim

We first import the libaries we will need throughout the project

In [1]:
#Import graphing utilities
%matplotlib inline
import matplotlib.pyplot as plt

# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim

# Import utility files
from utils import save_object,load_object

### Set model name

Before begining the rest of this project, we select a name for our model. This name will be used to save and load the files for this model

In [2]:
model_name = "model6"

### Load our Data

After selecting our model, we next load the data

In [3]:
posts = load_object('objects/',model_name+"-posts")
df    = load_object('objects/',model_name+"-df")

### Posts Analysis

We first analyze the raw contents of the posts, and their users

In [4]:
num_posts = len(df['cleantext'])

In [5]:
total_words =0
for post in posts:
    for word in post:
        total_words +=1

In [6]:
#get the number of users (minus [deleted])
userList= df["author"].tolist()
userDict = {}
for user in userList:
    if user in userDict.keys() and user != "[deleted]":
        userDict[user] =1+userDict[user]
    else:
        userDict[user] =1
num_users = len(list(userDict.keys()))

In [7]:
wordDict = {}
for post in posts:
    for word in post:
        if word in wordDict.keys():
            wordDict[word] =1+wordDict[word]
        else:
            wordDict[word] =1

In [8]:
num_posts

131652

In [9]:
total_words

27978246

In [10]:
num_users

63252

In [11]:
len(list(wordDict))

97368

In [12]:
words = list(wordDict.keys())
total_freq_count      = 0
filtered_freq_count   = 0
total_unique_count    = 0
filtered_unique_count = 0
for word in words:
    count = wordDict[word]
    total_freq_count      += count
    filtered_freq_count   += count if count >= 10 else 0
    total_unique_count    += 1
    filtered_unique_count += 1 if count >= 10 else 0

In [13]:
#Total number of unique words
total_unique_count

97368

In [14]:
# Total number of unique words with count above 10
filtered_unique_count

28663

In [16]:
# percentage of the text that was made up of words with count below 10
str(((total_freq_count-filtered_freq_count)/total_freq_count)*100) + "%"

'0.5910449139663723%'

### Check model

We now will analyze the model, to ensure that it has a word count that coresponds to the posts word count.

In [17]:
model = gensim.models.Word2Vec.load('models/'+model_name+'.model')

In [21]:
vocab_list = sorted(list(model.wv.vocab))

In [23]:
# Ensure model has correct number of unique words
len(vocab_list)==filtered_unique_count

True

In [36]:
model_freq_count = 0
for word in vocab_list:
    model_freq_count += model.wv.vocab[word].count

In [37]:
# Ensure that the total count of the model's words is the total count of the filtered words
model_freq_count==filtered_freq_count

True