# Topic modelling and word2vec analysis of /r/gaming user comments

This notebook demonstrates some text analysis of 1 year of comments on /r/gaming, using LDA topic modelling and word2vec. 

This notebook doesn't analyse the entire dataset. It analyses only the comments that include (an arbitrarily selected set of) terms relating to females, e.g., "woman", "women", "she", "girl", and so on. Thus the total comments for this data subset is 69502. 

In [1]:
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import nltk

import os
from time import time
from collections import defaultdict

import numpy as np
import pandas as pd

import gensim
from gensim import corpora, utils
from gensim.corpora.dictionary import Dictionary
from gensim.models.wrappers.dtmmodel import DtmModel

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

import matplotlib
import matplotlib.pyplot as plt

from dateutil import parser
import datetime

import random

%matplotlib inline

In [2]:
import csv

#with open('REDDIT_DATASET_ONE_YEAR_COMMENTS.csv', 'r') as f:
#  reader = csv.reader(f)
#  your_list = list(reader)

with open('REDDIT_DATASET_ONE_YEAR_COMMENTS_FEMALE_RELATED_COMMENTS.csv', 'r') as f:
  reader = csv.reader(f)
  your_list = list(reader)
    
your_list.pop(0) # remove the first element which is accidentally put there by R after exporting

['x']

In [3]:
# print(your_list[:5]) # first 5 elements

In [4]:
# len(your_list)

In [5]:
import itertools
merged_list = list(itertools.chain(*your_list))

In [6]:
# import random
# merged_list = merged_list[1:10000] # we create a small toy dataset to work with 
# merged_list = random.sample(merged_list, 100000)
# print(merged_list[:5]) # view first 5 elements

In [7]:
# len(merged_list)

In [8]:
comments_text = [comment for comment in merged_list]
comments_text_original = [comment for comment in merged_list]

In [9]:
# do word stemming?
from stemming.porter2 import stem

comments_text = [[stem(word) for word in sentence.split(" ")] for sentence in comments_text]
comments_text = [" ".join(sentence) for sentence in comments_text]

In [10]:
# print(comments_text[:5])

## Generate topic model

In this section we generate the topic model. There are some particular parameters that have an important role in shaping the outcome of the model. Changing these will affect the output in various ways. 

I have tested quite a few different combinations and the current output seems to give reasonable results (although we can certainly do better).

num_topics = 30 ... we specify that we want to generate 30 topics from our data
word stemming ... we use a word stemmer to reduce some words to their roots (e.g., walked and walking becomes "walk")
n-grams ... we specify that we want single terms (e.g. "walked"), and also find bi-grams or 2-grams (e.g., "she walked")
stop words ... we remove common English words that are not useful for analysis (e.g. "the", "and")
max_df ... we ignore terms that have a document frequency strictly higher than the given threshold
min_df ... ignore terms that have a document frequency strictly lower than the given threshold


In [11]:
# for good overview of these values see: http://stackoverflow.com/a/35615151/2589495
max_df = 0.06 # ignore terms that have a document frequency strictly higher than x% of docs (or occurs in > n docs if integer value)
min_df = 0.001 # ignore terms that have a document frequency strictly less than x% of documents (or occurs in < n documents if integer value)
# Some good values found so far...
# max_df = 0.05 (smaller in same order of magnititude is also good e.g. 0.03)
# min_df = 2
# k topics = 50
# passes = 3

# bi-grams:
# tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df,ngram_range=(1,2),stop_words='english',analyzer='word')
# unigrams 
tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df,stop_words='english')
tf = tf_vectorizer.fit_transform(comments_text)

In [12]:
# dictionary.filter_tokens(bad_ids=low_value_words)
# corpus = [dictionary.doc2bow(doc) for doc in texts]
corpus = gensim.matutils.Sparse2Corpus(tf.T)
n_terms, n_docs = corpus.sparse.shape
id2word = {i:word for i, word in enumerate(tf_vectorizer.get_feature_names())}
dictionary = Dictionary.from_corpus(corpus, id2word=id2word)

In [47]:
# generate LDA model
number_topics = 30
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=number_topics, id2word = dictionary, passes=100,update_every=1, chunksize=10000)

In [48]:
# ldamodel.print_topics(-1)

## Visualisation of topics (by term/word)

This is an interactive visualisation of the topic model. The right-hand side shows which terms belong to each topic and how important they are for that topic (in order of importance). 

The left-hand Intertopic Distance Map gives an indication of how closely related the topics are (projected into a two-dimensional space). It may or may not be that helpful analytically, but is included anyway. 

In [81]:
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary,sort_topics=False)
pyLDAvis.save_html(vis, 'lda_model_B.html')

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


In [50]:
pyLDAvis.display(vis)

In [51]:
docTopicProbMat = ldamodel[corpus]

In [52]:
docTopicProbMat_list = list(docTopicProbMat)

In [53]:
# print(docTopicProbMat_list[7]) # get the topic probabilities for a document
# print(comments_text[7]) # and the document text

In [54]:
comment_probabilities = []
for doc in docTopicProbMat_list:
    docTopic = 999 # reset the topic
    bestMatch = 0 # reset the best match
    possibleMatch = 0 # reset the possible match
    for topic in doc:
        possibleMatch = topic[1]
        if possibleMatch > bestMatch:
                bestMatch = topic[1]
                if bestMatch > 0.5: # minimum threshold for comment topic relatedness
                    docTopic = topic[0]
                # print("bestMatch is now: ",bestMatch)
                # print("docTopic is now: ",docTopic)
    # print("best topic was: ",docTopic)
    comment_probabilities.append(docTopic)

In [55]:
# comment_probabilities[0:10]

In [56]:
comment_probabilities = [probs + 1 for probs in comment_probabilities] # now when we ask for topic n we actually get topic n (not topic n-1)

In [57]:
# print(comments_text[8])
# comment_probabilities[8]

In [58]:
comments_and_topics = pd.DataFrame(
    {'original_comment_text': comments_text_original,
     'comment_text': comments_text,
     'topic': comment_probabilities
    })

In [59]:
# comments_and_topics

In [60]:
# a function to get n random example comments assigned to a given topic k
def getTopicExamples(topic_id,num_examples):
    topic_comments = list(comments_and_topics['original_comment_text'][comments_and_topics.topic == topic_id])
    topic_comments = random.sample(topic_comments, num_examples)
    return(topic_comments)

## Extract 10 comment examples for each topic

In this section we extract 10 (randomly selected) text comments that belong to each topic. 

This provides some examples of comments that belong to each topic. The topics are ordered from 1 to k (where k is the number of topics we specified for the model). 

You can use these comment examples to help make sense of the topic terms in the above visualisation, in other words "what is going on" for each topic.

In [61]:
for k in range(1,number_topics+1):
    topic_example_comments = getTopicExamples(k,10)
    print("\n____________________ TOPIC",k,"______________________________\n")
    print(*topic_example_comments, sep='\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - \n')


____________________ TOPIC 1 ______________________________

&gt; In Tracer's case, it really didn't fit her character 

I don't understand how the pose didn't fit her character. Her thing is being fast, isn't it? So looking over her shoulder behind her is something that she does a lot of on a daily basis anyway. 

The pin up girl pose it was replaced with doesn't fit her "character", it fits her design. She has an bomber jacket, pin up girls were popular in the military, so it fits her "personality" to pose like one? That's stupid.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
&gt;girlfriend just started streaming

&gt;girl

&gt;no viewers

I am skeptical.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Let's ignore the "sexy" argument for a bit (it wasn't even the original guy's main complaint).

If we all agree that her ass isn't even remotely sexy (hypothetically), what's left in the pose?

Why is tracer just standing there? Why isn't she doing anything? How is

In [62]:
# theta, _ = ldamodel.inference(corpus) # we can extract the thetas directly from the corpus

In [63]:
# theta[8] # columns are indexed from 0, so topic n is topic n+1 (e.g. topic22 is topic23)

## Word2vec analysis of text

In this section we do some exploratory text analysis using the novel word2vec algorithm.

This is very experimental... just some computational doodling.

In [64]:
sentences = [comment for comment in merged_list]
sentences = [[stem(word) for word in sentence.split(" ")] for sentence in sentences]

In [65]:
# import nltk
# nltk.download('stopwords')

In [66]:
from nltk.corpus import stopwords
filtered_words = [word for word in sentences if word not in stopwords.words('english')]

In [67]:
# print(filtered_words[2]) # sanity check the data first

In [68]:
# convert to lowercase
filtered_lower_words = []
for doc in filtered_words:
    filtered_lower_words.append([word.lower() for word in doc])

In [69]:
# remove punctuation
import string
filtered_lower_words_nopunct = []
for doc in filtered_lower_words:
    filtered_lower_words_nopunct.append([''.join(c for c in s if c not in string.punctuation) for s in doc])

In [70]:
# remove empty words due to previous steps 
filtered_lower_words_nopunct_nospaces = []
for doc in filtered_lower_words_nopunct:
    filtered_lower_words_nopunct_nospaces.append([s for s in doc if s])

In [71]:
# print(filtered_lower_words_nopunct_nospaces[2]) # sanity check the data quickly

In [72]:
from gensim.models import Word2Vec
min_count = 10
size = 50
window = 4
 
model = Word2Vec(filtered_lower_words_nopunct_nospaces, min_count=min_count, size=size, window=window)


Let's try the classic word2vec example:

King - man + woman = ?

In [73]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

[('booker', 0.7179170250892639),
 ('comstock', 0.7103918790817261),
 ('queen', 0.6987165808677673),
 ('command', 0.6826412677764893),
 ('wild', 0.6599280834197998),
 ('jedi', 0.6471225619316101),
 ('shadow', 0.6427894234657288),
 ('faction', 0.6389576196670532),
 ('trench', 0.6333158612251282),
 ('emperor', 0.6313338279724121)]

Using this kind of analysis we can ask other questions more specific to the paper:
    
Gamer - man + woman = ?

(What is 'gamer' without taking into account 'men', but taking into account 'women'?)

In [74]:
model.wv.most_similar(positive=['women', 'gamer'], negative=['men'])

[('nerd', 0.6442986130714417),
 ('fan', 0.6413302421569824),
 ('hobby', 0.6356649994850159),
 ('hardcor', 0.6190951466560364),
 ('streamer', 0.6040115356445312),
 ('video', 0.5970759987831116),
 ('gamers', 0.5957229733467102),
 ('communiti', 0.5908427834510803),
 ('onlin', 0.5813985466957092),
 ('videogam', 0.5635467767715454)]

And we can also do the converse for men:

Gamer - woman + man = ?

(What is 'gamer' without taking into account women, but taking into account men?)

In [75]:
model.wv.most_similar(positive=['men', 'gamer'], negative=['women'])

[('gamers', 0.756857693195343),
 ('streamer', 0.5786899924278259),
 ('harass', 0.5523459911346436),
 ('gaming', 0.5389747023582458),
 ('protagonist', 0.5371449589729309),
 ('attract', 0.5215532779693604),
 ('soldiers', 0.519288182258606),
 ('circumcis', 0.514221727848053),
 ('males', 0.5065438151359558),
 ('young', 0.49974873661994934)]

In [76]:
model.most_similar("angry")

[('scary', 0.7530699968338013),
 ('uncomfortable', 0.7322939038276672),
 ('salti', 0.7311169505119324),
 ('hungry', 0.719622790813446),
 ('pathetic', 0.6964035630226135),
 ('retarded', 0.6960612535476685),
 ('rude', 0.6938974261283875),
 ('whores', 0.6929017901420593),
 ('complicated', 0.6869460344314575),
 ('crazy', 0.6857407093048096)]

In [77]:
model.most_similar(["sjw"])

[('idiot', 0.7991223335266113),
 ('feminist', 0.7898494601249695),
 ('sjws', 0.7504045963287354),
 ('asshol', 0.7311912775039673),
 ('bigot', 0.7156610488891602),
 ('racist', 0.7130438685417175),
 ('bullshit', 0.699094295501709),
 ('sexist', 0.6889074444770813),
 ('tumblr', 0.6874884963035583),
 ('retard', 0.6586898565292358)]

We can also see which word within a set of words doesn't belong:

In [78]:
model.wv.doesnt_match("ps4 xbox pc chess".split())


'chess'

In [79]:
model.wv.doesnt_match("mouse keyboard gamepad".split())


'mouse'

In [80]:
model.wv.doesnt_match("chick hot cosplay guy".split())

'hot'