# Topic modelling and word2vec analysis of /r/gaming user comments

This notebook demonstrates some text analysis of 1 year of comments on /r/gaming, using LDA topic modelling and word2vec. 

This notebook doesn't analyse the entire dataset. It analyses only the comments that include (an arbitrarily selected set of) terms relating to females, e.g., "woman", "women", "she", "girl", and so on. Thus the total comments for this data subset is 69502. 

In [1]:
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import nltk

import os
from time import time
from collections import defaultdict

import numpy as np
import pandas as pd

import gensim
from gensim import corpora, utils
from gensim.corpora.dictionary import Dictionary
from gensim.models.wrappers.dtmmodel import DtmModel

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

import matplotlib
import matplotlib.pyplot as plt

from dateutil import parser
import datetime

import random

%matplotlib inline

In [2]:
import csv

#with open('REDDIT_DATASET_ONE_YEAR_COMMENTS.csv', 'r') as f:
#  reader = csv.reader(f)
#  your_list = list(reader)

# with open('REDDIT_DATASET_ONE_YEAR_COMMENTS_FEMALE_RELATED_COMMENTS.csv', 'r') as f:
#   reader = csv.reader(f)
#   your_list = list(reader)

with open('REDDIT_DATASET_ONE_YEAR_COMMENTS_FEMALE_RELATED_COMMENTS_commentsonly.csv', 'r') as f:
  reader = csv.reader(f)
  your_list = list(reader)
    
your_list.pop(0) # remove the first element which is accidentally put there by R after exporting

['x']

In [3]:
print(your_list[:5]) # first 5 elements

[["I've never read the book, but in the movie The Time Machine, he builds a time machine to save his wife who died tragically years ago. But every time he tries he fails, and he realizes it's futile. If she never died he'd never have built it in the first place, so his being there to save her would be impossible."], ['Ahhh bro my grandma destroyed in Bridge.  Her and all her old lady partners use to drink and curse all night playing that.  '], ["Goddamnit, I was a walking junkyard/arsenal trying to amass enough crap to sell to get the 30k caps I needed for the clinics... but whatever I got it and it's over. I had to send piper away, she refused to work anywhere and it was sticking it at 98."], ['Not very good.\n\nBut she must have boobs, so to the frontpage with you!\n\n'], ['You left out making all the characters hyper sexualized underaged girls wearing bikini armor.']]


In [4]:
# len(your_list)

In [5]:
import itertools
merged_list = list(itertools.chain(*your_list))

In [6]:
# import random
# merged_list = merged_list[1:10000] # we create a small toy dataset to work with 
# merged_list = random.sample(merged_list, 100000)
# print(merged_list[:5]) # view first 5 elements

In [7]:
len(merged_list)

89924

In [8]:
comments_text = [comment for comment in merged_list]
comments_text_original = [comment for comment in merged_list]

In [9]:
# do word stemming?
from stemming.porter2 import stem

comments_text = [[stem(word) for word in sentence.split(" ")] for sentence in comments_text]
comments_text = [" ".join(sentence) for sentence in comments_text]

In [447]:
# print(comments_text[:5])

## Generate topic model

In this section we generate the topic model. There are some particular parameters that have an important role in shaping the outcome of the model. Changing these will affect the output in various ways. 

I have tested quite a few different combinations and the current output seems to give reasonable results (although we can certainly do better).

num_topics = 30 ... we specify that we want to generate 30 topics from our data
word stemming ... we use a word stemmer to reduce some words to their roots (e.g., walked and walking becomes "walk")
n-grams ... we specify that we want single terms (e.g. "walked"), and also find bi-grams or 2-grams (e.g., "she walked")
stop words ... we remove common English words that are not useful for analysis (e.g. "the", "and")
max_df ... we ignore terms that have a document frequency strictly higher than the given threshold
min_df ... ignore terms that have a document frequency strictly lower than the given threshold


In [10]:
# for good overview of these values see: http://stackoverflow.com/a/35615151/2589495
max_df = 0.03 # ignore terms that have a document frequency strictly higher than x% of docs (or occurs in > n docs if integer value)
min_df = 0.0007 # ignore terms that have a document frequency strictly less than x% of documents (or occurs in < n documents if integer value)
# Some good values found so far...
# max_df = 0.05 (smaller in same order of magnititude is also good e.g. 0.03)
# min_df = 2
# k topics = 50
# passes = 3

# bi-grams:
# tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df,ngram_range=(1,2),stop_words='english',analyzer='word')
# unigrams 
tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df,stop_words='english')
tf = tf_vectorizer.fit_transform(comments_text)

In [11]:
# dictionary.filter_tokens(bad_ids=low_value_words)
# corpus = [dictionary.doc2bow(doc) for doc in texts]
corpus = gensim.matutils.Sparse2Corpus(tf.T)
n_terms, n_docs = corpus.sparse.shape
id2word = {i:word for i, word in enumerate(tf_vectorizer.get_feature_names())}
dictionary = Dictionary.from_corpus(corpus, id2word=id2word)

In [12]:
# generate LDA model
number_topics = 30
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=number_topics, id2word = dictionary, passes=100,update_every=1, chunksize=10000)

In [13]:
# ldamodel.print_topics(-1)

## Visualisation of topics (by term/word)

This is an interactive visualisation of the topic model. The right-hand side shows which terms belong to each topic and how important they are for that topic (in order of importance). 

The left-hand Intertopic Distance Map gives an indication of how closely related the topics are (projected into a two-dimensional space). It may or may not be that helpful analytically, but is included anyway. 

In [14]:
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary,sort_topics=False)
pyLDAvis.save_html(vis, 'lda_FINAL_MODEL_FOR_PAPER.html')

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


In [15]:
pyLDAvis.display(vis)

  new_obj[k] = extract_dates(v)


In [16]:
docTopicProbMat = ldamodel[corpus]

  new_obj[k] = extract_dates(v)


In [17]:
docTopicProbMat_list = list(docTopicProbMat)

  new_obj[k] = extract_dates(v)


In [18]:
# print(docTopicProbMat_list[7]) # get the topic probabilities for a document
# print(comments_text[7]) # and the document text

  new_obj[k] = extract_dates(v)


In [19]:
comment_probabilities = []
for doc in docTopicProbMat_list:
    docTopic = 999 # reset the topic
    bestMatch = 0 # reset the best match
    possibleMatch = 0 # reset the possible match
    for topic in doc:
        possibleMatch = topic[1]
        if possibleMatch > bestMatch:
                bestMatch = topic[1]
                if bestMatch > 0.6: # minimum threshold for comment topic relatedness
                    docTopic = topic[0]
                # print("bestMatch is now: ",bestMatch)
                # print("docTopic is now: ",docTopic)
    # print("best topic was: ",docTopic)
    comment_probabilities.append(docTopic)

  new_obj[k] = extract_dates(v)


In [20]:
# comment_probabilities[0:10]

  new_obj[k] = extract_dates(v)


In [21]:
comment_probabilities = [probs + 1 for probs in comment_probabilities] # now when we ask for topic n we actually get topic n (not topic n-1)

  new_obj[k] = extract_dates(v)


In [22]:
# print(comments_text[8])
# comment_probabilities[8]

  new_obj[k] = extract_dates(v)


In [23]:
comments_and_topics = pd.DataFrame(
    {'original_comment_text': comments_text_original,
     'comment_text': comments_text,
     'topic': comment_probabilities
    })

  new_obj[k] = extract_dates(v)


In [24]:
# comments_and_topics

  new_obj[k] = extract_dates(v)


In [25]:
# a function to get n random example comments assigned to a given topic k
def getTopicExamples(topic_id,num_examples):
    topic_comments = list(comments_and_topics['original_comment_text'][comments_and_topics.topic == topic_id])
    topic_comments = random.sample(topic_comments, num_examples)
    return(topic_comments)

  new_obj[k] = extract_dates(v)


## Extract 10 comment examples for each topic

In this section we extract 50 (randomly selected) text comments that belong to each topic. 

This provides some examples of comments that belong to each topic. The topics are ordered from 1 to k (where k is the number of topics we specified for the model). 

You can use these comment examples to help make sense of the topic terms in the above visualisation, in other words "what is going on" for each topic.

In [26]:
for k in range(1,number_topics+1):
    topic_example_comments = getTopicExamples(k,50)
    print("\n____________________ TOPIC",k,"______________________________\n")
    print(*topic_example_comments, sep='\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - \n')

  new_obj[k] = extract_dates(v)



____________________ TOPIC 1 ______________________________

What shows on your screen when she has the headset on? Can you at least watch what she doing?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Keyboard and mouse not controller!!!!!! What are you teaching her!!!!?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
I believe so. That's probably why she has what looks like irradiated hands. 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Ohhhhh crap. Sorry, I thought this was r/armoredwomen. My bad.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
now just get her to put down the controller and play with a keyboard.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
"My salsa makes all the pretty girls want to dance and take off their underpants" 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
That's good. Been trying to introduce her to some things but she has no mind for it, no imagination she says. She'd rather unwind by sitti

In [27]:
for k in range(1,number_topics+1):
    topic_example_comments = getTopicExamples(k,50)
    print("\n____________________ TOPIC",k,"______________________________\n")
    print(*topic_example_comments, sep='\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - \n')

  new_obj[k] = extract_dates(v)



____________________ TOPIC 1 ______________________________

"My salsa makes all the pretty girls want to dance and take off their underpants" 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
War. U lay on the floor and she blows u all to hell
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Can we play with your girlfriend instead?

... crap, I'm "that guy" now...
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Just wait til she wants to do a wax injection of her own
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
I never realized that the hand on his left side of his crotch is a womans
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
&gt;  She could always leave the show if she wanted it bad enough.

Another woman leaving Alex would be devastating. 😪
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
This will backfire when she realizes you were holding out the entire time. 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - 
I l

In [28]:
f1=open('./example_comments_numComments50_numTopics30_LDA_MODEL_A.txt', 'w+')
for k in range(1,number_topics+1):
    topic_example_comments = getTopicExamples(k,50)
    f1.write("\n____________________ TOPIC",k,"______________________________\n")
    f1.write(*topic_example_comments, sep='\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - \n')

  new_obj[k] = extract_dates(v)


TypeError: write() takes exactly one argument (3 given)

In [386]:
# theta, _ = ldamodel.inference(corpus) # we can extract the thetas directly from the corpus

In [387]:
# theta[8] # columns are indexed from 0, so topic n is topic n+1 (e.g. topic22 is topic23)

## Word2vec analysis of text

In this section we do some exploratory text analysis using the novel word2vec algorithm.

This is very experimental... just some computational doodling.

In [22]:
sentences = [comment for comment in merged_list]
sentences = [[stem(word) for word in sentence.split(" ")] for sentence in sentences]

In [23]:
# import nltk
# nltk.download('stopwords')

In [24]:
from nltk.corpus import stopwords
filtered_words = [word for word in sentences if word not in stopwords.words('english')]

In [25]:
# print(filtered_words[2]) # sanity check the data first

In [26]:
# convert to lowercase
filtered_lower_words = []
for doc in filtered_words:
    filtered_lower_words.append([word.lower() for word in doc])

In [27]:
# remove punctuation
import string
filtered_lower_words_nopunct = []
for doc in filtered_lower_words:
    filtered_lower_words_nopunct.append([''.join(c for c in s if c not in string.punctuation) for s in doc])

In [28]:
# remove empty words due to previous steps 
filtered_lower_words_nopunct_nospaces = []
for doc in filtered_lower_words_nopunct:
    filtered_lower_words_nopunct_nospaces.append([s for s in doc if s])

In [29]:
# print(filtered_lower_words_nopunct_nospaces[2]) # sanity check the data quickly

In [31]:
from gensim.models import Word2Vec
min_count = 50
size = 120
window = 7
workers = 6
 
model = Word2Vec(filtered_lower_words_nopunct_nospaces, min_count=min_count, size=size, window=window)


Let's try the classic word2vec example:

King - man + woman = ?

In [32]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.6162382960319519),
 ('jedi', 0.5811025500297546),
 ('witch', 0.5636150240898132),
 ('empire', 0.5580927133560181),
 ('princ', 0.5526511669158936),
 ('spirit', 0.5511578321456909),
 ('sea', 0.55086749792099),
 ('goddess', 0.5493709444999695),
 ('investig', 0.5470453500747681),
 ('emperor', 0.542177677154541)]

Using this kind of analysis we can ask other questions more specific to the paper:
    
Gamer - man + woman = ?

(What is 'gamer' without taking into account 'men', but taking into account 'women'?)

In [33]:
model.wv.most_similar(positive=['women', 'gamer'], negative=['men'])

[('gamers', 0.5871630907058716),
 ('hardcor', 0.5561313033103943),
 ('nerd', 0.5529419183731079),
 ('gaming', 0.5297905206680298),
 ('onlin', 0.5270678997039795),
 ('hobby', 0.5195361375808716),
 ('streamer', 0.5038512945175171),
 ('casual', 0.49965327978134155),
 ('communiti', 0.48689004778862),
 ('subreddit', 0.46447694301605225)]

And we can also do the converse for men:

Gamer - woman + man = ?

(What is 'gamer' without taking into account women, but taking into account men?)

In [34]:
model.wv.most_similar(positive=['men', 'gamer'], negative=['women'])

[('gamers', 0.6726009845733643),
 ('streamer', 0.5174188613891602),
 ('gaming', 0.4663570523262024),
 ('harass', 0.4317512512207031),
 ('males', 0.4124179482460022),
 ('date', 0.3997316360473633),
 ('cosplayers', 0.3903079330921173),
 ('nerd', 0.37952059507369995),
 ('mmos', 0.37874627113342285),
 ('hobby', 0.37795865535736084)]

In [35]:
model.most_similar("angry")

[('salti', 0.5569661259651184),
 ('annoying', 0.5362046957015991),
 ('shitty', 0.5361368060112),
 ('crazy', 0.5262633562088013),
 ('pissed', 0.5186154842376709),
 ('uncomfortable', 0.5078614354133606),
 ('funny', 0.5061267614364624),
 ('creepy', 0.5058543086051941),
 ('sick', 0.5024175643920898),
 ('excited', 0.4922211766242981)]

In [74]:
model.most_similar("game")

[('games', 0.8080633878707886),
 ('seri', 0.5676897168159485),
 ('gameplay', 0.5609169006347656),
 ('gaming', 0.5339395403862),
 ('minecraft', 0.5088506937026978),
 ('franchis', 0.506004810333252),
 ('titl', 0.5019721984863281),
 ('rpgs', 0.500690221786499),
 ('videogam', 0.4946770668029785),
 ('series', 0.49227070808410645)]

In [36]:
model.most_similar(["sjw"])

[('feminist', 0.7510287761688232),
 ('idiot', 0.7500626444816589),
 ('sjws', 0.711581826210022),
 ('troll', 0.658967137336731),
 ('retard', 0.6348732709884644),
 ('agenda', 0.6293011903762817),
 ('asshol', 0.6220519542694092),
 ('outrag', 0.6180007457733154),
 ('tumblr', 0.6046572327613831),
 ('sexist', 0.5956771969795227)]

We can also see which word within a set of words doesn't belong:

In [37]:
model.wv.doesnt_match("ps4 xbox pc chess".split())


'chess'

Can the model differentiate female vs male characters within a game (Overwatch)?

In [53]:
model.wv.doesnt_match("tracer genji widowmaker".split())


'genji'

In [39]:
model.wv.doesnt_match("chick hot cosplay guy".split())

'guy'