# Reddit analysis

Keyword analysis of text that we previously collected from reddit (see 03-reddit_text.ipynb)

Ling 583  
15 Feb 2018

In [1]:
import pandas as pd
import numpy as np
from cytoolz import concat
pd.set_option('display.max_rows', 10)

Load comments from r/machinelearning

In [2]:
df = pd.read_csv('machinelearning.csv', na_filter=None)
df

Unnamed: 0,author,body,score
0,fb-brian,Hi everyone! We'd like to share a simple metho...,2
1,snendroid-ai,ads saying hot machine learning events near yo...,6
2,shadowalf,The only other thing I can think of is if ther...,1
3,nicolasap,It's open source and you can find it [here](ht...,36
4,ps2fats,Who has books anyway,67
...,...,...,...
59042,dpineo,Start with CUDA. It has better a much better ...,2
59043,Optrode,What I was getting at is the problem of genera...,1
59044,Infidius,Most of our universe is limited in terms of no...,2
59045,hardmaru,It seems half of the dataset are images of art...,5


Find top 10 most prolific commenters

In [3]:
top10 = df.groupby('author')['author'].count().sort_values(ascending=False).head(10)
top10

author
visarga                542
ajmooch                435
NicolasGuacamole       428
alexmlamb              423
epicwisdom             413
darkconfidantislife    399
bbsome                 375
gwern                  346
olBaa                  270
phobrain               265
Name: author, dtype: int64

Normalize and tokenize text

In [4]:
df['bow'] = df['body'].str.lower().str.replace(r'(\W|\d)+', ' ').str.split()

Find keywords (by PMI) for each of the top 10 commenters

In [5]:
def keywords(words):
    f = pd.DataFrame({'all':pd.value_counts(list(concat(df[words])))})
    for name in top10.index:
        f['user'] = pd.value_counts(list(concat(df[df['author']==name][words])))
        f['user_pmi'] = np.log2((f['user'] * np.sum(f['all'])) / 
                                (f['all'] * np.sum(f['user'])))
        print(name, ':', ', '.join(f['user_pmi'][f['user']>5]
                                       .sort_values(ascending=False)
                                       .head(10)
                                       .index))
        print()
keywords('bow')        

visarga : starspace, simulators, relational, relations, eating, automata, fei, simulator, simulation, voices

ajmooch : ajbrock, freezeout, batchrenorm, anneal, smashv, convs, bw, hypernet, crop, hyperparams

NicolasGuacamole : generalisation, receptive, lasagne, learnmachinelearning, dilated, boosting, boundary, surely, semantic, insight

alexmlamb : authorship, autoregressive, ali, alex, alignment, anonymous, epsilon, professors, forcing, icml

epicwisdom : liberties, ontology, gmail, unethical, accused, sexism, likewise, emails, consciousness, conscious

darkconfidantislife : ding, gabor, processors, googlenet, nm, processor, analog, chip, movement, convnets

bbsome : gn, martens, h_min, hessian, ir, hf, llvm, nevertheless, kfac, additionally

gwern : scott_e_reed, fredkin, webvision, kanal, danbooru, gwern, gaydar, tank, multilevel, mdps

olBaa : qualia, adult, baby, intelligent, consciousness, defining, room, chinese, vec, turing

phobrain : raf, skot, phobrain, gallery, ions, his

------

## spaCy

Now let's repeat that, but using the [spaCy](https://spacy.io/) tagger to find just nouns and then just verbs

In [6]:
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])

Create a [Doc](https://spacy.io/api/doc) for each comment

In [7]:
df['doc'] = list(nlp.pipe(df['body']))

Now we do the same thing as we did above, but this time our bag of words consists of the lemma forms of the nouns only

In [8]:
def get_nouns(doc):
    return [tok.lemma_ for tok in doc if tok.pos_ == 'NOUN']

df['noun_bow'] = df['doc'].apply(get_nouns)

keywords('noun_bow')

visarga : simulator, consciousness, simulation, perception, relation, reasoning, robot, graph, processor, phrase

ajmooch : convs, celeba, hypernet, guard, crop, sweep, speedup, densenet, hyperparam, batchnorm

NicolasGuacamole : generalisation, heatmap, ensemble, being, estimation, room, luck, segmentation, insight, kernel

alexmlamb : authorship, gaussian, professor, angle, chain, sampling, arxiv, connection, pass, credit

epicwisdom : ontology, liberty, smartphone, sexism, consciousness, email, somebody, privacy, baby, distinction

darkconfidantislife : processor, analog, chip, movement, being, imagenet, convnet, multiplication, array, precision

bbsome : k[1, k[0, k[3, ps, autodiff, curvature, expansion, theta, delta, controller

gwern : gaydar, tank, mdp, photograph, metadata, anime, stochasticity, glass, tag, causality

olBaa : word2vec, baby, consciousness, room, projection, intelligence, definition, graph, dog, reviewer

phobrain : ion, dna, pic, histogram, cookie, photo, taste

And the same for verbs

In [9]:
def get_verbs(doc):
    return [tok.lemma_ for tok in doc if tok.pos_ == 'VERB']

df['verb_bow'] = df['doc'].apply(get_verbs)

keywords('verb_bow')

visarga : graph, compose, reuse, eat, operate, collect, generalize, search, extract, select

ajmooch : anneal, scoop, dilate, freeze, roll, steal, concatenate, investigate, validate, pursue

NicolasGuacamole : learnmachinelearn, dilate, ’re, boost, ’, drop, connect, agree, increase, say

alexmlamb : inject, suspect, list, force, supervise, cite, connect, pick, push, regard

epicwisdom : accuse, recall, count, recognize, die, disagree, justify, doubt, report, imagine

darkconfidantislife : float, pass, hold, wait, suppose, compute, reduce, recommend, help, move

bbsome : note, reuse, express, imply, care, suggest, claim, intend, agree, bind

gwern : tag, infer, split, claim, demonstrate, supervise, list, discover, classify, notice

olBaa : embed, define, gues, achieve, love, solve, believe, apply, happen, seem

phobrain : eat, map, edit, match, explore, forget, figure, propose, could, develop

