# Conspiracy Theories

A sample of texts from `r/conspiracy`

In [1]:
import pandas as pd
import numpy as np
from cytoolz import *

pd.set_option('display.max_colwidth', 200)

In [2]:
from sklearn.feature_extraction.text import *
from sklearn.feature_extraction import *
from sklearn.decomposition import *
from sklearn.cluster import *
from sklearn.metrics import *

In [3]:
df = pd.read_msgpack('http://bulba.sdsu.edu/conspiracy.dat')
df['tokens'] = df['tokens'].apply(list)

## Make document-term matrix

In [6]:
X = TfidfVectorizer(analyzer=identity, min_df=3, max_df=0.25, norm='l2', use_idf=True) \
        .fit_transform(df['tokens'])

## K Means

Use **k-means** algorithm to group texts into 25 clusters and compute **silhoutte** coefficients:

In [7]:
kmeans = KMeans(25).fit(X)
df['cluster'] = kmeans.labels_
df['silhouette'] = silhouette_samples(X, df['cluster'])

Silhoutte scores compare the distances among texts within a cluster to distances among texts in different clusters.  A 'good' cluster should have a large score:

In [15]:
df.groupby('cluster')['silhouette'].mean().sort_values()

cluster
12   -0.010933
8    -0.004242
16   -0.003312
9    -0.001044
2     0.001081
22    0.003083
13    0.007372
0     0.008634
23    0.009156
4     0.010539
21    0.011745
7     0.012702
5     0.013449
6     0.014152
17    0.016739
18    0.024096
14    0.027015
19    0.028172
15    0.030637
20    0.036436
24    0.048049
3     0.055859
10    0.064804
1     0.064842
11    0.069698
Name: silhouette, dtype: float64

The number of texts in a cluster is also instructive.  Interesting clusters are usually medium-sized. Clusters with only a few texts are picking up noise, and clusters with a large number of texts are probably incoherent.

In [17]:
df.groupby('cluster')['text'].count()

cluster
0      179
1      163
2      318
3      261
4      269
5      240
6      135
7      193
8      365
9     1058
10     135
11      54
12    3991
13     229
14     118
15     114
16      19
17      26
18     459
19     348
20     302
21     297
22     281
23     384
24      62
Name: text, dtype: int64

## Keywords

To get some insight into what a text cluster represents, we can find its keywords using PMI:

In [20]:
def keywords(cluster, n=10):
    f = pd.DataFrame({'all': pd.value_counts(list(concat(df['tokens'])))})
    f['cl'] = pd.value_counts(list(concat(df[df['cluster']==cluster]['tokens'])))
    f['pmi'] = np.log2( (f['cl'] * np.sum(f['all'])) / 
                        (f['all'] * np.sum(f['cl'])) )
    return list(f['pmi'][f['all']>25].sort_values(ascending=False)[:n].index)


In [21]:
keywords(11)

['lander',
 'astronauts',
 'lunar',
 'moon',
 'module',
 'landings',
 'moonlight',
 'Moon',
 'manned',
 'missions']

Looks like cluster 11 has something to do with moon landings, but it's hard to tell what they're saying from keywords alone.  So, we can also find some representative texts that are close to the center of the cluster

In [24]:
dist = kmeans.transform(X)
df['text'][dist[:,11].argsort()[:10]]

8021    We're supposed to believe U.S. suddenly lost interest in moon missions in 1972? Did you ever stop to consider this? From 1968 to 1972, eight of nine Project Apollo missions took Americans out of L...
3001    Have humans ever actually been to the moon? I think its obvious to most people that we clearly didn't go in 1969 but i wanted to know what you guys thought, have we ever actually been? Edit: if yo...
3671    Archive of nearly 20,000 faked moon landing pictures. (huge torrent file of proof!) 12GB Archive of faked moon landing pictures with 1 gbps+ seeds.     This archive contains almost 20,000 fake...
2943    Did you know the moon landing was fake !? Whaaaaa wow mind blown... wait where have you been? So your saying we really didn't go to the moon?.... But, where did all of those moon rocks I bought fr...
1778    Another clear proof of the flat earth, clouds moving BEHIND the moon. haha fucking really? I guess planes have to be careful not to run into the moon right? wel

The moon landings were faked?!  
-----
Try the same thing for cluster #1


In [25]:
keywords(1)

['thermitic',
 'Bazant',
 'CIT',
 'girders',
 'Acknowledge',
 '7500',
 'columns',
 '2.25',
 'angled',
 'shear']

In [26]:
df['text'][dist[:,1].argsort()[:10]]

5702    9/11 WTC Towers Had Power Turned Off For 36 Hours the Weekend Before the Attack - Security Systems disabled, unknown "workers" everywhere. I'm just upvoting this because it's not directly politics...
2780    Top Down Collapse - no explosives required (Jref thread)  from 9/11 blogger. Its long been claimed (by truthers) that the top portion of the WTC couldn't have brought down the building, or that t...
1449    Only six corporations own all mainstream media in the United States It's getting smaller too. And almost nobody reports on the deals that these corporations make. remember way back when, in '96, w...
874     Intelligence Officer: Every Single Terrorist Attack In U.S Was a False Flag Attack ... Or Egged On By the Government Washington's Blog : /r/news Is Censoring Reddit Again! Damn those sub rules tha...
7929    Russia Is Bombing Ambulances in Syria Do US media companies receive any capitol for agreeing to work with the propaganda folks or are they just beaten into subm

And 10:

In [27]:
keywords(10)

['HPV',
 'thimerosal',
 'pertussis',
 'Measles',
 'polio',
 'unvaccinated',
 'cervical',
 'Polio',
 'immunization',
 'SIDS']

In [28]:
df['text'][dist[:,10].argsort()[:10]]

5942    What is r/conspiracy take on vaccinations? There's a post on the front page stating that Italy is on its way to making vaccinations a legal requirement? Is there any reliable sources proving that ...
4299    Vaccine Maker Admits on FDA Website That DTaP Vaccine Causes Autism Internal communications admit that bad batches of the same vaccine are the cause of Crib Death, their solution was to split the ...
4197    Why I am against mandatory vaccinations, and any government mandated medical procedures. Here are 50 unethical medical experiments conducted by the United States since the start of the 20th centur...
3898    Robert F. Kennedy Jr: “All the things that I do are bent on forcing this vaccine debate out into the open—because once it is, the CDC’s position is so fragile, it’s an edifice of fraud, fraud stac...
4636    Being pro-vaccine is essentially being a member of a religious cult, the type who self-harm or take part in mass suicides. The worst part is that they also beli

In [5]:
words(11),words(1),words(10)

NameError: name 'words' is not defined