# Practical assignment

In this assignment you will analyse user comments from the website [reddit.com](http://www.reddit.com). Reddit users can post content (e.g., a website, a question, news), which can be up- or downvoted. Posts with many upvotes tend to appear in the top of the category or at the frontpage of Reddit. The website is quite popular and has over half a billion monthly visitors. At times, appearing on the frontpage of Reddit generates so much traffic to the posted website, that it actually crashes.

The community is organised in various subreddits, such as news, movies, music, et cetera. You will analyse user comments from the [politics subreddit](https://www.reddit.com/r/politics/). These user comments are either replies to the starting post, or replies to other users’ comments. The latter will be the basis for the communication network that you will construct here.

First let us get started with the data


** Data **

If you have not done so already, download all data from https://storage.googleapis.com/css-files/reddit_discussion_network_2016_10.csv. This file is 377MB, it may take some time to download. If you have trouble working with this dataset on your computer, please try the alternative: https://storage.googleapis.com/css-files/reddit_discussion_network_2015_02.csv, which is only 46MB.

** Importing libraries **

In [1]:
import random

import igraph as ig
import nltk
import gensim
import numpy as np
import pandas as pd
import scipy

**Reading in data**

First read in Reddit data

In [2]:
file_name = 'reddit_discussion_network_2016_10.csv';
df = pd.read_csv('../../../../data/' + file_name);

Which columns does this dataset have?

In [3]:
print df.columns

Index([u'comment', u'a_score', u'a_created_utc', u'a_retrieved_on',
       u'comment_id', u'comment_reply_to_id', u'author_from',
       u'author_reply_to'],
      dtype='object')


The first post:

In [4]:
print df.head(1)

                                             comment  a_score  a_created_utc  \
0  You think these women were sitting out there m...        0     1476397749   

   a_retrieved_on comment_id comment_reply_to_id     author_from  \
0      1478583791    d8qwv4k             d8qwg9k  Schmingleberry   

        author_reply_to  
0  pm_me_your_cuck_pics  


You can convert the dataframe into a graph using

In [5]:
G = ig.Graph.DictList(
        vertices=None,
        edges=df.to_dict('records'),
        directed=True,
        edge_foreign_keys=('author_from', 'author_reply_to'))

You can simplify the graph by using

In [6]:
G.simplify(combine_edges={'comment': len});

This simply counts the length of the list of edges between the same pair of nodes.
Any value that you calculate for a comment (e.g. a topic or a sentiment) can be added to the dataframe and can be used in the construction of the graph. 

There are now four smaller subassignments which we will work on. You can choose any single one to work on. Hints for doing some of the analysis are provided after the description of the subassignments. Most of the techniques involved should already be explained during the lectures, but these hints provide some more explicit help.



# Topics and centrality

Users that are central tend to interact with lots of different (central) users. We could either expect that users become more central if they secure a position of authority in a single topic. In that case, everybody interacts with the user because he is authoritative in this subject. Alternatively, somebody can be more central because he is active in many different topics. Finally, somebody may simply be more central because he is active himself, and every comment is likely to get a reply.

Techniques necessary
- Topic detection
- Centrality

**Read in trained LDA model and dictionary (identifier to word)**

In [None]:
lda_model_reddit = gensim.models.ldamodel.LdaModel.load('filename')
id2word_reddit = gensim.corpora.dictionary.Dictionary.load("filename")

To help speed up the analysis, we already computed topic values for each post. You can read the data as follows:

In [10]:
topic_sentiment_df = pd.read_csv('../../../../data/' + 'topic_sentiment_reddit.csv');

For each post, the topic distribution is saved in t_0 to t_14 (15 topics)

In [12]:
print topic_sentiment_df.head(5)

  comment_id     author_from     pos  neg       t_0       t_1       t_2  \
0    d8qwv4k  Schmingleberry  0.0125  0.0  0.001667  0.150760  0.001667   
1    d8p5x7o    socoamaretto  0.0000  0.0  0.016667  0.016667  0.016667   
2    d9dpj9r    allisslothed  0.0000  0.0  0.033333  0.033333  0.033333   
3    d8lh0lw    shaking_head  0.0250  0.0  0.002899  0.002899  0.046468   
4    d8cu7q1        InFearn0  0.0000  0.0  0.338153  0.001515  0.001515   

        t_3       t_4       t_5       t_6       t_7       t_8       t_9  \
0  0.001667  0.170236  0.001667  0.058648  0.001667  0.001667  0.146286   
1  0.016667  0.016667  0.016667  0.016667  0.766667  0.016667  0.016667   
2  0.033333  0.033333  0.033333  0.033333  0.033333  0.033333  0.033333   
3  0.448935  0.131211  0.002899  0.002899  0.156079  0.002899  0.072449   
4  0.001515  0.060315  0.038145  0.001515  0.001515  0.032549  0.001515   

       t_10      t_11      t_12      t_13      t_14  
0  0.164797  0.001667  0.109014  0.061628  0

We now calculate the average values for each user as follows:

In [13]:
topic_sentiment_user_df = topic_sentiment_df.groupby(['author_from'], as_index=False).mean()

In [15]:
topic_sentiment_user_df.head(5)

Unnamed: 0,author_from,pos,neg,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,t_9,t_10,t_11,t_12,t_13,t_14
0,--------Link--------,0.0,0.0625,0.262674,0.003509,0.003509,0.003509,0.003509,0.056574,0.069515,0.520009,0.003509,0.05614,0.003509,0.003509,0.003509,0.003509,0.003509
1,------________,0.017857,0.043675,0.017526,0.017526,0.039061,0.096752,0.074879,0.029013,0.190172,0.129516,0.017526,0.035383,0.05042,0.098166,0.15115,0.035383,0.017526
2,-----iMartijn-----,0.002,0.017867,0.018383,0.074263,0.005108,0.23024,0.11636,0.039299,0.017694,0.054546,0.005108,0.005108,0.247091,0.05379,0.102172,0.020444,0.010394
3,---CAISSON---,0.0,0.025,0.009697,0.009697,0.109697,0.009697,0.009697,0.055152,0.009697,0.061478,0.009697,0.109697,0.367007,0.209697,0.009697,0.009697,0.009697
4,---DONTDIEWEMULTIPLY,0.005842,0.007025,0.022242,0.154403,0.056291,0.104365,0.082621,0.054479,0.043127,0.086735,0.070458,0.050695,0.081356,0.032482,0.089906,0.05765,0.013191


To select the values for a particular user, do this: 

In [22]:
topic_sentiment_user_df.loc[topic_sentiment_user_df['author_from'] ==  '---CAISSON---']

Unnamed: 0,author_from,pos,neg,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,t_9,t_10,t_11,t_12,t_13,t_14
3,---CAISSON---,0.0,0.025,0.009697,0.009697,0.109697,0.009697,0.009697,0.055152,0.009697,0.061478,0.009697,0.109697,0.367007,0.209697,0.009697,0.009697,0.009697


You can easily convert this to for example a list:

In [23]:
topic_sentiment_user_df.loc[topic_sentiment_user_df['author_from'] ==  '---CAISSON---'].values.tolist()

[['---CAISSON---',
  0.0,
  0.025,
  0.009696985756390001,
  0.009696972080605,
  0.109696969609305,
  0.0096969706267,
  0.009696969877439999,
  0.055151522521100005,
  0.009696969696954999,
  0.06147797673265,
  0.009696969894069998,
  0.10969696975030499,
  0.3670068653565,
  0.20969691976345498,
  0.009696994830435,
  0.009696973806775,
  0.009696969696954999]]

** Compute statistics **

One way to calculate whether a user is posting mostly about one topic, or is the user is active in multiple topics is using **entropy** (https://en.wikipedia.org/wiki/Entropy_(information_theory))

This is an example where we have two topics. Because the probability of both topics is equal (0.5), the entropy is high.

In [10]:
print scipy.stats.entropy([0.5, 0.5])

0.69314718056


Because in the following example all the probability is concentrated on one topic, the entropy is low (0).

In [11]:
print scipy.stats.entropy([1, 0])

0.0


** User selection **

For this particular assignment, it might be useful to filter users. If you include *all* users, then users who have only posted a few posts might have a topic distribution skewed towards a few topics, just because they haven't been active much. Let's count the number of posts for each user:

In [7]:
user_post_count = df.groupby('author_from', as_index=False).size().rename('counts').reset_index()

Show the first 5 entries

In [8]:
user_post_count.head(5)

Unnamed: 0,author_from,counts
0,--------Link--------,1
1,------________,14
2,-----iMartijn-----,4
3,---CAISSON---,2
4,---DONTDIEWEMULTIPLY,27


Select users with at least 50 topics

In [9]:
selected_users = list(user_post_count[user_post_count.counts >= 50]['author_from'])

Sample 1000 users

In [17]:
random.shuffle(selected_users)
selected_users_sampled = selected_users[:1000]

** Centrality ** 

There are various possible centralities. Betweenness in in too slow to calculate for this network, so we will only focus on eigenvector centrality, pagerank and (in- or out-)degree. You can try any one of them, just keep in mind when interpreting further results. You can get the centralities by running any one of the following:


In [14]:
G.eigenvector_centrality()
G.pagerank()
G.degree()

[206,
 459,
 64,
 327,
 422,
 30,
 148,
 757,
 446,
 85,
 27,
 292,
 152,
 21,
 108,
 221,
 365,
 70,
 29,
 77,
 111,
 126,
 584,
 12,
 426,
 215,
 134,
 181,
 13,
 13,
 146,
 24,
 678,
 10,
 132,
 35,
 83,
 180,
 229,
 218,
 256,
 279,
 28,
 66,
 65,
 159,
 81,
 4,
 280,
 10,
 18,
 941,
 9,
 112,
 44,
 394,
 489,
 1,
 789,
 7,
 293,
 1180,
 183,
 8,
 42,
 127,
 959,
 80,
 232,
 370,
 246,
 473,
 208,
 69,
 208,
 75,
 64,
 773,
 597,
 474,
 481,
 94,
 2,
 9,
 20,
 191,
 23,
 959,
 123,
 14,
 15,
 1571,
 1422,
 1,
 489,
 141,
 114,
 5,
 383,
 61,
 111,
 95,
 365,
 101,
 317,
 36,
 490,
 48,
 560,
 831,
 136,
 159,
 311,
 81,
 57,
 19,
 16,
 73,
 5,
 15,
 105,
 693,
 971,
 158,
 105,
 419,
 407,
 433,
 719,
 78,
 95,
 236,
 194,
 216,
 511,
 29,
 4,
 290,
 364,
 223,
 5,
 16,
 29,
 155,
 113,
 280,
 29,
 1737,
 245,
 264,
 742,
 21,
 47,
 4,
 1885,
 113,
 273,
 127,
 9,
 489,
 327,
 160,
 24,
 24,
 25,
 42,
 768,
 18,
 5,
 166,
 86,
 20,
 182,
 959,
 22,
 123,
 79,
 132,
 78,
 970,
 183,

** Todo: **
- Decide which users you will analyze
- Compute the centrality for each user
- Compute the topic distribution for each user. 
- Analyze whether there is a relation between the two measures.

In [16]:
centrality_values = G.eigenvector_centrality()

In [18]:
a = [] # centrality
b = [] # topic entropy

for i, v in enumerate(G.vs):
    if v['name'] not in selected_users_sampled: # author_reply_to
        continue
        
    a.append(centrality_values[i])
    row = topic_sentiment_user_df.loc[topic_sentiment_user_df['author_from'] ==  v['name']].values.tolist()[0]
    topic_distribution =  row[-15:]
    
    b.append(scipy.stats.entropy(topic_distribution))

In [19]:
print scipy.stats.spearmanr(a,b)

SpearmanrResult(correlation=0.22448061248061255, pvalue=6.8864898434479773e-13)


# Sentiment and centrality 

In order to become central in the commenter network, sufficient people have to respond to your comment. Enticing others to respond is thus essential. This is more likely when comments are controversial: i.e. many people would disagree with the comment. What is controversial depends on in which environment a statement is made. At any rate, we could expect a controversial statement to be met with criticism. We should then expect that central people are more likely to be criticised, and that they attract relatively many negative comments.

Techniques necessary
- Sentiment analysis
- Centrality

** Sentiment analysis **

In [20]:
from empath import Empath
lexicon = Empath()

Take a look at post number 340

In [40]:
print df.iloc[[340]]['comment'].values[0]

What nonsense. The country is called "The United States of America". States' rights are an integral part of the US and have been so ever since it's existence.

What exactly is supposed bad about states' rights? Are you one of those globalist "world government" loonies?


Analyze the comment using Empath

In [41]:
def tokenize(text):
    return list(gensim.utils.simple_preprocess(text))

In [42]:
lexicon.analyze(tokenize(df.iloc[[340]]['comment'].values[0]), normalize=True)

{'achievement': 0.0,
 'affection': 0.0,
 'aggression': 0.0,
 'air_travel': 0.0,
 'alcohol': 0.0,
 'ancient': 0.0,
 'anger': 0.0,
 'animal': 0.0,
 'anonymity': 0.0,
 'anticipation': 0.0,
 'appearance': 0.0,
 'art': 0.0,
 'attractive': 0.0,
 'banking': 0.0,
 'beach': 0.0,
 'beauty': 0.0,
 'blue_collar_job': 0.0,
 'body': 0.0,
 'breaking': 0.0,
 'business': 0.0,
 'car': 0.0,
 'celebration': 0.0,
 'cheerfulness': 0.0,
 'childish': 0.0,
 'children': 0.0,
 'cleaning': 0.0,
 'clothing': 0.0,
 'cold': 0.0,
 'college': 0.0,
 'communication': 0.0,
 'competing': 0.0,
 'computer': 0.0,
 'confusion': 0.0,
 'contentment': 0.0,
 'cooking': 0.0,
 'crime': 0.0,
 'dance': 0.0,
 'death': 0.0,
 'deception': 0.0,
 'disappointment': 0.0,
 'disgust': 0.0,
 'dispute': 0.0,
 'divine': 0.0,
 'domestic_work': 0.0,
 'dominant_heirarchical': 0.022222222222222223,
 'dominant_personality': 0.0,
 'driving': 0.0,
 'eating': 0.0,
 'economics': 0.0,
 'emotional': 0.022222222222222223,
 'envy': 0.0,
 'exasperation': 0.0,

Again, we have precomputed the sentiment values (but if you have time: extend it and consider other features as well,
                                                like emotion)

Because we are going to look at responses, we first join the dataset to have access to the author_reply_to field

In [28]:
df_combined = df.join(topic_sentiment_df, lsuffix='_orig', rsuffix='_nlp')

In [29]:
print df_combined.head(5)

                                             comment  a_score  a_created_utc  \
0  You think these women were sitting out there m...        0     1476397749   
1                               Oh wow, so profound!        0     1476298220   
2                   Hillary will be your president.         0     1477809663   
3  Notice that after this one, Pence stated and I...        0     1476072979   
4  People are already required to file their tax ...        0     1475538428   

   a_retrieved_on comment_id_orig comment_reply_to_id author_from_orig  \
0      1478583791         d8qwv4k             d8qwg9k   Schmingleberry   
1      1478553301         d8p5x7o             d8p5ch9     socoamaretto   
2      1478983377         d9dpj9r             d9dpf06     allisslothed   
3      1478488885         d8lh0lw             d8lfk96     shaking_head   
4      1478338625         d8cu7q1             d8ctw14         InFearn0   

        author_reply_to comment_id_nlp author_from_nlp    ...          t_5

Very similar to what we did before. Compute the mean for each author (but now we are looking at responses, so we look at 'author_reply_to')

In [31]:
df_combined_author_reply_to = df_combined.groupby(['author_reply_to'], as_index=False).mean()

In [55]:
df_combined_author_reply_to[df_combined_author_reply_to['author_reply_to'] == 'sprcow']

Unnamed: 0,author_reply_to,a_score,a_created_utc,a_retrieved_on,pos,neg,t_0,t_1,t_2,t_3,...,t_5,t_6,t_7,t_8,t_9,t_10,t_11,t_12,t_13,t_14
81933,sprcow,6.511111,1477001000.0,1478755000.0,0.015939,0.009377,0.041849,0.044994,0.04556,0.16002,...,0.020855,0.062705,0.100086,0.082979,0.022196,0.034937,0.09735,0.099477,0.048544,0.045009


In [None]:
a = [] # centrality
b = [] # sentiment - pos
c = [] # sentiment - neg
d = []
for i, v in enumerate(G.vs):
    if v['name'] not in selected_users_sampled: # author_reply_to
        continue
        
    a.append(centrality_values[i])
    row = df_combined_author_reply_to.loc[df_combined_author_reply_to['author_reply_to'] ==  v['name']]
    pos = row['pos'].values[0]
    neg = row['neg'].values[0]
    
    b.append(pos)
    c.append(neg)
    d.append(pos+neg)

In [54]:
print scipy.stats.spearmanr(a,b)
print scipy.stats.spearmanr(a,c)
print scipy.stats.spearmanr(a,d)

SpearmanrResult(correlation=0.11766470166470168, pvalue=0.0001920947192013784)
SpearmanrResult(correlation=0.072682116682116688, pvalue=0.021529011579097989)
SpearmanrResult(correlation=0.091321303321303324, pvalue=0.0038492035148635631)


# Communities of interest

Earlier today you learned that interaction is often homophilous: people with the same interest are more likely to be connected to each other. We will look into this question here on the basis of topics. Two question are central in this assignment: (1) are users that share topics more likely to be connected; and (2) does this create communities of interest.

Techniques necessary
- Topic modelling
- Assortativity
- Community detection

The most difficult part of community detection is deciding what method is appropriate and sometimes what resolution is appropriate. First we need to import the library:

In [15]:
import louvain

Modularity is the most often used and you can get is using

In [17]:
partition = louvain.find_partition(G, 'Modularity', weight='weight')


KeyError: 'Attribute does not exist'

Alternatively, you can try out CPM, using various resolution values. Good resolution values are usually quite small, but this may depend on the weight. Try to search for a resolution parameter that gives you a solution in the range of 5-50 communities.


In [18]:
partition = louvain.find_partition(G, 'CPM', weight='weight', resolution_parameter=0.08)


KeyError: 'Attribute does not exist'

# Sentiment and language across communities

Following social balance theory, it is possible that the commenter network is highly polarized (not implausible given the divisive US politics). Simply looking at communication while disregarding the valence of the link (i.e. whether it was negative or positive) may distort our view of the integration of the network. We will use sentiment analysis of the comments to determine whether the links are in fact negative or positive. In this assignment two question are central: (1) does the valence of links change the community structure; and (2) is sentiment different within sentiment different from language between groups?
TODO: Social balance

Techniques necessary
- Sentiment analysis
- Community detection

# Bonus assignment: social influence

Earlier you studied homophily: are users that share interest more likely to connect? In this assignment you will study whether social influence takes place. The goal is to study whether new words are more likely to be used by some user, if another user mentioned this word to him earlier. Note that you explicitly need to take into account the time dimension in this assignment. You should calculate two probabilities: (1) the probability to use a new word given it was not observed before versus (2) the probability to use a new word given it was observed before. Given your results, do you think there is social influence, or do you have another explanation?

Techniques
- Topic detection
- Sentiment analysis
- Community detection





