# Matching Analysis: The Effect of Social Feedback in a Reddit Community
In this notebook we demonstrate the use of "matching" when using observational data to estimate the effect size of receiving a "treatment" vs. not receiving a treatment, i.e. being in the "control" group. The vast majority of this notebook has been put together by Tiago Cunha. https://sites.google.com/site/tiagocunha87/

In this concrete example we use real data from the Reddit community called "loseit", an online forum dedicated to weight loss.
https://www.reddit.com/r/loseit/

We want to see if the treatment of receiving a "warm welcome" on one's first post in this subreddit has an effect on a user's probability to return for a second post in the future. Concretely, we look at the number of upvotes a post receives and define users whose first post on loseit did not receive a single upvote as the control group (see e.g. https://www.reddit.com/r/loseit/comments/o3h7y). Correspondingly, we define users whose first post received at least one upvote as the treatment group (see e.g. https://www.reddit.com/r/loseit/comments/238r4o).

The data we are using for this tutorial covers 37,279 users whose first post occurred between August 2010
to October 2014. The data was obtained by crawling Reddit using PRAW (http://praw.readthedocs.org), a Python package that allows simple access to Reddit's official API. For more details see the corresponding DigitalHealth'16 publication: Tiago Cunha, Ingmar Weber, Hamed Haddadi, Gisele Pappa: The Effect of Social Feedback in a Weight Loss Subreddit, preprint at http://arxiv.org/abs/1602.07936.


The number of upvotes a user receives is clearly affected by their writing style and the subject matter of their first post. A user with a can-do attitude might write more positively and hence be more likely to receive upvotes. If this user then returns for a second post in the future, this might have been due to their attitude and not the feedback received. Similarly, a user who's depressed might write a sad post and not receive any upvotes. If this user then fails to return for a second post in the future it is again unclear if that is because of the lack of support or because of their inherent attitude.

To correct for this, we match each treated user with a similar control user. For this analysis we define "similar" in terms of LDA (Latent Dirichlet Allocation - a probabilistic topic model). Two users, or rather their first posts in the loseit subreddit, are said to be similar if their LDA topic distributions have a high cosine similarity. To save time, we have pre-computed the LDA topics of each post in our data set using the GibbsLDA++ implementation (http://gibbslda.sourceforge.net/). 

When doing matching, there are a couple of variants one can try such as:
- match either with or without replacement where the control group members can or cannot be reused for several pairings
- define a cut-off on how the closest match needs to be in order to be considered "similar" at all
- use different distance metrics, in particular Mahalanobis Distance (https://en.wikipedia.org/wiki/Mahalanobis_distance) for co-variates that have not been normalized

This IPython Notebook allows you to experiment with the first two (replacement or not, and different cut-offs), but the distance is fixed as cosine similarity for efficiency reasons to ensure the script runs fast enough to be interactive.

As a general introduction to using matching for causal inference from observational data, we highly recommend Gary King's presentation: https://www.youtube.com/watch?v=rBv39pK1iEs

In [2]:
# just importing some packages
from operator import itemgetter
import numpy as np
import sys
from gensim import similarities
from random import sample
from IPython.core.display import display, HTML

# We start by defining a bunch of functions. Nothing happening here (yet). Skip further down if you're impatient.
## Reads the covariates file, here pre-computed LDA topics for each post

In [3]:
#read the LDA file
def get_covariates(featuresFile):

    try:
        with open(featuresFile) as f:
            data = f.read()
    except:
        print ("Wrong similarity file")
        sys.exit(0)
    data = data.split("\n")
    data.pop(len(data)-1)
    count = 0
    docs = []
    docToPost = {}
    postToDoc = {}
    count = 0
    
    for line in data:
        x = line.split(",")
        post = x[len(x)-1]
        x.pop(len(x)-1)
        docToPost[count] = post
        postToDoc[post]  =count
        count+=1
        docs.append(list(map(float, x)))

    size = len(docs)
    num_features = len(docs[0])

    for j in range(0, size):
        for i in range(0,num_features):
            docs[j][i] = i, docs[j][i]
    return docs, docToPost, num_features, postToDoc

## Reads the number of upvotes received on first posts and if the authors later returned or not for a second post

In [4]:
def get_posts_information(postsFile):
    try:
        with open(postsFile) as f:
            data = f.read()
    except:
        print ("Wrong similarity file")
        sys.exit(0)
    data = data.split("\n")
    data.pop(0)
    data.pop(len(data)-1)

    upvotes = {}
    postsReturn = {}
    for line in data:
        x = line.split(",")
        upvotes[x[0]] = float(x[1])
        postsReturn[x[0]] = x[2]
    return upvotes, postsReturn

## This function performs the matching and splits the data into a treatment and (matched) control group
It supports different minimal similarity thresholds and supports matching both with and without replacement. A distance of "None" means that no matching is performed, which is the baseline case. The only non-trivial distance supported at the moment is "cosine". 

In [5]:
def distance_matching(docs, docToPost, upvotes, threshold, replacement, distance = None):

    #treament group
    treatment = []
    treatmentToPost= {}
    #control group
    control = []
    controlToPost = {}
    size = len(docs)
    countTreatment = 0
    countControl = 0
    #SPLIT THE DATA INTO TREATMENT (feedback > 0) AND CONTROL (feedback == 0) GROUPS
    for x in range(0,size):
        if upvotes[docToPost[x]] == 0:
            #GET THE CONTROL GROUP
            control.append(docs[x])
            controlToPost[countControl] = docToPost[x]
            countControl+=1
        else:
            #GET THE TREATMENT GROUP
            treatment.append(docs[x])
            treatmentToPost[countTreatment] = docToPost[x]
            countTreatment+=1

    print ("Treatment group = "+str(countTreatment)+" users")
    print ("Control group = "+str(countControl)+" users")

    #Doesnt apply matching
    if distance == None:
        return treatmentToPost.values(), controlToPost.values()
    #matching
    else: 
        similarity = {}
        similarity_return = {}

        ranking = {}

        size = len(control)

        matched = {}

        countUnmatched = 0

        #shuffle the control group to reduce selection bias
        lookUpControl = sample(range(0, size), size)
        if distance == "cosine":
            #create an index to speedup the queries with cosine distance
            index = similarities.SparseMatrixSimilarity(treatment, num_features=len(treatment[0]))

        matchedControl = []
        matchedTreament = []
        count_pairs = 0
        #the matching itself
        for x in range(0,size):
            #cosine distance
            if distance == "cosine":
                sims =  index[control[lookUpControl[x]]]
                closer = sorted(list(enumerate(sims)), key = itemgetter(1), reverse = True)
            y = 0
            #If false apply matching without replacement
            if replacement == False:
                #check for users already matched
                while treatmentToPost[closer[y][0]] in matched:
                    y+=1

            #If the similarity is bigger than the threshold, then match the users
            if closer[y][1]>=float(threshold):
                ranking[controlToPost[lookUpControl[x]]] = closer[y][1]
                similarity[controlToPost[lookUpControl[x]]] = treatmentToPost[closer[y][0]],closer[y][1]
                matched[treatmentToPost[closer[y][0]]] = 1
                similarity_return[count_pairs] = controlToPost[lookUpControl[x]], treatmentToPost[closer[y][0]]
                matchedControl.append(controlToPost[lookUpControl[x]])
                matchedTreament.append(treatmentToPost[closer[y][0]])
                count_pairs+=1
            else:
                countUnmatched+=1

        print ("Control users left unmatched (due to similarity threshold) = "+str(countUnmatched))    
    
        #sort the pairs by similarity
        sorted_posts = sorted(ranking.items(), key = itemgetter(1), reverse = True)

        #print the top 5 pairs by similarity
        count = 0
        print ("\n\nTop 5 pairs by similarity\n")
        for x in sorted_posts:
            if count<5:
                print ("Control:")
                print ("https://www.reddit.com/r/loseit/comments/"+x[0][3:]+"\n")
                print ("Treatment:")
                print ("https://www.reddit.com/r/loseit/comments/"+similarity[x[0]][0][3:]+"\n")
                print ("Cosine similarity = "+str(similarity[x[0]][1])+"\n\n")
                count+=1
        return matchedTreament, matchedControl

## Given a treatment and control group, this function computes the two "user returns for second post" rates

In [6]:
def return_rate(treatment, control, returns):

    treatmentReturn = 0
    controlReturn = 0

    both = 0
    neither = 0
    #number of control users that came back
    for user in control:
        if returns[user] == "true":
            controlReturn+=1
    #number of treatment users that came back
    for user in treatment:
        if returns[user] == "true":
            treatmentReturn+=1

    probTreatmentReturn = treatmentReturn/(float(len(treatment)))
    probControlReturn = controlReturn/(float(len(control)))
    bothReturn = (treatmentReturn+controlReturn)/(float(len(treatment)+len(control)))
    neitherReturn = 1-bothReturn
    
    print ("Probability of treatment user return = "+ format(probTreatmentReturn, '.2%'))    
    print ("Probability of control user return = "+format(probControlReturn, '.2%'))
    print ("Relative increase = "+format((probTreatmentReturn-probControlReturn)/probControlReturn,'.2%'))

# Finally, the fun begins!
## From here onward, things actually get executed

In [7]:
#READ THE COVARIATE FILES (= pre-computed LDA topics for the users' first posts)
docs, docToPost, numFeatures, postToDoc = get_covariates("featuresFirstPosts.csv")


In [8]:
#READ THE FIRST POSTS INFORMATIONS, NUMBER OF UPVOTES RECEIVED AND IF THE AUTHORS RETURNED OR NOT
upvotes, returns = get_posts_information("firstPostReturns.csv")

# Estimate effect size without matching (baseline)
## This just compares the "returns for second post" rate for users who (i) don't receive any upvote (= control group), and who (ii) receive at least one upvote (= treatment group)

In [9]:
#Get control and treatment groups
#arguments: covariates, postIDs, feedback, threshold, replacement
# The missing "distance" argument at the end means that no matching is performed
treatment, control = distance_matching(docs, docToPost, upvotes, 0.9, False)

Treatment group = 34939 users
Control group = 2340 users


In [10]:
#Compute the return rate of control and treatment group
#arguments: treatment group, control group, if the authors returned
return_rate(treatment, control, returns)

Probability of treatment user return = 30.22%
Probability of control user return = 21.58%
Relative increase = 40.02%


# Estimate effect size with matching
## Case 1: Matching without replacement
Here users in the treatment group can only be matched to one user in the control group. 
The argument of "0.9" indicates the minimal threshold for cosine similarity to be considered a match.
The argument of "False" indicates "without replacement".

In [11]:
#Perform the matching and return control and treatment group
#arguments: covariates, postIDs, feedback, threshold, replacement, distance
treatment, control = distance_matching(docs, docToPost, upvotes, 0.9, False, "cosine")

Treatment group = 34939 users
Control group = 2340 users
Control users left unmatched (due to similarity threshold) = 1157


Top 5 pairs by similarity

Control:
https://www.reddit.com/r/loseit/comments/224d1q

Treatment:
https://www.reddit.com/r/loseit/comments/f5sza

Cosine similarity = 0.99488276


Control:
https://www.reddit.com/r/loseit/comments/2aeljt

Treatment:
https://www.reddit.com/r/loseit/comments/167pw5

Cosine similarity = 0.9870385


Control:
https://www.reddit.com/r/loseit/comments/1vgr2v

Treatment:
https://www.reddit.com/r/loseit/comments/2cw67l

Cosine similarity = 0.9866215


Control:
https://www.reddit.com/r/loseit/comments/18k5gx

Treatment:
https://www.reddit.com/r/loseit/comments/2f6riz

Cosine similarity = 0.9860636


Control:
https://www.reddit.com/r/loseit/comments/lcb4w

Treatment:
https://www.reddit.com/r/loseit/comments/rda9c

Cosine similarity = 0.98387253




In [12]:
#Compute the return rate of control and treatment group
#arguments: treatment group, control group, if the authors returned
return_rate(treatment, control, returns)

Probability of treatment user return = 28.40%
Probability of control user return = 21.30%
Relative increase = 33.33%


# Estimate effect size with matching
## Case 2: Matching with replacement
Here users in the treatment group can be matched to several users in the control group. 
The argument of "0.9" indicates the minimal threshold for cosine similarity to be considered a match.
The argument of "True" indicates "with replacement".

In [13]:
#Perform the matching and return control and treatment group
#arguments: covariates, postIDs, feedback, threshold, replacement, distance
treatment, control = distance_matching(docs, docToPost, upvotes, 0.9, True, "cosine")

Treatment group = 34939 users
Control group = 2340 users
Control users left unmatched (due to similarity threshold) = 1135


Top 5 pairs by similarity

Control:
https://www.reddit.com/r/loseit/comments/224d1q

Treatment:
https://www.reddit.com/r/loseit/comments/f5sza

Cosine similarity = 0.99488276


Control:
https://www.reddit.com/r/loseit/comments/2aeljt

Treatment:
https://www.reddit.com/r/loseit/comments/167pw5

Cosine similarity = 0.9870385


Control:
https://www.reddit.com/r/loseit/comments/1vgr2v

Treatment:
https://www.reddit.com/r/loseit/comments/2cw67l

Cosine similarity = 0.9866215


Control:
https://www.reddit.com/r/loseit/comments/18k5gx

Treatment:
https://www.reddit.com/r/loseit/comments/2f6riz

Cosine similarity = 0.9860636


Control:
https://www.reddit.com/r/loseit/comments/lcb4w

Treatment:
https://www.reddit.com/r/loseit/comments/rda9c

Cosine similarity = 0.98387253




In [14]:
#Compute the return rate of control and treatment group
#arguments: treatment group, control group, if the authors returned
return_rate(treatment, control, returns)

Probability of treatment user return = 27.05%
Probability of control user return = 21.24%
Relative increase = 27.34%


# Lessons learned
You should have seen the estimated effect size shrink when going from the unmatched baseline to the matched sample estimate. This is the standard setting where not controlling for covariates can lead to an overly optimistic effect size.
While the matched sample estimate reduces the effect size, it also requires a number of choices to be made such as the distance threshold used or whether to match with or without replacement.
Finally, matching is only as good as the covariates available. Though LDA topics are a sensible choice one could imagine other variables such as time of day, gender of the person posting, post length and more. Such covariates were excluded for this tutorial for simplicity.
