# Finding Your Doppelganger: Who is Your Most Similar Twitter Friend?
### COMP 440: Collective Intelligence
### Instructor: Shilad Sen
### Eva Yifan Gong, Tony Bach, Gozong Lor

##### Introduction

This is the our project's programming part. **Our project's objective is to design a program that would find your most similar Twitter friend/your Doppelganger.** Our program calculates the similarity score between you and all friends you followed by looking at three difference aspects: **profile similarity, network similarity and content similarity**. Our program gives you the freedom to assign different weights x, y, z to those three similarity metrics. 

Our program is divided into two sections. The first section is our main program to calculate and return a given user’s doppelganger. The second section is a quantitative analysis of our program, which also includes the codes we write to generate the graphs presented in our poster.

The first section can be divided into four different parts. The first part is to import necessary packages and modules for our use. We also specify our credential in the first part as required by Twitter API. The second part is to collect data on a given user's friends list, profile information, mentions, retweets, all tweets and hashtags and store them in dictionaries. The third part is the calculation part, in which we calculate the cosine similarity and tf-idf scores for different similarity metrics. The last and fourth part is the one to combine all metrics together and return the final result: a given user's doppelganger.

##### Overview
**_1. Main Program_**

*1.1 Set-up*

*1.2 Data Collection*

*1.3 Calculation*

*1.4 Final Return*

**_2. Quantitative Analysis_**

##### Dependencies
Make sure you have the stop words python module installed. For directions on how to install: https://pypi.python.org/pypi/stop-words

##### How to Compile the Code
This code was written with the intention of seeing it turned into an application with an interactive GUI. In short, users should be able to type in a Twitter username and the app would return the list of the top doppelgangers. We don't have a GUI, but we wanted to mimic this process as closely as possible.

To compile the code, make sure all the code in each cell has been run. Then simply create a new code cell at the bottom of this iPython notebook. Copy the code in this blockquote to your cell, change the username 'Macalester' to your desired username, adjust the weights of each metric, then click run.

(The first weight corresponds to content. The second weight to profile, and the third weight to network similarity.)

>findDoppelganger('Macalester', 2.0, 4.0, 2.0)

#### Future Work
At the moment, the network analysis is only partial, as it only takes into consideration mentions and not retweets.
However, the retweet data gathering code has also been included with comments and description on the intended use.

In [1]:
#######################################################################
# COMP 440: Collective Intelligence
# Instructor: Shilad Sen
# Finding Your Doppelganger: Who is Your Most Similar Twitter Friend? 
# Tony Bach, Eva Yifan Gong, Gozong Lor
# Last update: 12/15/2015
#######################################################################

#######################################################################
# 1.1 SET UP
# Getting credentials and libraries initialized.
#######################################################################

import pprint
import twitter
import json
from collections import defaultdict
from collections import OrderedDict
import time

CONSUMER_KEY = 'HtEfcIvBG9xp8i6kEHvhhHgRG'
CONSUMER_SECRET = '8B4Avz38G0CVjxCKFNjLLcICiSvlBd02VxobndJuAsgSBrdCGo'
OAUTH_TOKEN = '294030079-BTZ0LzDJedShBJTbP9OOjL8JJRjzqPocyNPQdVzL'
OAUTH_TOKEN_SECRET = 'MF7LYJwPe5EpBUzxxbCRrqpQ2H3iPg7pQhR9Ra95lTM7a'

# CONSUMER_KEY = 'gdf9ARHNtdaCNZOqUNWTDBC3l'
# CONSUMER_SECRET = 'tgXH2BzTApr4SqVzmcgnAE0WlMo8Oc7IofY95aHDmIVpfO38PL'
# OAUTH_TOKEN = '3392065319-L9wNY6enpZNJCYQE842qTD8wJMtKDJYlEDyLhPq'
# OAUTH_TOKEN_SECRET = '6utP5pBkCz1SWzUbZgbH6pjVdCivdOXlO7KoN3wmdHQhc'

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
 CONSUMER_KEY, CONSUMER_SECRET)

t = twitter.Twitter(auth=auth)
pp = pprint.PrettyPrinter(indent=2)


In [2]:
##########################################################################
# 1.2 DATA COLLECTION
# All the methods we used to collect our data from Twitter. These methods
# do not have any calculations occuring within them.
##########################################################################

############################################
# GET TWEET, DESCRIPTION, LOCATION DATA
############################################

def get_info_of_self_and_friends(my_user_name):
    tweet_dict = defaultdict(list)
    description_dict = defaultdict(str)
    location_dict = defaultdict(str)
    
    # getting user's tweets, description and location
    tweet_dict[my_user_name] = t.statuses.user_timeline(screen_name=my_user_name, count = 200)
    user_info = t.users.show(screen_name=my_user_name)
    description_dict[my_user_name] = user_info['description']
    location_dict[my_user_name] = user_info['location']
    
    # number of calls we have left to get a user's tweets. Need to keep track of
    # this so that we don't run into the rate limit error
    numCallsLeft = t.application.rate_limit_status(resources="statuses")['resources']['statuses']['/statuses/user_timeline']['remaining']
    
    # friend list is in reverse chronological order, so we want to get to the bottom of the list
    friends = t.friends.list(screen_name=my_user_name, count = 200)
    while (friends['next_cursor'] != 0):
        friends = t.friends.list(screen_name=my_user_name, cursor = friends['next_cursor'], count = 200)
    
    # go up the list from the bottom, to get the "oldest" friends first
    # need to leave two calls remaining for another method
    while (numCallsLeft >= 2):
        i = len(friends['users']) - 1
        while (i >= 0):
            friend = friends['users'][i]
            friend_name = friend['screen_name']
            if (friend_name not in tweet_dict):
                try:
                    friend_tweets = t.statuses.user_timeline(screen_name=friend_name, count = 200)
                    friend_info = t.users.show(screen_name=friend_name)
                    friend_description = friend_info['description']
                    friend_location = friend_info['location']
                    if (friend_tweets > 20):
                        tweet_dict[friend_name] = friend_tweets
                        description_dict[friend_name] = friend_description
                        location_dict[friend_name] = friend_location
                except twitter.api.TwitterHTTPError:
                    pass
                numCallsLeft -= 1
            i -= 1           
        # this is reached when we still have calls left, but we have already iterated through
        # all the friends of this user
        if (friends['previous_cursor'] == 0):
            break
        friends = t.friends.list(screen_name=my_user_name, cursor = friends['previous_cursor'], count = 200)
        
    return tweet_dict, description_dict, location_dict
        
# bad practice, but we don't want to call method more than once because of the rate limit
# for testing purpose only, should not do this when submitting final code
tweet_dict, description_dict, location_dict = get_info_of_self_and_friends('Macalester')

############################################
# GET MENTION DATA
############################################

# ======================================================
# Function find_who_you_mentioned_in_your_tweets(userA)
# ======================================================
# find_who_you_mentioned_in_your_tweets(userA)
# Returns a dictionary of users that userA mentioned and number of times they were mentioned
# Reads in a username for userA (i.e. "gozonglor", string)
# i.e. {@bob: 1, @ann: 4, @cary: 2, etc} --> userA mentioned bob 1 time, ann 4 itmes, and cary 2 times.
def find_who_you_mentioned_in_your_tweets(userA):
    # SET UP
    # Notice below: 'count' is hard-coded to be 200. 200 is the Twitter API limit. More tweets, more mentions.
    yourTweets = t.statuses.user_timeline(screen_name=userA, count=200, include_rts="false") 
    yourMentions = {} #Initialize a dictionary to store all your mentioned users
    
    # BUILD DICTIONARY OF MENTIONS
    for status in yourTweets: # For each of your tweets
        if "@" in status['text']: # If there's an @ symbol (a mention)
            cleanTweet = status['text'].strip() # Clean the entire tweet of any unnecessary white space
            tokens = cleanTweet.split(" ") # Split the tweet into words
            for word in tokens: # For each word
                if len(word) > 1: # If it's longer than 1 character
                    wordS = word.split() # Split the word into individual characters (i.e. from "@bob" to ["@","b","o","b"])
                    if word[0][:1] == "@": # If the first character is an @
                        if word == "@@": # Error handling for typo in a mention?
                            break
                        if word in yourMentions:
                            yourMentions[word] += 1
                        else:
                            yourMentions[word] = 1
                            
    # IF yourMentions REMAINS EMPTY...
    if len(yourMentions) == 0: 
        print("Sorry, "+userA+" has not mentioned anyone yet.")
    return yourMentions

# ============================================
# Function find_who_mentioned_you(numTweets)
# ============================================
# Returns a dictionary of users and number of times they mentioned the authenticating user.
# numTweets are the number of tweets to pull that mention the AUTHENTICATING user
# Twitter API restricts numTweets to be >= 200. 
def find_who_mentioned_you(numTweets):
    # SET UP
    tweets = t.statuses.mentions_timeline(count=numTweets)
    mentionedBy = {}
    
    # FIND USERS WHO MENTIONED AUTHENTICATING USER
    if len(tweets) == 0:
        print("Sorry, no mentions of the authenticated user.")
    else:
        for tweetObject in tweets: 
            tweeter = tweetObject['user']['screen_name'] # Tweeter is the user who mentioned your authenticating user
            if tweeter in mentionedBy:
                mentionedBy[tweeter] += 1
            else:
                mentionedBy[tweeter] = 1
    return mentionedBy

# zero_list_maker(n)
# Helper function, used to initialize similarity vectors that are filled later
def zero_list_maker(n):
    listofzeros = [0] * n
    return listofzeros

# is_user_active(username)
# Helper function, checks if a given username is active (return True) or not active (return False)
def is_user_active(username):
    try:
        results = t.users.lookup(screen_name=username) #API query#
        return True
    except:
        return False

# is_user_protected(username)
# Helper function, checks if a given username is protected (return True) or not protected (return False)
def is_user_protected(username):
    try:
        whoTheyMentioned = findWhoYouMentionedInYourTweets(username) # Find who they mentioned in their timeline
        return True
    except:
        return False

# ==================================================================
# Function build_vectors_for_who_mentioned_you(whoYouMentionedDict)
# ==================================================================
# Returns a dictionary: {john: [[1,2,3][0,1,4]], anna: [[1,2,3][9,3,2]], bill: [[1,2,3][0,0,6]]}
# Key: Username of mentioned friend
# Value: A list of vectors. Index 0 -> Authenticating user's similarity vector. Index 1 -> Key user's similarity vector.
def build_vectors_for_who_mentioned_you(whoYouMentionedDict):
    # TO DO: Find a way to handle all the nan (i.e. eva's vector [0,0], and my vector [1,2])
    
    # SET UP
    vectorDict = {} # The dictionary returned at the end of the function (described above)
    yourVector = [] # Initializing the authenticating user's similarity vector
    yourVector = zero_list_maker(len(whoYouMentionedDict)) # Initialize vector to 0
    whoYouMentionedDict = OrderedDict(sorted(whoYouMentionedDict.items(), key=lambda t: t[1])) # We want the dictionary to keep its order
    
    # BUILD YOUR SIMILARITY VECTOR
    for f in whoYouMentionedDict:
        friendIndex = whoYouMentionedDict.keys().index(f) # Grab your friend's index in the ordered dictionary
        yourVector[friendIndex] = whoYouMentionedDict[f] # Use the index to insert them into your vector
        
    # BUILD EACH OF YOUR FRIEND'S SIMILARITY VECTOR    
    for friend in whoYouMentionedDict: # For each friend you mentioned
        # SET UP
        friendIndex = whoYouMentionedDict.keys().index(friend) # Grab their index in the ordered dictionary
        friendVector = [] # Initialize a similarity vector for your friend that will be stored in the vector dictionary
        friendVector = zero_list_maker(len(whoYouMentionedDict))
        friend = friend.split("@") # Clean up your friend's username
        friend = friend[1]

        # 1. MAKE SURE USER IS ACTIVE AND NOT PROTECTED
        # 2. GET EACH FRIEND'S MENTIONS THAT INTERSECT WITH THE AUTHENTICATING USER'S MENTIONS
        # 3. BUILD THE SIMILARITY VECTOR AND INSERT INTO VECTOR DICTIONARY
        if is_user_active(friend) is False: # If this friend is no longer on twitter, keep their vector to all 0s
            vectorDict[friend] = [yourVector, friendVector] # Entry in vectorDict (returned dictionary) becomes friend:[[authenticating user's vector],[[vector of 0s]]]
        else: # If the user is still active
            if is_user_protected(friend) is False: # Check if they are also protected
                print("This user "+friend+" is protected and we can't access their user timeline for tweets.\n") # Print error msg.
            else: # If the user is not protected
                whoTheyMentioned = find_who_you_mentioned_in_your_tweets(friend) # Find who your friend mentioned in their timeline
                for friend2 in whoTheyMentioned: # And for each friend of theirs (friend2) that they mentioned
                    if friend2 == friend: # If they mention themselves
                        score = whoTheyMentioned[friend]
                        friendVector[friendIndex] = score  # Store their score in their vector using the friendIndex pulled from the ordered dictionary whoYouMentionedDict 
                    if friend2 != friend: # If they dont mention themselves/if friend2 is someone else
                        if friend2 in whoYouMentionedDict: # And authenticating user mentioned friend2 as well
                            friend2Index = whoYouMentionedDict.keys().index(friend2) # Grab the index of the friend2 in your dictionary/vector
                            friendVector[friend2Index] = whoTheyMentioned[friend2] # Insert into the friend's similarity vector
                vectorDict[friend] = [yourVector, friendVector] # If the user is not protected, and friend's similarity vector is built, add it into the dictionary
    return vectorDict 


# ============================================================
# Function calculate_mention_similarity_dict(vectorDictionary) 
# ============================================================
# Returns a dictionary of the cosine similarity score for each user the authenticating user mentioned
# (i.e. {john: 0.53, anna: 0.4, bill: 0.17})
# Argument: vectorDictionary is the vectorDict returned by build_vectors_for_who_mentioned_you method.
def calculate_mention_similarity_dict(vectorDictionary):
    similarityDict = {}
    for friend in vectorDictionary:
        bothVectors = vectorDictionary[friend]
        yourVector = bothVectors[0]
        theirVector = bothVectors[1]
        result = 1-spatial.distance.cosine(yourVector, theirVector)
        similarityDict[friend] = result
    return similarityDict

############################################
# GET RETWEET DATA
############################################
# Note: This is still a prototype.
# 
# The goal was to find every person who retweeted a tweet of the authenticating user.
# i.e. [user1, user2, user3] = [1, 5, 6] (you can read this as: user1 retweeted 1, user2 retweeted 5 times, user3 retweeted 6 times)
# Then, grab the retweeted tweets for every person who retweeted you, find who retweeted those retweets, and compare lists of retweeters
# i.e. for user1, within all their retweets, here are the people who retweeted (and the number of times they retweeted): 
# [user4, user2, user7] = [1, 1, 1]
# The most similar person/doppelganger will be the user who has the most retweeters in common with the authenticating user.
# This will be calculated using cosine similarity.
#
#
# Each function below is called in the final function: find_retweet_similarity
#

# myPopTweets(numTweets)
# numTweets must <= 100
# Returns a list of the ids of your popular tweets
def myPopTweets(numTweets):
    tweets = t.statuses.retweets_of_me(count=numTweets) #grab numTweets (some # of) tweets that others retweeted
    popularTweets=[]
    print(len(tweets))
    for tweet in tweets:
        #print(tweet['user']['screen_name']+"\n")
        #print(tweet['text']+"\n")
        #print("------ ----- ----- ----- -----")
        popularTweets.append(tweet['id'])
    return popularTweets

# Checks to see if the tweetID still exists.
def isAccessible(tweetID):
    try:
        #results = t.statuses.retweeters.ids(screen_name=username) #API query#Wow you fucked up
        results = t.statuses.retweeters.ids(_id=tweetID, stringify_ids=False)
        return True
    except:
        return False
    
# whoRetweeted(tweetID)
# Given a tweet id, returns a list of user ids that retweeted the tweet
# Notice: Making a request to statuses/retweets/ids is a little tricky. 
# Work around: https://github.com/sixohsix/twitter/issues/300
def whoRetweeted(tweetID):
    users=[]
    print("tweet ID: "+str(tweetID)+", type: "+str(type(tweetID))+"\n")
    if isAccessible(tweetID) != False:
        userList = t.statuses.retweeters.ids(_id=tweetID, stringify_ids=False)
        print("size of user list: "+str(len(userList['ids']))+"\n")
        for userID in userList['ids']:
            print("--> userID from userList['ids']: "+str(userID)+"\n")
            users.append(userID)
    else:
        print("Sorry! Retweeters of "+str(tweetID)+" is not accessible?\n")
    return users

# buildRetweetDictionary(tweetsList)
# tweetsList will be a list of tweet IDs the authenticating user authored
# Returns a dicitonary of key:value pair tweetID and userIDs.
def buildRetweetDictionary(tweetsList):
    tweetUID = {} #{TID: [UID, UID, ... , UID], TID: [UID, UID ... , UID]} etc
    userIDList = []
    for tweetID in tweetsList:
        userIDList = whoRetweeted(tweetID) # Get a list of the userIDs who retweeted the particular tweet ID
        tweetUID[tweetID] = userIDList # Throw it into a dictionary
    return tweetUID

# Returns frequency list, rather than a dictionary, of user ids that retweeted your tweets.
def build_simple_list(tweetUID): #merge all lists (we are merging a dictionary of lists)
    mergedList = []
    for tweetID in tweetUID:
        usersList = tweetUID[tweetID]
        mergedList=mergedList+usersList
    return mergedList

# Checks to see if a given user is active based on the user id.
def isActive(uid):
    try:
        lookup = t.users.lookup(user_id=uid)
        return True
    except:
        return False
        
# convert_UID_to_names
# Given a merged list, for example ...
# mergedList = [UID1, UID1, UID2, UID3, UID4, UID4, UID4]
# Create a dictionary of UID and # of times it appears in the merged list
def convert_UID_to_names(mergedList):
    convertedDict = {}
    for uid in mergedList: #this is a very slow way of getting the job done (hoepfully it works?)
        check = isActive(uid) #check if the user is active
        if check != True:
            print("This user with the id "+uid+" is not active. We can't look them up.\n")
        else:
            lookupUser = t.users.lookup(user_id=uid)
            for userDict in lookupUser:
                username = userDict['screen_name']
                if username in convertedDict:
                    convertedDict[username] += 1
                else:
                    convertedDict[username] = 1
    return convertedDict

# find_retweet_similarity
# Calls all the data gathering methods to calculate your retweet similarity
def find_retweet_similarity():
    popularTweets = myPopTweets(100) # Grab your popular tweets (returns a list of tweet ids)    
    tweetUID=buildRetweetDictionary(popularTweets) # Builds a dictionary of tweet ids, and list of users who retweeted it
    #{TID: [UID, UID, ... , UID], TID: [UID, UID ... , UID]} etc
    mergedListUID = build_simple_list(tweetUID) # Builds a list of user ids that retweeted your tweet: i.e. [123124, 488177, 283177, 2398100, etc...]
    convertedDict = convert_UID_to_names(mergedListUID)
    retweetDict = {}
    for name in convertedDict:
        theirRTs = get_retweets_of(name) # Get the popular tweets of the name user
        theirTweetUID = buildRetweetDictionary(theirRTs)
        theirMergedList = buildSimpleList(theirTweetUID)
        theirConvertedDict = convertUIDtoNames(theirMergedList)
        retweetDict[name] = theirConvertedDict #i.e. {username1: 4, username2:6}
    buildVectorsRT(convertedDict, retweetDict)
    
#Makes a list of zeros ready for vector building.
def zero_list_maker(n):
    listofzeros = [0] * n
    return listofzeros           
    
# Finds the retweeted tweets of a given user.
# Returns a list of the tweet IDs. 
def get_retweets_of(username):
    theirRTs = []
    theirTweets = t.statuses.user_timeline(screen_name=username, count=200, include_rts=True)
    for status in theirTweets:
        RT =status['retweet_count'] #Notice: 'RT = status["retweeted"]' does not appear to grab all retweeted tweets
        if RT > 0:
            TID = status['id'] # Grab the ID of the tweet
            theirRTs.append(TID)
    return theirRTs

# buildVectorsRT(yourDictionary, theirDictionaries)
# yourDictionary = {username: #, username: #, username: # ... etc}, where each value is the number of times this particular user
# retweeted your tweets.
# theirDictionaries is a dictionary of dictionaries. Each user inside the inner dictionary value has retweeted tweets authored by the key user.
# {username1: {username1: #, username2: #...}, username2: {etc}, username3: {etc}}
def buildVectorsRT(yourDictionary, theirDictionaries, username):
    
    finalDict = {} #A dictionary to return that holds all the final vectors for each user and the authenticating user
    
    # Sort your dictionary to keep its order
    yourDictionary = collections.OrderedDict(sorted(yourDictionary.items(), key=lambda t: t[1]))
    
    # sort all the dictionaries inside the dicitonary
    for usr in theirDictionaries:
        dic = theirDictionaries[usr]
        dic = collections.OrderedDict(sorted(dic.items(), key=lambda t: t[1]))
        theirDictionaries[usr] = dic
    
    #order all the dicitonaries in the dicitonary
    theirDictionaries = collections.OrderedDict(sorted(theirDictionaries.items(), key=lambda t: t[1]))
    
    yourVector = []
    yourVector = zero_list_maker(len(yourDictionary))
    # Build your vector
    for f in yourDictionary:
        friendIndex = yourDictionary.keys().index(f) #Grab their index
        yourVector[friendIndex] = yourDictionary[f]
    finalDict[username]=yourVector # Add it to the final dictionary
    
    # Build vectors for each user who retweeted your tweets
    for u in theirDictionaries:
        theirVector = []
        theirVector = zerolistmaker(len(yourDictionary))
        d = theirDictionaries[u]#grab the dictionary
        for u2 in d:
               if u2 in yourDictionary:
                    uIndex = yourDictionary.keys().index(u2) #grab their index in your dicitonary
                    uScore = d[u2] #grab the score
                    theirVector[uIndex] = uScore
        finalDict[u] = theirVector
    return finalDict

# find_retweet_similarity
# returns a dictionary of username and similarity vectors for cosine similarity calculation to run on
def find_retweet_similarity(username):
    popularTweets = myPopTweets(100) #grab your popular tweets (returns a list of tweet ids)    
    tweetUID=buildRetweetDictionary(popularTweets) #builds a dictionary of tweet ids, and list of users who retweeted it
    mergedListUID = buildSimpleList(tweetUID)
    convertedDict = convertUIDtoNames(mergedListUID) #dont convert usernames without an error handling where users are non existant
    retweetDict = {}
    for name in convertedDict:
        theirRTs = getRetweetsOf(name)#get the popular tweets of the name user
        theirTweetUID = buildRetweetDictionary(theirRTs)
        theirMergedList = buildSimpleList(theirTweetUID)
        theirConvertedDict = convertUIDtoNames(theirMergedList)
        retweetDict[name] = theirConvertedDict #{ann: 4, john:6}
    finalDict = buildVectorsRT(convertedDict, retweetDict, username)
    return finalDict
        

In [None]:
########################################################
# 1.3 CALCULATION
# The synthesis of all three measurements of similarity.
########################################################

############################################
# BUILD COSINE SIMILARITIES
############################################

from stop_words import get_stop_words
from scipy import spatial
from sklearn.feature_extraction.text import TfidfVectorizer

stop_word_list = get_stop_words('en')

def get_profile_corpus(my_user_name, info_type):
    corpus = []
    if (info_type == "description"):
        info_dict = description_dict
    else:
        info_dict = location_dict
    corpus.append(info_dict[my_user_name])
    for user in info_dict:
        if user != my_user_name:
            corpus.append(info_dict[user])
    return corpus

def get_tweet_corpus(my_user_name, just_hash_tags):
    corpus = []
    my_content = ""
    
    # need to have current user content as the first in the document array/corpus
    # this would make it easier to calculate cosine similarity between
    # tf-idf vectors later
    for tweet in tweet_dict[my_user_name]:
        if (just_hash_tags):
            hashTags = tweet['entities']['hashtags']
            if (len(hashTags) != 0):
                for hashTag in hashTags:
                    my_content += (hashTag['text'] + " ")
        else:
            my_content += (tweet['text'] + " ")

    corpus.append(my_content)
    
    # now add friends' content to the corpus
    for user in tweet_dict:
        if user != my_user_name:
            current_user_content = ""
            for tweet in tweet_dict[user]:
                if (just_hash_tags):
                    hashTags = tweet['entities']['hashtags']
                    if (len(hashTags) != 0):
                        for hashTag in hashTags:
                            current_user_content += (hashTag['text'] + " ")
                else:
                    current_user_content += (tweet['text'] + " ")
            corpus.append(current_user_content)

    return corpus

def get_cosine_similarities(corpus, my_user_name):
    # transform the documents into tf-idf vectors, then compute the cosine similarity between them
    # method taken from here: http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents
    tfidf = TfidfVectorizer(stop_words = stop_word_list).fit_transform(corpus)
    pairwise_similarity = tfidf * tfidf.T
    cosine_similarities_list = pairwise_similarity.A[0]
    cosine_similarities_dict = defaultdict(list)
    index = 1
    for user in tweet_dict:
        if user != my_user_name:
            cosine_similarities_dict[user] = cosine_similarities_list[index]
            index += 1
    return cosine_similarities_dict


In [None]:
#############################
# 1.4 FINAL RETURN
# Calculates final similarity
#############################

def get_all_cosine_similarities(user_name, contentWeight, profileWeight, networkWeight):
    all_cosine_similarities_dict = defaultdict(float)
    
    tweet_corpus = get_tweet_corpus(user_name, False)
    content_cosine_similarities_dict = get_cosine_similarities(tweet_corpus, user_name)

    description_corpus = get_profile_corpus(user_name, "description")
    description_cosine_similarities_dict = get_cosine_similarities(description_corpus, user_name)

    location_corpus = get_profile_corpus(user_name, "location")
    location_cosine_similarities_dict = get_cosine_similarities(location_corpus, user_name)
    
    yourMentions = find_who_you_mentioned_in_your_tweets(user_name)
    whoMentionedYou = find_who_mentioned_you(200) #TO DO
    fDict = buildVectorsForWhoMentionedYouCorrectly(yourMentions)
    fDict = cleanVectorDictionary(fDict)
    md = mentionSimilarityBetter(fDict)

    for user in content_cosine_similarities_dict:
        content_score = float(content_cosine_similarities_dict[user])
        description_score = float(description_cosine_similarities_dict[user])
        location_score = float(location_cosine_similarities_dict[user])
        # have to check because number users in mention list is much smaller
        if user in md:
            mention_score = float(md[user])
        else:
            mention_score = 0
        profile_score = description_score * 0.5 + location_score * 0.5
        # add retweet score here if necessary
        network_score = mention_score
        
        all_score = content_score * contentWeight + profile_score * profileWeight + network_score * networkWeight
        all_cosine_similarities_dict[user] = all_score
    return all_cosine_similarities_dict

def findDoppelganger(username, x, y, z):
    all_cosine = get_all_cosine_similarities(username, x, y, z)
    all_cosine_sorted = sorted(all_cosine.iteritems(),key=lambda (k,v): v,reverse=True)
    print all_cosine_sorted



In [None]:
findDoppelganger('Macalester', 2.0, 4.0, 2.0)

In [None]:
########################################################
# 2. QUANTITATIVE ANALYSIS
########################################################

# ###################################################
# # BUILD MENTIONS - COSINE SIMILARITIES SCATTER PLOT
# ###################################################

import matplotlib
matplotlib.rcParams['backend'] = "Qt4Agg"
import matplotlib.pyplot as plt
import numpy as np

# ======================================================
# Function find_who_you_mentioned_in_your_tweets(userA)
# ======================================================
# find_who_you_mentioned_in_your_tweets(userA)
# Returns a dictionary of users that userA mentioned and number of times they were mentioned
# Reads in a username for userA (i.e. "gozonglor", string)
# i.e. {@bob: 1, @ann: 4, @cary: 2, etc} --> userA mentioned bob 1 time, ann 4 times, and cary 2 times.
def find_who_you_mentioned_in_your_tweets(userA):
    # SET UP
    # Notice below: 'count' is hard-coded to be 200. 200 is the Twitter API limit. More tweets, more mentions.
    yourTweets = t.statuses.user_timeline(screen_name=userA, count=200, include_rts="false") 
    yourMentions = {} #Initialize a dictionary to store all your mentioned users
    
    # BUILD DICTIONARY OF MENTIONS
    for status in yourTweets: # For each of your tweets
        if "@" in status['text']: # If there's an @ symbol (a mention)
            cleanTweet = status['text'].strip() # Clean the entire tweet of any unnecessary white space
            tokens = cleanTweet.split(" ") # Split the tweet into words
            for word in tokens: # For each word
                if len(word) > 1: # If it's longer than 1 character
                    wordS = word.split() # Split the word into individual characters (i.e. from "@bob" to ["@","b","o","b"])
                    if word[0][:1] == "@": # If the first character is an @
                        if word == "@@": # Error handling for typo in a mention?
                            break
                        if word in yourMentions:
                            yourMentions[word] += 1
                        else:
                            yourMentions[word] = 1
                            
    # IF yourMentions REMAINS EMPTY...
    if len(yourMentions) == 0: 
        print("Sorry, "+userA+" has not mentioned anyone yet.")
    return yourMentions

# ============================================
# Function find_who_mentioned_you(numTweets)
# ============================================
# Returns a dictionary of users and number of times they mentioned the authenticating user.
# numTweets are the number of tweets to pull that mention the AUTHENTICATING user
# Twitter API restricts numTweets to be >= 200. 
def findWhoMentionedYou(numTweets):
    # SET UP
    tweets = t.statuses.mentions_timeline(count=numTweets)
    mentionedBy = {}
    
    # FIND USERS WHO MENTIONED AUTHENTICATING USER
    if len(tweets) == 0:
        print("Sorry, no mentions of the authenticated user.")
    else:
        for tweetObject in tweets: 
            tweeter = tweetObject['user']['screen_name'] # Tweeter is the user who mentioned your authenticating user
            if tweeter in mentionedBy:
                mentionedBy[tweeter] += 1
            else:
                mentionedBy[tweeter] = 1
    return mentionedBy

# ========================================================================
# Function linear_regression_counts(whoYouMentioned, usersWhoMentionYou)
# ========================================================================
#returns a dictionary of total mentions between you ('gozonglor') and a user
#whoYouMentioned should be a dictionary returned by the function find_who_you_mentioned_in_your_tweets('gozonglor') *defaults/hard coded to request 200 tweets
#usersWhoMentionYou should be a dictionary returned by the function find_who_mentioned_you(count) *count = 200, the max number of returns from mentions_timeline
def linear_regression_counts(whoYouMentioned, usersWhoMentionYou):
    totalDict = {} 
    for user in whoYouMentioned:
        numMentions = whoYouMentioned[user] #grab the user you mentioned
        user = user.split("@")
        user = user[1]
        if user in usersWhoMentionYou: #check if they also mentioned you
            theirMention = usersWhoMentionYou[user]
            totalDict[user] = numMentions+theirMention
    return totalDict

yourMentions = find_who_you_mentioned_in_your_tweets("")
whoMentionedYou = find_who_mentioned_you(200)
linearDict = linear_regression_counts(yourMentions, whoMentionedYou)
    
mentionsVector = []
cosineScoreVector = []
for user in lDict:
    mentionsVector.append(linearDict[user])
    cosineScoreVector.append(content_cosine_similarities_dict[user])

plt.scatter(mentionsVector, cosineScoreVector,  color='blue')

plt.plot(mentionsVector, np.poly1d(np.polyfit(mentionsVector, cosineScoreVector, 1))(mentionsVector))

plt.xticks(np.arange(min(mentionsVector), max(mentionsVector), 2))
plt.yticks(np.arange(0,1, 0.1))
plt.xlabel("Mentions")
plt.ylabel("Cosine similarity")

plt.show()