# Finding Similar Users  on Twitter

In this project we are trying to find similar users to a given user base on Twitter. The objective is to create a database of similar users around a topic.

## Methodology

We are mainly focusing on lists to find similar users. Here is the general process of the method:

1. Determine base users. These users will underlie our similar user database. So it is important to choose users that are related to a common topic.
2. Get base users' lists which they are a member of.
3. Extract important specifications of the lists.
4. Find common lists which all the base users are a member of.
5. Eliminate some lists according to the subscriber_count and member_count of the lists.
6. Get members of the common lists.
7. Delete duplicate users and extract important information of users.
8. From the obtained similar user list, determine accounts that are not human but big companies.
9. Eliminate lists that includes determined big company accounts from the common lists.
10. Finally, print out members of lastly obtained common lists.


### 0. Importing required libraries and Twitter API initializations

We are using a Python library called birdy to access Twitter API. https://github.com/inueni/birdy

To use birdy, key list must have consumer_key, consumer_secret, access_token_key, access_token_secret. More than one key configuration is recommended, to overcome API limits.

In [2]:
import json
import sys
import datetime
import re
import os
import time
from birdy.twitter import UserClient, BirdyException 
import numpy as np
from time import sleep
import pandas as pd

key = []

client = UserClient(key[0][0], key[0][1], key[0][2], key[0][3])

### 1. Determining Base Users

These users will underlie our similar user database. So it is important to choose users that are related to a common topic. Choosing users who is member of too much lists (generally users with more than 500k followers) can cause issues with Twitter API to not responding. So try to choose base users considering this issue. Highly possible that, the users with many followers will be in the base users' lists. 

In [46]:
users = ['karpathy','AndrewYNg','drfeifei','AlecRad','KirkDBorne', 'hmason', 'hadleywickham', ]

# List preferences
minSubscriber = 0
maxMember = 500

# User preferences
minFollower = 1000
minTweets = 500

### 2. Getting Base Users' Lists

Getting base users' lists which they are a member of. Here we are using **"GET lists/memberships"** call to obtain lists. We are cycling around different API keys to overcome Rate Limit error and sleep(15) (waits 15 seconds) to Over Capacity error.

We store all the lists in **userSubs** list.

In [10]:
userSubs = []

keyInd = 2
client = UserClient(key[keyInd][0], key[keyInd][1], key[keyInd][2], key[keyInd][3])

for user in users:
    print(user)
    protec = False
    sub = []
    
    while(True):
        try:
            response = client.api.lists.memberships.get(screen_name=user, count=500, cursor=-1)
            break
        except Exception as err:
            print(err.status_code)
            print(err)
            if err.status_code == 429:
                sleep(60)
                keyInd = (keyInd + 1)%len(key)
            elif err.status_code == 404:
                protec = True
                break
            else:
                sleep(15)
            
            client = UserClient(key[keyInd][0], key[keyInd][1], key[keyInd][2], key[keyInd][3])
            
    if protec:
        userSubs.append([])
        print('protected!')
        continue
    ncur = response.data['next_cursor']
    for s in response.data['lists']:
        sub.append(s)
    
    while(ncur != 0):
        while(True):
            try:
                response = client.api.lists.memberships.get(screen_name=user, count=500, cursor=ncur)
                break
            except Exception as err:
                print(err.status_code)
                print(err)
                if err.status_code == 429:
                    sleep(60)
                    keyInd = (keyInd + 1)%len(key)
                else:
                    sleep(15)
                client = UserClient(key[keyInd][0], key[keyInd][1], key[keyInd][2], key[keyInd][3])
        
        ncur = response.data['next_cursor']
        for s in response.data['lists']:
            sub.append(s)
            
    userSubs.append(sub)

karpathy
AndrewYNg
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1546956732105377633&screen_name=AndrewYNg)
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1510900425542963537&screen_name=AndrewYNg)
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1510900425542963537&screen_name=AndrewYNg)
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1510900425542963537&screen_name=AndrewYNg)
drfeifei
AlecRad
KirkDBorne
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1552010180382441804&screen_name=KirkDBorne)
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1541498369619755184&screen_name=KirkDBorne)
503
Over capacity (GET https://api.twitter.com/1.1/lists/memberships.json?count=500&cursor=1541498369619755184&screen_name=KirkDBorne)
503
Over cap

### 3. Extract important specifications of the lists

The extracted specifications with examples:
* **name:**				"Digital Marketing"
* **slug:** 			"digital-marketing"
* **id:** 				49260625
* **full_name:**		"@pointcg/digital-marketing"
* **subscriber_count:** 1
* **member_count:**		46

**userLists** list holds the lists with specs.

In [11]:
# 0. "name": "Digital Marketing"
# 1. "slug": "digital-marketing"
# 2. "id": 49260625
# 3. "full_name": "@pointcg/digital-marketing"
# 4. "subscriber_count": 1
# 5. "member_count": 46

userLists = []

for userSub in userSubs:
    ul = []
    for li in userSub:
        ul.append((li['name'], li['slug'], str(li['id']), li['full_name'], li['subscriber_count'], li['member_count']))
        
    userLists.append(ul)
    
print(userLists[0][5])

('My AI', 'my-ai', '862370344073560064', '@intellification/my-ai', 0, 8)


### 4. Find Common Lists

Finds common lists which all the base users are a member of and stores it in **commonLists** list.

In [12]:
#commonLists = []

#for li in userLists[0]:
#    if li in userLists[1]:
#        commonLists.append(li)

commonLists = list(userLists[0])

for cL in commonLists[:]:
    for uL in userLists[1:]:
        if cL not in uL:
            commonLists.remove(cL)
            break

print("Number of common lists: " + str(len(commonLists)))

Number of common lists: 11


### 5. Eliminate some lists according to the subscriber_count and member_count of the lists from Common Lists

First, we are sorting the Common Lists according to subscriber_count to process easily. Then we are choosing lists that have at least 10 subscriber and at most 300 members. Those values are experimental. Stored in **mostCommons** list.

In [19]:
commonLists = sorted(commonLists,key=lambda x: x[4], reverse=True)

mostCommons = []

totalMember = 0
for li in commonLists:
    # List subscriber >= 0 and List member < 300
    if li[4] >= minSubscriber and li[5] < maxMember:
        totalMember = totalMember + li[5]
        mostCommons.append(li)
        
df = pd.DataFrame(columns=('Name', 'Slug', 'ID', 'Fullname', 'Subscribers', 'Members'))
pd.options.display.float_format = '{:,.0f}'.format
for i in range(len(mostCommons)):
    df.loc[i] = mostCommons[i]

print(df)

print()
print("Number of common lists after elimination: " + str(len(mostCommons)))
print("Number of members in lists: " + str(totalMember)) 

                    Name                 Slug                  ID  \
0       machine learning     machine-learning           205125841   
1  AI & machine learning  ai-machine-learning           231045220   
2              analytics            analytics  820723315467833344   
3                   data                 data  742832583449382912   

                         Fullname  Subscribers  Members  
0    @inancgumus/machine-learning            4      408  
1  @voxmenthe/ai-machine-learning            2      352  
2            @_mokhtar_/analytics            1      314  
3                 @AlanJumpi/data            0      218  

Number of common lists after elimination: 4
Number of members in lists: 1292


### 6. Get members of the common lists

Here we are using **"GET lists/members"** call to obtain users of each lists. We are again cycling around different API keys to overcome Rate Limit error and sleep(15) (waits 15 seconds) to Over Capacity error.

We store all the users in **similarUsers** list.

In [20]:
client = UserClient(key[keyInd][0], key[keyInd][1], key[keyInd][2], key[keyInd][3])

similarUsers = []

for li in mostCommons:
    print(li)
    sims = []
    
    while(True):
        try:
            response = client.api.lists.members.get(list_id=li[2], count=1000, cursor=-1)
            break
        except Exception as err:
            print(err.status_code)
            print(err)
            if err.status_code == 429:
                sleep(60)
                keyInd = (keyInd + 1)%len(key)
            else:
                sleep(15)
            client = UserClient(key[keyInd][0], key[keyInd][1], key[keyInd][2], key[keyInd][3])
            #response = client.api.lists.members.get(list_id=li[2], count=1000, cursor=-1)
    
    
    ncur = response.data['next_cursor']
    for s in response.data['users']:
        sims.append(s)
    
    while(ncur != 0):
        while(True):                
            try:
                response = client.api.lists.members.get(list_id=li[2], count=1000, cursor=ncur)
                break
            except Exception as err:
                print(err.status_code)
                print(err)
                if err.status_code == 429:
                    sleep(60)
                    keyInd = (keyInd + 1)%len(key)
                else:
                    sleep(15)
                client = UserClient(key[keyInd][0], key[keyInd][1], key[keyInd][2], key[keyInd][3])
                #response = client.api.lists.members.get(list_id=li[2], count=1000, cursor=ncur)
        
        
        ncur = response.data['next_cursor']
        for s in response.data['users']:
            sims.append(s)
            
    similarUsers.append(sims)

('machine learning', 'machine-learning', '205125841', '@inancgumus/machine-learning', 4, 408)
('AI & machine learning', 'ai-machine-learning', '231045220', '@voxmenthe/ai-machine-learning', 2, 352)
('analytics', 'analytics', '820723315467833344', '@_mokhtar_/analytics', 1, 314)
('data', 'data', '742832583449382912', '@AlanJumpi/data', 0, 218)


### 7. Delete duplicate users and extract important information of users.

Important specifications of users:
* **id_str**				: ID of the user
* **screen_name**		: Screen name of the user (@screen_name)
* **followers_count**	: # Followers
* **friends_count**		: # Following
* **favourites_count**	: # Likes
* **listed_count**		: Total number of list subscription and membership (?)
* **statuses_count**		: # Tweets
* **verified**			: True or False 
* **protected**			: True or False / if true can't crawl the account
* **created_at**			: Creation time of the account / (2009-10-30 12:11:39)

**similars** list holds the users.

In [21]:
# 0. id_str				: ID of the user
# 1. screen_name		: Screen name of the user (@screen_name)
# 2. followers_count	: # Followers
# 3. friends_count		: # Following
# 4. favourites_count	: # Likes
# 5. listed_count		: Total number of list subscription and membership (?)
# 6. statuses_count		: # Tweets
# 7. verified			: True or False 
# 8. protected			: True or False / if true can't crawl the account
# 9. created_at			: Creation time of the account / (2009-10-30 12:11:39)

similars = []
uNames = []
for sus in similarUsers:
    for su in sus:
        if su['screen_name'] not in uNames:
            uNames.append(su['screen_name'])
            similars.append((su['id_str'], su['screen_name'], su['followers_count'], su['friends_count'],
                          su['favourites_count'], su['listed_count'], su['statuses_count'], su['verified'], 
                          su['protected'], su['created_at']))
            
print("Number of unique users: " + str(len(similars)))

Number of unique users: 1072


### 8. From the obtained similar user list, determine accounts that are not human but big companies.

First we sort similar users according to followers_count, then observe those users to determine not human but big company accounts.

Here we only printed out the top 20 accounts.

In [47]:
sortedSimilars = sorted(similars,key=lambda x: x[2], reverse=True)

chosens = []

for s in sortedSimilars:
    if s[2] < minFollower:
        break
    if s[6] > minTweets and s[2] > s[3] and s[8] == False:
        chosens.append(s)
        
df = pd.DataFrame(columns=('ID', 'Name', 'Followers', 'Friends', 'Favourites', 'Listed', 'Statuses', 'Verified', 'Protected', 'Created_at'))
pd.options.display.float_format = '{:,.0f}'.format
for i in range(20):
    df.loc[i] = chosens[i]

print(len(chosens))
print(df)


567
            ID             Name  Followers  Friends  Favourites  Listed  \
0     16017475    NateSilver538  2,249,956      985         132  27,834   
1     33838201   googleresearch  1,107,765       19           0  12,474   
2     51263711  googleanalytics  1,006,549      404       2,755  18,292   
3     34181507  dez_blanchfield    754,646      472      14,339   1,519   
4   1526228120      TwitterData    735,709       10          17   4,195   
5     20280065      HansRosling    372,773      172         177   6,158   
6     18080585          MongoDB    271,918    5,480       2,354   4,864   
7    259725229       ValaAfshar    189,276      440           2   9,165   
8     15662446          avinash    182,407       88          14  11,432   
9     14174897   analyticbridge    160,818    4,377       5,895   5,946   
10   267283568       IBMBigData    153,333    2,148       4,458   4,370   
11   198483889        dr_morton    150,534   95,326      61,733   6,523   
12    54645160       

In [48]:
goodLists = []
badUsers = []
#badUsers = ['cnnbrk', 'nytimes', 'CNN', 'BBCBreaking', 'TheEconomist', 'BBCWorld', 'Reuters', 'FoxNews', 'TIME', 'WSJ',
#            'Forbes', 'ABC', 'HuffPost', 'washingtonpost']

for i in range(len(similarUsers)):
    bad = False
    for su in similarUsers[i]:
        if su['screen_name'] in badUsers:
            bad = True
            break
    if not bad:
        goodLists.append(i)

print("Number of remaining lists after elimination: " + str(len(goodLists)))
#print(goodLists)

Number of remaining lists after elimination: 4


### 9. Eliminate lists that includes determined big company accounts from the common lists.

Eliminate lists that includes determined big company accounts from the common lists.

In [49]:
similarUsers2 = []

totalMember = 0

for i in goodLists:
    if mostCommons[i][4] >= minSubscriber and mostCommons[i][5] < maxMember:
        totalMember = totalMember + mostCommons[i][5]
        similarUsers2.append(similarUsers[i])

df = pd.DataFrame(columns=('Name', 'Slug', 'ID', 'Fullname', 'Subscribers', 'Members'))
pd.options.display.float_format = '{:,.0f}'.format
for i in range(len(goodLists)):
    if mostCommons[i][4] >= minSubscriber and mostCommons[i][5] < maxMember:
        df.loc[i] = mostCommons[goodLists[i]]

print(df)
        
print()
print("Number of common lists after elimination: " + str(len(similarUsers2)))
print("Number of members in lists: " + str(totalMember))

                    Name                 Slug                  ID  \
0       machine learning     machine-learning           205125841   
1  AI & machine learning  ai-machine-learning           231045220   
2              analytics            analytics  820723315467833344   
3                   data                 data  742832583449382912   

                         Fullname  Subscribers  Members  
0    @inancgumus/machine-learning            4      408  
1  @voxmenthe/ai-machine-learning            2      352  
2            @_mokhtar_/analytics            1      314  
3                 @AlanJumpi/data            0      218  

Number of common lists after elimination: 4
Number of members in lists: 1292


Get members of the remaining common lists. Last remaining similar users are stored in **similars2**.

In [50]:
# 0. id_str				: ID of the user
# 1. screen_name		: Screen name of the user (@screen_name)
# 2. followers_count	: # Followers
# 3. friends_count		: # Following
# 4. favourites_count	: # Likes
# 5. listed_count		: Total number of list subscription and membership (?)
# 6. statuses_count		: # Tweets
# 7. verified			: True or False 
# 8. protected			: True or False / if true can't crawl the account
# 9. created_at			: Creation time of the account / (2009-10-30 12:11:39)

similars2 = []
uNames2 = []
for sus in similarUsers2:
    for su in sus:
        if su['screen_name'] not in uNames2:
            uNames2.append(su['screen_name'])
            similars2.append((su['id_str'], su['screen_name'], su['followers_count'], su['friends_count'],
                          su['favourites_count'], su['listed_count'], su['statuses_count'], su['verified'], 
                          su['protected'], su['created_at']))
            

print("Number of unique users: " + str(len(similars2)))

Number of unique users: 1072


### 10. Print out members of lastly obtained common lists.

Finally, we are printing the similar users that we obtained. We use a simple filter to eliminate users with followers_count < 1500 and statuses_count < 250. Here we again only printed out the top 20 accounts' information. You can find all users in **"SimilarUsers.txt"** where we write all users' informations to. First line of **"SimilarUsers.txt"** includes base users' screen names.

In [60]:
lastSimilars = []

sortedSimilars2 = sorted(similars2,key=lambda x: x[2], reverse=True)

f = open("SimilarUsers.txt", 'w', encoding='utf-8')

f.write(users[0])
for u in users[1:]:
    f.write("," + u)
f.write("\n")

f.write(str(len(goodLists)))
f.write("\n")
for i in range(len(goodLists)):
    f.write(str(mostCommons[goodLists[i]][0])+','+str(mostCommons[goodLists[i]][1])+','+str(mostCommons[goodLists[i]][2])+','
          +str(mostCommons[goodLists[i]][3])+','+str(mostCommons[goodLists[i]][4])+','+str(mostCommons[goodLists[i]][5]))
    f.write("\n")

for s in sortedSimilars2:
    if s[2] < minFollower:
        break
    if s[6] > minTweets and s[2] > s[3] and s[8] == False:
        lastSimilars.append(s)
        f.write(s[0] + ',' + s[1] + ',' + str(s[2]) + ',' + str(s[3]) + ',' + str(s[4]) + ',' + str(s[5]) + ',' + str(s[6])
                + ',' + str(s[7]) + ',' + str(s[8]) + ',' + str(s[9]))
        f.write("\n")

f.close()

print("Number of similar users: " + str(len(lastSimilars)))
print()

df = pd.DataFrame(columns=('ID', 'Name', 'Followers', 'Friends', 'Favourites', 'Listed', 'Statuses', 'Verified', 'Protected', 'Created_at'))
pd.options.display.float_format = '{:,.0f}'.format
for i in range(20):
    df.loc[i] = lastSimilars[i]

print(df)

Number of similar users: 567

            ID             Name  Followers  Friends  Favourites  Listed  \
0     16017475    NateSilver538  2,249,956      985         132  27,834   
1     33838201   googleresearch  1,107,765       19           0  12,474   
2     51263711  googleanalytics  1,006,549      404       2,755  18,292   
3     34181507  dez_blanchfield    754,646      472      14,339   1,519   
4   1526228120      TwitterData    735,709       10          17   4,195   
5     20280065      HansRosling    372,773      172         177   6,158   
6     18080585          MongoDB    271,918    5,480       2,354   4,864   
7    259725229       ValaAfshar    189,276      440           2   9,165   
8     15662446          avinash    182,407       88          14  11,432   
9     14174897   analyticbridge    160,818    4,377       5,895   5,946   
10   267283568       IBMBigData    153,333    2,148       4,458   4,370   
11   198483889        dr_morton    150,534   95,326      61,733   6,52