Link Analysis -- HITS + SEO


*Goals* Explore real-world challenges of building a graph (in this case, from tweets), implement and test HITS algortihm over this graph, and investigate factors that impact a page's rank on Google and Bing.

# HITS 

## A re-Tweet Graph

In this assignment, we're going to adapt the classic HITS approach to allow us to find not the most authoritative web pages, but rather to find significant Twitter users. So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Twitter users and their retweets of other Twitter users (so user = node, retweet of another user = edge). Over this Twitter-user graph, we can apply the HITS approach to order the users by their hub-ness and their authority-ness.

Here is a toy example. Suppose you are given the following four retweets:

* **userID**: diane, **text**: "RT ", **sourceID**: bob
* **userID**: charlie, **text**: "RT Welcome", **sourceID**: alice
* **userID**: bob, **text**: "RT Hi ", **sourceID**: diane
* **userID**: alice, **text**: "RT Howdy!", **sourceID**: parisa

There are four short tweets retweeted by four users. The retweet between users form a directed graph with five nodes and four edges. E.g., the "diane" node has a directed edge to the "bob" node.

You should build a graph by parsing the tweets in the file we provide called *HITS.json*.

**Notes:**

* You may see some weird characters in the content of tweets, just ignore them. 
* The edges are weighted and directed. If Bob retweets Alice's tweets 10 times, there is an edge from Bob to Alice with weight 10, but there is not an edge from Alice to Bob.
* If a user retweets herself, ignore it.
* Correctly parsing screen_name in a tweet is error-prone. Use the id of the user (this is the user who is re-tweeting) and the id of the user in the retweeted_status field (this is the user who is being re-tweeted; that is, this user created the original tweet).
* Later you will need to implement the HITS algorithm on the graph you build here.


In [2]:
import numpy as np
from numpy import dot, array
import scipy as sp
import json
import collections
from scipy import sparse
import networkx as nx
file = 'F:/SEM-2/IR/HW_2/HITS.json'
d = collections.defaultdict(dict)
names = collections.defaultdict(dict)
user_dict = dict()
result = {}
words = []
lines = []
labels = []

count = 0
print'started'
with open(file) as f:
    for line in f:
        json_data = json.loads(line.lower())
        user = json_data['user']['id']
        source = json_data['retweeted_status']['user']['id']
        if not user in names.keys():
            names[user] = count
            count+=1
        if not source in names.keys():
            names[source] = count
            count+=1
        #result.setdefault(user, []).append(source)
        if user in d.keys():
            if source in d[user].keys():
                d[user][source]+=1;
            else:
                d[user][source] = 1;
        else:
            d[user]={}
            d[user][source] = 1;
#print len(names)

#for key in d.keys():
    #for key1 in d[key].keys():
        #print key,"--",key1, "--", d[key][key1]
        
#for name in names.keys():
    #print name,"--",names[name]    
new_graph = nx.DiGraph()

for user in names.keys():
    for frm in d[user].keys():
        new_graph.add_edge(names[user], names[frm],weight=d[user][frm])
adjacency_matrix = nx.adjacency_matrix(new_graph)

matrix = sparse.csr_matrix(adjacency_matrix)
#print matrix
print "number of unique users = ",(len(names))


        
       

started
number of unique users =  1003


## HITS Implementation

This program will return the top 10 users with highest hub and authority scores. 

Hub Scores

* user1 - score1
* user2 - score2
* ...
* user10 - score10

Authority Scores

* user1 - score1
* user2 - score2
* ...
* user10 - score10


In [10]:
# your code here
from operator import itemgetter, attrgetter
hubs_in = np.transpose(np.ones(matrix.shape[0]))
auths_in = np.transpose(np.ones(matrix.shape[1]))

iterations = 100

auths_f = np.array
hubs_f = np.array
for i in range(iterations):
    auths_f = sparse.csr_matrix.dot(sparse.csr_matrix.transpose(matrix).tocsr(),hubs_in)
    hubs_f = sparse.csr_matrix.dot(matrix,auths_f)
    auths_in = auths_f/max(auths_f)
    hubs_in = hubs_f/max(hubs_f)
    
auths_f = auths_f/max(auths_f)
hubs_f = hubs_f/max(hubs_f)

hub_score = dict()
auth_score = dict()

for key, value in names.iteritems():
        hub_score[key] = hubs_f[value]
        auth_score[key] = auths_f[value]

print "######### after",iterations," iterations ###############" 

print "----------------hubs-------------------------"

for key, value in sorted(hub_score.iteritems(), key=lambda (k,v): (v,k),reverse = True)[0:20]:
    print "%s: %s" % (key, value)

print "----------------auths-------------------------"

for key, value in sorted(auth_score.iteritems(), key=lambda (k,v): (v,k),reverse = True)[0:20]:
    print "%s: %s" % (key, value)


######### after 100  iterations ###############
----------------hubs-------------------------
3068706044: 1.0
3093940760: 0.475287782598
2194518394: 0.417051522555
2862783698: 0.325115722927
3092183276: 0.27365327709
3029724797: 0.267990465997
2990704188: 0.237086961228
3001500121: 0.232422074305
3086921438: 0.207265056826
3042686360: 0.201026535312
3092935664: 0.19709589533
3021183212: 0.194054207202
3118683560: 0.182011802877
3084868798: 0.163743008103
2935948649: 0.161077825823
3089225044: 0.161014033197
3064218544: 0.142877265727
3091417449: 0.134306033739
3059435226: 0.126920093865
3092863895: 0.126497493583
----------------auths-------------------------
3042570996: 1.0
3065514742: 0.905608035937
1638625987: 0.81511290402
3077733683: 0.526217876188
3039321886: 0.411904770114
3077695572: 0.223794722843
3019659587: 0.207900590021
1358345766: 0.179998267529
3061155846: 0.172568907917
3092580049: 0.172006968736
571198546: 0.149751447319
3068694151: 0.137856674825
3058933933: 0.1315643