#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2018


# Homework 2:  Link Analysis -- HITS + SEO

### 100 points [5% of your final grade]

### Due: Sunday, February 25, 2018 by 11:59pm

*Goals of this homework:* Explore real-world challenges of building a graph (in this case, from tweets), implement and test HITS algortihm over this graph, and investigate factors that impact a page's rank on Google and Bing.

*Submission Instructions:* To submit your homework, rename this notebook as YOUR_UIN_hw2.ipynb. Submit this notebook via ecampus. Your notebook should be completely self-contained, with the results visible in the notebook. 

*Late submission policy:* For this homework, you may use up to three of your late days, meaning that no submissions will be accepted after Wednesday, February 28, 2018 at 11:59pm.

# Part 1: HITS (70 points)

## A re-Tweet Graph

In this assignment, we're going to adapt the classic HITS approach to allow us to find not the most authoritative web pages, but rather to find significant Twitter users. So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Twitter users and their retweets of other Twitter users (so user = node, retweet of another user = edge). Over this Twitter-user graph, we can apply the HITS approach to order the users by their hub-ness and their authority-ness.

Here is a toy example. Suppose you are given the following four retweets:

* **userID**: diane, **text**: "RT ", **sourceID**: bob
* **userID**: charlie, **text**: "RT Welcome", **sourceID**: alice
* **userID**: bob, **text**: "RT Hi ", **sourceID**: diane
* **userID**: alice, **text**: "RT Howdy!", **sourceID**: parisa

There are four short tweets retweeted by four users. The retweet between users form a directed graph with five nodes and four edges. E.g., the "diane" node has a directed edge to the "bob" node.

You should build a graph by parsing the tweets in the file we provide called *HITS.json*.

**Notes:**

* You may see some weird characters in the content of tweets, just ignore them. 
* The edges are weighted and directed. If Bob retweets Alice's tweets 10 times, there is an edge from Bob to Alice with weight 10, but there is not an edge from Alice to Bob.
* If a user retweets herself, ignore it.
* Correctly parsing screen_name in a tweet is error-prone. Use the id of the user (this is the user who is re-tweeting) and the id of the user in the retweeted_status field (this is the user who is being re-tweeted; that is, this user created the original tweet).
* Later you will need to implement the HITS algorithm on the graph you build here.


In [4]:
import json
import copy
import operator
from collections import defaultdict
nodes=0
edges=0
graph=defaultdict(dict)

def addEdge(u,v):
    if v in graph[u]:
        graph[u][v]+=1
    else:
        graph[u][v]=1
    if v not in graph:
        graph[v][u]=0
        
#parse json file
with open('HITS.json') as data_file:
    for line in data_file:
        data=json.loads(line)
        startID=data['user']['id']  #this user retweetes
        endID=data['retweeted_status']['user']['id']    #main original tweet
        if startID==endID:
            continue
        addEdge(startID, endID)

def mapping(L):
    mapp={}
    j=0
    for i in L:
        mapp[i]=j
        j+=1
    return mapp

def get_matrix(L):
    keys=sorted(L.keys())
    size=len(keys)
    M = [ [0]*size for i in range(size) ]
    mapped=mapping(keys)
    for k1 in keys:
        for k2 in keys:
            if k1==k2:
                M[mapped[k1]][mapped[k2]]=0
            try:
                M[mapped[k1]][mapped[k2]]=L[k1][k2]
            except:
                M[mapped[k1]][mapped[k2]]=0
                
    return (M,mapped)

graph_matrix,mapped=get_matrix(graph)
print 'One row in Adjacency Matrix looks as below'
print graph_matrix[1000]

One row in Adjacency Matrix looks as below
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

We will not check the correctness of your graph. However, this will affect the HITS results later.

## HITS Implementation

Your program will return the top 10 users with highest hub and authority scores. The **output** should be like:

Hub Scores

* user1 - score1
* user2 - score2
* ...
* user10 - score10

Authority Scores

* user1 - score1
* user2 - score2
* ...
* user10 - score10

You should follow these **rules**:

* Assume all nodes start out with equal scores.
* It is up to you to decide when to terminate the HITS calculation.
* There are HITS implementations out there on the web. Remember, your code should be **your own**.


**Hints**:
* If you're using the matrix style approach, you should use [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).
* Scipy is built on top of Numpy and has support for sparse matrices. You most likely will not need to use Scipy unless you'd like to try out their sparse matrices.
* If you choose to use Numpy (and Scipy), please make sure your Anaconda environment include their latest versions.
* Test your parsing and HITS calculations using a handful of tweets, before moving on to the entire file we provide.
* We will evaluate the user ranks you provide as well as the quality of your code. So make sure that your code is clear and readable.

In [5]:
import numpy as np
A=np.asmatrix(graph_matrix)
AT=A.transpose()

class Styling:
    bold='\033[1m'
    underline='\033[4m'
    end='\033[0m'
    
#Hub weight vector = u
u=np.asmatrix([1 for i in range(1003)])  #Initialization
u=u.transpose()

def normalize(x):
    return x/np.linalg.norm(x)



for i in range(50):    
    #authority weight vector = v
    v=AT*u
    #updated hub weight vector u
    u=A*v
    #new authorities and hubs after normalization
    v= normalize(v)
    u= normalize(u)

v_new=sorted(np.squeeze(np.asarray(v)), reverse=True)[:10]
u_new=sorted(np.squeeze(np.asarray(u)), reverse=True)[:10]
v_old=np.squeeze(np.asarray(v)).tolist()
u_old=np.squeeze(np.asarray(u)).tolist()

def print_s(x_new,x_old,string):
    print (Styling.bold+Styling.underline+('Top 10 '+string+' Scores').center(40)+Styling.end)
    print (Styling.bold+'User ID'.center(23)+ 'Score'.center(15)+Styling.end)
    k=1
    for i in x_new:
        index=x_old.index(i)
        value=[key for key, value in mapped.iteritems() if value == index][0]
        print "#"+(str(k)+"   "+str(value)+"  -  "+str(i)).center(40)
        k+=1
    print ' '
        
print_s(v_new,v_old,'Authority')
print_s(u_new,u_old,'Hub')

[1m[4m        Top 10 Authority Scores         [0m
[1m        User ID             Score     [0m
# 1   3042570996  -  0.5446247950416776  
# 2   3065514742  -  0.49321658111827393 
# 3   1638625987  -  0.44393069173916605 
# 4   3077733683  -  0.2865913014593139  
# 5   3039321886  -  0.22433354905516348 
# 6   3077695572  -  0.12188415371411263 
# 7   3019659587  -  0.11322781692143306 
# 8   1358345766  -  0.09803159885348592 
# 9   3061155846  -  0.09398530644913716 
#10   3092580049  -  0.09367926027613178 
 
[1m[4m           Top 10 Hub Scores            [0m
[1m        User ID             Score     [0m
# 1   3068706044  -  0.6231185894024612  
# 2   3093940760  -  0.29616065505007777 
# 3   2194518394  -  0.2598725578781697  
# 4   2862783698  -  0.20258565013065213 
# 5   3092183276  -  0.1705184445462124  
# 6   3029724797  -  0.1669898419144692  
# 7   2990704188  -  0.14773330437849239 
# 8   3001500121  -  0.1448265399708642  
# 9   3086921438  -  0.12915071105216777 


# Part 2: Search Engine Optimization (30 + 5 points)

For this part, your goal is to put on your "[search engine optimization](https://en.wikipedia.org/wiki/Search_engine_optimization)" hat. Your job is to create a webpage that scores highest for the query: **kbeznak parmatonic** --- two terms, lower case, no quote. As of today (Feb 16, 2018), there are no hits for this query on either Google or Bing. Based on our discussions of search engine ranking algorithms, you know that several factors may impact a page's rank. Your goal is to use this knowledge to promote your own page to the top of the list.

What we're doing here is a form of [SEO contest](https://en.wikipedia.org/wiki/SEO_contest). While you have great latitude in how you approach this problem, you are not allowed to engage in any unethical or illegal behavior. Please read the discussion of "white hat" versus "black hat" SEO over at [Wikipedia](https://en.wikipedia.org/wiki/Search_engine_optimization#White_hat_versus_black_hat_techniques).


**Rules of the game:**

* Somewhere in the page (possibly in the non-viewable source html) you must include your name or some other way for us to identify you (e.g., your NetID, but not the UIN!).
* Your target page may only be a TAMU student page, a page on your own webserver, a page on a standard blog platform (e.g., wordpress), or some other primarily user-controlled page
* Your target page CAN NOT be a twitter account, a facebook page, a Yahoo Answers or similar page
* No wikipedia vandalism
* No yahoo/wiki answers questions
* No comment spamming of blogs
* If you have concerns/questions/clarifications, please post on Piazza and we will discuss

For your homework turnin for this part, you should provide us the URL of your target page and a brief discussion (2-4 paragraphs) of the strategies you are using. We will issue the query and check the rankings at some undetermined time in the next couple of weeks. You might guess that major search engines take some time to discover and integrate new pages: if I were you, I'd get a target page up immediately.

**Grading:**

* 5 points for providing a valid URL
* 20 points for a well-reasoned discussion of your strategy
* 5 points for your page appearing in the search results by Google or Bing (no matter how is the ranking)

** Bonus: **
* 1 point for your page appearing in the top-20 on Google or Bing
* 1 more point for your page appearing in the top-10 on Google or Bing
* 1 more point for your page appearing in the top-5 on Google or Bing
* 2 more points for your page being ranked first by Google or Bing. And, a vigorous announcement in class, and a high-five for having the top result!

What's the URL of your page?


What's your strategy? (2-4 paragraphs)


In [None]:
URL = https://shivapk.github.io/Kbeznak-parmatonic/

Following below Strategies helped my page to stand in the 1st page(5th rank) of search results as on 24 Feb 2018:
1. Created a people's page on Github domain. This domain has high reputation and thus gets inherited in personal page too. This adds to the authority factors and helps in indexing. 
2. Embedding this string "awcv9kjlh scwrlkjf4e" at various places in the content of the page. This adds to the content relevance for the purpose of indexing. 
3. Added metatags in HTML <head> with the keywords as "Kbeznak" and "Parmatonic" to improve matches.
4. Creating another web page on TAMU student page, wordpress and proving inlinks to the main page. This is to ensure that the main page has enough inlinks so as to enhance its page rank value.
5. Used "awcv9kjlh scwrlkjf4e" as the hyperlink to my page. Anchor texts are a great way to improve indexing.
6. Added links to Google Trends, Famous Videos, Youtube Trends, and new trending topics so as to make the page somewhat relevant to an unkown user and also to boost indexing.
7. Included important/most common keywords inorder to be in the same group of search results as that of other students.
8. Registered my website on google webmaster for indexing.
9. Used google webmaster Data Highlighter to help google understand different sections of my website better.
10.I have added the keyword "kbeznak parmatonic" as heading in the website html just in case if google prioritizes matching search keywords with heading.
11. used custom domain to address the webpage with a name matching the search keyword.
12. used google webmaster Data Highlighter to help google better understand different sections of my website.