#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2017


# Homework 2:  Link Analysis -- PageRank + SEO

### 100 points [5% of your final grade]

### Due: Tuesday, February 21, 2017 by 11:59pm

*Goals of this homework:* Explore real-world challenges of building a graph (in this case, from tweets), implement and test PageRank over this graph, and investigate factors that impact a page's rank on Google and Bing.

*Submission Instructions:* To submit your homework, rename this notebook as YOUR_UIN_hw2.ipynb. Submit this notebook via ecampus. Your notebook should be completely self-contained, with the results visible in the notebook. 

*Late submission policy:* For this homework, you may use up to three of your late days, meaning that no submissions will be accepted after Friday, February 24, 2017 at 11:59pm.

# Part 1: PageRank (70 points)

## A Twitter-Mentioning Graph

In this assignment, we're going to adapt the classic PageRank approach to allow us to find not the most authoritative web pages, but rather to find significant Twitter users. So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Twitter users and their mentions of other Twitter users (so user = node, mention of another user = edge). Over this Twitter-user graph, we can apply the PageRank approach to order the users. The main idea is that a user who is mentioned by other users is more "impactful". 

Here is a toy example. Suppose you are given the following four tweets:

* **userID**: diane, **text**: "@bob Howdy!"
* **userID**: charlie, **text**: "Welcome @bob and @alice!"
* **userID**: bob, **text**: "Hi @charlie and @alice!"
* **userID**: alice, **text**: "Howdy!"

There are four short tweets generated by four users. The @mentions between users form a directed graph with four nodes and five edges. E.g., the "diane" node has a directed edge to the "bob" node. Note that a retweet also contain the "@", so it should be counted as a mention as well.

You should build a graph by parsing the tweets in the file we provide called *pagerank.json*.

**Notes:**

* The edges are binary and directed. If Bob mentions Alice once, in 10 tweets, or 10 times in one tweet, there is an edge from Bob to Alice, but there is not an edge from Alice to Bob.
* If a user mentions herself, ignore it.
* Correctly parsing @mentions in a tweet is error-prone. Use the entities field.
* Later you will need to implement the PageRank algorithm on the graph you build here.


In [None]:
# Here define your function for building the graph by parsing the input file of tweets
# Insert as many cells as you want



In [None]:
# Call your function to print out the size of the graph, i.e., the number of nodes and edges
# How you maintain the graph is totaly up to you
# However, if you encounter any memory issues, we recommend you write the graph into a file, and load it later.



We will not check the correctness of your graph. However, this will affect the PageRank results later.

## PageRank Implementation

Your program will return the top 10 users with highest PageRank scores. The **output** should be like:

* user1 - score1
* user2 - score2
* ...
* user10 - score10

You should follow these **rules**:

* Assume all nodes start out with equal probability.
* The probability of the random surfer teleporting is 0.1 (that is, the damping factor is 0.9).
* If a user is never mentioned and does not mention anyone, their PageRank scores should be zero. Do not include the user in the calculation.
* It is up to you to decide when to terminate the PageRank calculation.
* There are PageRank implementations out there on the web. Remember, your code should be **your own**.


**Hints**:
* If you're using the matrix style approach, you should use [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).
* Scipy is built on top of Numpy and has support for sparse matrices. You most likely will not need to use Scipy unless you'd like to try out their sparse matrices.
* If you choose to use Numpy (and Scipy), please make sure your Anaconda environment include their latest versions (Numpy 1.12.0; Scipy 0.18.1).
* Test your parsing and PageRank calculations using a handful of tweets, before moving on to the entire file we provide.
* We will evaluate the user ranks you provide as well as the quality of your code. So make sure that your code is clear and readable.

What is the termination condition in your PageRank implementation? Describe it below:

*ADD YOUR INPUT HERE*

In [110]:
# Here add your code to implement a function called PageRanker
# Insert as many cells as you want

# def PageRanker(...):
#    ...
import urllib2
import json
import collections
from urllib2 import urlopen

url = "https://piazza.com/class_profile/get_resource/ixkk1fy863r1vs/iyvvbjeguc579f"

response = urlopen(url)
data = response.read()
final_data = [s.strip() for s in data.splitlines()]

nodes = collections.defaultdict(set)
inlink_graph = collections.defaultdict(set)
# inlink_graph = {}

for x in final_data:
    user = json.loads(x)
    user_id = user['user']['id']
    for user_mention in user['entities']['user_mentions']:
        if user_mention['id'] not in inlink_graph:
            inlink_graph[user_mention['id']] = set()
        if user_id not in inlink_graph:
            inlink_graph[user_id] = set()
        if user_mention['id'] not in nodes:
            nodes[user_mention['id']] = set()
        if user_id != user_mention['id']:  
            inlink_graph[user_mention['id']].add(user_id)
            nodes[user_id].add(user_mention['id'])





16430
16430
24256


In [None]:
What is the termination condition in your PageRank implementation? Describe it below:
    
I am checking in each iteration the differnce between the values of the pagerank between each user. If all the values
are less the threshold (e^-10), then I am terminating the loop. It terminates after 27 interations.

In [113]:
# Now let's call your function on the graph you've built. Output the results.
from math import exp
from collections import Counter
prev_iter=collections.defaultdict(float)
curr_iter=collections.defaultdict(float)
d=0.9
epsilon = collections.defaultdict(float)
for key in nodes:
    prev_iter[key] = 1.0/len(nodes)
    epsilon[key]  = 0.0
limit = 1*exp(-10)
iterate = True
count = 0
while(iterate):
    count = count + 1
    for key in inlink_graph:
        curr_iter[key] = 0.0
        for x in inlink_graph[key]:
            curr_iter[key] += prev_iter[x]/len(nodes[x])
        curr_iter[key] = d*curr_iter[key] + (1.0-d)/len(nodes)
        
    for key, value in curr_iter.iteritems():
        epsilon[key] = abs(curr_iter[key] - prev_iter[key])
    if max(epsilon.values()) > limit:
        iterate = True
    else:
        iterate = False
    
    
    for key in curr_iter:
        prev_iter[key] = curr_iter[key]
sum_norm = 0.0
for key in curr_iter:
    sum_norm += curr_iter[key]
for key in curr_iter:
    curr_iter[key] = curr_iter[key]/sum_norm
d = curr_iter.items()
d.sort(key=lambda x:x[1], reverse=True)
print count
for i in d[:10]:
    print "user: ", i[0], " ranking: ", i[1]




27
user:  158314798  ranking:  0.00768846307411
user:  181561712  ranking:  0.00649836567142
user:  209708391  ranking:  0.00605142684996
user:  72064417  ranking:  0.00551595167197
user:  105119490  ranking:  0.00506482436337
user:  14268057  ranking:  0.0046613218593
user:  379961664  ranking:  0.00440795384186
user:  391037985  ranking:  0.00424082839959
user:  153074065  ranking:  0.00423748902304
user:  313525912  ranking:  0.00407948900579


# Part 2: Search Engine Optimization (30 + 5 points)

For this part, your goal is to put on your "[search engine optimization](https://en.wikipedia.org/wiki/Search_engine_optimization)" hat. Your job is to create a webpage that scores highest for the query: **awcv9kjlh scwrlkjf4e** --- two terms, lower case, no quote. As of today (Feb 7, 2017), there are no hits for this query on either Google or Bing. Based on our discussions of search engine ranking algorithms, you know that several factors may impact a page's rank. Your goal is to use this knowledge to promote your own page to the top of the list.

What we're doing here is a form of [SEO contest](https://en.wikipedia.org/wiki/SEO_contest). While you have great latitude in how you approach this problem, you are not allowed to engage in any unethical or illegal behavior. Please read the discussion of "white hat" versus "black hat" SEO over at [Wikipedia](https://en.wikipedia.org/wiki/Search_engine_optimization).


**Rules of the game:**

* Somewhere in the page (possibly in the non-viewable source html) you must include your name or some other way for us to identify you (e.g., your NetID, but not the UIN!).
* Your target page may only be a TAMU student page, a page on your own webserver, a page on a standard blog platform (e.g., wordpress), or some other primarily user-controlled page
* Your target page CAN NOT be a twitter account, a facebook page, a Yahoo Answers or similar page
* No wikipedia vandalism
* No yahoo/wiki answers questions
* No comment spamming of blogs
* If you have concerns/questions/clarifications, please post on Piazza and we will discuss

For your homework turnin for this part, you should provide us the URL of your target page and a brief discussion (2-4 paragraphs) of the strategies you are using. We will issue the query and check the rankings at some undetermined time in the next couple of weeks. You might guess that major search engines take some time to discover and integrate new pages: if I were you, I'd get a target page up immediately.

**Grading:**

* 5 points for providing a valid URL
* 20 points for a well-reasoned discussion of your strategy
* 5 points for your page appearing in the search results by Google or Bing (no matter how is the ranking)

** Bonus: **
* 1 point for your page appearing in the top-20 on Google or Bing
* 1 more point for your page appearing in the top-10 on Google or Bing
* 1 more point for your page appearing in the top-5 on Google or Bing
* 2 more points for your page being ranked first by Google or Bing. And, a vigorous announcement in class, and a high-five for having the top result!

What's the URL of your page?

http://people.tamu.edu/~savinay

What's your strategy? (2-4 paragraphs)

As we know from the PageRank algorithm, the more inlinks we are able to create for our page, the better it will rank on google search. So, this is what I did. I created my page on http://people.tamu.edu/~savinay and also created other webpages using github webpages. Now I tried to link the other pages that I created to my main webpage (http://people.tamu.edu/~savinay) and tried to add the keyword "awcv9kjlh scwrlkjf4e" in my webpage a number of times to improve its term-frequency, meanwhile also having a meaningful sentence creation. Also I added the keyword in the title which I think provides a good visibility on google.

I posted a post on Craigslist and tried to link my main webpage to it. This helped me increase my visibility on google. Also I have created a google maps page (submitting my website as a business) to link to my website. But it is still to be approved by google. I also maintained a sketch recognition blog from a previous course I took at TAMU and created a post on "awcv9kjlh scwrlkjf4e" in the blog. The blog linked to my webpage and helped increased its inlinks, hence the pagerank. I have also asked my fellow classmates to link to my webpage and asked a lot of my friends to visit my webpage. 

I also looked for latest search trends on google and included those terms in my website. This will help me get visibility to a lot of users and thus help me increase my pagerank.
