# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# PageRank [39 points]

In this notebook, you'll implement the [PageRank algorithm](http://ilpubs.stanford.edu:8090/422/) summarized in class. You'll test it on a real dataset (circa 2005) that consists of [political blogs](http://networkdata.ics.uci.edu/data/polblogs/) and their links among one another.

For today's notebook, you'll need to download the following additional materials:
* A SQLite version of the political blogs dataset: http://cse6040.gatech.edu/datasets/poliblogs.db (~ 611 KiB)

In [None]:
# Some modules you'll need
from IPython.display import display
import numpy as np
import scipy.sparse as sp
import scipy.io as spio
import cse6040utils

## Part 1: Explore the Dataset

Let's start by looking at the dataset, to get a feel for what it contains.

In this part, try to rely primarily on SQL queries to accomplish each task. This scenario is appropriate if the database is so large that you cannot expect to load it all into memory.

Incidentally, one of you asked recently how to get the schema for a SQLite database when using Python. Here is some code adapted from a few ideas floating around on the web. Let's use these to inspect the tables available in the political blogs dataset.

In [None]:
import sqlite3 as db
import pandas as pd

def get_table_names (conn):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    query = "select name from sqlite_master where type='table'"
    return pd.read_sql_query (query, conn)

def print_schemas (conn, table_names=None, limit=0):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    if table_names is None:
        table_names = get_table_names (conn)
    c = conn.cursor ()
    query = "pragma table_info ({table})"
    for name in table_names:
        c.execute (query.format (table=name))
        columns = c.fetchall ()
        print ("=== {table} ===".format (table=name))
        col_string = "[{id}] {name} : {type}"
        for col in columns:
            print (col_string.format (id=col[0],
                                      name=col[1],
                                      type=col[2]))
        print ("\n")

In [None]:
conn = db.connect ('poliblogs.db')

for name in get_table_names (conn)['name']:
    print_schemas (conn, [name])
    query = '''select * from %s limit 5''' % name
    display (pd.read_sql_query (query, conn))
    print ("\n")

**Exercise 1.** (3 points). Write a snippet of code to verify that the vertex IDs are _dense_ in some interval $[1, n]$. That is, there is a minimum value of $1$, some maximum value $n$, and _no_ missing values between $1$ and $n$.

Also store the number of vertices $n$ in a variable named `num_vertices`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert num_vertices == 1490

**Exercise 2** (3 points). Make sure every edge has its end points in the vertex table.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
query = '''
  select min (Source), max (Source), min (Target), max (Target) from Edges
'''
pd.read_sql_query (query, conn)

**Exercise 3** (2 points). Determine which vertices have no outgoing edges. Store the result in a Pandas `DataFrame` named `df_deadends` with a single column named `Id`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

display (df_deadends.head ())
display (df_deadends.tail ())

In [None]:
print ("\n==> %d vertices have no outgoing edges." % len (df_deadends))

df_deadends_soln = pd.read_csv ('df_deadends_soln.csv')
assert cse6040utils.tibbles_are_equivalent (df_deadends, df_deadends_soln)

print ("\n(Passed.)")

**Exercise 4** (2 points). Determine which vertices that have no incoming edges. Store the result in a Pandas `DataFrame` called `df_nolove` having just a single column named `Id` to hold the corresponding vertex IDs.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

display (df_nolove.head ())
display (df_nolove.tail ())

In [None]:
print ("\n==> %d vertices have no incoming edges." % len (df_nolove))

df_nolove_soln = pd.read_csv ('df_nolove_soln.csv')
assert cse6040utils.tibbles_are_equivalent (df_nolove, df_nolove_soln)

print ("\n(Passed.)")

**Exercise 5** (3 points). Compute an [SQL view](https://www.sqlite.org/lang_createview.html) called `Outdegrees`, which contains the following columns:

1. `Id`: vertex ID
2. `Degree`: the out-degree of the corresponding vertex.

To help you check your view, the test code selects from your view but adds a `Url` and `Leaning` fields, ordering the results in descending order of degree. It also prints first few and last few rows of this query, so you can inspect the URLs as a sanity check. (Perhaps it also provides a small bit of entertainment!)

In [None]:
# Remove an existing view, if it exists:
c = conn.cursor ()
c.execute ('drop view if exists Outdegrees')

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
query = '''
  select Outdegrees.Id, Degree, Url, Leaning
    from Outdegrees, Vertices
    where Outdegrees.Id=Vertices.Id
    order by -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print ("==> A few entries with large out-degrees:")
display (df_outdegrees.head (10))
print ("\n==> A few entries with small out-degrees:")
display (df_outdegrees.tail ())

df_outdegrees_soln = pd.read_csv ('outdegrees_soln.csv')
assert cse6040utils.tibbles_are_equivalent (df_outdegrees, df_outdegrees_soln)

print ("\n(Passed.)")

**Exercise 6** (3 points). Compute an [SQL view](https://www.sqlite.org/lang_createview.html) called `Indegrees`, which contains the following columns:

1. `Id`: vertex ID
2. `Degree`: the in-degree of this vertex.

Your view should only include vertices with positive out-degree. (That is, if the in-degree is zero, you may leave it out of the resulting view.)

In [None]:
# Remove an existing view, if it exists:
c = conn.cursor ()
c.execute ('drop view if exists Indegrees')

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
query = '''
  select Indegrees.Id, Degree, Url, Leaning
    from Indegrees, Vertices
    where Indegrees.Id=Vertices.Id
    order by -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print ("==> A few entries with large in-degrees:")
display (df_outdegrees.head (10))
print ("\n==> A few entries with small in-degrees:")
display (df_outdegrees.tail ())

df_outdegrees_soln = pd.read_csv ('indegrees_soln.csv')
assert cse6040utils.tibbles_are_equivalent (df_outdegrees, df_outdegrees_soln)

print ("\n(Passed.)")

**Exercise 7** (5 points). Query the database to extract a report of which URLs point to which URLs, storing the result in a Pandas data frame called `df_G`. This data frame should have these columns:

- `SourceURL`: URL of a source vertex
- `SourceLeaning`: "Leaning" value of that vertex
- `TargetURL`: URL of the corresponding target vertex
- `TargetLeaning`: "Leaning" value of that vertex

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from IPython.display import display
display (df_G.head ())
print ("...")
display (df_G.tail ())

df_G_soln = pd.read_csv ('df_G_soln.csv')
assert cse6040utils.tibbles_are_equivalent (df_G, df_G_soln)

## Part 2: Implement PageRank

The following exercises will walk you through a possible implementation of PageRank for this dataset.

**Exercise 8** (5 points). Build a sparse matrix, `PT`, as a Scipy CSR matrix that stores $P^T \equiv G^TD^{-1}$, where $G^T$ is the transpose of the connectivity matrix $G$, and $D^{-1}$ is the diagonal matrix of inverse out-degrees. Be sure also to do the following:

1. Recall that the database indices are 1-based; when converting to the Scipy representation, you should convert these to be 0-based.
2. To ensure that there is no "information loss," place a 1.0 at any diagonal entry where there are no outgoing edges.
3. Ensure that `PT` is square with dimension `num_vertices`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert PT.shape == (num_vertices, num_vertices)
assert len (PT.indices) == 19450

# Check that columns sum to 1.0
u = np.ones (num_vertices)
y_u = np.transpose (PT).dot (u)
assert (np.max (np.abs (y_u - u))) <= 3e-15

# Check whether `PT` matches what we expect
#assert (spio.loadmat ('PT.mat')['PT'] != PT).getnnz () == 0

print ("\n(Passed.)")

**Exercise 9** (10 points). Complete the PageRank implementation for this dataset. To keep it simple, you may take $\alpha=0.85$, $x(0)$ equal to the vector of all $1/n$ values, and 25 iterations.

> **Note.** This implementation asks you to maintain a list, `X`, that stores every `x(t)` that you compute in sequence.

In [None]:
# YOUR CODE GOES BELOW. We've provided some scaffolding code,
# so you just need to complete it.

ALPHA = 0.85 # Probability of following some link
MAX_ITERS = 25
n = num_vertices

# Let X[t] store the dense vector x(t) at time t
X = []

x_0 = np.ones (n) / n # Initial distribution: 1/n at each page
X.append (x_0)

for t in range (1, MAX_ITERS):
    # Complete this implementation
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Write some code here to create a table in the database
# called PageRank

command = '''DROP TABLE IF EXISTS PageRank'''
c = conn.cursor ()
c.execute (command)

command = '''CREATE TABLE PageRank (Id INTEGER, Rank REAL)'''
c.execute (command)

command = '''INSERT INTO PageRank VALUES (?, ?)'''
c.executemany (command, zip (range (1, n+1), X[-1]))

# Complete this query:
query = '''
  SELECT Rank, V.Id, I.Degree AS InDegree, O.Degree AS OutDegree, V.Url, V.Leaning
    FROM PageRank AS P, Vertices AS V, Indegrees AS I, Outdegrees AS O
    WHERE (P.Id = V.Id) AND (P.Id = I.Id) AND (P.Id = O.Id)
    ORDER BY -Rank
    LIMIT 10
'''
df_ranks = pd.read_sql_query (query, conn)
display (df_ranks)

assert df_ranks['Url'][0] == 'dailykos.com'
assert df_ranks['Url'][1] == 'atrios.blogspot.com'
assert df_ranks['Url'][2] == 'instapundit.com'
assert df_ranks['Url'][3] == 'blogsforbush.com'
assert df_ranks['Url'][4] == 'talkingpointsmemo.com'

print ("\n(Passed.)")

**Exercise 10** (3 points). The `Vertices` table includes a column called, `Leaning`, which expresses a political leaning -- either "Left" or "Right". How might you use this column to come up with an alternative ranking scheme?

YOUR ANSWER HERE