# Part III - Uncertainty
## Project 3a - PageRank

[Course Link](https://cs50.harvard.edu/ai/)

[Project Instructions](https://cs50.harvard.edu/ai/projects/2/pagerank/)

## PageRank
PageRank is an algrorithm created by by Google’s co-founders that determines the importance of a website for search result purposes. In PageRank’s algorithm, a website is more important if it is linked to by other important websites, and links from less important websites have their links weighted less. 

In this project, you’ll calculae a website PageRank in two ways, one by sampling pages from a Markov Chain Random Surfer and another by iteratively applying the PageRank formula.


**Note, the code to run the project is at the bottom of the file. If you execute all cells in the notebook it should work.** 

In [26]:
# Imports, global vars, and functions used for all solutions
import os
import random
import re
import sys
import pandas as pd
import numpy as np

DAMPING = 0.85
SAMPLES = 10_000

def crawl(directory):
    """
    Parse a directory of HTML pages and check for links to other pages.
    Return a dictionary where each key is a page, and values are
    a list of all other pages in the corpus that are linked to by 
    the page.
    """
    pages = dict()

    # Extract all links from HTML files
    for filename in os.listdir(directory):
        if not filename.endswith(".html"):
            continue
        with open(os.path.join(directory, filename)) as f:
            contents = f.read()
            links = re.findall(r"<a\s+(?:[^>]*?)href=\"([^\"]*)\"", contents)
            pages[filename] = set(links) - {filename}

    # Only include links to other pages in the corpus
    for filename in pages:
        pages[filename] = set(
            link for link in pages[filename]
            if link in pages
        )

    return pages

---
### Markov Transition Model

This function accepts three arguments: corpus, page, and damping_factor.

* **corpus** is a Python dictionary mapping a page name to a set of all pages linked to by that page.
 
 
* **page** is a string representing which page the random surfer is currently on.

    
* **damping_factor** is a floating point number representing the damping factor to be used when generating the probabilities.

The return value of the function should be a Python dictionary with one key for each page in the corpus. Each key should be mapped to a value representing the probability that a random surfer would choose that page next. The values in this returned probability distribution should sum to 1.

In [236]:
def transition_model(directory, damping_factor, page):
    corpus = crawl(directory)  
    page_links = corpus[page]
  
    if len(page_links) != 0:
        page_probs = {}
        sumcheck = 0
        rand_page_prob = (1 - damping_factor) / len(corpus)
        rand_link_prob = damping_factor / len(page_links) + rand_page_prob  
        
        for page in corpus:
            if page in page_links:
                page_probs.update({page : rand_link_prob})
                sumcheck += page_probs[page]
            else:
                page_probs.update({page : rand_page_prob})
                sumcheck += page_probs[page]
        #print(f'{sumcheck:.4f}  <------should be 1, if not, error in code.')
        return page_probs   
    else:
        no_link_probs = {p : 1 / len(corpus) for p in corpus}
        return no_link_probs

In [237]:
transition_model('corpus0', DAMPING, '2.html')

{'4.html': 0.037500000000000006,
 '1.html': 0.4625,
 '2.html': 0.037500000000000006,
 '3.html': 0.4625}

In [238]:
transition_model('corpus1', DAMPING, 'tictactoe.html')

{'search.html': 0.021428571428571432,
 'bfs.html': 0.021428571428571432,
 'dfs.html': 0.021428571428571432,
 'minimax.html': 0.4464285714285714,
 'tictactoe.html': 0.021428571428571432,
 'minesweeper.html': 0.021428571428571432,
 'games.html': 0.4464285714285714}

In [239]:
transition_model('corpus2', DAMPING, 'recursion.html')

{'ai.html': 0.125,
 'c.html': 0.125,
 'logic.html': 0.125,
 'algorithms.html': 0.125,
 'inference.html': 0.125,
 'python.html': 0.125,
 'recursion.html': 0.125,
 'programming.html': 0.125}

---
## Random Surfer Instructions
One way to think about PageRank is with the random surfer model, which considers the behavior of a hypothetical surfer on the internet who clicks on links at random.

**One way to interpret this model is as a Markov Chain, where each page represents a state, and each page has a transition model that chooses among its links at random.** At each time step, the state switches to one of the pages linked to by the current state.

**By sampling states randomly from the Markov Chain, we can get an estimate for each page’s PageRank. We can start by choosing a page at random, then keep following links at random, keeping track of how many times we’ve visited each page. After we’ve gathered all of our samples (based on a number we choose in advance), the proportion of the time we were on each page might be an estimate for that page’s rank.**

To ensure we can always get to somewhere else in the corpus of web pages, we’ll introduce to our model a damping factor d. With probability d (where d is usually set around 0.85), **the random surfer will choose from one of the links on the current page at random. But otherwise (with probability 1 - d), the random surfer chooses one out of all of the pages in the corpus at random (including the one they are currently on).**

Our random surfer now starts by choosing a page at random, and then, for each additional sample we’d like to generate, chooses a link from the current page at random with probability d, and chooses any page at random with probability 1 - d. If we keep track of how many times each page has shown up as a sample, we can treat the proportion of states that were on a given page as its PageRank.

### Random Surfur Implementation

In [52]:
def get_rand_sample(directory, corpus, damping_factor, prior_sample):
    if prior_sample == None:  
        # Start with random page
        initial_state = random.choice(list(corpus.keys()))      
        return initial_state
    else: 
        # Use tranistion model probabilities for random sample
        tm = transition_model(directory, damping_factor, prior_sample)
        pages = list(corpus.keys())
        probs = [value for key, value in tm.items()]
        sample = np.random.choice(pages, 1, p=probs).item()
        return sample

def sample_pagerank(directory, damping_factor, n):
    corpus = crawl(directory)
    pageranks = {}
    m_chain = []
    
    # Create Markov Chain of Pages from Random Samples
    for i in range(0, SAMPLES):
        if i == 0:
            sample = get_rand_sample(directory, corpus, damping_factor, None)
            m_chain.append(sample)
        else:
            sample = get_rand_sample(directory,corpus, damping_factor, m_chain[-1])
            m_chain.append(sample)
    
    page_counts = pd.Series(m_chain).value_counts().to_dict()
    page_ranks = {x:v/SAMPLES for x, v in page_counts.items()}

    return(page_ranks)

---
## Iterative Algorithm Instructions

We can also define a page’s PageRank using a recursive mathematical expression. 

In this model we let:
* PR(p) = the PageRank of a given page p (the probabilty that a random surfer ends up on that page)

There are two ways a random surfer could end up on the page:

     1. With probability 1 - d, the surfer chose a page at random and ended up on page p

     2. With probabilty d, the surfer followed a link from page i to page p
     
     
The first condition is fairly straightforward to express mathematically: it’s 1 - d divided by N, where N is the total number of pages across the entire corpus. This is because the 1 - d probability of choosing a page at random is split evenly among all N possible pages.

For the second condition, we need to consider each possible page i that links to page p. For each of those incoming pages, let NumLinks(i) be the number of links on page i. Each page i that links to p has its own PageRank, PR(i), representing the probability that we are on page i at any given time. And since from page i we travel to any of that page’s links with equal probability, we divide PR(i) by the number of links NumLinks(i) to get the probability that we were on page i and chose the link to page p.

Thus the definition for PR(p) is:

## $ PR(p) = \frac{1-d}{N} + d * \sum_{i} \frac{PR(i)}{NumLinks(i)} $

In this formula, d is the damping factor, N is the total number of pages in the corpus, i ranges over all pages that link to page p, and NumLinks(i) is the number of links present on page i.

How would we go about calculating PageRank values for each page, then? We can do so via iteration: start by assuming the PageRank of every page is 1 / N (i.e., equally likely to be on any page). Then, use the above formula to calculate new PageRank values for each page, based on the previous PageRank values. If we keep repeating this process, calculating a new set of PageRank values for each page based on the previous set of PageRank values, eventually the PageRank values will converge (i.e., not change by more than a small threshold with each iteration).

### Iterative Algorithm Implementation

In [228]:
def getranks(corpus, old_ranks, final_ranks):
    o_ranks = old_ranks
    f_ranks = final_ranks
    new_ranks = {}
      
    for page, links in corpus.items():
        new_rank = 0 
        
        # Calculate New Rank Using PageRank Formula:
        for pg, link in links.items(): 
            if len(links) > 0:
                new_rank += o_ranks[pg] / link
        new_rank *= DAMPING
        new_rank += (1 - DAMPING) / len(corpus)
                
        if abs(new_rank - o_ranks[page]) <= .0001:
            f_ranks.update({page: new_rank})
            new_ranks.update({page: new_rank})
                
        new_ranks.update({page: new_rank})
                
    if len(o_ranks) == len(f_ranks):
        return(f_ranks)   
    else:
        o_ranks = new_ranks        
        return getranks(corpus, o_ranks, f_ranks)
    

def pagerank2(directory, d):
    corpus = crawl(directory)
    new_corpus = {}
    
    # New corpus created with pages and incoming links to the page
    # Note: old corpus has pages with outgoing links from the page
    for page in corpus:
        links = {}
        for pg, lnk in corpus.items():
            if page in lnk:
                links.update({pg:len(lnk)})
            else:
                pass
        new_corpus.update({page: links})
    
    # If a page in original corpus has no outgoing links, then
    # per project instructions, this means it actually has an outgoing
    # link to every page in the corpus.
    # So, when a page with no links is found, I add the page from
    # the old corpus with no outgoing links to every link inside of 
    # new_corpus to reflect this. 
    for page, links in corpus.items():
        if len(links) == 0:
            pg = page
            for p, l in new_corpus.items():
                l.update({pg : len(new_corpus)})

    old_ranks = {page: 1 / len(new_corpus) for page in new_corpus}
    final_ranks = getranks(new_corpus, old_ranks, final_ranks={})
    return(final_ranks)
    

In [229]:
pagerank2('corpus0', DAMPING)

{'4.html': 0.13099022817761485,
 '1.html': 0.21988476390560002,
 '2.html': 0.4292402440111851,
 '3.html': 0.21988476390560002}

## Run Project

In [232]:
def get(directory):
    ranks = sample_pagerank(directory, DAMPING, SAMPLES)
    print(f"Random Sampling PageRank Results (n = {SAMPLES})")
    for page in sorted(ranks):
        print(f"  {page}: {ranks[page]:.4f}")
   
    print()
        
    ranks = pagerank2(directory, DAMPING)
    print(f"PageRank Results from Iteration")
    for page in sorted(ranks):
        print(f"  {page}: {ranks[page]:.4f}")       

In [233]:
get('corpus0')

Random Sampling PageRank Results (n = 10000)
  1.html: 0.2275
  2.html: 0.4300
  3.html: 0.2117
  4.html: 0.1308

PageRank Results from Iteration
  1.html: 0.2199
  2.html: 0.4292
  3.html: 0.2199
  4.html: 0.1310


In [234]:
get('corpus1')

Random Sampling PageRank Results (n = 10000)
  bfs.html: 0.1188
  dfs.html: 0.0857
  games.html: 0.2240
  minesweeper.html: 0.1137
  minimax.html: 0.1291
  search.html: 0.2124
  tictactoe.html: 0.1163

PageRank Results from Iteration
  bfs.html: 0.1150
  dfs.html: 0.0807
  games.html: 0.2278
  minesweeper.html: 0.1182
  minimax.html: 0.1309
  search.html: 0.2091
  tictactoe.html: 0.1182


In [235]:
get('corpus2')

Random Sampling PageRank Results (n = 10000)
  ai.html: 0.1912
  algorithms.html: 0.1045
  c.html: 0.1241
  inference.html: 0.1301
  logic.html: 0.0234
  programming.html: 0.2312
  python.html: 0.1250
  recursion.html: 0.0705

PageRank Results from Iteration
  ai.html: 0.1887
  algorithms.html: 0.1066
  c.html: 0.1240
  inference.html: 0.1290
  logic.html: 0.0264
  programming.html: 0.2297
  python.html: 0.1240
  recursion.html: 0.0716


### My original incorrect iteration, below I didn't add recursion link to every page so the results were lower than they should have been for the corpus 2 example, though the first two corpus examples worked. 

**I also iterated using a while loop here. I actually prefer the function iteration method, but the while loop is clean and easy to follow.**

In [None]:
def iterate_pagerank(directory, d):
    corpus = crawl(directory)
    start_prob = (1 - d) / len(corpus)
    
    old_ranks = {page: start_prob for page in corpus}
    
    # new corpus created with pages and links to p rather than
    # pages and links from p
    new_corpus = {}
    pages = [pg for pg, val in corpus.items()]
    for page in pages:
        links = {}
        for pg, lnk in corpus.items():
            if page in lnk:
                links.update({pg:len(lnk)})
            else:
                pass
        new_corpus.update({page: links})

    while True:
        new_ranks = {}       
        for page, links in new_corpus.items():
            #print(page, links)
            if len(links) == 0:
                new_ranks.update({page: start_prob})
            elif len(links) == 1:
                link_page =  list(links.keys())[0]
                num_links =  list(links.values())[0]
                new_ranks.update({page: start_prob + d * old_ranks[link_page]/num_links})
            else:
                rank_link_sum = 0
                for pg, lnk in links.items():
                    rank_link_sum += (old_ranks[pg]/lnk)
                    new_ranks.update({page: start_prob + d * rank_link_sum})
                                 
        # Compare new ranks to old and check for +/- .0001 difference
        # if check fails, old_ranks = new_ranks
        old_check = sum(old_ranks.values())
        new_check = sum(new_ranks.values())
        
        #old_ranks_vals = [rank for p, rank in old_ranks.items]
        diffs = new_check - old_check
              
        if  diffs <= 0.001:
            return(new_ranks)
            break
        else:
            old_ranks = new_ranks