
# STA 141B Assignment 3

Due __Nov 24, 2023__ by __11:59pm__. Submit your work by uploading it to Gradescope through Canvas.

Please rename this file as __"LastName_FirstName_hw3"__ and export it as as pdf-file. 

The objective of this assignment is acquire data via web APIs.  

Instructions:

1. Provide your solutions in new cells following each exercise description. Create as many new cells as necessary. Use code cells for your Python scripts and Markdown cells for explanatory text or answers to non-coding questions. Answer all textual questions in complete sentences.

2. Prioritize code readability. Just as in writing a book, the clarity of each line matters. Adopt the __one-statement-per-line__ rule. If you have a lengthy code statement, consider breaking it into multiple lines for clarity. (Please note: violating the one-statement-per-line rule will result in a one-point deduction for each offending line.)

3. To help understand and maintain code, you should always add comments to explain your code. Use the hash symbol (#) to start writing a comment (homework without any comments will automatically receive 0 points).

4. Submit your final work as a __.pdf__ file on __Gradescope__. To convert your .ipynb file into one of these formats, navigate to "File", select "Download as", and then choose either "PDF via LaTeX" or "HTML". If "PDF via LaTeX" does not work for you, export to "HTML", and then use Chrome to print the .html file into PDF. Gradescope only accepts PDF files.

5. On gradescope, mark the locations of your answers on your submission in order to facilitate grading.

6. This assignment will be graded on your proficiency in programming. Be sure to demonstrate your abilities and submit your own, correct and readable solutions. 

### Problem 1 : Getting to Philosophy [10 Points]

Lets play a variation of the [wiki game](https://en.wikipedia.org/wiki/Wikipedia:Wiki_Game) to learn about [this](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) phenomenon. The rules are as follows: 
 - Start using the random article link (wiki menu on the left hand side)
 - Click on the first non-italicized link outside of parentheses 
 - Ignore external links (e.g., `/wiki/File:...` or `/wiki/Category:...`), links to the current page
 - Stop when reaching "Philosophy", a dead end (page with no links) or when a loop occurs

#### Exercise

Write a function `play` that plays the game and stops if "Philosophy" is not reached after `maxiter = 1000` steps. This function should return information to compute the quantities below. 

Play the game $200$ times. Then Report:
 - the mean number of sites visited per game, 
 - the maximum number of sites visited per game,
 - and number of convergences to "Philosophy" and 
 - the 20 most visited sites over all 200 games. 

 __(The sample output does not include the results you need to report)__
 
You may want to use the module `lxml.html` and the function `tostring` `lxml.etree` or similar packages to to parse the html. Besides these, you are allowed to use `requests`, `re`, and `time`. To display the results, you may use `pandas` and its method `pandas.Series.value_counts()` or similar packages. You might find [regexr.com](https://regexr.com/) helpful. 

__Hint for writing this function:__

1. **Function Definition**: Define the `play` function that accepts a starting URL and a maximum number of iterations (`maxiter`).

2. **HTTP Request**: Use the `requests` module to perform a GET request on the given URL.

3. **HTML Parsing**: Parse the returned HTML content using `lxml.html` to find links.

4. **Link Selection**: Define a regular expression using the `re` module to match the first non-italicized link outside of parentheses that is not an external link, a file link, a category link, or a link to the current page. This part is the most difficult part of this problem.

5. **Game Loop**: Implement a loop that follows the selected link and repeats the process, while keeping track of visited sites and checking for "Philosophy", dead ends, or loops. Use a counter to ensure the loop stops after `maxiter` iterations.

__Sample output for play() function:__

In [15]:
import requests
from lxml import html
import re
import time
import pandas as pd

def is_link_in_parentheses(element):
    """
    Check if the link is within parentheses by examining the text content 
    around the link.
    """
    text_content = ''.join(element.xpath('.//text()'))
    link_text = element.text_content()
    start_index = text_content.find(link_text)
    if start_index != -1:
        # Count the number of opening and closing parentheses before and after the link text
        before = text_content[:start_index].count('(') - text_content[:start_index].count(')')
        after = text_content[start_index:].count(')') - text_content[start_index:].count('(')
        return before > 0 or after > 0
    return False

def play(start_url, maxiter=1000):
    visited_pages = []
    current_url = start_url
    base_url = "https://en.wikipedia.org"
    
    for _ in range(maxiter):
        response = requests.get(base_url + current_url)
        tree = html.fromstring(response.content)
        
        for element in tree.xpath('//p//a[not(contains(@href, ":")) and not(contains(@class, "new")) and starts-with(@href, "/wiki/")]'):
            href = element.get('href')
            if not is_link_in_parentheses(element) and href != current_url:
                current_url = href
                break
        else:
            break

        if current_url in visited_pages or current_url.endswith('/Philosophy'):
            break
        visited_pages.append(current_url)

        time.sleep(0.1)

    return visited_pages

paths = [play('/wiki/Special:Random') for _ in range(200)]
df = pd.DataFrame({'paths': game_result})
mean_sites_visited = df['paths'].apply(len).mean()
max_sites_visited = df['paths'].apply(len).max()
games_won = df['paths'].apply(lambda x: x[-1].endswith('/Philosophy') if x else False).sum()
all_sites = pd.Series([site for path in paths for site in path])
top_20_sites = all_sites.value_counts().head(20)

print("Mean number of sites visited per game:", mean_sites_visited)
print("Maximum number of sites visited per game:", max_sites_visited)
print("Number of convergences to 'Philosophy':", games_won)
print("Top 20 most visited sites:")
print(top_20_sites)


Mean number of sites visited per game: 14.2
Maximum number of sites visited per game: 22
Number of convergences to 'Philosophy': 71
Top 20 most visited sites:
/wiki/Abstraction               124
/wiki/Rule_of_inference         124
/wiki/Philosophy_of_logic       124
/wiki/Communication             123
/wiki/Information               123
/wiki/Language                  119
/wiki/Classical_language         75
/wiki/Latin                      69
/wiki/Greek_language             68
/wiki/Modern_Greek               68
/wiki/Dialect                    68
/wiki/Ancient_Greek_language     39
/wiki/Empirical_evidence         33
/wiki/Scientific_method          33
/wiki/Analytic_philosophy        33
/wiki/Proposition                33
/wiki/Philosophy_of_language     33
/wiki/Ancient_Greek              26
/wiki/Geography                  24
/wiki/Organism                   23
Name: count, dtype: int64


In [16]:
#these are all the lines that mightve been cut off on top

#1 - 
#for element in tree.xpath('//p//a[not(contains(@href, ":")) and not(contains(@class, "new")) 
#     and starts-with(@href, "/wiki/")]'): href = element.get('href')


#2
#before = text_content[:start_index].count('(') - text_content[:start_index].count(')')
#after = text_content[start_index:].count(')') - text_content[start_index:].count('(')

#3
#games_won = df['paths'].apply(lambda x: x[-1].endswith('/Philosophy') if x else False).sum()


In [13]:
play('/wiki/Robert_Alfred_Tarlton')['pages']

TypeError: list indices must be integers or slices, not str

In [5]:
play('/wiki/Riku_Morgan')['pages']

['/wiki/Riku_Morgan',
 '/wiki/Nigerian_Airforce',
 '/wiki/Nigerian_Armed_Forces',
 '/wiki/Military',
 '/wiki/Warfare',
 '/wiki/State_(polity)',
 '/wiki/Politics',
 '/wiki/Decision-making',
 '/wiki/Psychology',
 '/wiki/Mind',
 '/wiki/Thought',
 '/wiki/Consciousness',
 '/wiki/Awareness',
 '/wiki/Philosophy']

In [6]:
play('/wiki/Brigade_Commander_(video_game)')['pages']

['/wiki/Brigade_Commander_(video_game)',
 '/wiki/Amiga_Action',
 '/wiki/Amiga',
 '/wiki/Personal_computer',
 '/wiki/Microcomputer',
 '/wiki/Computer',
 '/wiki/Machine',
 '/wiki/Power_(physics)',
 '/wiki/Physics',
 '/wiki/Natural_science',
 '/wiki/Branches_of_science',
 '/wiki/Sciences',
 '/wiki/Scientific_method',
 '/wiki/Empirical_evidence',
 '/wiki/Proposition',
 '/wiki/Philosophy_of_language',
 '/wiki/Analytic_philosophy',
 '/wiki/Philosophical_tradition',
 '/wiki/Philosophy']

In [7]:
play('/wiki/Exclusive_(TV_series)')['pages']

['/wiki/Exclusive_(TV_series)', '/wiki/Double_Vision_(company)']