# STA 141B WQ 25 Homework Assignment 2

## Instructions

- Complete the exercises below. Create more code chunks if necessary. Answer all questions. Show results for both the *test* and *run* cases.
- Export the Jupyter Notebook as an PDF file.
- Submit the PDF by **Sunday, February 23, at 11:59 PM PT** to Gradescope.
- For each exercise, indicate the region of your answer in the PDF to facilitate grading. 

## Additional information

- Complete this worksheet yourself. 
- You may use the internet or discuss possible approaches to solve the problems with other students. You are not allowed to share your code or your answers with other students.
- No other libraries than those explicitly allowed can be used. 
- Use code cells for your Python scripts and Markdown cells for explanatory text or answers to non-coding questions. Answer all textual questions in complete sentences.
- Late homework submissions will not be accepted. No submissions will be accepted by email.
- The total number of points for this assignment is 20. You can earn 5 bonus points. 

__Exercise 1__

As a public organization, the compensations of employees of all institutions of the University of California are freely accessible. These reports cover UC's career faculty and staff employees, as well as part-time, temporary and student employees. It is accessible [here](https://ucannualwage.ucop.edu). Internally, the data requested by the search mask is queried using an undocumented API. For this exercise, you may use: 
```
import requests
import pandas

from json import loads
```

_Hint: If you encounter an error when parsing the data, try to use string methods (e.g., `str.replace`) to deal with them._

__(a)__ Get the compensation information of all UC Irvine professors in 2023. How many entries are being returned?

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# Would use APIs to request this data, but CORS :/
# So instead we will inject a script on browser side to bypass CORS and scrape the data

# Start WebDriver
driver = webdriver.Chrome()
driver.get("https://ucannualwage.ucop.edu/wage/")
wait = WebDriverWait(driver, 10)

# Wait until the year element is present
select_element = wait.until(EC.presence_of_element_located((By.ID, "year")))

# Select 2023
dropdown = Select(select_element)
dropdown.select_by_visible_text("2023")

# Wait until the Location element is present
select_element = wait.until(EC.presence_of_element_located((By.ID, "location")))

# Select Irvine
dropdown = Select(select_element)
dropdown.select_by_visible_text("Irvine")

In [2]:
# Do search
driver.execute_script("""
jQuery(document).ready(function() {
    jQuery("#list2").jqGrid('setGridParam', {
        rowNum: 32212, // Common way to communicate no limit is -1, but the given server does not support it.
        // Also tried setting loadonce: true, but server expects Pagination, so this property also doesn't work
    });

    doSearch();
});
""")

In [3]:
# Extract table html
table = driver.find_element(By.ID, 'list2')
table_html = table.get_attribute('outerHTML')

In [4]:
from io import StringIO
import pandas as pd

# Convert to Pandas DF
table_html_io = StringIO(table_html)
df = pd.read_html(table_html_io)[0]

In [5]:
# Get column names
header = driver.find_element(By.XPATH, "//table[@class='ui-jqgrid-htable']")
header_html = header.get_attribute('outerHTML')
header_html_io = StringIO(header_html)
header_df = pd.read_html(header_html_io)[0]

df.columns = header_df.columns
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Location,First Name,Last Name,Title,Gross Pay,Regular Pay,Overtime Pay,OtherPay
0,1,2023,Irvine,*****,*****,STDT 4,1907.0,1907.0,0.0,0.0
1,2,2023,Irvine,*****,*****,STDT 4,1414.0,1414.0,0.0,0.0
2,3,2023,Irvine,*****,*****,GSR-NO REM,22701.0,22701.0,0.0,0.0
3,4,2023,Irvine,*****,*****,GSR-FULL FEE REM,28302.0,28302.0,0.0,0.0
4,5,2023,Irvine,*****,*****,POSTDOC-EMPLOYEE,60799.0,60799.0,0.0,0.0


In [6]:
# save the DF
df.to_csv('HW3E1.csv')

In [7]:
# Close browser
driver.quit()

In [8]:
# get the number of entries
num_entries = df.shape[0]

f"Number of entries: {num_entries}"

'Number of entries: 32212'

__BONUS__

__(b, i)__ Use the [UCI directory](https://directory.uci.edu/) to learn each professors department, if available. How many professors with departments information do you find? __(ii)__ Find the four departments that have the largest average gross pay, and the four departments that have the largest average base pay.

In [None]:
# TODO: ??

__Exercise 2__

Lets play a variation of the [wiki game](https://en.wikipedia.org/wiki/Wikipedia:Wiki_Game) to learn about [this](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) phenomenon. The rules are as follows: 
 - Start using either a provided article or the random article link (wiki menu on the left hand side)
 - Click on the first non-italicized link outside of parentheses and info-boxes
 - Ignore external links (e.g., `/wiki/File:...` or `/wiki/Category:...`)
 - Stop when reaching "Philosophy", a dead end (page with no links, this should return `None`) or when a loop occurs
 
Use the test cases below to check your code: 

```python
>play('/wiki/Brigade_Commander_(video_game)')
['/wiki/Brigade_Commander_(video_game)',
 '/wiki/Amiga_Action',
 '/wiki/Amiga',
 '/wiki/Personal_computer',
 '/wiki/Computer',
 '/wiki/Machine',
 '/wiki/Power_(physics)',
 '/wiki/Energy',
 '/wiki/Physical_quantity',
 '/wiki/Quantification_(science)',
 '/wiki/Mathematics',
 '/wiki/Mathematical_theory',
 '/wiki/Reason',
 '/wiki/Consciousness',
 '/wiki/Awareness',
 '/wiki/Philosophy']

>play('/wiki/Keretapi_Tanah_Melayu')
['/wiki/Keretapi_Tanah_Melayu',
 '/wiki/Airline',
 '/wiki/Civil_aviation',
 '/wiki/Aviation',
 '/wiki/Flight',
 '/wiki/Motion_(physics)',
 '/wiki/Physics',
 '/wiki/Scientific',
 '/wiki/Scientific_method',
 '/wiki/Empirical_evidence',
 '/wiki/Evidence',
 '/wiki/Proposition',
 '/wiki/Philosophy_of_language',
 '/wiki/Language',
 '/wiki/Communication',
 '/wiki/Information',
 '/wiki/Abstraction',
 '/wiki/Rule_of_inference',
 '/wiki/Philosophy_of_logic',
 '/wiki/Philosophy']

>play('/wiki/Robert_Alfred_Tarlton')
['/wiki/Robert_Alfred_Tarlton',
 '/wiki/Birmingham',
 '/wiki/City_status_in_the_United_Kingdom',
 '/wiki/The_Crown',
 '/wiki/State_(polity)',
 '/wiki/Politics',
 '/wiki/Decision-making',
 '/wiki/Psychology',
 '/wiki/Mind',
 '/wiki/Thought',
 '/wiki/Cognition',
 '/wiki/Action_(philosophy)',
 '/wiki/Philosophy']

>play('/wiki/Ricky_Vallen')
['/wiki/Ricky_Vallen', None]
```

__(a)__ Run: 
    
```python
play('/wiki/Yadav') # (i)
play('/wiki/1953_Arab_Games') # (ii)
play('/wiki/Save_Me_(Silver_Convention_song)') # (iii)
```

In [60]:
import requests
from bs4 import BeautifulSoup
import time

BASE_URL = 'https://en.wikipedia.org'
def get_links(curr_page):
  """Fetches all valid Wikipedia article links from a given page, excluding italicized, parenthetical, and external links."""
  response = requests.get(f'{BASE_URL}/{curr_page}')
  soup = BeautifulSoup(response.text, "html.parser")
  links = []

  # Only from body
  body_content = soup.find(id="bodyContent")
  if not body_content:
    return links

  # For all possible links
  for link in body_content.select("a[href]"):
    # No infobox
    if link.find_parent(class_="infobox"):
      continue

    # No sidebar
    if link.find_parent(class_="sidebar"):
      continue

    # No thumb
    if link.find_parent(class_="thumb"):
      continue

    # No figcaption
    if link.find_parent("figcaption"):
      continue

    # No hatnote
    if link.find_parent(class_="hatnote"):
      continue

    if link.find_parent("table"):
      continue

    # Make sure it's valid link
    href = link.get("href")
    if not href.startswith("/wiki/"):
      continue
    if ":" in href:
      continue

    # No Italics
    if link.find_parent('i'):
      continue

    links.append(href)

  return links

def play(start, target = "/wiki/Philosophy"):
  """Navigates from start Wikipedia page to target page using article links."""
  current = start
  visited = set()
  path = [start]

  while current != target:
    if current in visited:
      # A loop occurred
      return path

    visited.add(current)
    links = get_links(current)

    if not links:
      # Stuck
      path.append(None)
      return path

    path.append(links[0])
    current = links[0]
    time.sleep(0.1)  # Adding a small delay to prevent excessive requests

  return path

play('/wiki/Brigade_Commander_(video_game)')

['/wiki/Brigade_Commander_(video_game)',
 '/wiki/Amiga_Action',
 '/wiki/Amiga',
 '/wiki/Personal_computer',
 '/wiki/Computer',
 '/wiki/Machine',
 '/wiki/Power_(physics)',
 '/wiki/Energy',
 '/wiki/Ancient_Greek_language',
 '/wiki/Greek_language',
 '/wiki/Modern_Greek',
 '/wiki/Endonym_and_exonym',
 '/wiki/Name',
 '/wiki/Referent',
 '/wiki/Person',
 '/wiki/People',
 '/wiki/Person']

In [47]:
# Test cases
play('/wiki/Brigade_Commander_(video_game)')

Computer
Computer
Computer
Computer
Computer
Computer
Computer
Computer
Computer
Computer
Computer


['/wiki/Brigade_Commander_(video_game)',
 '/wiki/Amiga_Action',
 '/wiki/Amiga',
 '/wiki/Personal_computer',
 '/wiki/Computer',
 '/wiki/ENIAC',
 '/wiki/Geographic_coordinate_system',
 '/wiki/Geodesy',
 '/wiki/Geodynamics',
 '/wiki/Geodesy']

In [None]:
play('/wiki/Keretapi_Tanah_Melayu')

In [2]:
play('/wiki/Ricky_Vallen')

['/wiki/Ricky_Vallen', None]

__(b)__ Run this the game 200 times and report __(i)__ How often did you end with _Philosophy_? __(ii)__ What is the average and __(iii)__ maximum length of your games? __(iv)__ Print the ten most often visited articles and __(v)__ the number of all visited articles. 

__(c)__ Print the articles that you obtain when starting from _Philosophy_.