# STA 141B FQ 25 Homework Assignment 4

## Instructions

- Complete the exercises below. Create more code chunks if necessary. Answer all questions. Show results for both the *test* and *run* cases.
- Export the Jupyter Notebook as an PDF file.
- Submit the PDF by **Sunday, March 9th, at 11:59 PM PT** to [Gradescope](https://www.gradescope.com/courses/947485). 
- For each exercise, indicate the region of your answer in the PDF to facilitate grading. 

## Additional information

- Complete this worksheet yourself. 
- You may use the internet or discuss possible approaches to solve the problems with other students. You are not allowed to share your code or your answers with other students.
- No other libraries than those explicitly allowed can be used. 
- Use code cells for your Python scripts and Markdown cells for explanatory text or answers to non-coding questions. Answer all textual questions in complete sentences.
- Late homework submissions will not be accepted. No submissions will be accepted by email.
- The total number of points for this assignment is 20.

__Exercise 1__

Lets obtain movie information for the movies available on the Internet Movie Script Database [IMSDb](https://imsdb.com/). 

__(a)__ Use the _Alphabetical_ section to obtain the URL of all movies. How many different movies do you obtain?

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of IMSDb
base_url = "https://imsdb.com"

In [2]:
alphabet = ['0', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']

In [7]:
URLs = []

for letter in alphabet:
  url = f"{base_url}/alphabetical/{letter}"
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  tables = soup.find_all("table")
  assert(len(tables) > 1)
  second_table = tables[1]
  data_cells = second_table.find_all("td")
  assert(len(data_cells) > 2)
  third_data_cell = data_cells[0]
  ps = third_data_cell.find_all("p")
  for p in ps:
    a = p.find("a")
    URLs.append(a["href"])

In [12]:
df = pd.DataFrame(URLs, columns=["URL"])
df = df.drop_duplicates()
df.head()

Unnamed: 0,URL
0,/Movie Scripts/10 Things I Hate About You Scri...
1,/Movie Scripts/12 Script.html
2,/Movie Scripts/12 and Holding Script.html
3,/Movie Scripts/12 Monkeys Script.html
4,/Movie Scripts/12 Years a Slave Script.html


In [13]:
f'There are {df.shape[0]} movies'

'There are 1295 movies'

In [14]:
df.to_csv("movies.csv", index=False)

__(b)__ For every movie, obtain the title, writers, genres, script date and movie release date. 

__Test:__

```python
> get_movie_details('/Movie Scripts/Feast Script.html')
('Feast',
 {'writers': ['Patrick Melton', 'Marcus Dunston'],
  'genres': ['Action', 'Comedy', 'Horror', 'Thriller'],
  'script_date': 2004,
  'release_date': 2006})
```

__(i)__ Which movie has the greatest observed distance between script and movie release date? __(ii)__ Which writer has written the most movies?

In [5]:
df = pd.read_csv("movies.csv")

In [35]:
def get_movie_details(endpoint):
  title = endpoint.split("/")[-1][:-4]

  with requests.Session() as session:
    response = session.get(f'{base_url}{endpoint}')

  soup = BeautifulSoup(response.text, 'html.parser')

  writers = [*map(
    lambda x: x.text,
    soup.find_all('a', href=lambda x: x and x.startswith('/writer.php?')),
  )]

  genres = [*map(
    lambda x: x.text,
    soup.find_all('a', href=lambda x: x and x.startswith('/genre/')),
  )]

  script_date_tag = soup.find('b', string='Script Date')
  script_date = script_date_tag.find_next(string=True).find_next(string=True).strip()[2:] if script_date_tag else None
  release_date_tag = soup.find('b', string='Movie Release Date')
  release_date = release_date_tag.find_next(string=True).find_next(string=True).strip()[2:] if release_date_tag else None

  sleep(0.25)

  return title, {
    'writers': writers,
    'genres': genres,
    'script_date': script_date,
    'release_date': release_date,
  }

get_movie_details("/Movie Scripts/Feast Script.html")

('Feast Script.',
 {'writers': ['Patrick Melton', 'Marcus Dunston'],
  'genres': ['Action',
   'Adventure',
   'Animation',
   'Comedy',
   'Crime',
   'Drama',
   'Family',
   'Fantasy',
   'Film-Noir',
   'Horror',
   'Musical',
   'Mystery',
   'Romance',
   'Sci-Fi',
   'Short',
   'Thriller',
   'War',
   'Western',
   'Action',
   'Comedy',
   'Horror',
   'Thriller'],
  'script_date': 'May 2004',
  'release_date': 'September 2006'})

In [36]:
df[['title', 'details']] = df['URL'].map(get_movie_details).apply(pd.Series)

ConnectionError: HTTPSConnectionPool(host='imsdb.com', port=443): Max retries exceeded with url: /Movie%20Scripts/A%20Scanner%20Darkly%20Script.html (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x115c52f00>: Failed to establish a new connection: [Errno 12] Cannot allocate memory'))

In [None]:
df.to_csv("movies.csv", index=False)

__Exercise 2__

__(a, i)__ Lets retrieve data from the [CIA World Factbook](https://www.cia.gov/the-world-factbook/). Using devtools, find a way to retrieve the names of all listed world entities. *How many distinct world entities did you find? (Hint: I found more than 228 and less than 261)*

__(ii)__ In order to navigate to their respective site, I assembled the path by processing the country names. Retrieve all country specific data in JSON format.

In [141]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select

import pandas as pd
import requests
import json

from bs4 import BeautifulSoup
from time import sleep

In [142]:
# (i)
driver = webdriver.Chrome()
driver.get("https://www.cia.gov/the-world-factbook/countries/")
select_num_pages = Select(driver.find_element(By.CSS_SELECTOR, "select.per-page"))

In [143]:
select_num_pages.select_by_visible_text("All")
sleep(1)

In [144]:
links = driver.find_elements(By.XPATH, "//a[starts-with(@href, '/the-world-factbook/countries/')]")

In [145]:
link_texts = map(lambda link: link.text, links)

In [146]:
link_urls = map(lambda link: link.get_attribute("href"), links)

In [147]:
df = pd.DataFrame({"countries": link_texts, "urls": link_urls})

In [148]:
driver.quit()

In [149]:
df = df[df["countries"] != "Countries"]

In [150]:
def get_country_json(url):
  country = url.split("/")[-2]
  response = requests.get(f'https://www.cia.gov/the-world-factbook/page-data/countries/{country}/page-data.json')
  return response.text if response.status_code == 200 else None

In [151]:
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
    df['json_data'] = [*executor.map(get_country_json, df['urls'])]

df.head()

Unnamed: 0,countries,urls,json_data
1,Afghanistan,https://www.cia.gov/the-world-factbook/countri...,"{""componentChunkName"":""component---src-templat..."
2,Akrotiri and Dhekelia,https://www.cia.gov/the-world-factbook/countri...,"{""componentChunkName"":""component---src-templat..."
3,Albania,https://www.cia.gov/the-world-factbook/countri...,"{""componentChunkName"":""component---src-templat..."
4,Algeria,https://www.cia.gov/the-world-factbook/countri...,"{""componentChunkName"":""component---src-templat..."
5,American Samoa,https://www.cia.gov/the-world-factbook/countri...,"{""componentChunkName"":""component---src-templat..."


In [152]:
df.to_csv("countries.csv", index=False)

In [137]:
df = pd.read_csv("countries.csv")

In [153]:
f'There are {df['urls'].nunique()} unique entities in the world'

'There are 254 unique entities in the world'


__(b, i)__ We are interested in the key ports of each country. Write a function `get_port_data` that takes the country and returns all key ports in a list. *How many ports in total did you find?*

__Test: __

```
>get_port_data('Afghanistan')

>get_port_data('Algeria')
['Alger',
 'Annaba',
 'Arzew',
 'Arzew El Djedid',
 'Bejaia',
 'Mers El Kebir',
 'Oran',
 'Port Methanier',
 'Skikda']
 
>get_port_data('United States')
['Baltimore',
 'Boston',
 'Brooklyn',
 'Buffalo',
 'Chester',
 'Cleveland',
 'Detroit',
 'Galveston',
 'Houston',
 'Los Angeles',
 'Louisiana Offshore Oil Port (LOOP)',
 'Mobile',
 'New Orleans',
 'New York City',
 'Norfolk',
 'Oakland',
 'Philadelphia',
 'Portland',
 'San Francisco',
 'Seattle',
 'Tri-City Port']
```

In [154]:
def get_port_data(country):
  data = json.loads(df.loc[df['countries'] == country, 'json_data'].iloc[0])
  ports_object = next((item for item in data['result']['data']['fields']['nodes'] if item.get('name') == "Ports"), None)
  if ports_object is None: return None
  soup = BeautifulSoup(ports_object['data'], 'html.parser')
  strong_tag = soup.find('strong', string="key ports:")
  return strong_tag.next_sibling.strip().split(', ')

In [155]:
get_port_data('Afghanistan')

In [156]:
get_port_data('Algeria')

['Alger',
 'Annaba',
 'Arzew',
 'Arzew El Djedid',
 'Bejaia',
 'Mers El Kebir',
 'Oran',
 'Port Methanier',
 'Skikda']

In [158]:
get_port_data('United States')

['Baltimore',
 'Boston',
 'Brooklyn',
 'Buffalo',
 'Chester',
 'Cleveland',
 'Detroit',
 'Galveston',
 'Houston',
 'Los Angeles',
 'Louisiana Offshore Oil Port (LOOP)',
 'Mobile',
 'New Orleans',
 'New York City',
 'Norfolk',
 'Oakland',
 'Philadelphia',
 'Portland',
 'San Francisco',
 'Seattle',
 'Tri-City Port']

In [159]:
df['ports'] = df['countries'].map(get_port_data)

In [160]:
df.drop(columns=['json_data', 'urls'], inplace=True)
df.head()

Unnamed: 0,countries,ports
1,Afghanistan,
2,Akrotiri and Dhekelia,
3,Albania,"[Durres, Shengjin, Vlores]"
4,Algeria,"[Alger, Annaba, Arzew, Arzew El Djedid, Bejaia..."
5,American Samoa,[Pago Pago Harbor]


__(ii)__ Where are these ports? Use the [Nominatim API](https://nominatim.org/) to obtain latitude-longitude pairs for each port. *How many pairs did you find?*

In [166]:
# TODO: this and everything below
ports = df.explode('ports').dropna(subset=['ports'])

In [167]:
ports.head()

Unnamed: 0,countries,ports
3,Albania,Durres
3,Albania,Shengjin
3,Albania,Vlores
4,Algeria,Alger
4,Algeria,Annaba


In [168]:
ports.to_csv("ports.csv", index=False)

In [169]:
ports = pd.read_csv("ports.csv")

In [203]:
def get_port_info(port, country):
  country = country.replace(' ', '%20')
  port = port.replace(' ', '%20')
  url = f'https://nominatim.openstreetmap.org/search?q={port},%20{country}&polygon_geojson=1&format=jsonv2'
  headers = {
    'User-Agent': 'MyWebScraper (ajowe@ucdavis.edu)'
  }
  response = requests.get(url, headers=headers)
  sleep(1)
  return response.json()

In [None]:
response = get_port_info("Big Creek", "Belize")
response

In [176]:
json_data = response.json()

In [181]:
json_data[0].keys()

dict_keys(['place_id', 'licence', 'osm_type', 'osm_id', 'lat', 'lon', 'category', 'type', 'place_rank', 'importance', 'addresstype', 'name', 'display_name', 'boundingbox', 'geojson'])

In [199]:
json_data[1]['category']

'place'

In [204]:
ports['info'] = ports.apply(lambda row: get_port_info(row['ports'], row['countries']), axis=1)

In [205]:
ports.to_csv("ports.csv", index=False)

__(iii)__ Add markers to a world map identifying each found port. The result should look something like this: 

<img src="source/world.png" width="500" height="300">

__(c)__ Lets learn about each nations merchant marine! Write a function `merchant_marine` that takes the country and returns the merchant marine as follows: 

```
>merchant_marine('Angola')
{'bulk carrier': 0,
 'container ship': 0,
 'general cargo': 13,
 'oil tanker': 8,
 'other': 43}
```

List the five most merchant marines measured __(i)__ by the total amount of ships, __(ii)__ by the total amount of *non-other* ships __(iii)__ and by the total amout of oil tankers. 