# Web 1 - How to get data from the Internet

In [1]:
import requests
import json
import pandas as pd
from pandas import Series, DataFrame

###  P10 check-in

In [2]:
# It is very important to check auto-grader test results on p10 in a timely manner.
# Take a few minutes to verify if you hardcoded the slashes in P10 rather than using os.path.join? 
       # Your code won't clear auto-grader if you hardcode either "/" or "\" 
       # for *ANY* relative path in the entire project
        
# Check your code and check the autograder as soon as possible.

### Warmup 1: Read the data from "IMDB-Movie-Data.csv" into a pandas DataFrame called "movies"

In [5]:
movies = pd.read_csv("IMDB-Movie-Data.csv")
movies

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...,...
1063,1063,Guardians of the Galaxy Vol. 2,"Action, Adventure, Comedy",James Gunn,"Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...",2017,136,7.6,389.81
1064,1064,Baby Driver,"Action, Crime, Drama",Edgar Wright,"Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...",2017,113,7.6,107.83
1065,1065,Only the Brave,"Action, Biography, Drama",Joseph Kosinski,"Josh Brolin, Miles Teller, Jeff Bridges, Jenni...",2017,134,7.6,18.34
1066,1066,Incredibles 2,"Animation, Action, Adventure",Brad Bird,"Craig T. Nelson, Holly Hunter, Sarah Vowell, H...",2018,118,7.6,608.58


### Warmup 2: fixing duplicate index columns

Notice that there are two index columns
- That happened because when you write a csv from pandas to a file, it writes a new index column
- So if the DataFrame already contains an index, you are going to get two index columns
- Let's fix that problem

In [4]:
#use slicing to retain all the rows and columns excepting for column with integer position 0
movies = movies.iloc[:, 1:] 
movies

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...
1063,Guardians of the Galaxy Vol. 2,"Action, Adventure, Comedy",James Gunn,"Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...",2017,136,7.6,389.81
1064,Baby Driver,"Action, Crime, Drama",Edgar Wright,"Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...",2017,113,7.6,107.83
1065,Only the Brave,"Action, Biography, Drama",Joseph Kosinski,"Josh Brolin, Miles Teller, Jeff Bridges, Jenni...",2017,134,7.6,18.34
1066,Incredibles 2,"Animation, Action, Adventure",Brad Bird,"Craig T. Nelson, Holly Hunter, Sarah Vowell, H...",2018,118,7.6,608.58


In [5]:
movies.to_csv("better_movies.csv", index = False)

### Warmup 3: Which movie has highest rating?

In [6]:
max_rating = movies["Rating"].max()
movies[movies["Rating"] == max_rating]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
54,The Dark Knight,"Action,Crime,Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,533.32


### Warmup 4: Which movies were released in 2020?

In [7]:
movies[movies["Year"] == 2020]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
998,Hamilton,"Biography, Drama, History",Thomas Kail,"Lin-Manuel Miranda, Phillipa Soo, Leslie Odom ...",2020,160,8.6,612.82
1000,Soorarai Pottru,Drama,Sudha Kongara,"Suriya, Madhavan, Paresh Rawal, Aparna Balamurali",2020,153,8.6,5.93
1022,Soul,"Animation, Adventure, Comedy",Pete Docter,"Kemp Powers, Jamie Foxx, Tina Fey, Graham Norton",2020,100,8.1,121.0
1031,Dil Bechara,"Comedy, Drama, Romance",Mukesh Chhabra,"Sushant Singh Rajput, Sanjana Sanghi, Sahil Va...",2020,101,7.9,263.61
1047,The Trial of the Chicago 7,"Drama, History, Thriller",Aaron Sorkin,"Eddie Redmayne, Alex Sharp, Sacha Baron Cohen,...",2020,129,7.8,0.12
1048,Druk,"Comedy, Drama",Thomas Vinterberg,"Mads Mikkelsen, Thomas Bo Larsen, Magnus Milla...",2020,117,7.8,21.71


### Warmup 5a: What does this function do?

In [8]:
def format_revenue(revenue):
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return float(revenue[:-1]) * 1e6
    else:                    # otherwise, assume millions.
        return float(revenue) * 1e6

### Warmup 5b: Using the above function, create a new column called "Revenue in dollars" by applying appropriate conversion to Revenue column.

In [9]:
movies["Revenue in dollars"] = movies["Revenue"].apply(format_revenue)
movies

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue,Revenue in dollars
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13,333130000.0
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M,126460000.0
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M,138120000.0
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32,270320000.0
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02,325020000.0
...,...,...,...,...,...,...,...,...,...
1063,Guardians of the Galaxy Vol. 2,"Action, Adventure, Comedy",James Gunn,"Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...",2017,136,7.6,389.81,389810000.0
1064,Baby Driver,"Action, Crime, Drama",Edgar Wright,"Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...",2017,113,7.6,107.83,107830000.0
1065,Only the Brave,"Action, Biography, Drama",Joseph Kosinski,"Josh Brolin, Miles Teller, Jeff Bridges, Jenni...",2017,134,7.6,18.34,18340000.0
1066,Incredibles 2,"Animation, Action, Adventure",Brad Bird,"Craig T. Nelson, Holly Hunter, Sarah Vowell, H...",2018,118,7.6,608.58,608580000.0


### Warmup 6: What are the top 10 highest-revenue movies?

In [10]:
movies.sort_values(by = "Revenue in dollars", ascending = False).head(10)

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue,Revenue in dollars
50,Star Wars: Episode VII - The Force Awakens,"Action,Adventure,Fantasy",J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac, Domhna...",2015,136,8.1,936.63,936630000.0
1006,Avengers: Endgame,"Action, Adventure, Drama",Anthony Russo,"Joe Russo, Robert Downey Jr., Chris Evans, Mar...",2019,181,8.4,858.37,858370000.0
87,Avatar,"Action,Adventure,Fantasy",James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",2009,162,7.8,760.51,760510000.0
1007,Avengers: Infinity War,"Action, Adventure, Sci-Fi",Anthony Russo,"Joe Russo, Robert Downey Jr., Chris Hemsworth,...",2018,149,8.4,678.82,678820000.0
85,Jurassic World,"Action,Adventure,Sci-Fi",Colin Trevorrow,"Chris Pratt, Bryce Dallas Howard, Ty Simpkins,...",2015,124,7.0,652.18,652180000.0
76,The Avengers,"Action,Sci-Fi",Joss Whedon,"Robert Downey Jr., Chris Evans, Scarlett Johan...",2012,143,8.1,623.28,623280000.0
998,Hamilton,"Biography, Drama, History",Thomas Kail,"Lin-Manuel Miranda, Phillipa Soo, Leslie Odom ...",2020,160,8.6,612.82,612820000.0
1066,Incredibles 2,"Animation, Action, Adventure",Brad Bird,"Craig T. Nelson, Holly Hunter, Sarah Vowell, H...",2018,118,7.6,608.58,608580000.0
54,The Dark Knight,"Action,Crime,Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,533.32,533320000.0
12,Rogue One,"Action,Adventure,Sci-Fi",Gareth Edwards,"Felicity Jones, Diego Luna, Alan Tudyk, Donnie...",2016,133,7.9,532.17,532170000.0


### Warmup 7: Which shortest movies (below average runtime) have highest rating?

In [11]:
short_movies = movies[movies["Runtime"] < movies["Runtime"].mean()]
short_movies[short_movies["Rating"] == short_movies["Rating"].max()]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue,Revenue in dollars
96,Kimi no na wa,"Animation,Drama,Fantasy",Makoto Shinkai,"Ryûnosuke Kamiki, Mone Kamishiraishi, Ryô Nari...",2016,106,8.6,4.68,4680000.0
249,The Intouchables,"Biography,Comedy,Drama",Olivier Nakache,"François Cluzet, Omar Sy, Anne Le Ny, Audrey F...",2011,112,8.6,13.18,13180000.0


### Learning Objectives

- Make a request for data using requests.get(URL)
- Check the status of a request/response
- Extract the text of a response
- Create a json file from a response
- State and practice good etiquette when getting data

### Core Ideas:
 - Network structure
     - Client / server
     - Request / response
  
    ![Client_server.png](attachment:Client_server.png)
    
 - HTTP protocol
     - URL
     - Headers
     - Status Codes
 - The requests module

## HTTP Status Codes you need to know
- 200: success
- 404: not found

Here is a list of all status codes, you do NOT need to memorize it: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

## requests.get : Simple string example
- URL: https://cs220.cs.wisc.edu/hello.txt

In [17]:
url = "https://cs220.cs.wisc.edu/hello.txt"
r = requests.get(url) # r is the response
print(type(r.status_code))
print(r.text)

<class 'int'>
Hello CS220 / CS319 students! Welcome to my website. Hope you are staying safe and healthy!



In [13]:
# Q: What if the web site does not exist?
typo_url = "https://cs220.cs.wisc.edu/hello.txttttt"
r = requests.get(typo_url)
print(r.status_code)
print(r.text)

# A: We get a 403 (Forbidden error)
# The most common error that you will encounter is 404 (File not found)

403
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>403 Forbidden</h1>
<ul>
<li>Code: AccessDenied</li>
<li>Message: Access Denied</li>
<li>RequestId: J2691NM08FYT831Q</li>
<li>HostId: OdPnBwKeV2m3q3Ks5aZmoSup1a24627wGkRzrHzt8KjWCNlR1F99bZ5PRhiv+LKfVI+zCGfHQcQ=</li>
</ul>
<h3>An Error Occurred While Attempting to Retrieve a Custom Error Document</h3>
<ul>
<li>Code: AccessDenied</li>
<li>Message: Access Denied</li>
</ul>
<hr/>
</body>
</html>



In [14]:
# We can check for a status_code error by using an assert
typo_url = "https://cs220.cs.wisc.edu/hello.txttttt"
r = requests.get(typo_url)
assert r.status_code == 200
print(r.status_code)
print(r.text)

AssertionError: 

In [None]:
# Instead of using an assert, we often use raise_for_status()
r = requests.get(typo_url)
r.raise_for_status() #similar to asserting r.status_code == 200
r.text

# Note the error you get.... We will use this in the next cell

In [15]:
# Let's try to catch that error

try:
    r = requests.get(typo_url)
    r.raise_for_status() #similar to asserting r.status_code == 200
    r.text
except requests.exceptions.HTTPError as e: # What's still wrong here?
    print("oops!!", e)

oops!! 403 Client Error: Forbidden for url: https://cs220.cs.wisc.edu/hello.txttttt


In [17]:
# we often need to prepend the names of exceptions with the name of the module
# fix the error from above

try:
    r = requests.get(typo_url)
    r.raise_for_status() #similar to asserting r.status_code == 200
    r.text
except requests.HTTPError as e: #correct way to catch the error.
    print("oops!!", e)

oops!! 403 Client Error: Forbidden for url: https://cs220.cs.wisc.edu/hello.txttttt


## requests.get : JSON file example
- URL: https://cs220.cs.wisc.edu/scores.json
- `json.load` (FILE_OBJECT)
- `json.loads` (STRING)

In [18]:
# GETting a JSON file, the long way
url = "https://cs220.cs.wisc.edu/scores.json"
r = requests.get(url)
r.raise_for_status()
urltext = r.text
print(urltext)
d = json.loads(urltext)
print(type(d), d)

{
  "alice": 100,
  "bob": 200,
  "cindy": 300
}

<class 'dict'> {'alice': 100, 'bob': 200, 'cindy': 300}


In [19]:
# GETting a JSON file, the shortcut way
url = "https://cs220.cs.wisc.edu/scores.json"
#Shortcut to bypass using json.loads()
r = requests.get(url)
r.raise_for_status()
d2 = r.json()
print(type(d2), d2)

<class 'dict'> {'alice': 100, 'bob': 200, 'cindy': 300}


## Good GET Etiquette

Don't make a lot of requests to the same server all at once.
 - Requests use up the server's time
 - Major websites will often ban users who make too many requests
 - You can break a server....similar to DDoS attacks (DON'T DO THIS)
 
In CS220 we will usually give you a link to a copied file to avoid overloading the site.


### Explore real-world JSON

How to explore an unknown JSON?
- If you run into a `dict`, try `.keys()` method to look at the keys of the dictionary, then use lookup process to explore further
- If you run into a `list`, iterate over the list and print each item

### Weather for UW-Madison campus
- URL: https://api.weather.gov/gridpoints/MKX/37,63/forecast

In [20]:
# TODO: GET the forecast
url = "https://api.weather.gov/gridpoints/MKX/37,63/forecast"
r = requests.get(url)
r.raise_for_status()
weather_data = r.json()

# TODO: explore the type of the data structure 
print(type(weather_data))

# display the data
# weather_data # uncomment to see the whole JSON

<class 'dict'>


In [21]:
# TODO: display the keys of the weather_data dict
print(list(weather_data.keys()))

# TODO: lookup the value corresponding to the 'properties'
weather_data["properties"]

# TODO: you know what to do next ... explore type again
print(type(weather_data["properties"]))

['@context', 'type', 'geometry', 'properties']
<class 'dict'>


In [22]:
# TODO: display the keys of the properties dict
print(list(weather_data["properties"].keys()))

# TODO: lookup the value corresponding to the 'periods'
# weather_data["properties"]["periods"] # uncomment to see the output

# TODO: you know what to do next ... explore type again
print(type(weather_data["properties"]["periods"]))

['updated', 'units', 'forecastGenerator', 'generatedAt', 'updateTime', 'validTimes', 'elevation', 'periods']
<class 'list'>


In [23]:
# TODO: extract periods list into a variable
periods_list = weather_data["properties"]["periods"]

# TODO: create a DataFrame using periods_list
# TODO: What does each inner data structure represent in your DataFrame?
#       Keep in mind that outer data structure is a list.
#       A. rows (because outer data structure is a list)
periods_df = DataFrame(periods_list)
periods_df

Unnamed: 0,number,name,startTime,endTime,isDaytime,temperature,temperatureUnit,temperatureTrend,windSpeed,windDirection,icon,shortForecast,detailedForecast
0,1,Today,2022-11-16T08:00:00-06:00,2022-11-16T18:00:00-06:00,True,34,F,,5 to 10 mph,NW,"https://api.weather.gov/icons/land/day/snow,60...",Snow Showers Likely,"Snow showers likely. Cloudy, with a high near ..."
1,2,Tonight,2022-11-16T18:00:00-06:00,2022-11-17T06:00:00-06:00,False,23,F,,5 to 10 mph,W,https://api.weather.gov/icons/land/night/snow?...,Chance Snow Showers,A chance of snow showers after 9pm. Mostly clo...
2,3,Thursday,2022-11-17T06:00:00-06:00,2022-11-17T18:00:00-06:00,True,28,F,,10 to 15 mph,W,"https://api.weather.gov/icons/land/day/snow,30...",Chance Snow Showers,"A chance of snow showers. Mostly cloudy, with ..."
3,4,Thursday Night,2022-11-17T18:00:00-06:00,2022-11-18T06:00:00-06:00,False,17,F,,15 mph,W,https://api.weather.gov/icons/land/night/bkn?s...,Mostly Cloudy,"Mostly cloudy, with a low around 17. West wind..."
4,5,Friday,2022-11-18T06:00:00-06:00,2022-11-18T18:00:00-06:00,True,23,F,,15 mph,W,https://api.weather.gov/icons/land/day/bkn?siz...,Mostly Cloudy,"Mostly cloudy, with a high near 23. West wind ..."
5,6,Friday Night,2022-11-18T18:00:00-06:00,2022-11-19T06:00:00-06:00,False,12,F,,15 mph,SW,https://api.weather.gov/icons/land/night/bkn?s...,Mostly Cloudy,"Mostly cloudy, with a low around 12. Southwest..."
6,7,Saturday,2022-11-19T06:00:00-06:00,2022-11-19T18:00:00-06:00,True,23,F,,15 mph,W,https://api.weather.gov/icons/land/day/bkn/sno...,Mostly Cloudy then Slight Chance Snow Showers,A slight chance of snow showers after noon. Mo...
7,8,Saturday Night,2022-11-19T18:00:00-06:00,2022-11-20T06:00:00-06:00,False,7,F,,10 to 15 mph,W,https://api.weather.gov/icons/land/night/cold?...,Mostly Cloudy,"Mostly cloudy, with a low around 7. West wind ..."
8,9,Sunday,2022-11-20T06:00:00-06:00,2022-11-20T18:00:00-06:00,True,23,F,,10 mph,W,https://api.weather.gov/icons/land/day/few?siz...,Sunny,"Sunny, with a high near 23."
9,10,Sunday Night,2022-11-20T18:00:00-06:00,2022-11-21T06:00:00-06:00,False,15,F,rising,10 mph,SW,https://api.weather.gov/icons/land/night/sct?s...,Partly Cloudy,"Partly cloudy. Low around 15, with temperature..."


#### What is the maximum and minimum observed temperatures? Include the temperatureUnit in your display

In [24]:
min_temp = periods_df["temperature"].min()
idx_min = periods_df["temperature"].idxmin()
min_unit = periods_df.loc[idx_min, "temperatureUnit"]

max_temp = periods_df["temperature"].max()
idx_max = periods_df["temperature"].idxmax()
max_unit = periods_df.loc[idx_max, "temperatureUnit"]

print("Minimum observed temperature is: {} degree {}".format(min_temp, min_unit))
print("Maximum observed temperature is: {} degree {}".format(max_temp, max_unit))

Minimum observed temperature is: 7 degree F
Maximum observed temperature is: 36 degree F


#### Which days `detailedForecast` contains `snow`?

In [25]:
# What courses contain the keyword "programming"?
snow_days_df = periods_df[periods_df["detailedForecast"].str.contains("snow")]
snow_days_df

Unnamed: 0,number,name,startTime,endTime,isDaytime,temperature,temperatureUnit,temperatureTrend,windSpeed,windDirection,icon,shortForecast,detailedForecast
0,1,Today,2022-11-16T08:00:00-06:00,2022-11-16T18:00:00-06:00,True,34,F,,5 to 10 mph,NW,"https://api.weather.gov/icons/land/day/snow,60...",Snow Showers Likely,"Snow showers likely. Cloudy, with a high near ..."
1,2,Tonight,2022-11-16T18:00:00-06:00,2022-11-17T06:00:00-06:00,False,23,F,,5 to 10 mph,W,https://api.weather.gov/icons/land/night/snow?...,Chance Snow Showers,A chance of snow showers after 9pm. Mostly clo...
2,3,Thursday,2022-11-17T06:00:00-06:00,2022-11-17T18:00:00-06:00,True,28,F,,10 to 15 mph,W,"https://api.weather.gov/icons/land/day/snow,30...",Chance Snow Showers,"A chance of snow showers. Mostly cloudy, with ..."
6,7,Saturday,2022-11-19T06:00:00-06:00,2022-11-19T18:00:00-06:00,True,23,F,,15 mph,W,https://api.weather.gov/icons/land/day/bkn/sno...,Mostly Cloudy then Slight Chance Snow Showers,A slight chance of snow showers after noon. Mo...


In [26]:
snow_days_df["name"]

0       Today
1     Tonight
2    Thursday
6    Saturday
Name: name, dtype: object

#### Which day's `detailedForecast` has the most lengthy description?

In [27]:
idx_max_desc = periods_df["detailedForecast"].str.len().idxmax()
periods_df.iloc[idx_max_desc]['name']

'Thursday'

In [28]:
# What was that forecast?
periods_df.iloc[idx_max_desc]['detailedForecast']

'A chance of snow showers. Mostly cloudy, with a high near 28. West wind 10 to 15 mph, with gusts as high as 25 mph. Chance of precipitation is 30%. New snow accumulation of less than half an inch possible.'

### Write it out to a CSV file on your drive
You now have your own copy!

In [29]:
# Write it all out to a single CSV file
periods_df.to_csv("campus_weather.csv", index=False)

### Other Cool APIs

- City of Madison Transit: http://transitdata.cityofmadison.com/
- Reddit: https://reddit.com/r/UWMadison.json
- Lord of the Rings: https://the-one-api.dev/
- Pokemon: https://pokeapi.co/

Remember: Be judicious when making requests; don't overwhelm the server! :)

## Next Time
What other documents can we get via the Web? HTML is very popular! We'll explore this.