In [1]:
# Warmup 1: Go back to the code link for Monday's Lecture (lec 30)
# Take a look at the examples at the end of the lesson 
#     how to bucketize a DataFrame
#     how to apply a function to every element of a column
#     how to make a new column

# Wednesday, November 17
## Web 1 - How to get data from the 'Net

Core ideas: (thanks to instructor Alexi Brooks)
 - Network structure
     - IP addresses
     - host/domain names
     - client/server
     - request/response
 - HTTP protocol
     - URL
     - GET/POST
     - headers
     - status codes
 - The requests module
     - Etiquette
     - requests.get
     - requests.post

![Screen%20Shot%202021-11-15%20at%202.00.47%20PM.png](attachment:Screen%20Shot%202021-11-15%20at%202.00.47%20PM.png)

## HTTP Status Codes overview
- 1XX : Informational
- 2XX : Successful
- 3XX : Redirection
- 4XX : Client Error
- 5XX : Server Error

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [3]:
import requests
import json
from pandas import DataFrame

## requests.get : Simple string example
- URL: https://www.msyamkumar.com/hello.txt

In [4]:
url = "https://www.msyamkumar.com/hello.txt"
r = requests.get(url)
print(r.status_code)
print(r.text)

200
Hello CS220 / CS319 students! Welcome to my website. Hope you are staying safe and healthy!



In [5]:
# Q: What if the web site does not exist?
typo_url = "https://www.msyamkumar.com/hello.txtt"
r = requests.get(typo_url)
print(r.status_code)
print(r.text)

# A: We get text but the text is not from the client

404
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>meena/hello.txtt</Key><RequestId>PFB7WGNRWABBG0KE</RequestId><HostId>dyjKSgG4dYo2QcOcR/xN3uhowTgo/Ejg13MGIP3fVCP7iX4MRj8d2v1DQ6J4RKPXcABlaguwq5U=</HostId></Error>


In [6]:
# We can check for a status_code error by using an assert
typo_url = "https://www.msyamkumar.com/hello.txtt"


In [8]:
# instead of using an assert, we often use raise_for_status()
#r = requests.get(typo_url)
#r.raise_for_status() #similar to asserting r.status_code == 200
#r.text

# Note the error you get....we will use this in the next cell

In [None]:
#Let's try to catch that error

#try:

#except:

In [9]:
# we often need to pre-pend the names of exceptions with the name of the module
# fix the error from above
#try:

#except:


## requests.get : JSON file example
- URL: https://www.msyamkumar.com/scores.json
- `json.load` (FILE_OBJECT)
- `json.loads` (STRING)

In [11]:
# GETting a JSON file, the long way
url = "https://www.msyamkumar.com/scores.json"
r = requests.get(url)
r.raise_for_status()
#urltext = r.text
#print(urltext)
#d = json.loads(urltext)
#print(type(d), d)

In [12]:
# GETting a JSON file, the shortcut way
url = "https://www.msyamkumar.com/scores.json"
#Shortcut to bypass using json.loads()
#r = requests.get(url)
#r.raise_for_status()
#d2 = r.json()
#print(type(d2), d2)

## Good GET Etiquette

Don't make a lot of requests to the same server all at once.
 - Requests use up the server's time
 - Major websites will often ban users who make too many requests
 - You can break a server....similar to DDoS attacks (DON'T DO THIS)
 
In CS220 we will usually give you a link to a copied file to avoid overloading the site.


## DEMO 1: reddit json processing
- URL: https://www.reddit.com/r/UWMadison.json or https://www.msyamkumar.com/cs220/f21/materials/lectureDemo_code/lec-31/examples/UWMadison.json

THE FIRST LINK IS TO A LIVE WEBPAGE - Review requests etiquette before running!

![Screen%20Shot%202021-11-15%20at%202.47.26%20PM.png](attachment:Screen%20Shot%202021-11-15%20at%202.47.26%20PM.png)

In [39]:
#url = "https://www.reddit.com/r/UWMadison.json"
url = "https://www.msyamkumar.com/cs220/f21/materials/lectureDemo_code/lec-31/examples/UWMadison.json"
r = requests.get(url)
r.raise_for_status()
#d = r.json()
#print(type(d))


<class 'dict'>


### How to explore an unknown JSON?
- If you run into a dict, try .keys() functions to look at the keys of the dictionary
- If you run into a list, iterate over the list and print each item

In [44]:
print(d.keys())
#print(type(d['data']))
#print(d['data'].keys())
#print(type(d['data']['children']))
#print(len(d['data']['children']))

dict_keys(['kind', 'data'])
<class 'dict'>
dict_keys(['modhash', 'dist', 'children', 'after', 'before'])
<class 'list'>
25


In [41]:
for item in d['data']['children']:
    print(type(item))
len(d)

<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>


2

In [13]:
# write a for loop to iterate over the children
   
    # inside the loop, print out the score and the title


## DEMO 2: State populations
- URL: https://www.msyamkumar.com/cs220/f21/materials/lectureDemo_code/lec-31/examples/data/state_files.txt

Challenge problem: Each line in state_files.txt contains the name of a .json file (in the same directory on the server). Using `get` requests, load the contents of all of the json files and make one combined DataFrame with all of them. You will probably need to explore the data!

![Screen%20Shot%202021-11-15%20at%202.48.42%20PM.png](attachment:Screen%20Shot%202021-11-15%20at%202.48.42%20PM.png)

In [59]:
prefix_url = "https://www.msyamkumar.com/cs220/f21/materials/lectureDemo_code/lec-31/examples/data/"
r = requests.get(prefix_url + "state_files.txt")
r.raise_for_status()
#r.text
# get a list of all files

# put into a list
file_names = r.text.split("\n")


# make a list of lists, each row is one list
all_state_info = []

# iterate through all file_names


    #print(state_url)
    # get the request, store in a dict
 
    #print(state, state_dict)
    
    
all_state_info[:5]

[{'2000': 4447100, '2010': 4779736, '2015': 4846411, 'Name': 'Alabama'},
 {'2000': 626932, '2010': 710231, '2015': 737046, 'Name': 'Alaska'},
 {'2000': 5130632, '2010': 6392017, '2015': 6728783, 'Name': 'Arizona'},
 {'2000': 2673400, '2010': 2915918, '2015': 2966835, 'Name': 'Arkansas'},
 {'2000': 33871648, '2010': 37253956, '2015': 38792291, 'Name': 'California'}]

In [62]:
# you can make a DataFrame from a list of dicts
#state_df = DataFrame(all_state_info)
#state_df

# there is no state name
# state_dict["Name"] = state[:-5]  # add this above

# set the index to be Name
state_df = state_df.set_index("Name")
state_df

Unnamed: 0_level_0,2000,2010,2015
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,4447100,4779736,4846411
Alaska,626932,710231,737046
Arizona,5130632,6392017,6728783
Arkansas,2673400,2915918,2966835
California,33871648,37253956,38792291
Colorado,4301261,5029196,5355588
Connecticut,3405565,3574097,3594762
Delaware,783600,897934,935968
Florida,15982378,18801310,19905569
Georgia,8186453,9687653,10097132


In [64]:
# bonus....  Write it all out to a single JSON file
# use the code from the previous lecture to write a DF to a csv
# but in this case we WANT the index part of our CSV
#state_df.to_csv("state_populations.csv", index = False) # not this way, 
state_df.to_csv("state_populations.csv") # this way...include the index in the CSV

##  Demo 3: Madison Bus Alerts

