# Introduction to Data Analysis with Python III


<img src="https://www.python.org/static/img/python-logo.png" alt="yogen" style="width: 200px; float: right;"/>
<br>
<br>
<br>
<img src="../assets/yogen-logo.png" alt="yogen" style="width: 200px; float: right;"/>

#  Web APIs

An API, or aplication programming interface, is the way programs communicate with one another. 

Web APIs are the way programs communicate with one another _over the internet_

[RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) APIs respect a series of design principles that make them simple to use.

The basic tools we are going to use are: POST and GET requests to urls we'll specify and json objects that we'll receive as response or send as payload (in a POST command, for example).

In [2]:
import requests

response = requests.get('https://www.elpais.com')
response.status_code

200

In [3]:
response.content[:500]

b'<!DOCTYPE html><html lang="es-ES"><head><link rel="preconnect" href="//static.elpais.com"/><link rel="preconnect" href="//ep00.epimg.net"/><link rel="preconnect" href="//imagenes.elpais.com"/><link rel="preconnect" href="//static-sandbox.elpais.com"/><link rel="preconnect" href="//sdk.privacy-center.org"/><link rel="preconnect" href="//sdk-gcp.privacy-center.org"/><link rel="preload" as="script" href="//ep00.epimg.net/js/prisa/user.min.js?i=1"/><link rel="preconnect" href="//www.googletagservice'

In [5]:
type(response.content)

bytes

In [6]:
response.text[:500]

'<!DOCTYPE html><html lang="es-ES"><head><link rel="preconnect" href="//static.elpais.com"/><link rel="preconnect" href="//ep00.epimg.net"/><link rel="preconnect" href="//imagenes.elpais.com"/><link rel="preconnect" href="//static-sandbox.elpais.com"/><link rel="preconnect" href="//sdk.privacy-center.org"/><link rel="preconnect" href="//sdk-gcp.privacy-center.org"/><link rel="preload" as="script" href="//ep00.epimg.net/js/prisa/user.min.js?i=1"/><link rel="preconnect" href="//www.googletagservice'

This is an API that returns the current position of the ISS:

http://api.open-notify.org/

In [9]:
response = requests.get('http://api.open-notify.org/iss-now.json')
response

<Response [200]>

In [10]:
response.text

'{"timestamp": 1643991895, "iss_position": {"latitude": "48.9847", "longitude": "-42.7074"}, "message": "success"}'

In [11]:
response.text[:10]

'{"timestam'

We can convert a json-formatted string such as the one we get in the response into a Python object with the json library:

In [19]:
import json


iss = json.loads(response.text)

In [20]:
type(iss)

dict

In [21]:
iss['iss_position']['longitude']

'-42.7074'

We also can go in the other direction and generate json-formatted strings from Python objects:

In [23]:
teacher = {'name' : 'Daniel',
           'surname': 'Mateos',
          'employed': True}
json.dumps(teacher)

'{"name": "Daniel", "surname": "Mateos", "employed": true}'

### Ejercicio

Write a function that return the name and the number os crewed craft that are in space using the 'number of people in space' API.

In [43]:
def Space_people():
    response = requests.get('http://api.open-notify.org/astros.json')
    response = json.loads(response.text)
    number = response['number']
    for i in range(len(response['people'])):
        name, craft = response['people']['name', 'craft']
        print('The total number number of people in space is:' + number + ' and the names are:' + name + 'of the craft:' + craft)
Space_people()

TypeError: list indices must be integers or slices, not tuple

In [50]:
response = requests.get('http://api.open-notify.org/astros.json')
people = json.loads(response.text)['people']
people

[{'craft': 'ISS', 'name': 'Mark Vande Hei'},
 {'craft': 'ISS', 'name': 'Pyotr Dubrov'},
 {'craft': 'ISS', 'name': 'Anton Shkaplerov'},
 {'craft': 'Shenzhou 13', 'name': 'Zhai Zhigang'},
 {'craft': 'Shenzhou 13', 'name': 'Wang Yaping'},
 {'craft': 'Shenzhou 13', 'name': 'Ye Guangfu'},
 {'craft': 'ISS', 'name': 'Raja Chari'},
 {'craft': 'ISS', 'name': 'Tom Marshburn'},
 {'craft': 'ISS', 'name': 'Kayla Barron'},
 {'craft': 'ISS', 'name': 'Matthias Maurer'}]

In [53]:
import pandas as pd

craft = pd.DataFrame(people).groupby('craft').count().index
craft

Index(['ISS', 'Shenzhou 13'], dtype='object', name='craft')

In [58]:
set([persons['craft'] for persons in people])

{'ISS', 'Shenzhou 13'}

In [61]:
def craft_names():
    response = requests.get('http://api.open-notify.org/astros.json')
    people = json.loads(response.text)['people']
    
    craft = set([persons['craft'] for persons in people])
    
    return len(craft), craft

craft_names()

(2, {'ISS', 'Shenzhou 13'})

#### Exercise:

https://agify.io/ hosts an API that estimates the age of a person based on their name.  

Write a function that wraps the API.

In [None]:
https://api.agify.io?name=michae

In [62]:
def age_of(name):
    response = requests.get('https://api.agify.io?name='+name)
    age = json.loads(response.text)
    
    return age

age_of('Roberto')

{'name': 'Roberto', 'age': 66, 'count': 142197}

In [78]:
age_of('Kaleshi')

{'name': 'Kaleshi', 'age': 44, 'count': 7}

In [67]:
def age_and_country(name,country):
    response = requests.get('https://api.agify.io?name='+name+'&country_id='+country)
    age = json.loads(response.text)
    
    return age

age_and_country('Roberto','US')

{'name': 'Roberto', 'age': 70, 'count': 6, 'country_id': 'US'}

Although we managed to get the response, more complicated sets of parameters will be a complicated and error-prone thing to encode. Thankfully, the `requests` library can do that work for us.

In [79]:
requests.get('https://api.agify.io', params = {'name': 'Federico Jose'}).url

'https://api.agify.io/?name=Federico+Jose'

Even more complicated sets of parameters are sometimes required. When that is the case, API designers often decide to require them in json format, received via a `POST` request.

For example, take a look at the [Google Maps API](https://developers.google.com/maps/documentation). In the documentation, they define the body of the request, which we will have to provide, and of the response, which they'll provide back.

## Things that you can do with web APIs

Basically anything, but some examples are:

- Query addresses to get coordinates, or ask what is in some coordinates ([Google Maps](https://developers.google.com/maps/documentation/geocoding/overview))
- Access your files in cloud services: eg Dropbox, Google Drive, etc.
- Query user details or song information, modify your playlists... ([Spotify](https://developer.spotify.com/documentation/web-api/)).
- Make or receive payments ([PayPal](https://developer.paypal.com/api/rest/), [Square](https://squareup.com/us/en)...)
- Order pizza: https://apilist.fun/api/order-pizza-api
- Make bookings (https://connect.booking.com/user_guide/site/en-US/res/).

# Web scraping

![HTML to DOM](http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png)

![DOM TREE](http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png)



## Basic web scraping

In [85]:
from bs4 import BeautifulSoup

response = requests.get('https://www.elpais.com')
soup = BeautifulSoup(response.text)

In [86]:
soup.find('img')

<img alt="Teodoro García Egea y Cuca Gamarra protestan a Batet por la aprobación de la reforma laboral, el 3 de febrero en el Congreso. Eugenia Morago" class="c_m_e _re lazyload a_m-h" decoding="auto" height="234" loading="lazy" src="https://imagenes.elpais.com/resizer/iBxXhsUqByLNgS_-bGeBkLVCiUU=/414x233/cloudfront-eu-central-1.images.arcpublishing.com/prisa/HP73Y6KNQBE5TKCELRA3FIVGAE.jpg" srcset="https://imagenes.elpais.com/resizer/iBxXhsUqByLNgS_-bGeBkLVCiUU=/414x233/cloudfront-eu-central-1.images.arcpublishing.com/prisa/HP73Y6KNQBE5TKCELRA3FIVGAE.jpg 414w,https://imagenes.elpais.com/resizer/j-YQQDQC0TdteapfTrV7pCR2ASo=/828x466/cloudfront-eu-central-1.images.arcpublishing.com/prisa/HP73Y6KNQBE5TKCELRA3FIVGAE.jpg 828w" width="414"/>

In [87]:
soup.find('a')

<a href="https://elpais.com/s/setEspana.html?ed=el-pais_ham"><abbr>esp</abbr> <span>España</span></a>

In [88]:
soup.find('p')

<p class="c_d">El líder popular considera que Batet  podría estar prevaricando. “Ha habido un caso de transfuguismo”, denuncia la ‘número dos’ de los socialistas</p>

In [89]:
soup.find('h2')

<h2 class="c_t"><a href="/espana/2022-02-04/casado-cuestiona-la-legitimidad-de-la-votacion-del-congreso-sobre-la-reforma-laboral.html">Casado habla de “pucherazo” en la votación y Lastra acusa al PP de comprar a los diputados de UPN</a></h2>

In [90]:
soup.find('div')

<div class="fusion-app" id="fusion-app"><div class="ad" id="elpais_gpt-SKIN"></div><div class="ad" id="elpais_gpt-INTER"></div><header class="z-he"><div class="ad ad-giga ad-giga-1"><div class="ad ad-ldb ad-ldb1" id="elpais_gpt-LDB1"></div><div class="mldb1-wrapper" id="mldb1-wrapper"><div class="ad ad-mldb ad-mldb1" id="elpais_gpt-MLDB1"></div></div></div><header class="cg _dg" data-dtm-region="header"><script type="application/ld+json">{"@context":"http://schema.org","@type":"SiteNavigationElement","hasPart":[{"name":"Internacional","url":"https://elpais.com/internacional/"},{"name":"Opinión","url":"https://elpais.com/opinion/"},{"name":"España","url":"https://elpais.com/espana/"},{"name":"Economía","url":"https://elpais.com/economia/"},{"name":"Sociedad","url":"https://elpais.com/sociedad/"},{"name":"Educación","url":"https://elpais.com/educacion/"},{"name":"Medio Ambiente","url":"https://elpais.com/clima-y-medio-ambiente/"},{"name":"Ciencia","url":"https://elpais.com/ciencia/"},{"n

In [91]:
soup.find_all('img')

[<img alt="Teodoro García Egea y Cuca Gamarra protestan a Batet por la aprobación de la reforma laboral, el 3 de febrero en el Congreso. Eugenia Morago" class="c_m_e _re lazyload a_m-h" decoding="auto" height="234" loading="lazy" src="https://imagenes.elpais.com/resizer/iBxXhsUqByLNgS_-bGeBkLVCiUU=/414x233/cloudfront-eu-central-1.images.arcpublishing.com/prisa/HP73Y6KNQBE5TKCELRA3FIVGAE.jpg" srcset="https://imagenes.elpais.com/resizer/iBxXhsUqByLNgS_-bGeBkLVCiUU=/414x233/cloudfront-eu-central-1.images.arcpublishing.com/prisa/HP73Y6KNQBE5TKCELRA3FIVGAE.jpg 414w,https://imagenes.elpais.com/resizer/j-YQQDQC0TdteapfTrV7pCR2ASo=/828x466/cloudfront-eu-central-1.images.arcpublishing.com/prisa/HP73Y6KNQBE5TKCELRA3FIVGAE.jpg 828w" width="414"/>,
 <img alt="Russian President Vladimir Putin attends a meeting with Chinese President Xi Jinping in Beijing, China February 4, 2022. Sputnik/Aleksey Druzhinin/Kremlin via REUTERS ATTENTION EDITORS - THIS IMAGE WAS PROVIDED BY A THIRD PARTY." class="c_m_e

In [92]:
n = 0

for imagen in soup.find_all('img'):
    n += 1
n

94

#### Exercise

Get the titles and urls of all articles in the front page of `elpais.com` into a csv.

In [153]:
results = []

for article in soup.find_all('article'):
    headline = article.find('h2').text
    url = article.find('a')['href']
    
    if 'c-bra' not in article['class']:
        results.append((headline, url))
    
pd.DataFrame(data = results, columns=['headline', 'url']).to_csv('headlines.csv')
    

## Blocking 

In [129]:
response = requests.get('https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts')
response.status_code

403

In [137]:
from selenium.webdriver import Firefox

driver = Firefox(executable_path='/home/dsc/Documents/geckodriver')

In [138]:
driver.get('https://www.elpais.com')

In [140]:
driver.find_element_by_partial_link_text('Casado').text

'Casado habla de “pucherazo” en la votación y Lastra acusa al PP de comprar a los diputados de UPN'

In [143]:
driver.get('https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts')

button = driver.find_element_by_class_name('btn-load-more')
button.click()

In [144]:
import time

for _ in range(10):
    button = driver.find_element_by_class_name('btn-load-more')
    button.click()
    time.sleep(2.5)

In [146]:
len(driver.find_elements_by_class_name('content-details'))

216

In [None]:
driver.find_element_by_tag_name('article')

In [None]:
results = []

for article in soup.find_all('article'):
    headline = article.find('h2').text
    url = article.find('a')['href']
    
    if 'c-bra' not in article['class']:
        results.append((headline, url))
    
pd.DataFrame(data = results, columns=['headline', 'url']).to_csv('headlines.csv')


In [None]:
driver.get('https://elpais')

button = driver.find_element_by_class_name('')
button.click()

"04 feb 2022|Actualizado 17:58 UTC|Edición:esp Españaame Américamex Méxicocat Cataluñaeng In EnglishELPAISsuscríbeteHHOLAIniciar sesiónInternacionalOpiniónEspañaEconomíaSociedadEducaciónMedio AmbienteCienciaCulturaBabeliaDeportesTecnologíaTelevisiónGenteEL PAÍS SEMANAL04 feb 2022|Actualizado 17:58 UTC|suscríbete\n\n\n\nApoyos de la reforma laboral\nHacen falta más síes que noes. Escaños 350 / Votan 349\n\n\n175\n174\n\n\n\nA favor PSOE | UP | Cs | PdCAT | Más País | Compromís | PRC | Teruel Existe | Coalición Canaria | Nueva Canarias | Un diputado del PP por error\n\n\nEn contra PP | VOX | PNV | Bildu | Junts | CUP | BNG | Foro Asturias | ERC | Mixto | UPN\n\n\nCasado habla de “pucherazo” en la votación y Lastra acusa al PP de comprar a los diputados de UPNElsa García de Blas / José Marcos|MadridEl líder popular considera que Batet\xa0 podría estar prevaricando. “Ha habido un caso de transfuguismo”, denuncia la ‘número dos’ de los socialistasLa votación telemática requiere una doble co

# Annex: ultra easy scraping with pandas!

When the data we want is already formatted as a table, we can do it even more easily! Just use `pandas.read_html`:

# Annex II: exercises

### Exercise:

Extract the date of the worst aviation disaster from: https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll

Prerequisites: pandas, pd.read_html

### Exercise: 

Assuming the list is exhaustive, calculate how many people died in accidental explosions per decade in the XX century. Plot it.

Data: 
https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll

Prerequisites: pandas, pd.read_html, pd.to_datetime, matplotlib or seaborn

### Exercise: 

create a function that, given the two tables extracted from http://en.wikipedia.org/wiki/List_of_S%26P_500_companies and a date, returns the list of companies in the S&P 500 at that date.

# References / Further reading

https://realpython.com/api-integration-in-python/

https://j2logo.com/flask/tutorial-como-crear-api-rest-python-con-flask/

https://www.scrapingbee.com/blog/selenium-python/

https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/