## API Scraping

### A simple API query
You will start with the basics: how to do a simple request to an [API endpoint](../../2.python/2.python_advanced/05.Scraping/5.apis.ipynb).

You will use the [requests](https://requests.readthedocs.io/en/latest/) external library through the `import` keyword.

Note that all external libraries need to be installed first. Check their documentation.

Check the [quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/) section of the `requests` library's documentation to:
1. use the `get()` method to connect to this endpoint: https://country-leaders.onrender.com/status
2. check if the `status_code` is equal to 200, which means OK
    * if OK, `print()` the `text` of the response
    * if not, `print()` the `status_code`

Here is an overview of the essential [HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

In [5]:
# import the requests library (1 line)
import requests
# assign the root url (without /status) to the root_url variable for ease of reference (1 line)
root_url = 'https://country-leaders.onrender.com'
# assign the /status endpoint to another variable called status_url (1 line)
status_url = '/status/'
# query the /status endpoint using the get() method and store it in the req variable (1 line)
req =  requests.get(f"{root_url}{status_url}")
# check the status_code using a condition and print appropriate messages (4 lines)
if req.status_code == 200:
    print(req.json())
else :
    print (req.status_code)
                                                          

# assign the output to the leaders variable (1 line)
leaders_endpoint ='/leaders/'
leaders = requests.get(f"{root_url}{leaders_endpoint}")
# display the leaders variable (1 line)
print(leaders)
# does it work?
'nope'


Alive
<Response [403]>


'nope'

### Cookies anyone?

It looks like the access to this API is restricted...
Query the `/cookie` endpoint and extract the appropriate field to access your cookie.

You will need to use this cookie in each of the following API requests: `/countries`, `/leaders`, `/leader`.

Try to query the countries endpoint using the cookie, save the output and print it.

In [9]:
# get cookie from api
cookie_endpoint = '/cookie/'
cookie_req = requests.get(f"{root_url}{cookie_endpoint}")
cookie= cookie_req.cookies
if cookie_req.status_code == 200:
    print(cookie)
else :
    print (cookie_req.status_code)


# try to get country from the API passing the cookies argument 
country_endpoint ='/countries/'
country_req = requests.get(f"{root_url}{country_endpoint}", cookies = cookie)
#if country_req.status_code == 200:
country_list = (country_req.json())
print(country_list)
print(type(country_list))
   



# same with leaders
params = {'country':'ru'}
leaders_endpoint ='/leaders/'
leaders_req = requests.get(f"{root_url}{leaders_endpoint}", cookies = cookie, params = params)
if leaders_req.status_code == 200:
    print(leaders_req.text)
else :
    print (leaders_req.status_code)



<RequestsCookieJar[<Cookie user_cookie=5b1c5d25-d6fd-4577-9640-e85491a65225 for country-leaders.onrender.com/>]>
['ma', 'us', 'fr', 'be', 'ru']
<class 'list'>
[{"id":"Q7747","first_name":"Vladimir","last_name":"Putin","birth_date":"1952-10-07","death_date":null,"place_of_birth":"Saint Petersburg","wikipedia_url":"https://ru.wikipedia.org/wiki/%D0%9F%D1%83%D1%82%D0%B8%D0%BD,_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80_%D0%92%D0%BB%D0%B0%D0%B4%D0%B8%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B8%D1%87","start_mandate":"2000-05-07","end_mandate":"2008-05-07"},{"id":"Q23530","first_name":"Dmitry","last_name":"Medvedev","birth_date":"1965-09-14","death_date":null,"place_of_birth":"Saint Petersburg","wikipedia_url":"https://ru.wikipedia.org/wiki/%D0%9C%D0%B5%D0%B4%D0%B2%D0%B5%D0%B4%D0%B5%D0%B2,_%D0%94%D0%BC%D0%B8%D1%82%D1%80%D0%B8%D0%B9_%D0%90%D0%BD%D0%B0%D1%82%D0%BE%D0%BB%D1%8C%D0%B5%D0%B2%D0%B8%D1%87","start_mandate":"2008-05-07","end_mandate":"2012-05-07"},{"id":"Q34453","first_name":"Boris","l

In [10]:

#same with leader
parameters = {"leader_id":"Q2042"}
leader_endpoint ='/leader/'
leader_req = requests.get(f"{root_url}{leader_endpoint}", cookies = cookie, params = parameters)
if leader_req.status_code == 200:
    print(leader_req.text)
else :
    print (leader_req.status_code)



{"id":"Q2042","first_name":"Charles","last_name":"de Gaulle","birth_date":"1890-11-22","death_date":"1970-11-09","place_of_birth":"Lille","wikipedia_url":"https://fr.wikipedia.org/wiki/Charles_de_Gaulle","start_mandate":"1959-01-08","end_mandate":"1969-04-28"}


In [None]:

# same to get a  leaders dic for all counrty
leaders_endpoint ='/leaders/'
leaders_main_data_list= []
url_list = []
for country in country_list :
        params = {'country': f'{country}'}
        leaders_req = requests.get(f"{root_url}{leaders_endpoint}", cookies = cookie, params = params)
        leaders_data = leaders_req.json()
        leaders_main_data_list.extend([(item['first_name'] +' '+ item['last_name'], item['wikipedia_url']) for item in leaders_data])
        url_list.extend([item['wikipedia_url'] for item in leaders_data])

        
print(leaders_data)
print(type(leaders_data))       
print(leaders_main_data_list)
print (url_list)

## Extracting data from Wikipedia

Query one of the leaders Wikipedia urls (from the `/leader` API endpoint) and display its `text` (not JSON).

In [7]:
# 3 lines
dict = leader_req.json()
print(dict)
url = dict.get('wikipedia_url')
print(url)
r = requests.get(url)
content = r.text
print(content)
print(type(content))



NameError: name 'leader_req' is not defined

Ouch! You get the raw HTML code of the webpage. If you try to deal with it without tools, you will be there all night. Instead, use the [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) *external* library.

As shown in the **quickstart** section, start by importing the library and loading the output of your `get_text()` function.

Use the `prettify()` function and print it to take a look. You will start the actual parsing in the next step.

In [10]:
# 3 lines
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html')
url_text = soup.get_text()
#print(url_text)
url_text_pretty = soup.prettify()
print(url_text_pretty)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="fr">
 <head>
  <meta charset="utf-8"/>
  <title>
   Charles de Gaulle — Wikipédia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector

That looks better but you need to extract the right part of the webpage: the text of the first paragraph.

It is a bit tricky because Wikipedia pages slightly differ in structure from one language to the next. We cannot simply get the text for the first HTML paragraph.

You will start by getting all the HTML paragraphs from the HTML source and saving them in the `paragraphs` variable.

Use the documentation or google the appropriate keywords.

In [11]:
# 2 lines
paragraphs = []
[paragraphs.append(tag) for tag in soup.find_all('p')]
print(len(paragraphs))
print(paragraphs)
print(paragraphs[3])

#paragraph = paragraph[0]
#print(paragraph)


231
[<p class="mw-empty-elt">
</p>, <p>Pour les articles homonymes, voir <a class="mw-disambig" href="/wiki/Charles_de_Gaulle_(homonymie)" title="Charles de Gaulle (homonymie)">Charles de Gaulle (homonymie)</a> et <a class="mw-disambig" href="/wiki/Gaulle" title="Gaulle">Gaulle</a>.
</p>, <p>« De Gaulle » redirige ici. Pour les autres membres de la famille, voir <a href="/wiki/Famille_de_Gaulle" title="Famille de Gaulle">Famille de Gaulle</a>.
</p>, <p class="mw-empty-elt">
</p>, <p><b>Charles de Gaulle</b> (<span class="API nowrap" style="font-family:'Segoe UI','DejaVu Sans','Lucida Grande','Lucida Sans Unicode','Arial Unicode MS','Hiragino Kaku Gothic Pro',sans-serif;" title="Alphabet phonétique international">/<a class="mw-redirect" href="/wiki/API_%CA%83" title="API ʃ"><span title="[ʃ] « ch » dans « chou ».">ʃ</span></a><a class="mw-redirect" href="/wiki/API_a" title="API a"><span title="[a] « a » dans « patte ».">a</span></a><a class="mw-redirect" href="/wiki/API_%CA%81" title="AP

If you try different urls, you might find that the paragraph you want may be at a different index each time.

That is where you need to be clever and ask yourself what would be a reliable way to identify the right index, i.e., which string matches only the first paragraph whatever the language...

Spend a good 30 minutes on the problem and brainstorm with your fellow learners. If you come out empty handed, ask your coach.

1. Loop over the HTML paragraphs
2. When you have identified the correct one
    * store the [text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output) inside the `first_paragraph` variable
    * exit the loop

In [65]:
# < 10 lines
from bs4 import BeautifulSoup
url = 'https://fr.wikipedia.org/wiki/Charles_de_Gaulle'
r = requests.get(url)
content = r.text
soup = BeautifulSoup(content, 'html')

for tag in soup.find_all('p'):
    extract_first_paragraph=''
    if '<p><b>' in str(tag):
        extract_first_paragraph = tag
        print(extract_first_paragraph.get_text())
        print(type(extract_first_paragraph))
        break


Charles de Gaulle (/ʃaʁl də ɡol/[n 2] Écouter), communément appelé le général de Gaulle ou parfois simplement le Général, né le 22 novembre 1890 à Lille (Nord) et mort le 9 novembre 1970 à Colombey-les-Deux-Églises (Haute-Marne), est un militaire, résistant, homme d'État et écrivain français.

<class 'bs4.element.Tag'>


At this stage, you can create a function to maintain consistency in your code. We will give you its *skeleton*, you will copy the code you wrote and make it work inside a function.

Don't forget to test your function.

In [13]:
# 10 lines
import requests
def get_first_paragraph(wikipedia_url):
    print(wikipedia_url) # keep this for the rest of the notebook
    
    #   scrap url
    
    r = requests.get(wikipedia_url)
    content = r.text
    soup = BeautifulSoup(content, 'html')

    #   return first_paragraph
    for tag in soup.find_all('p'):
        if '<p><b>' in str(tag):
            extract_first_paragraph = tag
            return extract_first_paragraph.get_text()
            break
get_first_paragraph ('https://fr.wikipedia.org/wiki/Charles_de_Gaulle')

https://fr.wikipedia.org/wiki/Charles_de_Gaulle


"Charles de Gaulle (/ʃaʁl də ɡol/[n 2] Écouter), communément appelé le général de Gaulle ou parfois simplement le Général, né le 22 novembre 1890 à Lille (Nord) et mort le 9 novembre 1970 à Colombey-les-Deux-Églises (Haute-Marne), est un militaire, résistant, homme d'État et écrivain français.\n"

### Regular expressions to the rescue

Now that you have extracted the content of the first paragraph, the only thing that remains to finish your Wikipedia scraper is to sanitize the output.

Some Wikipedia references, HTML code, phonetic pronunciation, ... still linger across your paragraphs. You might find *regular expressions* handy to get rid of them and obtain pristine text.

Once you have one of your regexes working [online](https://regexr.com/), try it in the cell below. 

Hints: 
* Check the `sub()` method documentation.
* Make sure to test urls in different languages. Some may look good but others won't.

In [None]:
# 3 lines



## Tidy things up in a stand-alone Python script

Congratulations! You now have a working scraper! However, your code is scattered throughout this notebook alongside the tutorial text. Hardly ready for production or for your GitHub portfolio...

Gather your code into a module, add some functionality for saving, and call it all from a single script!