![ ](https://1.bp.blogspot.com/-1K2QZ8EF9ic/VM-jFqXTFsI/AAAAAAAAFyQ/UtozSbBm614/w1200-h630-p-k-no-nu/DelMe.jpg)

# <center> Web scraping and parsing <br>  </center>


* [Basics](#basics)
* [What if the server is mad at you](#mad)
* [API and how to handle it](#api)
* [Some tricks](#tips)
* [Recommended materials](#recs)

# 1. Basics <a name='basics'></a>

### In short, it's all about automatic data collection and transformation of downloaded string/binary data into a pretty data structure to work with. 

Key notions:

* **Web scraping** is data collection from websites. In other words, it is a technique used for extracting large amounts of data from web pages, in which the data is retrieved and then stored in a file on your computer or in a database in a table.

* **Parsing** (in computer science). In practice, parsing is almost always turning a string, or binary data, into a data structure inside your program. The term is also used to describe a split or separation. Formally speaking, it is a syntactic analysis of the input symbols into its component parts in order to facilitate the writing of compilers and interpreters.

## What is HTML? 

**HTML (HyperText Markup Language)**  — is the standard markup language (like LaTeX or Markdown) for documents designed to be displayed in a web browser (Google Chrome, Safari or any other). So, it's a standard language for writing various sites. 

HTML **tags** are the foundation of the HTML language. Tags are used to delimit the beginning and end of elements in markup.


An HTML page is a collection of nested tags. Example of tags:

- `<title>` - page title
- `<h1> ... <h6>` - headings of different levels
- `<p>` - paragraph
- `<div>` - a division or a section in a document, it is used as a container for HTML elements
- `<table>` - drawing the table
- `<tr>` - separator for rows in the table
- `<td>` - separator for columns in the table
- `<b>` - sets the font to bold
- `<i>` - italic text

Command `<...>` opens a tag and `</...>` closes it. The tag properties apply to everything in between the two commands. For instance, everuthing in between `<p>` and `</p>` — is a separate paragraph.   

Tags form a tree rooted in the `<html>` tag and break the page into different logical pieces. Each tag has its own parents and their descendants (children) - those tags that are nested.

Example of HTML-tree:


````html
<html>
    <head> Some regular title </head>
    <body>
        <div>
            The first piece of text. 
        </div>
        <div>
            Some text here.
                <b>
                    This text will be bold. 
                </b>
        </div>
        One more piece of text...
    </body>
</html>
````

You can work with this html as with text, or as a tree. Traversing this tree is web page parsing. We will just find the nodes we need among all this variety and take information from them!

<center>
<img src="https://raw.githubusercontent.com/hse-econ-data-science/eds_spring_2020/master/sem05_parsing/image/tree.png" width="450"> 

In [1]:
import requests  

url = 'https://www.whitehouse.gov/briefing-room/press-briefings/2022/03/31/background-press-call-by-senior-administration-officials-on-president-bidens-plan-to-respond-to-putins-price-hike-at-the-pump/'
response = requests.get(url)
response

<Response [200]>

Response 200 means that connection established and data received, everything is wonderful! If you try to navigate to a non-existent page, you can get, for example, the famous 404 error.

In [2]:
requests.get('https://www.whitehouse.gov/common_sense')

<Response [404]>

Entry of the response is the html markup of the page that we are parsing.

In [3]:
response.content[:1000]

b'<!doctype html>\n<html class="no-js alert__has-cookie" lang="en-US">\n<head>\n\t<meta charset="utf-8">\n\t<meta name="google" content="notranslate">\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<link rel="profile" href="https://gmpg.org/xfn/11">\n\t\n\t<!-- If you\'re reading this, we need your help building back better. https://usds.gov/ -->\n<meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n\n\t<title>Background Press Call by Senior Administration Officials on President Biden&#039;s Plan to Respond to Putin&#039;s Price Hike at the Pump | The White House</title>\n\t<link rel="canonical" href="https://www.whitehouse.gov/briefing-room/press-briefings/2022/03/31/background-press-call-by-senior-administration-officials-on-president-bidens-plan-to-respond-to-putins-price-hike-at-the-pump/" />\n\t<meta property="og:locale" content="en_US"

It looks like something hard to work with.

The **[`bs4`](https://www.crummy.com/software/BeautifulSoup/)** a.k.a **BeautifulSoup** package was named after a poem about a beautiful soup from Alice in Wonderland. This is a completely magical library that from the raw HTML (or XML) code of the page will give you a structured array of data, by which it is very convenient to search for the necessary tags, classes, attributes, texts and other elements of web pages.

<img align="center" src="https://raw.githubusercontent.com/hse-econ-data-science/eds_spring_2020/master/sem05_parsing/image/alisa.jpg" height="200" width="200"> 


In [4]:
from bs4 import BeautifulSoup

# parse page into a tree
tree = BeautifulSoup(response.content, 'html.parser')

The variable `tree` contains the tree of tags that can be examined. 

In [7]:
#tree

You can see the structure of the web-page with the method `prettify()`

In [5]:
from bs4 import BeautifulSoup
tree = BeautifulSoup(response.content, 'html.parser')

print(tree.prettify())

<!DOCTYPE html>
<html class="no-js alert__has-cookie" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="notranslate" name="google"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <!-- If you're reading this, we need your help building back better. https://usds.gov/ -->
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
   <title>
    Background Press Call by Senior Administration Officials on President Biden's Plan to Respond to Putin's Price Hike at the Pump | The White House
   </title>
   <link href="https://www.whitehouse.gov/briefing-room/press-briefings/2022/03/31/background-press-call-by-senior-administration-officials-on-president-bidens-plan-to-respond-to-putins-price-hike-at-the-pump/" rel="canonical">
    <meta content="en_US" property="og:locale">
     <meta conte

In [6]:
tree.html.head.title

<title>Background Press Call by Senior Administration Officials on President Biden's Plan to Respond to Putin's Price Hike at the Pump | The White House</title>

You can get the text out of the place where we wandered using the `text` method.

In [7]:
tree.html.head.title.text

"Background Press Call by Senior Administration Officials on President Biden's Plan to Respond to Putin's Price Hike at the Pump | The White House"

In [16]:
type(tree.html.head.title.text)

bs4.element.Tag

You can work with text as with a string using classical Python methods.

In [17]:
header = tree.html.head.title.text
header.split(' ')

['Background',
 'Press',
 'Call',
 'by',
 'Senior',
 'Administration',
 'Officials',
 'on',
 'President',
 "Biden's",
 'Plan',
 'to',
 'Respond',
 'to',
 "Putin's",
 'Price',
 'Hike',
 'at',
 'the',
 'Pump',
 '|',
 'The',
 'White',
 'House']

Let's scrap and parse section about leaders on the website The Economist and study _other useful commands_

In [8]:
import requests  
from bs4 import BeautifulSoup
url = 'https://www.economist.com/leaders/?page=1'
response = requests.get(url)

html_tree = BeautifulSoup(response.content, 'html.parser')


In [9]:
html_tree.html.head.title

<title>Leaders | The Economist</title>

In [3]:
html_tree.title

<title>Leaders | The Economist</title>

In [4]:
type(html_tree.title)

bs4.element.Tag

In [5]:
html_tree.title.text

'Leaders | The Economist'

In [7]:
html_tree.title.string

'Leaders | The Economist'

In [27]:
html_tree.title.name

'title'

In [8]:
html_tree.title.parent.name

'head'

In [9]:
len(html_tree.find_all('a'))

103

In [12]:
type(html_tree.find_all('a')), type(html_tree.find_all('a')[0])

(bs4.element.ResultSet, bs4.element.Tag)

In [13]:
html_tree.find_all('a')[1:5]

[<a class="ds-skip-to-content" href="#content">Skip to content</a>,
 <a aria-expanded="false" class="ds-menu-disclosure" data-menu-is-open="false" data-test-id="Menu link" href="#" id="menu-button" type="menu-nav" url="#sections"><svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><g fill="none" fill-rule="evenodd" id="icon-menu-disclosure"><path d="M0 0h24v24H0z"></path><path class="path-foreground" d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z" fill="#0D0D0D" fill-rule="nonzero"></path></g></svg>Menu</a>,
 <a class="weekly-edition-link ds-navigation-link" href="/weeklyedition"><span>Weekly edition</span></a>,
 <a class="the-world-in-brief-link ds-navigation-link" href="https://www.economist.com/the-world-in-brief"><span>The world in brief</span></a>]

One common task is extracting all the URLs found within a page’s `<a>` tags:
```python
for a in html_tree.find_all('a'):
    print(a.get('href'))
```

### Extracting the link from the a-tag

In [21]:
for a in html_tree.find_all('a')[1:200]:
    print(a.get('href'), a, sep = '; ')


#content; <a class="ds-skip-to-content" href="#content">Skip to content</a>
#; <a aria-expanded="false" class="ds-menu-disclosure" data-menu-is-open="false" data-test-id="Menu link" href="#" id="menu-button" type="menu-nav" url="#sections"><svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><g fill="none" fill-rule="evenodd" id="icon-menu-disclosure"><path d="M0 0h24v24H0z"></path><path class="path-foreground" d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z" fill="#0D0D0D" fill-rule="nonzero"></path></g></svg>Menu</a>
/weeklyedition; <a class="weekly-edition-link ds-navigation-link" href="/weeklyedition"><span>Weekly edition</span></a>
https://www.economist.com/the-world-in-brief; <a class="the-world-in-brief-link ds-navigation-link" href="https://www.economist.com/the-world-in-brief"><span>The world in brief</span></a>
#; <a class="ds-navigation-disclosure--icon ds-navigation-disclosure--icon-search ds-navigation-disclosure" href="#" type="search-form">Search<svg viewbox="0 0

### Detailed specification of the a-tag

* `find_all()` : the second argument is the dictionary that specifies an additional condition

__Visual check__:
The links that we are looking for (and only they) contain the specific structure started with _/leaders/20_ . We introduce the corresponding search

In [10]:
articles = html_tree.find_all('a', {'class' : 'ds-navigation-link'})
len(articles)

94

> the `class` attribute is treated as a set; you match against individual elements listed in the attribute. Then, e.g., `<a class="ds-navigation-link ds-navigation-link--inverse">` matches the above search pattern. This follows the HTML standard.

You can access the dictionary of attributes directly as `.attrs`:

In [11]:
articles[10].attrs

{'class': ['ds-navigation-link', 'ds-navigation-link--inverse'],
 'href': '/climate-change'}

In [16]:
articles[10]

<a class="ds-navigation-link ds-navigation-link--inverse" href="/1843"><span>1843 magazine</span></a>

In [17]:
articles[10].get('href'), articles[10].text

('/1843', '1843 magazine')

The object obtained after the search also has the bs4 structure. Therefore, we can continue to search for the objects we need already in it.

In [42]:
type(articles[0])

bs4.element.Tag

Note that there are at least two methods for searching: `find` and `find_all`. If several elements on the page have the specified address, then the `find` method will return only the very first one. To find all elements with this address, you need to use the `find_all` method. A list will be displayed at the exit.

In addition to content, tags often have attributes. For example, an article title has an `href` attribute. The `<span>` tag is an inline container used to mark up a part of a text, but never mind. 

In [12]:
articles[0]

<a class="weekly-edition-link ds-navigation-link" href="/weeklyedition"><span>Weekly edition</span></a>

In [13]:
articles_extra = html_tree.find_all('a', {'class' : 'ds-navigation-link ds-navigation-link--inverse'})
len(articles_extra)

59

You also can get this attribute:

And by these attributes, you can search for parts of the page that interest us.

In [14]:
articles[0].find_all('span')

[<span>Weekly edition</span>]

In [15]:
articles_new = html_tree.find_all('a', {'class' : 'ds-menu-disclosure'})
#articles_new[0].text
paths = articles_new[0].find_all('path')
print(articles_new[0], '\n')
for path in paths:
    print(path)

<a aria-expanded="false" class="ds-menu-disclosure" data-menu-is-open="false" data-test-id="Menu link" href="#" id="menu-button" type="menu-nav" url="#sections"><svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><g fill="none" fill-rule="evenodd" id="icon-menu-disclosure"><path d="M0 0h24v24H0z"></path><path class="path-foreground" d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z" fill="#0D0D0D" fill-rule="nonzero"></path></g></svg>Menu</a> 

<path d="M0 0h24v24H0z"></path>
<path class="path-foreground" d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z" fill="#0D0D0D" fill-rule="nonzero"></path>


That's it.

## Task: Extract links to articles from this page

* Can we use the detailed search to solve the problem?
* At the moment of writing these notes, the answer is negative. The corresponding lines do not contain a specific class. Instead, they contain the specific pattern of the link structure

In [18]:
links = html_tree.find_all('a')
for link in links:
    link_href = link.get('href')
    if link_href is not None and link_href.find('leaders/') >= 0:
        print(link.text, link.get('href'), sep = '; ')


Why XL Bully dogs should be banned everywhere; /leaders/2024/03/25/why-xl-bully-dogs-should-be-banned-everywhere
The hidden costs of Biden’s steel protectionism; /leaders/2024/03/21/the-hidden-costs-of-bidens-steel-protectionism
At a moment of military might, Israel looks deeply vulnerable; /leaders/2024/03/21/at-a-moment-of-military-might-israel-looks-deeply-vulnerable
Britain is the best place in Europe to be an immigrant; /leaders/2024/03/21/britain-is-the-best-place-in-europe-to-be-an-immigrant
America’s Supreme Court should reject the challenge to abortion drugs; /leaders/2024/03/20/americas-supreme-court-should-reject-the-challenge-to-abortion-drugs
Oil’s endgame could be highly disruptive; /leaders/2024/03/14/oils-endgame-could-be-highly-disruptive
The Gulf’s scramble for Africa is reshaping the continent; /leaders/2024/03/14/the-gulfs-scramble-for-africa-is-reshaping-the-continent
Making sense of the gulf between young men and women; /leaders/2024/03/14/making-sense-of-the-gulf

__Minor comment__:
`link.get('href')` returned `None` in my experiment. That is why an additional check is performed within the `if`-statement in the above code

## Example: code collecting articles

Please note that all articles on the site are on different pages. If you click them, you will notice that the `page` attribute will change in the url link. So, if we want to collect all the articles in this section about leaders, we need to create a bunch of links with different `page` inside the loop. When you download data from more complex sites, the link often has a huge number of attributes that govern the search results.

Let's write the code that collects articles as a function. At the entrance, it will receive the number of the page to be downloaded. 

In [23]:
def get_page(p: int):
    # link 
    url = 'https://www.economist.com/leaders/?page={}'.format(p)
    
    # make a request to the server
    response = requests.get(url)
    
    # build a tree
    tree = BeautifulSoup(response.content, 'html.parser')
    
    # find everything we wanted
    articles = tree.find_all('a')
    #articles = tree.find_all('a', {'class' : 'ds-navigation-link'})
    
    all_headers = []
    
    for article in articles:
        #all_headers.append({'title': article.text, 'href': article.get('href')})
        article_href = article.get('href')
        if article_href is not None and article_href.find('leaders/') >= 0:
            #print(article.text, article.get('href'), sep = '; ')
            all_headers.append({'title': article.text, 'href': article_href})
                     
    return all_headers

Loop over all pages


In [25]:
all_headers = get_page(1)
print(all_headers[:5])

[{'title': 'How the EU should respond to American subsidies', 'href': '/leaders/2023/03/23/how-the-eu-should-respond-to-american-subsidies'}, {'title': 'The trouble with Emmanuel Macron’s pension victory', 'href': '/leaders/2023/03/23/the-trouble-with-emmanuel-macrons-pension-victory'}, {'title': 'The machinery, structure and output of the British state need reform', 'href': '/leaders/2023/03/23/the-machinery-structure-and-output-of-the-british-state-need-reform'}, {'title': 'As video games grow, they are eating the media', 'href': '/leaders/2023/03/23/as-video-games-grow-they-are-eating-the-media'}, {'title': 'The world according to Xi', 'href': '/leaders/2023/03/23/the-world-according-to-xi'}]


## Progress bar

* Hours of hard work would go into vain if our program becomes unresponsive during execution. Often we come across large datasets or longer loops that takes a long while to complete such as in Data Scraping.

* While these commands are executing and massive loops are being processed behind the screen, it seems like an eternity of waiting time till the whole process is completed. Thus, Creating a Progress Bar would solve this problem.

* Progress Bars would help us to look at the progress of our execution and to manage our anxiety levels.

`tqdm` is a library in Python which is used for creating Progress Meters or Progress Bars. `tqdm` got its name from the Arabic name taqaddum which means ‘progress’.

Implementing `tqdm` can be done effortlessly in our loops, functions or even Pandas. Progress bars are pretty useful in Python because:

1. One can see if the Kernal is still working

2. Progress Bars are visually appealing to the eyes

3. It gives Code Execution Time and Estimated Time for the code to complete which would help while working on huge datasets


### Installation (if required)

`!pip install tqdm`

`!pip install time`

```python
#with conda within the jupyter cell
!conda install -c conda-forge tqdm
!conda install -c conda-forge time
```

### Import the Libraries 

````python
from tqdm import tqdm
import time
````

### Using tqdm() 

Now we will use the function tqdm() on a simple program with a _for loop_

````python
for i in tqdm(range(20)):
    time.sleep(0.5)
````

Here i is the variable that takes a value of the number 0 to 19 during each iteration. During each iteration, the system will sleep for 0.5 seconds before moving to the next iteration.

The complete code would look like this:
````python
from tqdm import tqdm
import time
for i in tqdm(range(20)):
    time.sleep(0.5)
````

### Using tqdm_notebook( )

Unlike the `tqdm(`), `tqdm_notebook()` gives a coloured version of progress bars. It has 3 sets of colour by default.

A moving __Blue Bar__ shows for a process undergoing, a stable __Green Bar__ shows that the process is completed, A __Red Bar__ shows that process is being stopped. Interestingly, like the tqdm(), tqdm_notebook() too has a straightforward way of implementation.

````python
from tqdm.notebook import tqdm_notebook
import time
for i in tqdm_notebook(range(10)):
    time.sleep(0.5)
````


In [19]:
from tqdm import tqdm
import time
for i in tqdm(range(20)):
    time.sleep(0.5)


100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.96it/s]


In [20]:
from tqdm.notebook import tqdm_notebook
import time
for i in tqdm_notebook(range(20)):
    time.sleep(0.5)


  0%|          | 0/20 [00:00<?, ?it/s]

__TASK__

Find the links gathered within the html-code split into several pages

Problem: How can we select the links

The solution is prepared above. We select the links based on its structure in `get_page()` function

My code uses the progress bar from __tqdm.notebook__ library. You may use __tqdm__ instead

In [21]:
from tqdm import tqdm

In [24]:
from tqdm.notebook import tqdm_notebook
info = []

for page in tqdm_notebook(range(1, 10)):
    info.extend(get_page(page))

  0%|          | 0/9 [00:00<?, ?it/s]

In [25]:
for line in info[:10]:
    print(line['title'], line['href'])

Why XL Bully dogs should be banned everywhere /leaders/2024/03/25/why-xl-bully-dogs-should-be-banned-everywhere
The hidden costs of Biden’s steel protectionism /leaders/2024/03/21/the-hidden-costs-of-bidens-steel-protectionism
At a moment of military might, Israel looks deeply vulnerable /leaders/2024/03/21/at-a-moment-of-military-might-israel-looks-deeply-vulnerable
Britain is the best place in Europe to be an immigrant /leaders/2024/03/21/britain-is-the-best-place-in-europe-to-be-an-immigrant
America’s Supreme Court should reject the challenge to abortion drugs /leaders/2024/03/20/americas-supreme-court-should-reject-the-challenge-to-abortion-drugs
Oil’s endgame could be highly disruptive /leaders/2024/03/14/oils-endgame-could-be-highly-disruptive
The Gulf’s scramble for Africa is reshaping the continent /leaders/2024/03/14/the-gulfs-scramble-for-africa-is-reshaping-the-continent
Making sense of the gulf between young men and women /leaders/2024/03/14/making-sense-of-the-gulf-between

## Transformation to a table with _pandas_

In [26]:
import pandas as pd

df = pd.DataFrame(info)
df.head()

Unnamed: 0,title,href
0,Why XL Bully dogs should be banned everywhere,/leaders/2024/03/25/why-xl-bully-dogs-should-b...
1,The hidden costs of Biden’s steel protectionism,/leaders/2024/03/21/the-hidden-costs-of-bidens...
2,"At a moment of military might, Israel looks de...",/leaders/2024/03/21/at-a-moment-of-military-mi...
3,Britain is the best place in Europe to be an i...,/leaders/2024/03/21/britain-is-the-best-place-...
4,America’s Supreme Court should reject the chal...,/leaders/2024/03/20/americas-supreme-court-sho...


By the way, if you follow the link to the article itself, there will be a lot of additional information about it. For instance, you can go through all the links and download all texts from them.

# 2. What if the server is mad at you <a name='mad'></a>

* You decided to collect some data for yourself
* The server is not happy with the automatic request attack
* Error 403, 404, 504, $ \ldots $
* Captcha, registration requirements
* Messages that suspicious traffic has been detected from your device

<center>
<img src="https://cdn5.vectorstock.com/i/1000x1000/39/89/kawaii-cute-angry-computer-technology-vector-16553989.jpg" width="250"> 

## а) be patient 

Too frequent requests annoy the server. Put time delays between them.

In [None]:
import time
time.sleep(3) # the pause between requests 

## b) be like a human


A normal person's request through a browser looks like this:

<center>
<img src="https://raw.githubusercontent.com/hse-econ-data-science/eds_spring_2020/master/sem05_parsing/image/browser_get.png" width="600"> 
    
The request from python looks like this: 

<center>
<img src="https://raw.githubusercontent.com/hse-econ-data-science/eds_spring_2020/master/sem05_parsing/image/python_get.jpg" width="250"> 
 
Fortunately, no one forbids us to pretend to be human and show off the server by generating a fake user agent. There are many libraries that cope with this task, for example, <a href = "https://pypi.org/project/fake-useragent/">fake-useragent</a>. When a method is called from various pieces, a random combination of the operating system will be generated, specifications and browser versions that can be passed to the request:

In [133]:
#!pip install fake-useragent
#!pip3 install --upgrade fake-useragent
#!conda install fake-useragent



Reload your jupyter notebook.

In [80]:
from fake_useragent import UserAgent
ua = UserAgent()
#ua.ie
ua.chrome

'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.551.0 Safari/534.10'

For example, https://knowyourmeme.com/ will not want to let python into itself and will give a 403 error. It is issued by the server if it is available and capable of processing requests, but for some personal reasons refuses to do so.

In [138]:
import requests

In [90]:
url = 'https://knowyourmeme.com/'

response = requests.get(url)
response

<Response [200]>

If you generate a User-Agent, the problem is gone.

In [81]:
response = requests.get(url, headers={'User-Agent': UserAgent().chrome})
response

<Response [200]>

**Another example**: If you want to parse the CIAN, it will start giving you a captcha. One of the workarounds: change IP through Tor. However, for almost every request from Tor, CIAN will raise a captcha. If you add `User_Agent` to the request, then the captcha will appear much less often.

## c) communicate through intermediaries

<center>
<img src="https://raw.githubusercontent.com/hse-econ-data-science/eds_spring_2020/master/sem05_parsing/image/proxy.jpeg" width="300"> 

Look at your ip-address without a proxy.

In [82]:
r = requests.get('https://httpbin.org/ip')
print(r.json())

{'origin': '212.191.80.164'}


The proxy changes the matter

In [151]:
proxies = {
    'http': 'http://46.30.188.54',
    'https': 'https://46.30.188.54'
}

r = requests.get('https://httpbin.org/ip', proxies=proxies)
#r = requests.get('https://httpbin.org/ip', proxies=proxies)

print(r.json())

The request worked a little longer, the IP address changed. Most of the proxies you'll find work bad. Sometimes the request takes a very long time and it is more profitable to drop it and try another proxy. This can be configured with the `timeout` option. For example, if the server does not respond for a second, the code will crash.

In [None]:
import requests
requests.get('http://www.google.com', timeout=1)

There are quite a few interesting features for requests. You can find them in [a guide.](https://requests.readthedocs.io/en/master/user/advanced/)


# 3. API  <a name='api'></a>

__API (Application Programming Interface)__ — this is a ready-made code that you can insert into your code! Many services, including Google and Vkontakte, provide their own ready-made solutions for your development.

Examples: 

* [API Twitter](https://developer.twitter.com/en/docs.html) 
* [API YouTube](https://developers.google.com/youtube/v3/)
* [API Google Maps](https://developers.google.com/maps/documentation/) 
* [Aviasales](https://www.aviasales.ru/API)
* [Yandex Translate](https://yandex.ru/dev/translate/)

## 3.2 API Google maps

A map API may be needed for various semi-geographic surveys. For example, we want to test the hypothesis that good coffee raises the price of an apartment. We want to take the number of coffee shops in the vicinity as one of the regressors. This number of coffee shops must be taken from somewhere. You can use Google Maps. 

It all starts again with [getting the key] (https://developers.google.com/maps/documentation/directions/start) It's much simpler. Follow the link, click Get started, agree with everything except payment. We get the access key, save it in a file next to the notebook.

In [153]:
# upload your token
with open('google_token.txt') as f:
    google_token = f.read()

We form a link for a request according to the behests [documentation] (https://developers.google.com/maps/documentation) and receive a response in the form of JSON.

In [None]:
mainpage = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?'

location = '51.781840, 19.485776'
radius = '3000'
keyword = 'coffee' # coffee shop

parameters = 'location='+location+'&radius='+radius+'&keyword='+keyword+'&language=ru-Ru'+'&key='+ google_token

itog_url = mainpage + parameters 
itog_url

In [None]:
response = requests.get(itog_url)

response.json()['results'][0]

From json on the corresponding keys, we drag the most interesting. For example, the names of coffee shops:

In [None]:
[item['name'] for item in response.json()['results']]

# Tips: <a name='tips'></a>

### Tip 1:  Use `try-except`

This construction allows python to do something else in case of an error or ignore it. For example, we want to find the logarithm of all numbers in a list:

In [83]:
from math import log 

a = [1,2,3,-1,-5,10,3]

for item in a:
    print(log(item))

0.0
0.6931471805599453
1.0986122886681098


ValueError: math domain error

It does not work since the logarithm of negative numbers cannot be calculated. To prevent the code from crashing when an error occurs, we can change it a little:

In [84]:
from math import log 

a = [1,2,3,-1,-5,10,3]

for item in a:
    try:
        print(log(item))  # try logarithm
    except:
        print("I can't") # if fails, return this

0.0
0.6931471805599453
1.0986122886681098
I can't
I can't
2.302585092994046
1.0986122886681098


How to use this for parsing?  Suppose we set the parser to download prices at night, it ran for an hour and crashed. It would be nice if the code ignored this error and continued working further.

### Tip 2:  pd.read_html

If a table is hidden among the tags `<tr>` and `<td>` on the page that you have parsed, most often you can pick it up without writing a loop that will iterate over all columns and rows. `Pd.read_html` will help with this. For example, this is how you can pick [a table from the Central Bank website] (https://cbr.ru/currency_base/daily/)

In [None]:
import pandas as pd

df = pd.read_html('https://cbr.ru/currency_base/daily/', header=None)[0]
df.head()

The team tries to collect all the tablets from the web page into an array. If you want, you can first find the desired table through bs4, and then parse it:

In [None]:
resp = requests.get('https://cbr.ru/currency_base/daily/')
tree = BeautifulSoup(resp.content, 'html.parser')

# find a table
table = tree.find_all('table', {'class' : 'data'})[0]

# parse it
df = pd.read_html(str(table), header=None)[0]
df.head()

### Tip 3:  Use tqdm

> The code has been running for an hour. I have no idea when it will finish. It would be cool to know how long I have to wait ... 

If you have such a thought in your head, install it: `` 'conda install tqdm' '

In [None]:
from tqdm import tqdm_notebook

a = list(range(30))

for i in tqdm_notebook(a):
    time.sleep(1)

### Tip 4:  parallel computing

If the server is not very configured to ban you, you can parallelize your requests to it. The easiest way to do this is with the `joblib` library.

In [3]:
from joblib import Parallel, delayed
from tqdm import tqdm_notebook

def simple_function(x):
    return x**2

nj = -1 # use all cores
result = Parallel(n_jobs=nj)(
                delayed(simple_function)(item)          # function to apply
                for item in tqdm_notebook(range(10)))   # objects to be applied to

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for item in tqdm_notebook(range(10)))   # objects to be applied to


  0%|          | 0/10 [00:00<?, ?it/s]

This is actually not the most efficient way to parallelize in python. It consumes a lot of memory and is slower than [multiprocessing.](https://docs.python.org/3/library/multiprocessing.html). But you only need to write two lines of code.

###  More tips: 

* **Save what you parse as you download!** Put the code that saves the file right inside the loop!
* When the code fails in the middle of downloading, it is not necessary to run it from the start. Just save the piece that has already been downloaded and run the code from the crashed step.
* Adding a loop to traverse references inside a function is not a good idea. Let's say you want to download $100$ links. The function should return to the output the objects that were downloaded. It fails at the $50$th object. Of course, the function does not return what has already been downloaded. Everything that you've been downloading - you lose. It should be executed again. Why? Because the function has its own namespace. If you did it in a loop, then you could save the first $50$ of objects that are already inside the list, and then continue the download.
* You can navigate the html-page using `xpath`. It is designed so that you can quickly find some elements inside the html page. [You can read more here.](Https://devhints.io/xpath)
* Don't be lazy to read the documentation. You can learn many useful things from it.

### Tasks

Work with [webpage](https://www.math.uni.lodz.pl/en/)

* Print the title of the webpage. Print an error message if the title does not exist
* Print all links that exist within `<a href> </a>` blocks


### Further reading

[A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)