# Web Scraping using *BeautifulSoup* and *requests*
## **Introduction**
In this project, we will use the modules `BeautifulSoup` and `requests` to showcase an introduction to web scraping, the automated process of collecting data from websites. 

The `requests` module is used to make a *request* to a web server, which downloads the HTML content of a given web page.

HTML (HyperText Markup Language) is a markup language that tells a browser how to display content. 

The `BeautifulSoup` module is then used to *parse* the downloaded HTML data from the web page and extract the desired content. 

Finally, the `pandas` module will be used to extract the data into a dataframe for future analysis.

Generally, we will follow these steps to extract data from a website:

    Request the source code of a specific URL
    Download returned content
    Identify the elements of the page we want
    Extract those elements into dataset

Let us begin by importing the appropriate libraries.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

## Web Scraping
In this first scrape, we will be using the online encyclopedia, *Wikipedia*, to extract a list of champions from the racing sport *Formula 1*.

A good start to scraping begins with a static website so as to not overwhelm the complexity of the code.

The `requests` module is used to submit a `GET` request to the Wikipedia page that downloads the HTML content.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_Formula_One_World_Drivers%27_Champions"
page = requests.get(url)
print(page)

<Response [200]>


The response object returned contains a status code indicating whether our download was successful or not.

Generally, response codes starting with `2` indicate success, and codes starting with `4` or `5` indicate an error has occured.

Next, an instance of the *BeautifulSoup* class will be created and the document can be parsed and prepared for data extraction.

In [3]:
soup = bs(page.content, "html.parser") # .content prints out the HTML content of the web page

Using developer tools provided by web browsers helps us to easily navigate the HTML of a web page directly on itself. This can be done using the *inspect* tool on a given portion of a web page. 

BeautifulSoup also contains a method that prints out the HTML in a structured format called `prettify()` which we will not be using here but is shown for reference.

In [4]:
# print(soup.prettify())
# We will opt to use the developer tools on the website itself (inspect) to navigate the HTML instead

### Finding all instances of a tag at once
We are looking to extract the *World Driver's Champions by season* table from the Wikipedia page. 

We can use the `.find_all()` method to find all instances of tables using the appropriate *tag*.

**Tags** are important elements that make up an HTML document and allows us to navigate through and extract other tags and text.

Let us find all instances of tables on this web page using the `table` tag. Note that the `find_all()` method returns a list.

In [5]:
print(len(soup.find_all('table')))

14


There are 14 returned instances of the `table` tag within this document, but we are looking for just the one. 
### Searching for tags by class
Classes (and ids) are used to determine which HTML elements to apply certain styles to. With the help of the developer tools, we can leverage this and also use them to specify the elements we want to scrape.

We see that the table has a "wikitable sortable" class.

In [6]:
print(len(soup.find_all("table", class_ = "wikitable sortable")))

7


We cut down the number of instances by half but again this still produces a lot of undesired data. 

Recall that find_all() returns a list and we can use list indexing to partition the slice of data we want. 

In this case, we want the first entry as that table is what contains all our desired info.

In [7]:
c_list = soup.find_all("table", class_ = "wikitable sortable")[0]

### Searching using CSS selectors
Items can also be found using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

    div p — finds all p tags inside of a div tag.
    body p a — finds all a tags inside of a p tag inside of a body tag.
    html body — finds all body tags inside of an html tag.
    p.outer-text — finds all p tags with a class of outer-text.
    p#first — finds all p tags with an id of first.
    body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
### Extracting the column names
At first glance, we see that the column names are stored in `th` tags within `tr` tags. 

We'll use CSS selectors to extract the column names from the `tr` tag.

(This section needs refining)

In [8]:
cols = [t.get_text().strip() for t in c_list.select("tr th")]
print(cols)

['Season', 'Driver', 'Age', 'Constructor', 'Tyres', 'Poles', 'Wins', 'Podiums', 'Fastest laps', 'Points', '% Points', 'Clinched[17]', '# of racesremaining', 'Margin', '% Margin', 'Chassis', 'Engine', 'Season', 'Driver', 'Age', 'Chassis', 'Engine', 'Tyres', 'Poles', 'Wins', 'Podiums', 'Fastest laps', 'Points', '% Points', 'Clinched', '# of racesremaining', 'Margin', '% Margin', 'Constructor']


We extracted the desired columns, but we also extracted duplicated values.

Upon closer inspection, we see that our desired column names are in two locations, the `thead` and `tfoot` tag. 

Looking at the table, we notice that the *Constructors* column is really two columns in one. 

The `tfoot` tag does the honors of splitting them into individual columns, so we will extract from there.

In [9]:
# Prints desired names but has empty strings with it
cols = [t.get_text().strip() for t in c_list.select("tr")[-2]]

# Removes all empty strings since blank strings evaluate to false in a boolean context
cols[:] = [x for x in cols if x]
print(cols)

['Season', 'Driver', 'Age', 'Chassis', 'Engine', 'Tyres', 'Poles', 'Wins', 'Podiums', 'Fastest laps', 'Points', '% Points', 'Clinched', '# of racesremaining', 'Margin', '% Margin']


### Extracting information from the table
Now that we have the appropriate columns, we will begin to scrape the page to populate them.

The data we want is located within the `tr` tags of `c_list`.

Each `tr` tag contains the information that is required to populate one row of our dataframe in individual `td` tags.

Let's find all instances of the `tr` tag.

In [10]:
f1_info = c_list.select("tr")

We now have a list of the individual `tr` tags that contain the information needed. 

We must be careful, however, as the first two and last two indicies contain information from `thead` and `tfoot` tag that also have `tr` tags. We will remove these for now and move forward.

In [11]:
f1_info = f1_info[2:-2]

The desired information has been organized but we now need to isolate individual data for each column that is stored within the `td` tags. 

The `td` tags are nested, so we can select all the elements using the `children` property.

In [12]:
f1_list = [] # Initiate empty list to store each champion entry

for element in f1_info:

    f1_champ = []
    
    for tag in element.children:
        
        f1_champ.append(tag.get_text().strip()) # Appends extracted, stripped text from each individual <td> tag  
        
    f1_champ[:] = [x for x in f1_champ if x] # Removes any blank strings within the list
    
    f1_list.append(f1_champ)

### Creating Dataframe
We previously obtained all the names of the columns we are interested in, stored in `cols`. 

We now have extracted the information for each champion, stored in `f1_list`.

Let us finish by creating a dataframe containing our desired information. 

This step could also include exporting our data into a csv file for others to use.

In [13]:
f1 = pd.DataFrame(f1_list, columns = cols)
# f1.to_csv()
f1.head()

Unnamed: 0,Season,Driver,Age,Chassis,Engine,Tyres,Poles,Wins,Podiums,Fastest laps,Points,% Points,Clinched,# of racesremaining,Margin,% Margin
0,1950,Giuseppe Farina[20],44,Alfa Romeo,Alfa Romeo,P,2,3,3,3,30.0,83.333 (47.619),Race 7 of 7,0,3.0,10.0
1,1951,Juan Manuel Fangio[21],40,Alfa Romeo,Alfa Romeo,P,4,3,5,5,31.0,86.111 (51.389),Race 8 of 8,0,6.0,19.355
2,1952[a],Alberto Ascari[23],34,Ferrari,Ferrari,F P,5,6,6,6,36.0,100.000 (74.306),Race 6 of 8,2,12.0,33.333
3,1953[a],Alberto Ascari[23],35,Ferrari,Ferrari,P,6,5,5,4,34.5,95.833 (57.764),Race 8 of 9,1,6.5,18.841
4,1954,Juan Manuel Fangio[21],43,Maserati[b],Maserati,P,5,6,7,3,42.0,93.333 (70.547),Race 7 of 9,2,16.857,40.136


## Conclusion

In this project, we have succesfully scraped the *World Driver's Champions by season* table from Wikipedia using the `requests`, `BeautifulSoup`, and `pandas` modules. 

The techniques used within this project showcased an introduction to scraping of a static website. 

Improvements can be made surrounding the extraction and handling of tags for ease of complexity. 

Future endeavors will be to create functions and scrape dynamic websites. 

We will return at a later time to conduct exploration, cleaning and analysis of the data we just obtained.