<a href="https://colab.research.google.com/github/rmpbastos/data_science/blob/master/Web_Scraping_with_Python_using_BeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Web Scraping](https://github.com/rmpbastos/data_science/blob/master/img/clement-h-95YRwf6CNw8-unsplash_small.jpg?raw=true)

# Web Scraping with Python using BeautifulSoup

### How to parse and extract data from HTML documents in simple steps
---

One of the main concerns we have when starting a new project is how to obtain the data we'll be working on. Companies like Airbnb and Twitter, for instance, simplify this task by providing APIs so we can compile information in an organized way. On other occasions, we can download the structured dataset already cleaned and ready to use, as in some Kaggle competitions. However, not rarely, we'll need to explore the web to find and extract the data we need.

That's when **web scraping** comes in handy. The idea is to extract information from a website and convert it for practical analysis. While there are several tools available for this purpose, in this article we'll be using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a **Python** library designed for easily pulling data out of HTML and XML files.

Here, we'll visit this [Wikipedia page](https://en.wikipedia.org/wiki/List_of_best-selling_books) that contains several lists of best-selling books and extract the second table, for books between 50 million and 100 million copies sold.


### Libraries needed

We only need 2 packages to handle the HTML file. We'll be also using **Pandas** to create a data frame from the extracted data:

*   `requests` - Allows us to send HTTP requests and download the HTML code from the webpage 
*   `beautifulsoup` - Used to pull data out of the raw HTML file
*   `pandas` - Python library for data manipulation. We'll use it to create our data frame

In [16]:
# Import request library
import requests
from bs4 import BeautifulSoup

To extract the raw HTML file, we simply pass the website URL into the `request.get()` function.

### Extracting the html file

In [17]:
# Extracting html
url_path = ('https://en.wikipedia.org/wiki/List_of_best-selling_books')
html_text = requests.get(url_path).text

We now have an unstructured text file, containing the HTML code extracted from the URL path we passed.

Let's take a look:

In [18]:
html_text

'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of best-selling books - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"9793959f-5d4e-4c5f-bd81-ede838d596b0","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_best-selling_books","wgTitle":"List of best-selling books","wgCurRevisionId":967504130,"wgRevisionId":967504130,"wgArticleId":2512935,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with reference errors","Webarchive template wayback links","CS1 Italian-language sources (it)","Webarchive template 

### Creating a BeautifulSoup object

Now, we can start to work with BeautifulSoup. Let's generate a BeautifulSoup object called `soup`, passing the `html_text` file created above.

In the next step, we can use a function called `prettify()` to shape the object in a structured format.

In [21]:
# Getting a Beautiful Soup object
soup = BeautifulSoup(html_text)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of best-selling books - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"9793959f-5d4e-4c5f-bd81-ede838d596b0","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_best-selling_books","wgTitle":"List of best-selling books","wgCurRevisionId":967504130,"wgRevisionId":967504130,"wgArticleId":2512935,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with reference errors","Webarchive template wayback links","CS1 Italian-language sources (it)","Webarchive

Notice how the formatted file is easier to read and work on, compared to when we first generated the raw `html_text` file.

### Inspecting the Wikipedia page

On the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_best-selling_books), let's inspect the elements of the web page. (In Windows, press **Ctrl + Shift + I**. In Mac, press **Cmd + Opt + I**)


![Wikipedia inspect elements](https://github.com/rmpbastos/data_science/blob/master/img/wiki-inspect-img2.jpg?raw=true)

Notice that all tables have a class of `wikitable sortable`. We can take advantage of that to select all tables in the HTML file.

### Extracting the table

We are saving the tables in a variable called `wiki_tables`, using the method `find_all()` to search for all HTML `table` tags, with a class of `wikitable sortable`.

In [23]:
wiki_tables = soup.find_all('table', {'class': 'wikitable sortable'})

As we want the second table on the page (Between 50 million and 100 million copies), let's narrow down our search to the second `wiki_tables` element. Let's also extract each row `tr` in that table.

In [24]:
second_table = wiki_tables[1].find_all("tr")

Now, we'll create an empty list called `table_list`, and append the elements of each table cell `td` into `table_list`.

In [25]:
# Extracting the text from the table cells
table_list = []

for tr in second_table:
    td = tr.find_all('td')
    row = [ele.text.strip() for ele in td]
    table_list.append(row)

We have successfully extracted that second table from the website into a list and we're all set up to start analyzing the data.

### Creating a pandas DataFrame

Finally, we can simply convert the list into a **Pandas DataFrame** to visualize the data we extracted from Wikipedia.

In [28]:
# Import pandas
import pandas as pd

In [29]:
df = pd.DataFrame(rows, columns=['book', 'author', 'original_language', 'first_published', 'approximate_sales', 'genre'])
df = df.dropna(how='all').reset_index(drop=True)
df.head()

Unnamed: 0,book,author,original_language,first_published,approximate_sales,genre
0,"The Lion, the Witch and the Wardrobe",C. S. Lewis,English,1950,85 million[24],fantasy
1,She: A History of Adventure,H. Rider Haggard,English,1887,83 million[25],adventure
2,The Adventures of Pinocchio (Le avventure di P...,Carlo Collodi,Italian,1881,>80 million[26][27],fantasy
3,The Da Vinci Code,Dan Brown,English,2003,80 million[28],mystery thriller
4,Harry Potter and the Chamber of Secrets,J. K. Rowling,English,1998,77 million[29],fantasy


That's it! With a few steps and some lines of code we now have a data frame extracted from an HTML table, ready for analysis. Well, there are still some adjustments that could be made, such as removing the square brackets references in the `approximate sales` columns, but the web scraping is done!