**Web scraping**

In [1]:
%run setup.ipynb

# Web scraping libraries

Web scraping is a way to retrieve unseful information from web sources programmatically.

You can use it when you are not able to find the available corpus (speeches, movie reviews and so on) useful particularly.

Some websites allow web scraping for individual non-commercial use. Usually you `must` to confirm the policy before scraping a website.

## Example with e-commerce company

Consider the following [test website](https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops) to build the web scraping text. It is an imaginary e-commerce company that sells computers and phones.

The page lists the products that it sells.
Each product has price and user rating information.

The idea is to get price and user ratings of every laptop listed on the website.

- Use requests and beautifulsoup4 libraries (the most commly used Python libraries for web scraping.

In [2]:
#!pip install requests
#!pip install beautifulsoup4
#!pip install pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

- Show the output of the entire HTML script of the web page. Use the `.text` function of `request`.

In [3]:
titles = []
prices = []
ratings = []
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
request = requests.get(url)
print(request.text) 

<!DOCTYPE html>
<html lang="en">
<head>

			<!-- Anti-flicker snippet (recommended)  -->
<style>.async-hide {
		opacity: 0 !important
	} </style>
<script>(function (a, s, y, n, c, h, i, d, e) {
		s.className += ' ' + y;
		h.start = 1 * new Date;
		h.end = i = function () {
			s.className = s.className.replace(RegExp(' ?' + y), '')
		};
		(a[n] = a[n] || []).hide = h;
		setTimeout(function () {
			i();
			h.end = null
		}, c);
		h.timeout = c;
	})(window, document.documentElement, 'async-hide', 'dataLayer', 4000,
		{'GTM-NVFPDWB': true});</script>
	
	<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NV

You can see the same content on your browser, simply right-click on the page and select `Inspect`.
This will open a panel containing HTML code of the page.

- Navigate through the HTML code text and extract the relevant information. Use `BeautifulSoup` to scrape the text.

In [4]:
soup = BeautifulSoup(request.text, "html.parser")

`html.parser` creates a BeautifulSoup Html parser object.

In [5]:
print(request.text) 

<!DOCTYPE html>
<html lang="en">
<head>

			<!-- Anti-flicker snippet (recommended)  -->
<style>.async-hide {
		opacity: 0 !important
	} </style>
<script>(function (a, s, y, n, c, h, i, d, e) {
		s.className += ' ' + y;
		h.start = 1 * new Date;
		h.end = i = function () {
			s.className = s.className.replace(RegExp(' ?' + y), '')
		};
		(a[n] = a[n] || []).hide = h;
		setTimeout(function () {
			i();
			h.end = null
		}, c);
		h.timeout = c;
	})(window, document.documentElement, 'async-hide', 'dataLayer', 4000,
		{'GTM-NVFPDWB': true});</script>
	
	<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NV

Each product's detail is coded within `<div>` tag that represents a division in HTML page with `col-sm-4 col-lg-4 col-md-4` as the class.

The name of the product can be extracted from the `title` element of the `<a>` tag, which is within the `caption` subdivision of the code.

The price of the product can be extracted within the same `caption` subdivision but under the `pull-right price` class.

The rating can be extracted from the subdivision with the `rating` class.

The `for` loop is used to search information in the various subdivision. `.text` function allows you to extract text, while `.get()` function is used to extract the name of the element.

In [6]:
for product in soup.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'}):
    for pr in product.find_all('div', {'class': 'caption'}):
        for p in pr.find_all('h4', {'class': 'pull-right price'}):
            prices.append(p.text)
        for title in pr.find_all('a' , {'title'}):
            titles.append(title.get('title'))
    for rt in product.find_all('div', {'class': 'ratings'}):
        ratings.append(len(rt.find_all('span', {'class': 'glyphicon glyphicon-star'})))

#build dataframe and export to csv            
product_df = pd.DataFrame(zip(titles,prices,ratings), columns =['Titles', 'Prices', 'Ratings'])  
product_df.head()

Unnamed: 0,Titles,Prices,Ratings
0,Asus VivoBook X441NA-GA190,$295.99,3
1,Prestigio SmartBook 133S Dark Grey,$299.00,2
2,Prestigio SmartBook 133S Gold,$299.00,4
3,Aspire E1-510,$306.99,3
4,Lenovo V110-15IAP,$321.94,3


In [7]:
product_df.to_csv(f"{RESULTS_PATH}/ecommerce.csv",index=False)

It is possible to extract programmatically information from web sources using web scraping. 

The more complex the structure of a web page is, the more difficult it is to scrabe that page.