# Web Scraping with Beautiful Soup
I'm trying to build a guide for and develop my skills using the popular `BeautifulSoup` Python library ([Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) for web scraping, using realistic examples that I have come across when doing some web scraping projects.   
<br>
This will also provide me with a good opportunity to clean 'dirty' data and create data sets that are easily manipulated and worked with for further analysis, machine learning techniques or visualisations. 

---

Basic flow:
1. You have to scrape the raw html for a given url using `requests`
2. You have to trim the information you want from that html via elements, classes and ids using `BeuatifulSoup`
3. Manipulate that information into a easily usable format using `pandas`


In [None]:
# Imports
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

## Searching the HTML using a class
Every website is going to have some conventions for formatting pages and the information that we ultimately want to acquire. This is likely going to come in the form of CSS classes, especially if the website logic is dynamic and they are producing many pages with different information using the same template file. Thus, being able to isolate those elements using a distinct class is incredibly useful for us. 

This can be achieved through a special use case of the `find_all()` function ([Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class)). 

In [77]:
# soup.find_all(class_="sectionHeader")

url = "https://www.masterofmalt.com/gin/"

cookies = dict(MaOMa='VisitorID=556630649&IsVATableCountry=1&CountryID=464&CurrencyID=-1&CountryCodeShort=GB&DeliveryCountrySavedToDB=1')

# html = requests.get(url, headers = {"Accept-Language": "en-GB"}).text
html = requests.get(url, headers = {"Accept-Language": "en-GB"}, cookies = cookies).text 

soup = BeautifulSoup(html, features="html.parser")

print(soup.title.get_text())


	Gin - Master of Malt



In [None]:
pagination = soup.find_all(class_='list-paging')
# Note on why you access the first element, to go from result set to something you can call find_all on again
# https://stackoverflow.com/questions/24108507/beautiful-soup-resultset-object-has-no-attribute-find-all
pagination_links = pagination[0].find_all('a')
pagination[0].find_all('a')[0].get('href') == url

True

In [None]:
[link.get('href') for link in pagination_links if link.get('href') != url]

['https://www.masterofmalt.com/gin/2',
 'https://www.masterofmalt.com/gin/3',
 'https://www.masterofmalt.com/gin/4',
 'https://www.masterofmalt.com/gin/5',
 'https://www.masterofmalt.com/gin/6']

In [78]:
# Main page looping
main_product_boxes = soup.find_all(class_="boxBgr product-box-wide h-gutter js-product-box-wide")
[product.find(class_="product-box-wide-price gold") for product in main_product_boxes]

[<div class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl00_pricesWrapper">£99.95</div>,
 <div class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl01_pricesWrapper">£124.95</div>,
 <div class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl02_pricesWrapper">£124.95</div>,
 <span class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl03_price">£19.95</span>,
 <span class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl04_price">£19.50</span>,
 <span class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl05_price">£33.75</span>,
 <div class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl06_pricesWrapper">£124.95</div>,
 <span class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl07_price">£29.99</span>,
 <span class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl08_price">£27.90</span>,
 <span class="product-box-wide-price gold" id="ContentPlaceHolder1_ctl09_price">£33.75</span>,
 <span class="product-b

In [None]:
product_details = []

for product in main_product_boxes:
    name = product.find('h3').get_text()
    volume_strength = product.find(class_="product-box-wide-volume gold").get_text()
    optional_rating = product.select('span[id$=ratingStars]')
    print(len(optional_rating))
    if len(optional_rating) > 0:
        rating = optional_rating[0].get('title')
    else:
        rating = 'Unknown'
    review_count = product.select('span[id$=reviewCount]')[0].get_text() if len(product.select('span[id$=reviewCount]')) > 0 else 'Unknown'
    price = product.find(class_="product-box-wide-price gold").get_text()
    product_details.append([name, volume_strength, rating, review_count, price])

1
0
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1


In [None]:
gin_df = pd.DataFrame(product_details)

In [None]:
gin_df.columns = ['Gin', 'Vol_Strength', 'Rating', 'Review_count', 'Price']
gin_df.head()

Unnamed: 0,Gin,Vol_Strength,Rating,Review_count,Price
0,That Boutique-y Gin Company Advent Calendar (2...,"72cl, 45.4%",Rating (4.5/5),3 Reviews,$106.52
1,Ginvent Calendar (2018 Edition),"72cl, 45%",Unknown,Unknown,$133.16
2,Gin Advent Calendar - Themed (2018 Edition),"72cl, 43.3%",Unknown,Unknown,$133.16
3,Pickering's Gin Christmas Baubles,"30cl, 42%",Rating (5/5),16 Reviews,$21.26
4,Peaky Blinder Spiced Dry Gin,"70cl, 40%",Rating (4.5/5),15 Reviews,$20.78


In [None]:
split = gin_df['Vol_Strength'].str.split(',', expand = True)

In [None]:
split

Unnamed: 0,0,1
0,72cl,45.4%
1,72cl,45%
2,72cl,43.3%
3,30cl,42%
4,70cl,40%
5,50cl,47%
6,72cl,43.3%
7,50cl,40%
8,70cl,40%
9,70cl,44%


In [None]:
gin_df['Volume'] = split[0]
gin_df['Strength'] = split[1]
gin_df.drop(columns = ['Vol_Strength'], inplace=True)

In [None]:
gin_df

Unnamed: 0,Gin,Rating,Review_count,Price,Volume,Strength
0,That Boutique-y Gin Company Advent Calendar (2...,Rating (4.5/5),3 Reviews,$106.52,72cl,45.4%
1,Ginvent Calendar (2018 Edition),Unknown,Unknown,$133.16,72cl,45%
2,Gin Advent Calendar - Themed (2018 Edition),Unknown,Unknown,$133.16,72cl,43.3%
3,Pickering's Gin Christmas Baubles,Rating (5/5),16 Reviews,$21.26,30cl,42%
4,Peaky Blinder Spiced Dry Gin,Rating (4.5/5),15 Reviews,$20.78,70cl,40%
5,Monkey 47 Dry Gin,Rating (4.5/5),116 Reviews,$35.96,50cl,47%
6,Gin Advent Calendar (2018 Edition),Unknown,Unknown,$133.16,72cl,43.3%
7,Sharish Blue Magic Gin,Rating (4.5/5),27 Reviews,$31.96,50cl,40%
8,Brockmans Intensely Smooth Gin,Rating (3.5/5),64 Reviews,$29.74,70cl,40%
9,Salcombe Gin - Start Point,Rating (5/5),9 Reviews,$35.96,70cl,44%


In [None]:
import re

In [None]:
def extract_rating(string):
    if bool(re.search(r'\((.*?)\)', string)):
        return re.search(r'\((.*?)\)', string).group(1)
    else:
        return np.nan

bool(re.search(r'\((.*?)\)',gin_df['Rating'][0]))

True

In [None]:
gin_df['Rating'] = gin_df['Rating'].apply(extract_rating)

In [None]:
gin_df.head()

Unnamed: 0,Gin,Rating,Review_count,Price,Volume,Strength
0,That Boutique-y Gin Company Advent Calendar (2...,4.5/5,3 Reviews,$106.52,72cl,45.4%
1,Ginvent Calendar (2018 Edition),,Unknown,$133.16,72cl,45%
2,Gin Advent Calendar - Themed (2018 Edition),,Unknown,$133.16,72cl,43.3%
3,Pickering's Gin Christmas Baubles,5/5,16 Reviews,$21.26,30cl,42%
4,Peaky Blinder Spiced Dry Gin,4.5/5,15 Reviews,$20.78,70cl,40%


In [None]:
gin_df['Price'] = gin_df['Price'].str.replace('$', '£')

In [None]:
def extract_review_count(string):
    if string.find('Reviews') != -1:
        return string.replace('Reviews', '')
    else:
        return np.nan

In [None]:
gin_df['Review_count'] = gin_df['Review_count'].apply(extract_review_count)

We can now have a look at the first few rows of the table to see that we have a much clearer data structure that is going to be much easier to work with for further analysis. 

In [None]:
gin_df.head()

Unnamed: 0,Gin,Rating,Review_count,Price,Volume,Strength
0,That Boutique-y Gin Company Advent Calendar (2...,4.5/5,3.0,£106.52,72cl,45.4%
1,Ginvent Calendar (2018 Edition),,,£133.16,72cl,45%
2,Gin Advent Calendar - Themed (2018 Edition),,,£133.16,72cl,43.3%
3,Pickering's Gin Christmas Baubles,5/5,16.0,£21.26,30cl,42%
4,Peaky Blinder Spiced Dry Gin,4.5/5,15.0,£20.78,70cl,40%
