# Web Scraping with Beautiful Soup
I'm building a notebook for web scraping using Python to develop my skills using the popular `BeautifulSoup` Python library ([Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). It will be based on projects that I have wanted to do, so the code will have real world examples for scraping websites and provide me with a good opportunity to clean 'dirty' data and create data sets that are easily manipulated and worked with for further analysis, machine learning techniques or visualisations. 

---

Basic flow:
1. You have to scrape the raw html for a given url using `requests`
2. You have to trim the information you want from that html via elements, classes and ids using `BeuatifulSoup`
    1. Identifying the classes and ids of the elements that you want to extract information from has appeared to be a vital part of the process
3. Manipulate that information into a easily usable format using `pandas`


In [1]:
# Imports
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

## Searching the HTML using a class
Every website is going to have some conventions for formatting pages and the information that we ultimately want to acquire. This is likely going to come in the form of CSS classes, especially if the website logic is dynamic and they are producing many pages with different information using the same template file. Thus, being able to isolate those elements using a distinct class is incredibly useful for us. 

This can be achieved through a special use case of the `find_all()` function ([Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class)).  

In my example here, I am going to be trying to scrape all the information about the gins that are available from the site 'Master of Malt'. 

In [2]:
# soup.find_all(class_="sectionHeader")

url = "https://www.masterofmalt.com/gin/"

# By setting a cookie for this example, I am able to obtain the prices in GBP
# This was something I had to do, as the default when running this was giving me USD
# I investigated the console through my browser and realised that setting this fixed the issue I was having
cookies = dict(MaOMa='VisitorID=556630649&IsVATableCountry=1&CountryID=464&CurrencyID=-1&CountryCodeShort=GB&DeliveryCountrySavedToDB=1')

html = requests.get(url, headers = {"Accept-Language": "en-GB"}, cookies = cookies).text 

soup = BeautifulSoup(html, features="html.parser")

# View that the object type that we have created is a `bs4.BeautifulSoup`
print(type(soup))

# Print the title of the page
print(soup.title.get_text())

<class 'bs4.BeautifulSoup'>

	Gin - Master of Malt



Since all of the gins for the site aren't on a single page, we are going to have to work out a way to loop through the pages. This will inevitably start with me finding the pagination section on the page and getting all the links.  

Once we have the links we should start working on a system for noting which ones we have visited and which we are yet to scrape. 

In [3]:
pagination = soup.find_all(class_='list-paging')
# We can print to see what we have, looks as though we have two versions of the same thing
# print(pagination)

# Note on why you access the first element, to go from result set to something you can call find_all on again
# https://stackoverflow.com/questions/24108507/beautiful-soup-resultset-object-has-no-attribute-find-all

# Remove all the anchor tags from the pagination html
pagination_links = pagination[0].find_all('a')
# pagination[0].find_all('a')[0].get('href') == url

Below we can loop through and extract the href from all the anchor tags that aren't our original url. This will surely form the basis of our navigation system through the site for the scraper. 

In [4]:
[link.get('href') for link in pagination_links if link.get('href') != url]

['https://www.masterofmalt.com/gin/2',
 'https://www.masterofmalt.com/gin/3',
 'https://www.masterofmalt.com/gin/4',
 'https://www.masterofmalt.com/gin/5',
 'https://www.masterofmalt.com/gin/6']

You can inspect the pages that you are going to loop through, and note constant elements that are going to act as the starting point for your data collection.  



In [5]:
# Original url looping
# The class "boxBgr product-box-wide h-gutter js-product-box-wide" 
#   appears on all the main drink elements for the index page 
main_product_boxes = soup.find_all(class_="boxBgr product-box-wide h-gutter js-product-box-wide")

# Loop through each element and get the current price span/div and then the contained text (price)
[product.find(class_="product-box-wide-price gold").get_text() for product in main_product_boxes]

['£69.95',
 '£17.95',
 '£33.75',
 '£84.95',
 '£84.95',
 '£21.95',
 '£84.95',
 '£21',
 '£14.95',
 '£34.70',
 '£24.95',
 '£27.83',
 '£28.95',
 '£29.95',
 '£27.19',
 '£34.95',
 '£30.20',
 '£23',
 '£52.95',
 '£23.95',
 '£21',
 '£24.99',
 '£22.95',
 '£24.90',
 '£21']

In [6]:
# Get the product ids from the data property on the element
# Could potentially be a way of avoiding duplication of results
[main_product_box.get('data-productid') for main_product_box in main_product_boxes]

['74326',
 '70039',
 '13362',
 '74332',
 '74322',
 '70035',
 '74320',
 '62662',
 '83919',
 '55071',
 '67485',
 '5227',
 '10031',
 '55709',
 '47796',
 '65369',
 '41772',
 '3020',
 '43965',
 '70034',
 '71772',
 '73022',
 '30166',
 '68670',
 '71771']

In [7]:
product_details = []

for product in main_product_boxes:
    name = product.find('h3').get_text()
    volume_strength = product.find(class_="product-box-wide-volume gold").get_text()
    optional_rating = product.select('span[id$=ratingStars]')
#     print(len(optional_rating))
    if len(optional_rating) > 0:
        rating = optional_rating[0].get('title')
    else:
        rating = 'Unknown'
    review_count = product.select('span[id$=reviewCount]')[0].get_text() if len(product.select('span[id$=reviewCount]')) > 0 else 'Unknown'
    price = product.find(class_="product-box-wide-price gold").get_text()
    product_details.append([name, volume_strength, rating, review_count, price])

In [8]:
gin_df = pd.DataFrame(product_details)

In [9]:
gin_df.columns = ['Gin', 'Vol_Strength', 'Rating', 'Review_count', 'Price']
gin_df.head()

Unnamed: 0,Gin,Vol_Strength,Rating,Review_count,Price
0,That Boutique-y Gin Company Advent Calendar (2...,"72cl, 45.4%",Rating (5/5),4 Reviews,£69.95
1,Peaky Blinder Spiced Dry Gin,"70cl, 40%",Rating (4.5/5),15 Reviews,£17.95
2,Monkey 47 Dry Gin,"50cl, 47%",Rating (4.5/5),117 Reviews,£33.75
3,Ginvent Calendar (2018 Edition),"72cl, 45%",Unknown,Unknown,£84.95
4,Gin Advent Calendar - Themed (2018 Edition),"72cl, 43.3%",Rating (5/5),1 Review,£84.95


In [10]:
split = gin_df['Vol_Strength'].str.split(',', expand = True)

In [11]:
split

Unnamed: 0,0,1
0,72cl,45.4%
1,70cl,40%
2,50cl,47%
3,72cl,45%
4,72cl,43.3%
5,70cl,41.3%
6,72cl,43.3%
7,70cl,43%
8,15cl,40.9%
9,70cl,44%


In [12]:
gin_df['Volume'] = split[0]
gin_df['Strength'] = split[1]
gin_df.drop(columns = ['Vol_Strength'], inplace=True)

In [13]:
gin_df

Unnamed: 0,Gin,Rating,Review_count,Price,Volume,Strength
0,That Boutique-y Gin Company Advent Calendar (2...,Rating (5/5),4 Reviews,£69.95,72cl,45.4%
1,Peaky Blinder Spiced Dry Gin,Rating (4.5/5),15 Reviews,£17.95,70cl,40%
2,Monkey 47 Dry Gin,Rating (4.5/5),117 Reviews,£33.75,50cl,47%
3,Ginvent Calendar (2018 Edition),Unknown,Unknown,£84.95,72cl,45%
4,Gin Advent Calendar - Themed (2018 Edition),Rating (5/5),1 Review,£84.95,72cl,43.3%
5,Aber Falls Orange Marmalade Gin,Rating (4.5/5),15 Reviews,£21.95,70cl,41.3%
6,Gin Advent Calendar (2018 Edition),Unknown,Unknown,£84.95,72cl,43.3%
7,Whitley Neill Rhubarb & Ginger Gin,Rating (3.5/5),95 Reviews,£21,70cl,43%
8,Exclusive Gin Tasting Set #1,Unknown,Unknown,£14.95,15cl,40.9%
9,Salcombe Gin - Start Point,Rating (5/5),9 Reviews,£34.70,70cl,44%


In [14]:
import re

In [15]:
def extract_rating(string):
    if bool(re.search(r'\((.*?)\)', string)):
        return re.search(r'\((.*?)\)', string).group(1)
    else:
        return np.nan

bool(re.search(r'\((.*?)\)',gin_df['Rating'][0]))

True

In [16]:
gin_df['Rating'] = gin_df['Rating'].apply(extract_rating)

In [17]:
gin_df.head()

Unnamed: 0,Gin,Rating,Review_count,Price,Volume,Strength
0,That Boutique-y Gin Company Advent Calendar (2...,5/5,4 Reviews,£69.95,72cl,45.4%
1,Peaky Blinder Spiced Dry Gin,4.5/5,15 Reviews,£17.95,70cl,40%
2,Monkey 47 Dry Gin,4.5/5,117 Reviews,£33.75,50cl,47%
3,Ginvent Calendar (2018 Edition),,Unknown,£84.95,72cl,45%
4,Gin Advent Calendar - Themed (2018 Edition),5/5,1 Review,£84.95,72cl,43.3%


In [18]:
# Before I fixed the cookies request issue
# gin_df['Price'] = gin_df['Price'].str.replace('$', '£')

In [19]:
def extract_review_count(string):
    if string.find('Reviews') != -1:
        return string.replace('Reviews', '')
    elif string.find('1 Review') != -1:
        return '1'
    else:
        return np.nan

In [20]:
gin_df['Review_count'] = gin_df['Review_count'].apply(extract_review_count)

We can now have a look at the first few rows of the table to see that we have a much clearer data structure that is going to be much easier to work with for further analysis. 

In [21]:
gin_df.head()

Unnamed: 0,Gin,Rating,Review_count,Price,Volume,Strength
0,That Boutique-y Gin Company Advent Calendar (2...,5/5,4.0,£69.95,72cl,45.4%
1,Peaky Blinder Spiced Dry Gin,4.5/5,15.0,£17.95,70cl,40%
2,Monkey 47 Dry Gin,4.5/5,117.0,£33.75,50cl,47%
3,Ginvent Calendar (2018 Edition),,,£84.95,72cl,45%
4,Gin Advent Calendar - Themed (2018 Edition),5/5,1.0,£84.95,72cl,43.3%
