# Master of Malt 
Master of Malt, in their own words, is an online retailer of single malt whisky, blended whisky, bourbon, rum, brandy, vodka, gin and many other fine spirits!

I was interested in collecting some information about different gins that they had to offer after making an order, and thought that I would write a web scraper to automate the process for me. 

In [2]:
# Imports
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import random
import time

## Managing visited pages
We are going to have to track which of the paginated index pages we have been to that all start with the overall 'gin' index page defined below.  

To find all available links that are found in the pagination section of the HTML, we can use the following code.  

I was having issues with the default country code and currency when I was running the script, so had to provide some added details using `cookies` to force the website to render in the correct format. 

In [3]:
url = "https://www.masterofmalt.com/gin/"
cookies = dict(MaOMa='VisitorID=556630649&IsVATableCountry=1&CountryID=464&CurrencyID=-1&CountryCodeShort=GB&DeliveryCountrySavedToDB=1')
html = requests.get(url, headers = {"Accept-Language": "en-GB"}, cookies = cookies).text 
soup = BeautifulSoup(html, features="html.parser")
pagination = soup.find_all(class_='list-paging')
pagination_links = pagination[0].find_all('a')
[link.get('href') for link in pagination_links]

['https://www.masterofmalt.com/gin/',
 'https://www.masterofmalt.com/gin/2',
 'https://www.masterofmalt.com/gin/3',
 'https://www.masterofmalt.com/gin/4',
 'https://www.masterofmalt.com/gin/5',
 'https://www.masterofmalt.com/gin/6']

This is going to be something that we are going to do a lot of times; makes sense to create a function that does what we need it to and stops at an appropriate time. 

In [41]:
def get_pagination_links(url, html_class = 'list-paging'):
    """
    For a given url find the href attributes of anchor tags that match a certain HTML class.
    Returns a set of URLs.
    """
    cookies = dict(MaOMa='VisitorID=556630649&IsVATableCountry=1&CountryID=464&CurrencyID=-1&CountryCodeShort=GB&DeliveryCountrySavedToDB=1')
    html = requests.get(url, headers = {"Accept-Language": "en-GB"}, cookies = cookies).text 
    soup = BeautifulSoup(html, features="html.parser")
    pagination = soup.find_all(class_= html_class)
    pagination_links = pagination[0].find_all('a')
    
    # Return a set of distinct URLs 
    return [link.get('href') for link in pagination_links]

In [42]:
get_pagination_links("https://www.masterofmalt.com/gin/")

['https://www.masterofmalt.com/gin/',
 'https://www.masterofmalt.com/gin/2',
 'https://www.masterofmalt.com/gin/3',
 'https://www.masterofmalt.com/gin/4',
 'https://www.masterofmalt.com/gin/5',
 'https://www.masterofmalt.com/gin/6']

----

In [78]:
starting_url = "https://www.masterofmalt.com/gin/"

# Create master array of pages that we want to visit
all_gin_pages = [starting_url]
visited_pages = []
unvisited_pages = np.setdiff1d(all_gin_pages, visited_pages)

while len(unvisited_pages):
    new_page = unvisited_pages[0]
    pages = get_pagination_links(new_page)
    all_gin_pages = list(set(all_gin_pages).union(set(pages)))
    visited_pages.append(new_page)
    unvisited_pages = np.setdiff1d(all_gin_pages, visited_pages)
    vp_count = len(visited_pages)
    up_count = len(all_gin_pages)
    v_perc = round(vp_count / up_count * 100.0, 2)
    print('%s/%s pages visited: %s%%' % (vp_count, up_count, v_perc))

1/6 pages visited: 16.67%
2/6 pages visited: 33.33%
3/6 pages visited: 50.0%
4/6 pages visited: 66.67%
5/6 pages visited: 83.33%
6/11 pages visited: 54.55%
7/11 pages visited: 63.64%
8/16 pages visited: 50.0%
9/16 pages visited: 56.25%
10/16 pages visited: 62.5%
11/16 pages visited: 68.75%
12/16 pages visited: 75.0%
13/21 pages visited: 61.9%
14/21 pages visited: 66.67%
15/21 pages visited: 71.43%
16/21 pages visited: 76.19%
17/21 pages visited: 80.95%
18/26 pages visited: 69.23%
19/26 pages visited: 73.08%
20/26 pages visited: 76.92%
21/26 pages visited: 80.77%
22/26 pages visited: 84.62%
23/31 pages visited: 74.19%
24/31 pages visited: 77.42%
25/31 pages visited: 80.65%
26/31 pages visited: 83.87%
27/31 pages visited: 87.1%
28/36 pages visited: 77.78%
29/36 pages visited: 80.56%
30/36 pages visited: 83.33%
31/36 pages visited: 86.11%
32/36 pages visited: 88.89%
33/41 pages visited: 80.49%
34/41 pages visited: 82.93%
35/41 pages visited: 85.37%
36/41 pages visited: 87.8%
37/41 pages v

In [80]:
all_gin_pages

['https://www.masterofmalt.com/gin/68',
 'https://www.masterofmalt.com/gin/92',
 'https://www.masterofmalt.com/gin/88',
 'https://www.masterofmalt.com/gin/112',
 'https://www.masterofmalt.com/gin/9',
 'https://www.masterofmalt.com/gin/16',
 'https://www.masterofmalt.com/gin/111',
 'https://www.masterofmalt.com/gin/118',
 'https://www.masterofmalt.com/gin/30',
 'https://www.masterofmalt.com/gin/41',
 'https://www.masterofmalt.com/gin/23',
 'https://www.masterofmalt.com/gin/74',
 'https://www.masterofmalt.com/gin/99',
 'https://www.masterofmalt.com/gin/12',
 'https://www.masterofmalt.com/gin/13',
 'https://www.masterofmalt.com/gin/56',
 'https://www.masterofmalt.com/gin/36',
 'https://www.masterofmalt.com/gin/39',
 'https://www.masterofmalt.com/gin/116',
 'https://www.masterofmalt.com/gin/45',
 'https://www.masterofmalt.com/gin/113',
 'https://www.masterofmalt.com/gin/29',
 'https://www.masterofmalt.com/gin/84',
 'https://www.masterofmalt.com/gin/22',
 'https://www.masterofmalt.com/gin/3

In [81]:
len(all_gin_pages)

118

We can see that there are 118 pages of gins to scroll through on the website.  

#### Overview of using sets
I thought that sets were the ideal data type for this particular instance. It makes sense because we want to loop through each of the pages, and not 'go back' when we are presented with the pages that we have already seen.  

In [36]:
np.setdiff1d([1,2,3,4,5], [1,2])

array([3, 4, 5])

In [35]:
set(['url']).union(set(['url', 'url2']))

{'url', 'url2'}