# Goal of the project

The goal of this notebook is to scrape the location of supermarkets in the UK. This will be used in the future to build a dashboard of the food supply chain in the UK.

# Part 1: Scraping the links of all the cities by letters

### Useful Links

GitHub repository:
    
https://github.com/tlemenestrel/Mining_Food_Supply_Chain_Data

Link to the scraped website:

https://openhours.co.uk

### Importing the necessary modules

In [1]:
import pandas as pd
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup, SoupStrainer

### Links to the website

In [2]:
website_letters_link = "https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on="
website_cities_link  = "https://openhours.co.uk"

### List of all the letters for the pages to scrape

In [3]:
list_letters = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","Y","Z"]

### List to append afterwards with the name of the cities and their respective links

In [4]:
cities = []
cities_links = []

### Scraping the cities and their links

In [5]:
for letter in list_letters:
    
    # Requesting the page using the url plus the associated letter
    
    page = requests.get(website_letters_link + letter)
    
    # Turning the page into a soup for future scraping
    
    soup = BeautifulSoup(page.content, 'lxml')
        
    # Finding the name of the city and its link
    
    for j in soup.find_all('li'):
                
        a = j.find ('a')
        
        # Appending the list of cities and their respective links

        cities_links.append(website_cities_link + a.attrs['href'])
        cities.append(a.contents)

### Converting the lists of scraped data into a pandas dataframe

In [6]:
cities_df = pd.DataFrame({
    
    "cities_links": cities_links,
    "cities"      : cities
    
})

### Printing the entire dataframe to check if the scraped data is accurate

In [7]:
print (cities_df.head())

                                        cities_links               cities
0                 https://openhours.co.uk/categories       [[Categories]]
1  https://openhours.co.uk/categories/supermarket...      [[Supermarket]]
2  https://openhours.co.uk/categories/supermarket...  [[Choose location]]
3  https://openhours.co.uk/categories/supermarket...                  [A]
4  https://openhours.co.uk/categories/supermarket...                  [B]


### Filtering the dataframe to only keep the links of the cities

In [8]:
cities_df = cities_df[cities_df['cities_links'].str.contains('/spots')]

### Verifying that the dataframe only contains the links of the cities

In [9]:
print (cities_df.head())

                                         cities_links            cities
28  https://openhours.co.uk/spots?city=Abbey+Road&...      [Abbey Road]
29  https://openhours.co.uk/spots?city=Abbey+Wood&...      [Abbey Wood]
30  https://openhours.co.uk/spots?city=Abbots+Lang...  [Abbots Langley]
31  https://openhours.co.uk/spots?city=Aberdare&la...        [Aberdare]
32  https://openhours.co.uk/spots?city=Aberdeen&la...        [Aberdeen]


### Printing out the type of each column of the dataframe

In [10]:
def print_data_type_of_all_columns_of_a_dataframe(df):

    dataTypeSeries = df.dtypes
    print('Data type of each column of the dataframe :')
    print(dataTypeSeries)

In [11]:
print_data_type_of_all_columns_of_a_dataframe(cities_df)

Data type of each column of the dataframe :
cities_links    object
cities          object
dtype: object


### Removing the brackets in the cities column

In [12]:
cities_df['cities'] = cities_df['cities'].str.get(0)

### Verifying that the column has been properly changed

In [13]:
print (cities_df.head())

                                         cities_links          cities
28  https://openhours.co.uk/spots?city=Abbey+Road&...      Abbey Road
29  https://openhours.co.uk/spots?city=Abbey+Wood&...      Abbey Wood
30  https://openhours.co.uk/spots?city=Abbots+Lang...  Abbots Langley
31  https://openhours.co.uk/spots?city=Aberdare&la...        Aberdare
32  https://openhours.co.uk/spots?city=Aberdeen&la...        Aberdeen


### Exporting the dataframe to a csv file

In [14]:
cities_df.to_csv('cities.csv', index=False) 

# Part 2: Scraping the data from all the links of the different cities

## Scraping an individual page

To simplify the task ahead, we will start by scraping an individual page and later integrate it into a for-loop.

In [15]:
individual_supermarket_name = []
individual_supermarket_address = []
individual_supermarket_postcode = []

In [16]:
page = requests.get("https://openhours.co.uk/spots?city=Ossett&lat=53.6798&lng=-1.5801&page=100&q=&search_term_id=583")

# Turning the page into a soup for future scraping

postcodes_parsed = SoupStrainer(itemprop = 'postalCode')
                                  
soup = BeautifulSoup(page.content, "lxml", parse_only=postcodes_parsed)

# Finding the postal codes of an individual page

for postcode in soup.find_all(itemprop = 'postalCode'):
    
        # Appending the list of cities and their respective links

        individual_supermarket_postcode.append(postcode.next)

print (individual_supermarket_postcode)

['YO24 2RQ', 'DN8 5BT', 'DN8 5BA', 'DN11 9HT', 'DN11 9HT', 'YO23 2RA', 'YO24 3JQ', 'SK14 2TA', 'SK14 1HB', 'OL9 8AU']


In [17]:
individual_postcodes_df = pd.DataFrame({
    
    "postcodes": individual_supermarket_postcode
    
})

In [18]:
print (individual_postcodes_df)

  postcodes
0  YO24 2RQ
1   DN8 5BT
2   DN8 5BA
3  DN11 9HT
4  DN11 9HT
5  YO23 2RA
6  YO24 3JQ
7  SK14 2TA
8  SK14 1HB
9   OL9 8AU


## Building the new list of URLs from the data scraped previously

Now that we know how to scrape one page, we can start to scrape all the other pages and generalize our approach to the entire website. For this, we will need to generate a list with all the new URLs. 

### Arranging the previous URLs by removing extra characters

String to remove at the end of each URLs:
    
q=&search_term_id=583

In [19]:
cities_df['cities_links'] = cities_df['cities_links'].str.rstrip('q=&search_term_id=583')

### Verifying that the characters have been removed at the end of each URLs

In [20]:
pd.set_option('display.max_colwidth', None)

print(cities_df.head())

                                                                 cities_links  \
28      https://openhours.co.uk/spots?city=Abbey+Road&lat=51.5299&lng=-0.1747   
29        https://openhours.co.uk/spots?city=Abbey+Wood&lat=51.4869&lng=0.107   
30  https://openhours.co.uk/spots?city=Abbots+Langley&lat=51.7057&lng=-0.4176   
31        https://openhours.co.uk/spots?city=Aberdare&lat=51.7144&lng=-3.4492   
32        https://openhours.co.uk/spots?city=Aberdeen&lat=57.1437&lng=-2.0981   

            cities  
28      Abbey Road  
29      Abbey Wood  
30  Abbots Langley  
31        Aberdare  
32        Aberdeen  


### Defining the new list of URLs

In [21]:
URLs = []

### Converting the previous dataframe into a list

In [22]:
list_links = cities_df['cities_links'].values.tolist()

### Checking that the list is of the right size

In [23]:
print (len(list_links))

1390


### Building the new list of URLs

In [24]:
for link in list_links :
        
    for i in range (1,101) :
        
            # Requesting the page using the URL plus the associated letter
    
            URLs.append(link + '&page=' + str(i) + '&q=&search_term_id=583')

### Printing the size of the new list of URLs

In [25]:
print(len(URLs))

139000


### Printing an example of an URL

In [26]:
print(URLs[0])

https://openhours.co.uk/spots?city=Abbey+Road&lat=51.5299&lng=-0.1747&page=1&q=&search_term_id=583


# Multi-threading

Now that we have a complete list of URLs, we can start to implement multi-threading.

### Importing the necessary modules

In [27]:
import time
from time import sleep
from multiprocessing import Pool
import warnings

warnings.simplefilter('ignore')

### Readings

Here are a list of goods reads on multi-threading and the number of threads to use and also the difference between multi-threading and multi-processing.

<br/>

How to implement multi-threading and multi-processing for a web scraper:

https://medium.com/@apbetahouse45/asynchronous-web-scraping-in-python-using-concurrent-module-a5ca1b7f82e4

Right number of threads to use:

https://www.jstorimer.com/blogs/workingwithcode/7970125-how-many-threads-is-too-many

Difference between multi-processing and multi-threading:

https://medium.com/contentsquare-engineering-blog/multithreading-vs-multiprocessing-in-python-ece023ad55a

<br/>

In this case, I used multi-threading, as this task was I/O and not CPU bound.

## Measuring the difference in performance using multi-threading

Here, we will use a subset of the list of URLs and determine what is the best number of workers for multi-threading.

### Notes

-Changed to verify = False to disable SSL as I used to get an error because of it

-Used SoupStrainer to only scrape specific tags to improve the performance of the scraper

-Added a time sleep of a total of 10ms for 10,000 URLs to slow down the scraping and the pressure put on the server

### Creating a subset of 1,000 URLs to be used for performance testing

In [28]:
URLs_subset = URLs[:2000]

### Defining the scraping function

In [29]:
def scrape_postcode(url):
    
    # Adding a pause of 10 milliseconds every time the function is called - adding 10 seconds of delay in total 
    
    time.sleep(1 * 10**(-2))
    
    supermarkets_postcode = []
    
    # Requesting the page using the URL plus the associated letter

    page = requests.get(url, verify = False)

    # Turning the page into a soup for future scraping

    postcodes_parsed = SoupStrainer(itemprop = 'postalCode')

    soup = BeautifulSoup(page.content, "lxml", parse_only=postcodes_parsed)

    # Finding the name of the city and its link
    
    for postcode in soup.find_all(itemprop = 'postalCode'):
    
        # Appending the list of cities and their respective links

        supermarkets_postcode.append(postcode.next)
        
    return supermarkets_postcode

### Creating a pool with 10 workers for multi-threading

In [30]:
start = time.time()

pool = Pool(10)
results = pool.map(scrape_postcode, URLs_subset)
        
pool.close()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links with 10 workers

In [31]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 62.081752s


### Creating a pool with 15 workers for multi-threading

In [32]:
start = time.time()

pool = Pool(15)
results = pool.map(scrape_postcode, URLs_subset)

pool.close()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links with 15 workers

In [33]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 70.302282s


### Creating a pool with 20 workers for multi-threading

In [34]:
start = time.time()

pool = Pool(20)
results = pool.map(scrape_postcode, URLs_subset)

pool.close()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links with 20 workers

In [35]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 79.616263s


### Creating a pool with 25 workers for multi-threading

In [36]:
start = time.time()

pool = Pool(25)
results = pool.map(scrape_postcode, URLs_subset)

pool.close()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links with 25 workers

In [37]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 84.469282s


### Creating a pool with 30 workers for multi-threading

In [38]:
start = time.time()

pool = Pool(30)
results = pool.map(scrape_postcode, URLs_subset)

pool.close()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links with 30 workers

In [39]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 78.566728s


### Creating a pool with 100 workers for multi-threading

In [40]:
start = time.time()

pool = Pool(100)
results = pool.map(scrape_postcode, URLs_subset)

pool.close()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links with 100 workers

In [41]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 79.337509s


### Printing out the scraping results

In [42]:
print (results)
print (len(results))

[['NW8 6PG', 'W9 1UP', 'NW8 8JN', 'W9 1SY', 'NW8 8EX', 'NW1 6TU', 'W2 1DY', 'W2 6EZ', 'W2 1EE', 'NW1 6AE'], ['NW6 4RY', 'NW6 5UA', 'W2 1HB', 'W2 2DS', 'NW3 3NT', 'NW3 3RA', 'W2 2DS', 'W1U 6TS', 'W1U 6TP', 'W2 1RH'], ['NW3 6JR', 'NW6 4HS', 'W2 3PX', 'W2 6ES', 'NW6 6JH', 'W2 2EA', 'NW3 6NN', 'W2 3QA', 'W2 4SB', 'W1U 8NN'], ['NW1 4SA', 'W9 3RU', 'W1U 4SD', 'NW3 4UE', 'W1U 3AA', 'W9 3QA', 'W1U 4SA', 'W2 5RT', 'W1H 7AA', 'NW3 6LU'], ['NW6 6NL', 'W2 3RL', 'W2 4QH', 'W2 3RY', 'NW6 7JN', 'NW1 7AH', 'W2 3HJ', 'NW1 8AN', 'NW1 3AU', 'NW1 7PN'], ['NW6 7TA', 'NW6 6RG', 'W10 4RG', 'W1W 5QQ', 'NW3 4QG', 'NW1 8AA', 'NW1 7JR', 'N1 9LE', 'NW6 1SG', 'NW1 0LU'], ['W1C 2JS', 'NW1 0LT', 'W1W 6PS', 'NW1 0LT', 'NW6 1SG', 'W1W 6BE', 'NW1 3JA', 'NW1 9LJ', 'W11 1LJ', 'W11 1LA'], ['NW6 7QE', 'NW1 0JH', 'W1A 1EX', <span class="postal-code" itemprop="postalCode">NW5 4EG</span>, 'NW5 4EG', 'W1T 5AS', 'NW6 7QE', 'W1K 2PX', 'NW6 1RN', 'W1T 7NE'], ['NW3 2YY', 'W10 6HH', 'W10 6HJ', 'W11 3QG', 'W11 3QE', 'NW1 1TT', 'NW1 

### Conclusion

The performance of the scraper does not increase after using more than 10 workers. Therefore, we will keep this amount of workers for scraping all the data.

## Scraping all the data using multi-threading

### Defining the function to scrape the data of an individual page

In [43]:
def scrape_postcode(url):
    
    supermarkets_postcode = []
    
    # Requesting the page using the URL plus the associated letter

    page = requests.get(url)

    # Turning the page into a soup for future scraping

    postcodes_parsed = SoupStrainer(itemprop = 'postalCode')

    soup = BeautifulSoup(page.content, "lxml", parse_only=postcodes_parsed)

    # Finding the name of the city and its link
    
    for postcode in soup.find_all(itemprop = 'postalCode'):
    
        # Appending the list of cities and their respective links

        supermarkets_postcode.append(postcode.next)
        
    return supermarkets_postcode

### Creating the pool for multi-threading

In [44]:
start = time.time()

pool = Pool(30)
results = pool.map(scrape_postcode, URLs)

pool.terminate()
pool.join()
        
end = time.time()

### Printing out the time it took to scrape all the links

In [45]:
print("Time Taken: {:.6f}s".format(end-start))

Time Taken: 5506.458607s


# Printing out the scraping results

In [55]:
print (len(results))
print (results[:10])

139000
[['NW8 6PG', 'W9 1UP', 'NW8 8JN', 'W9 1SY', 'NW8 8EX', 'NW1 6TU', 'W2 1DY', 'W2 6EZ', 'W2 1EE', 'NW1 6AE'], ['NW6 4RY', 'NW6 5UA', 'W2 1HB', 'W2 2DS', 'NW3 3NT', 'NW3 3RA', 'W2 2DS', 'W1U 6TS', 'W1U 6TP', 'W2 1RH'], ['NW3 6JR', 'NW6 4HS', 'W2 3PX', 'W2 6ES', 'NW6 6JH', 'W2 2EA', 'NW3 6NN', 'W2 3QA', 'W2 4SB', 'W1U 8NN'], ['NW1 4SA', 'W9 3RU', 'W1U 4SD', 'NW3 4UE', 'W1U 3AA', 'W9 3QA', 'W1U 4SA', 'W2 5RT', 'W1H 7AA', 'NW3 6LU'], ['NW6 6NL', 'W2 3RL', 'W2 4QH', 'W2 3RY', 'NW6 7JN', 'NW1 7AH', 'W2 3HJ', 'NW1 8AN', 'NW1 3AU', 'NW1 7PN'], ['NW6 7TA', 'NW6 6RG', 'W10 4RG', 'W1W 5QQ', 'NW3 4QG', 'NW1 8AA', 'NW1 7JR', 'N1 9LE', 'NW6 1SG', 'NW1 0LU'], ['W1C 2JS', 'NW1 0LT', 'W1W 6PS', 'NW1 0LT', 'NW6 1SG', 'W1W 6BE', 'NW1 3JA', 'NW1 9LJ', 'W11 1LJ', 'W11 1LA'], ['NW6 7QE', 'NW1 0JH', 'W1A 1EX', <span class="postal-code" itemprop="postalCode">NW5 4EG</span>, 'NW5 4EG', 'W1T 5AS', 'NW6 7QE', 'W1K 2PX', 'NW6 1RN', 'W1T 7NE'], ['NW3 2YY', 'W10 6HH', 'W10 6HJ', 'W11 3QG', 'W11 3QE', 'NW1 1TT'

# Data Cleaning

### Importing the necessary modules

In [149]:
from itertools import chain 

### Flattening the 2-D list into a 1-D list

In [150]:
postcodes = list(chain.from_iterable(results))

### Printing the first 10 elements of the list to check it has been properly flattened

In [151]:
print (postcodes[:10])

['NW8 6PG', 'W9 1UP', 'NW8 8JN', 'W9 1SY', 'NW8 8EX', 'NW1 6TU', 'W2 1DY', 'W2 6EZ', 'W2 1EE', 'NW1 6AE']


### Converting the list of scraped data into a pandas dataframe

In [152]:
postcodes_df = pd.DataFrame({
    
    "postcodes": postcodes
    
})

### Printing the entire dataframe to check if the scraped data is accurate

In [153]:
print (postcodes_df.head())
print ("\nThe dataframe contains: " + str(len(postcodes_df)) + " elements")

  postcodes
0   NW8 6PG
1    W9 1UP
2   NW8 8JN
3    W9 1SY
4   NW8 8EX

The dataframe contains: 1372000 elements


### Printing the types of each columns of the dataframe

In [154]:
print_data_type_of_all_columns_of_a_dataframe(postcodes_df)

Data type of each column of the dataframe :
postcodes    object
dtype: object


### Printing the first 5 rows of the dataframe to check that the white spaces have been removed

In [155]:
print (postcodes_df.head())

  postcodes
0   NW8 6PG
1    W9 1UP
2   NW8 8JN
3    W9 1SY
4   NW8 8EX


### Removing the rows that are too long

In [156]:
postcodes_df[postcodes_df['postcodes'].str.len().lt(10)]
print (postcodes_df.shape)

(1372000, 1)


### Removing all the duplicates from the dataframe

In [157]:
postcodes_df = postcodes_df.drop_duplicates(keep='first')
print ("\nThe dataframe contains: " + str(len(postcodes_df)) + " elements")


The dataframe contains: 17415 elements


### Printing the first 5 rows of the dataframe to check that the duplicates have been removed

In [158]:
print (postcodes_df.head())

  postcodes
0   NW8 6PG
1    W9 1UP
2   NW8 8JN
3    W9 1SY
4   NW8 8EX


### Exporting the dataframe to a csv file

In [159]:
postcodes_df.to_csv('supermarkets_postcodes.csv', index=False) 