# Goal of the project

The goal of this notebook is to scrape the location of supermarkets in the UK. This will be used in the future to build a dashboard of the food supply chain in the UK.

# Part 1: Scraping the links of all the cities by letters

### Useful Links

GitHub repository:
    
https://github.com/tlemenestrel/Mining_Food_Supply_Chain_Data

Link to the scraped website:

https://openhours.co.uk

### Importing the necessary modules

In [103]:
import pandas as pd
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

### Links to the website

In [104]:
website_letters_link = "https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on="
website_cities_link  = "https://openhours.co.uk"

### List of all the letters for the pages to scrape

In [105]:
list_letters = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","Y","Z"]

### List to append afterwards with the name of the cities and their respective links

In [106]:
cities = []
cities_links = []

### Scraping the cities and their links

In [107]:
for letter in list_letters:
    
    # Requesting the page using the url plus the associated letter
    
    page = requests.get(website_letters_link + letter)
    
    # Turning the page into a soup for future scraping
    
    soup = BeautifulSoup(page.content, 'html5lib')
        
    # Finding the name of the city and its link
    
    for j in soup.find_all('li'):
                
        a = j.find ('a')
        
        # Appending the list of cities and their respective links

        cities_links.append(website_cities_link + a.attrs['href'])
        cities.append(a.contents)

### Converting the lists of scraped data into a pandas dataframe

In [108]:
cities_df = pd.DataFrame({
    
    "cities_links": cities_links,
    "cities"      : cities
    
})

### Printing the entire dataframe to check if the scraped data is accurate

In [109]:
print (cities_df.head())

                                                                                 cities_links  \
0                                                          https://openhours.co.uk/categories   
1                       https://openhours.co.uk/categories/supermarket-583/choose_subcategory   
2  https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on=A   
3  https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on=A   
4  https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on=B   

                cities  
0       [[Categories]]  
1      [[Supermarket]]  
2  [[Choose location]]  
3                  [A]  
4                  [B]  


### Filtering the dataframe to only keep the links of the cities

In [110]:
cities_df = cities_df[cities_df['cities_links'].str.contains('/spots')]

### Verifying that the dataframe only contains the links of the cities

In [111]:
print (cities_df.head())

                                                                                    cities_links  \
28      https://openhours.co.uk/spots?city=Abbey+Road&lat=51.5299&lng=-0.1747&search_term_id=583   
29       https://openhours.co.uk/spots?city=Abbey+Wood&lat=51.4869&lng=0.1075&search_term_id=583   
30  https://openhours.co.uk/spots?city=Abbots+Langley&lat=51.7057&lng=-0.4176&search_term_id=583   
31        https://openhours.co.uk/spots?city=Aberdare&lat=51.7144&lng=-3.4492&search_term_id=583   
32        https://openhours.co.uk/spots?city=Aberdeen&lat=57.1437&lng=-2.0981&search_term_id=583   

              cities  
28      [Abbey Road]  
29      [Abbey Wood]  
30  [Abbots Langley]  
31        [Aberdare]  
32        [Aberdeen]  


### Printing out the type of each column of the dataframe

In [112]:
def print_data_type_of_all_columns_of_a_dataframe(df):

    dataTypeSeries = df.dtypes
    print('Data type of each column of the dataframe :')
    print(dataTypeSeries)

In [113]:
print_data_type_of_all_columns_of_a_dataframe(cities_df)

Data type of each column of the dataframe :
cities_links    object
cities          object
dtype: object


### Removing the brackets in the cities column

In [114]:
cities_df['cities'] = cities_df['cities'].str.get(0)

### Verifying that the column has been properly changed

In [115]:
print (cities_df.head())

                                                                                    cities_links  \
28      https://openhours.co.uk/spots?city=Abbey+Road&lat=51.5299&lng=-0.1747&search_term_id=583   
29       https://openhours.co.uk/spots?city=Abbey+Wood&lat=51.4869&lng=0.1075&search_term_id=583   
30  https://openhours.co.uk/spots?city=Abbots+Langley&lat=51.7057&lng=-0.4176&search_term_id=583   
31        https://openhours.co.uk/spots?city=Aberdare&lat=51.7144&lng=-3.4492&search_term_id=583   
32        https://openhours.co.uk/spots?city=Aberdeen&lat=57.1437&lng=-2.0981&search_term_id=583   

            cities  
28      Abbey Road  
29      Abbey Wood  
30  Abbots Langley  
31        Aberdare  
32        Aberdeen  


### Exporting the dataframe to a csv file

In [116]:
cities_df.to_csv('cities.csv', index=False) 

# Part 2: Scraping the data from all the links of the different cities

## Scraping an individual page

To simplify the task ahead, we will start by scraping an individual page and later on integrate it into a for-loop.

In [None]:
individual_supermarket_name = []
individual_supermarket_address = []
individual_supermarket_postcode = []

In [None]:
page = requests.get("https://openhours.co.uk/spots?city=Ossett&lat=53.6798&lng=-1.5801&page=100&q=&search_term_id=583")

# Turning the page into a soup for future scraping

soup = BeautifulSoup(page.content, 'html5lib')

# Finding the postal codes of an individual page

for postcode in soup.find_all(itemprop = 'postalCode'):
    
        # Appending the list of cities and their respective links

        individual_supermarket_postcode.append(postcode.next)

print(individual_supermarket_postcode)

['YO24 2RQ', 'DN8 5BT', 'DN8 5BA', 'DN11 9HT', 'DN11 9HT', 'YO23 2RA', 'YO24 3JQ', 'SK14 2TA', 'SK14 1HB', 'OL9 8AU']


In [None]:
individual_postcodes_df = pd.DataFrame({
    
    "postcodes": individual_supermarket_postcode
    
})

In [None]:
print (individual_postcodes_df)

  postcodes
0  YO24 2RQ
1   DN8 5BT
2   DN8 5BA
3  DN11 9HT
4  DN11 9HT
5  YO23 2RA
6  YO24 3JQ
7  SK14 2TA
8  SK14 1HB
9   OL9 8AU


## Scraping all the pages

Now that we know how to scrape one page, we can start to scrape all the other pages and generalize our approach to the entire website.

### Importing the necessary modules

In [None]:
# time and randint are used to delay the scraping, which is done to avoid crashing the website's server

from time import sleep
from random import randint

### List to store the scraped data

In [None]:
supermarkets_postcode = []

### Arranging the previous URLs by removing extra characters

String to remove at the end of each URLs:
    
q=&search_term_id=583

In [None]:
cities_df['cities_links'] = cities_df['cities_links'].str.rstrip('q=&search_term_id=583')

### Verifying that the characters have been removed at the end of each URLs

In [None]:
pd.set_option('display.max_colwidth', None)

print(cities_df.head())

                                                                 cities_links  \
28      https://openhours.co.uk/spots?city=Abbey+Road&lat=51.5299&lng=-0.1747   
29        https://openhours.co.uk/spots?city=Abbey+Wood&lat=51.4869&lng=0.107   
30  https://openhours.co.uk/spots?city=Abbots+Langley&lat=51.7057&lng=-0.4176   
31        https://openhours.co.uk/spots?city=Aberdare&lat=51.7144&lng=-3.4492   
32        https://openhours.co.uk/spots?city=Aberdeen&lat=57.1437&lng=-2.0981   

            cities  
28      Abbey Road  
29      Abbey Wood  
30  Abbots Langley  
31        Aberdare  
32        Aberdeen  


### Scraping the data

In [None]:
for link in cities_links :
    
    sleep(randint(2,10))
    
    for i in range (1,100) :
        
            # Requesting the page using the URL plus the associated letter
    
            page = requests.get(link + '&page=' + str(i) + '&q=&search_term_id=583')
    
            # Turning the page into a soup for future scraping
    
            soup = BeautifulSoup(page.content, 'html5lib')
        
            # Finding the name of the city and its link
    
            for postcode in soup.find_all(itemprop = 'postalCode'):
    
                # Appending the list of cities and their respective links

                supermarkets_postcode.append(postcode.next)

### Converting the list of scraped data into a pandas dataframe

In [None]:
postcodes_df = pd.DataFrame({
    
    "postcodes": supermarkets_postcode
    
})

### Printing the entire dataframe to check if the scraped data is accurate

In [None]:
print (postcodes_df.head())

### Exporting the dataframe to a csv file

In [None]:
postcodes_df.to_csv('supermarkets_postcodes.csv', index=False) 