# Goal of the project

The goal of this notebook is to scrape the location of supermarkets in the UK. This will be used in the future to build a dashboard of the food supply chain in the UK.

# Part 1: Scraping the links of all the cities by letters

### Useful Links

GitHub repository:
    
https://github.com/tlemenestrel/Mining_Food_Supply_Chain_Data

Link to the scraped website:

https://openhours.co.uk

### Importing the necessary modules

In [175]:
import pandas as pd
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

### Links to the website

In [176]:
website_letters_link = "https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on="
website_cities_link  = "https://openhours.co.uk"

### List of all the letters for the pages to scrape

In [177]:
list_letters = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","Y","Z"]

### List to append afterwards with the name of the cities and their respective links

In [178]:
cities = []
cities_links = []

### Scraping the cities and their links

In [179]:
for letter in list_letters:
    
    # Requesting the page using the url plus the associated letter
    
    page = requests.get(website_city_link + letter)
    
    # Turning the page into a soup for future scraping
    
    soup = BeautifulSoup(page.content, 'html5lib')
        
    # Finding the name of the city and its link
    
    for j in soup.find_all('li'):
                
        a = j.find ('a')
        
        # Appending the list of cities and their respective links

        cities_links.append(website_cities_link + a.attrs['href'])
        cities.append(a.contents)

### Converting the lists of scraped data into a pandas dataframe

In [180]:
cities_df = pd.DataFrame({
    
    "cities_links": cities_links,
    "cities"      : cities
    
})

### Printing the entire dataframe to check if the scraped data is accurate

In [181]:
print (cities_df.head())

                                        cities_links               cities
0                 https://openhours.co.uk/categories       [[Categories]]
1  https://openhours.co.uk/categories/supermarket...      [[Supermarket]]
2  https://openhours.co.uk/categories/supermarket...  [[Choose location]]
3  https://openhours.co.uk/categories/supermarket...                  [A]
4  https://openhours.co.uk/categories/supermarket...                  [B]


### Filtering the dataframe to only keep the links of the cities

In [182]:
cities_df = cities_df[cities_df['cities_links'].str.contains('/spots')]

### Verifying that the dataframe only contains the links of the cities

In [183]:
print (cities_df.head())

                                         cities_links            cities
28  https://openhours.co.uk/spots?city=Abbey+Road&...      [Abbey Road]
29  https://openhours.co.uk/spots?city=Abbey+Wood&...      [Abbey Wood]
30  https://openhours.co.uk/spots?city=Abbots+Lang...  [Abbots Langley]
31  https://openhours.co.uk/spots?city=Aberdare&la...        [Aberdare]
32  https://openhours.co.uk/spots?city=Aberdeen&la...        [Aberdeen]


### Printing out the type of each column of the dataframe

In [184]:
def print_data_type_of_all_columns_of_a_dataframe(df):

    dataTypeSeries = df.dtypes
    print('Data type of each column of Dataframe :')
    print(dataTypeSeries)

In [185]:
print_data_type_of_all_columns_of_a_dataframe(cities_df)

Data type of each column of Dataframe :
cities_links    object
cities          object
dtype: object


### Removing the brackets in the cities column

In [186]:
cities_df['cities'] = cities_df['cities'].str.get(0)

### Verifying that the column has been properly changed

In [187]:
print (cities_df.head())

                                         cities_links          cities
28  https://openhours.co.uk/spots?city=Abbey+Road&...      Abbey Road
29  https://openhours.co.uk/spots?city=Abbey+Wood&...      Abbey Wood
30  https://openhours.co.uk/spots?city=Abbots+Lang...  Abbots Langley
31  https://openhours.co.uk/spots?city=Aberdare&la...        Aberdare
32  https://openhours.co.uk/spots?city=Aberdeen&la...        Aberdeen


### Exporting the dataframe to a csv file

In [188]:
cities_df.to_csv('cities.csv', index=False) 