# Goal of the project

The goal of this notebook is to scrape the location of supermarkets in the UK. This will be used in the future to build a dashboard of the food supply chain in the UK.

# Part 1: Scraping the links of all the cities by letters

### Useful Links

GitHub repository:
    
https://github.com/tlemenestrel/Mining_Food_Supply_Chain_Data

Link to the scraped website:

https://openhours.co.uk

### Importing the necessary modules

In [94]:
import pandas as pd
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

### Links to the website

In [95]:
website_letters_link = "https://openhours.co.uk/categories/supermarket-583/choose_location?all_locations=true&on="
website_cities_link  = "https://openhours.co.uk"

### List of all the letters for the pages to scrape

In [96]:
list_letters = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","Y","Z"]

### List to append afterwards with the name of the cities and their respective links

In [97]:
cities = []
cities_links = []

### Scraping the cities and their links

In [98]:
for letter in list_letters:
    
    # Requesting the page using the url plus the associated letter
    
    page = requests.get(website_city_link + letter)
    
    # Turning the page into a soup for future scraping
    
    soup = BeautifulSoup(page.content, 'html5lib')
        
    # Finding the name of the city and its link
    
    for j in soup.find_all('li'):
                
        a = j.find ('a')
        
        # Appending the list of cities and their respective links

        cities_links.append(website_cities_link + a.attrs['href'])
        cities.append(a.contents)

### Converting the lists of scraped data into a pandas dataframe

In [152]:
cities_df = pd.DataFrame({
    
    "cities_links": cities_links,
    "cities"      : cities
    
})

### Printing the entire dataframe to check if the scraped data is accurate

In [153]:
pd.set_option('display.max_rows', None)

print (cities_df.cities)

0                    [[Categories]]
1                   [[Supermarket]]
2               [[Choose location]]
3                               [A]
4                               [B]
5                               [C]
6                               [D]
7                               [E]
8                               [F]
9                               [G]
10                              [H]
11                              [I]
12                              [J]
13                              [K]
14                              [L]
15                              [M]
16                              [N]
17                              [O]
18                              [P]
19                              [Q]
20                              [R]
21                              [S]
22                              [T]
23                              [U]
24                              [V]
25                              [W]
26                              [Y]
27                          

### Filtering the dataframe to only keep the links of the cities

In [154]:
cities_df = cities_df[cities_df['cities_links'].str.contains('/spots')]

### Verifying that the dataframe only contains the links of the cities

In [155]:
print (cities_df.head())

                                         cities_links            cities
28  https://openhours.co.uk/spots?city=Abbey+Road&...      [Abbey Road]
29  https://openhours.co.uk/spots?city=Abbey+Wood&...      [Abbey Wood]
30  https://openhours.co.uk/spots?city=Abbots+Lang...  [Abbots Langley]
31  https://openhours.co.uk/spots?city=Aberdare&la...        [Aberdare]
32  https://openhours.co.uk/spots?city=Aberdeen&la...        [Aberdeen]


### Printing out the type of each column of the dataframe

In [156]:
def print_data_type_of_all_columns_of_a_dataframe(df):

    dataTypeSeries = df.dtypes
    print('Data type of each column of Dataframe :')
    print(dataTypeSeries)

In [157]:
print_data_type_of_all_columns_of_a_dataframe(cities_df)

Data type of each column of Dataframe :
cities_links    object
cities          object
dtype: object


### Removing the brackets in the cities column

In [169]:
cities_df['cities'] = cities_df['cities'].str.get(0)

### Verifying that the column has been properly changed

In [170]:
print (cities_df.head())

                                         cities_links          cities
28  https://openhours.co.uk/spots?city=Abbey+Road&...      Abbey Road
29  https://openhours.co.uk/spots?city=Abbey+Wood&...      Abbey Wood
30  https://openhours.co.uk/spots?city=Abbots+Lang...  Abbots Langley
31  https://openhours.co.uk/spots?city=Aberdare&la...        Aberdare
32  https://openhours.co.uk/spots?city=Aberdeen&la...        Aberdeen


### Exporting the dataframe to a csv file

In [172]:
cities_df.to_csv(r'⁨iCloud Drive⁩\⁨Documents⁩\⁨Covid_Project⁩\cities.csv', index=False) 
