# Web Scraping using Beautiful Soup - Python

The tutorial followed is from [Real Python](https://realpython.com/beautiful-soup-web-scraper-python/#decipher-the-information-in-urls)

scraped website is [PLAYFINDER](https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page=1&refine=true)

In [1]:
import requests
from bs4 import BeautifulSoup
import csv

import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.options.display.max_colwidth = None
pd.set_option("display.float_format", lambda x: '%.2f' % x)

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Understanding the Information in URLs

- You can deconstruct the URL into two main parts:

    - 1) The base URL represents the path to the search functionality of the website. 
    - 2) The specific site location at the latter part or the url is the path to the specific resource.

Example 1: `https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html`

In the example above, the base URL is `https://realpython.github.io/fake-jobs/`

The specific specific site location is `jobs/senior-python-developer-0.html`


Example 2: `https://au.indeed.com/jobs?q=software+developer&l=Australia`

Some websites use query parameters to encode values that you submit when performing a search. e.g query strings that you send to the database to retrieve specific records. The parameters will be at the end of a URL. 

For example, if you go to Indeed and search for “software developer” in “Australia”, you’ll see that the URL changes to include these values as query parameters:

`https://au.indeed.com/jobs?q=software+developer&l=Australia`

The query parameters in this URL are ?q=software+developer&l=Australia. Query parameters consist of three parts:

**1. Start:** The beginning of the query parameters is denoted by a question mark `(?)`.

**2. Information:** The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value).

**3. Separator:** Every URL can have multiple query parameters, separated by an ampersand symbol `(&)`.
Equipped with this information, you can pick apart the URL’s query parameters into two key-value pairs:

`q=software+developer selects the type of job.`

`l=Australia selects the location of the job.`


## Inspect the Site Using Developer Tools

In Chrome on macOS, you can open up the developer tools through the menu by selecting View → Developer → Developer Tools. On Windows and Linux, you can access them by clicking the top-right menu button (⋮) and selecting More Tools → Developer Tools.

## Scrape HTML Content From a Page

From inspection, this website has 4 pages. Get the URLs for the four pages.

In [2]:
num = [1, 2, 3, 4]

url_list = []
for i in num:
    base_url = f"https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page={i}&refine=true"
    url_list.append(base_url)


In [3]:
url_list

['https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page=1&refine=true',
 'https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page=2&refine=true',
 'https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page=3&refine=true',
 'https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page=4&refine=true']

In [4]:
# html_text = requests.get(url).text
# soup = BeautifulSoup(html_text, "html.parser")

In [5]:
all_venues = []

for url in url_list:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    venues = soup.find_all('div', {'class': "col-12 col-lg-5 col-xl-6"})
    
    all_venues.extend(venues)

In [6]:
len(all_venues)

74

In [7]:

my_dict = {"name":[],"address":[],"format":[],"surface":[]};


for j, venue in zip(range(len(all_venues)), all_venues):
    name = venue.find("h2").text.strip()
    
    address = venue.find("p", class_="mb-1 body3").text.strip()
    
    facilities = [f.text.strip() for f in venue.find_all("p", class_="mb-3 body3 d-none d-md-block")]
    format_surface = facilities[0].split('\n                                \n')
    
    format_ = format_surface[0].split(':')[-1]
    surface_ = format_surface[1].split('\n')[-1]
    
    my_dict["name"].append(name)
    my_dict["address"].append(address)
    my_dict["format"].append(format_)
    my_dict["surface"].append(surface_)

In [8]:
my_dict.keys()

dict_keys(['name', 'address', 'format', 'surface'])

In [9]:
# my_dict.values()

In [10]:
data = pd.DataFrame(my_dict)

In [11]:
data.head(10)

Unnamed: 0,name,address,format,surface
0,Mossbourne Community Academy,"100 Downs Park Road, Hackney Downs, London, E5 8JY",nets,sports hall
1,Mossbourne Victoria Park Academy,"Victoria Park Road, Hackney, London, E9 7HD",nets,sports hall
2,Low Hall Sports Ground,"South Access Road, Walthamstow, Waltham Forest, E17 8AX",pitch,grass
3,Hackney Downs Park,"Downs Park Road, Shacklewell, Hackney, E5 8NP",full size,artificial
4,Millfields Park,"Millfields Park, Millfields Road, Hackney, Hackney, E5 0AR",full size,grass
5,London Fields,"Richmond Road, Hackney, Hackney, E8 3QN",full size,grass
6,Leyton Sports Ground,"Crawley Road, Leyton, London, E10 6PY",nets,indoor
7,Leyton Sports Ground,"Crawley Road, Leyton, London, E10 6PY",nets,artificial
8,Springfield Park,"Spring Hill, Stamford Hill, Waltham Forest, E5 9EF",full size,grass
9,SPACe,"31, Falkirk Street, London, N1 6HF",nets,grass


In [12]:
# url_list = []

# page_checking = True
# page_count = 1


# while page_checking:
#     url = f"https://www.playfinder.com/uk/results/cricket/london?latitude=51.5522353&longitude=-0.0616466&radius=5mi&page={page_count}&refine=true"
#     print(url)
#     html_text = requests.get(url).text
#     soup = BeautifulSoup(html_text, "html.parser")
    
#     last_page = soup.find("div", {'class': "col-9"})#{'class': "row mlp-main mb-3"})   
    
#     print(last_page)
                        
#     if last_page is not None:
#         print("Done loading pages!")
#         page_checking = False
        
#     else: 
#         url_list.append(url)
#         page_count = page_count + 1
#         print("*"*50, page_count)        