## Scraping the EPL Stats Website (Deepnote Edition)
> Dynamic Website Scraper built in Deepnote (designed to run anywhere e.g. locally). See project on Deepnote [here](https://deepnote.com/project/19f51d7b-ae79-4c51-906c-dee0138da144).

![EPLSiteScreen](https://sportsdatasolutionsacademy.s3.eu-west-2.amazonaws.com/public/EPLsitescreen.png)

The **[Official EPL Stats Website](https://www.premierleague.com/stats/)** is a great source of quality data in regards to Football, specifically the English Premier League. The **[Player Goals Table](https://www.premierleague.com/stats/top/players/goals?se=363)** specifically fetches data from the ***Pulse Live Football API*** (footballapi.pulselive.com). Use browser tools > network (refresh page) > XHR to see the ```goals?``` request. However, if you try and request that data yourself through the browser, e.g copy paste the query to pulse live API, **your request will be denied!** 🙅‍♂️

So, we need to scrape the **HTML**. However, if we were to simply ***request*** the page's html (via http get request), we'd get the initial page load which defaults to the ***all-time goal scorers list*** (```requests``` example below). We'd also have a problem scraping all the goal scorers because the table is **paginated**. So..we know the data in the table is **dynamically loaded**, and a **simple http request** of the page is **not enough**.

**All this means it's time to break out the headless browser, automate with watir, parse tables with pandas, sit back...and enjoy** 👻🤖🐼💅

#### Scraping Dynamic Pages w/ Requests ❎
> See [Official EPL Goal Stats for Current Season](https://www.premierleague.com/stats/top/players/goals?se=363)

In [4]:
import requests
import pandas as pd

response = requests.get('https://www.premierleague.com/stats/top/players/goals?se=363') # se=363 should load current season data

df = pd.read_html(response.text)
df[0] # Loads the all-time list, not the current season data (/goals?se=363)...also, how are we going to go to the next page of data (as table is paginated)?

Unnamed: 0,Rank,Player,Club,Nationality,Stat,Unnamed: 5
0,1,Alan Shearer,-,England,260,
1,2,Wayne Rooney,-,England,208,
2,3,Andrew Cole,-,England,187,
3,4,Sergio Agüero,Manchester City,Argentina,180,
4,5,Frank Lampard,-,England,177,
5,6,Thierry Henry,-,France,175,
6,7,Robbie Fowler,-,England,163,
7,8,Jermain Defoe,-,England,162,
8,9,Michael Owen,-,England,150,
9,10,Les Ferdinand,-,England,149,


#### Scraping Dynamic Pages w/ Watir (Nerodia) & Chromium (Chrome) ✅
> [Watir (Nerodia)](https://github.com/watir/nerodia),
> [ChromeDriver](https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver) (See [```init.ipynb```](https://deepnote.com/project/19f51d7b-ae79-4c51-906c-dee0138da144#%2Finit.ipynb) in Environment Tab)

In [5]:
from nerodia.browser import Browser
import pandas as pd
import time

browser = Browser('chrome', headless=True) # Set Headless to True so the physical GUI of Chrome doesn't have to be used 👻  
browser.goto('https://www.premierleague.com/stats/top/players/goals?se=363') # Now use the browser to navigate to the EPL Stats Page

time.sleep(4) # Allow data time to load into HTML

goals_df = pd.read_html(browser.html)[0] # Use Pandas to fetch all the tables within the browser html, select the first table it finds ([0])

# Note: On the EPL site, when you've reached the end of the table, the table's Page Next element has 'inactive' added to it's class. Use browser tools to inspect the Page Next html element on the last page of the goals table to see for yourself.
# Note: As we know this, we can keep clicking the Page Next button and scraping the table until the element is 'inactive'. In Python we can use while not:
while not browser.div(class_name=['paginationBtn', 'paginationNextContainer', 'inactive']).exists:
  print('Next Page')
  browser.div(class_name=['paginationBtn', 'paginationNextContainer']).fire_event('onClick') # fire onClick event on page next element. If it was a button element (not a div element), we could simply use .click() 
  goals_df = goals_df.append(pd.read_html(browser.html)[0]) # append the table from this page with the existing goals dataframe.

browser.close() # Close Browser

goals_df = goals_df[goals_df['Stat'] > 0] # Random Players at end of table with 0 goals...

goals_df = goals_df.dropna(axis=1, how='all') # Random Unamed Column (all NaN elements, so clear columns where 'all' values are NaN)

goals_df.to_csv(r'data/epl_goals_20_21.csv', index=False) # Save dataframe to new csv file

goals_df

Next Page
Next Page
Next Page
Next Page


Unnamed: 0,Rank,Player,Club,Nationality,Stat
0,1.0,Dominic Calvert-Lewin,Everton,England,6
1,1.0,Son Heung-Min,Tottenham Hotspur,South Korea,6
2,3.0,Mohamed Salah,Liverpool,Egypt,5
3,3.0,Jamie Vardy,Leicester City,England,5
4,5.0,Neal Maupay,Brighton and Hove Albion,France,4
...,...,...,...,...,...
15,26.0,Allan Saint-Maximin,Newcastle United,France,1
16,26.0,Romain Saïss,Wolverhampton Wanderers,Morocco,1
17,26.0,Bukayo Saka,Arsenal,England,1
18,26.0,Raheem Sterling,Manchester City,England,1


In [7]:
# Run the script version from a notebook (scraping 19/20 season)
!python epl_web_scraper.py

Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
Next Page
