# WhoScorred Match-Centre Scraping - Part 01

This is part one of my process on how I went about scraping shots data from the WhoScored website. The first step was to manually copy each & every match URL from the 'fixtures' section for each chosen season of the Premier League. 

I did this by going to the set of fixtures in each month of the season and then using this [Chrome extension](https://chromewebstore.google.com/detail/copy-all-urls-free/pnbocjclllbkfkkchadljokjclnpakia?hl=en) to copy every link in the page and paste them on an excel worksheet. So, given that are fixtures taking place from August to May of each season, I had to repeat this "copy & paste" process around 10 times for each Premier League season.  

After stacking all these links together, I used the code attached below to pull out the match URLs I wanted. I will attach the excel worksheet with the web links I copied from WhoScored for the 2013/14 Premier League season as an example. 

In [2]:
import numpy as np
import pandas as pd
import os

### Please import the excel file named 'PL 2013-14 -- WhoScored Web Links' under the name "raw_df"

season = '2013-14.xlsx'

import_folder_path = 'C:/Users/tharu/OneDrive/Desktop/Big-PL-Project/PL-Project/WhoScored Match URLs/Raws/Premier League' 
import_file_path = f"{import_folder_path}/{season}"

export_folder_path = 'C:/Users/tharu/OneDrive/Desktop/Big-PL-Project/PL-Project/WhoScored Match URLs/Final/Premier League' 

raw_df = pd.read_excel(import_file_path)
raw_df

Unnamed: 0,Links
0,https://1xbet.whoscored.com/regions/252/tourna...
1,https://1xbet.whoscored.com/ - https://1xbet.w...
2,https://www.launchpass.com/whoscored/first-mon...
3,https://1xbet.whoscored.com/accounts - My account
4,https://1xbet.whoscored.com/accounts/logout - ...
...,...
1845,https://twitter.com/WhoScored - https://twitte...
1846,https://www.instagram.com/officialwhoscored/ -...
1847,https://www.youtube.com/whoscored - https://ww...
1848,https://geo.itunes.apple.com/us/app/whoscored-...


### Note - 
I used the fact that the match URLs are of the format "{match URL} - {home_team goals scored}{away_team goals scored}" to pick them out from the full list of web links and then I save these match URLs onto an excel file.

In [5]:
raw_df[['Links', 'Flag']] = raw_df['Links'].str.split(' - ', n=1, expand=True)

final_df = raw_df.loc[raw_df.Flag.str.isnumeric()]
final_df.columns = final_df.columns.str.replace("Flag", "Score")

final_df = final_df[final_df['Score'].astype(str).str.len() == 2]
final_df = final_df[~final_df['Links'].str.contains('comments-panel', case=False, na=False)]

### A first litmus test for this dataframe of WhoScored's match URLs was ensuring that it had 380 rows, 
### since that is how many games were played in a single PL season
final_df

Unnamed: 0,Links,Score
19,https://1xbet.whoscored.com/matches/719852/liv...,10
23,https://1xbet.whoscored.com/matches/719846/liv...,13
27,https://1xbet.whoscored.com/matches/719871/liv...,01
31,https://1xbet.whoscored.com/matches/719873/liv...,20
35,https://1xbet.whoscored.com/matches/719858/liv...,22
...,...,...
1816,https://1xbet.whoscored.com/matches/720887/liv...,20
1820,https://1xbet.whoscored.com/matches/720889/liv...,30
1824,https://1xbet.whoscored.com/matches/720890/liv...,02
1828,https://1xbet.whoscored.com/matches/720892/liv...,12


In [None]:
# Checking for duplicates!
duplicates = final_df[final_df['Links'].duplicated()]
print(len(duplicates))
print(duplicates)

In [None]:
#final_df = final_df.drop_duplicates()
#final_df

In [None]:
### Exporting the final dataframe to the folder of choice
final_df.to_excel(os.path.join(export_folder_path, season), index=False)