## Leeds League History
After Leeds' return to the Premier League in 2020, I decided that I wanted to plot their final league positions after each season since the formation of the Premier League. Wikipedia has a list of the league positions for each season [here](https://en.wikipedia.org/wiki/List_of_Leeds_United_F.C._seasons) so I decided to build a scraper that pulls this information and stores as a csv for me to easily read in for the visual. 

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re 

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_Leeds_United_F.C._seasons"
html = requests.get(url).text 
soup = BeautifulSoup(html, features="html.parser")

In [3]:
tables = soup.find_all(class_='wikitable')

In [4]:
header_rows = tables[0].find_all('tr')[:2]
data_rows = tables[0].find_all('tr')[2:]

The table that we are trying to scrape here isn't the most regular and contains some rows that have 'sub-rows' and double cells, so I'm going to define some helper functions to work on being able to pull the information that we need from the rows that we have looped through and found. 

In [5]:
def find_season(data_row):
    """
    Given a data row, return the season for that row
    """
    return re.search(r'(\d\d\d\d.{1}\d\d).*', data_row.find('th').text).group(1)

def find_data_values(data_row):
    """
    Find headers to match the header in each row 
    """
    return [x.text.replace('\n', '') for x in data_row.find_all('td')]

def find_division(data_row):
    """
    For a given cleaned data row, take the first element, the division
    """
    return data_row[0]

def find_position(data_row):
    """
    For a given cleaned data row, take the ninth element, the final position for that season
    """
    return data_row[8]

Due to the format of the table, we have to do some manual intervention here because `data_rows[72]` is a sub row for the same year that is started by `data_rows[71]`. This is because in that year there are 2 cells required for the 'Europe/Other' category as Leeds played in the FA Charity Shield and the Champions League. For the time being, I will manually declare the rows and as we don't need that information. 

In [6]:
# Show the partial row for reference 
[x.text.replace('\n','') for x in data_rows[71:73]]

['1992–93Prem[l]4212151557625117thR4R3FA Charity ShieldWLee Chapman1727,585',
 'Champions LeagueR2']

In [7]:
# As discussed, I have removed the sub row for easier processing of our elements 
relevant_data_rows = [data_rows[71]] + data_rows[73:]

In [8]:
# Pull the seasons for the ones that we are interested in
seasons = [find_season(data_rows[71])] + [find_season(x) for x in data_rows[73:]]

In [9]:
# Use the helper function to pull the values, and isolate the ones we are interested in
data_values = [find_data_values(x) for x in relevant_data_rows]
divisions = [find_division(x) for x in data_values]
positions = [find_position(x) for x in data_values]

In [10]:
# Store data as a dictionary for conversion
data_dict = {
    'season': seasons, 
    'division': divisions, 
    'position': positions
}

# Format as a data frame
pd.DataFrame(data_dict)

Unnamed: 0,season,division,position
0,1992–93,Prem[l],17th
1,1993–94,Prem,5th
2,1994–95,Prem,5th
3,1995–96,Prem,13th
4,1996–97,Prem,11th
5,1997–98,Prem,5th
6,1998–99,Prem,4th
7,1999–20,Prem,3rd
8,2000–01,Prem,4th
9,2001–02,Prem,5th
