# **Covid-19 & the English Premier League**

The Coronaviurs pandemic impacted the sports industry as much as it devestasted the world. The English Premier League was no exception and it being one of the biggest leagues in the world, it was hugely impacted by Covid-19. Not only did it cause huge financial losses but it also affected the momentum of games.

In this analysis, we will look deeper into the impact Covid-19 had on matches. We will specifically look at how the absense of fans affected games during the 2019/20 season when Covid-19 first hit, and when only home fans were allowed to return during the 2020/21 season.
To look into these research ideas, we will look at different game attributes like winning & scoring percentages, corners & shots, total points, etc and compare them over the following four seasons:

*** 2018/19 Season**

*** 2019/20 Season**

*** 2020/21 Season**

*** 2021/22 Season**

These four seasons were chosen because two of the seasons (i.e., 2019/20 & 2020/21 seasons) were the seasons that were impacted by Covid-19 while the 2018/19 and the 2021/22 seasons are used for comparision (i.e., before and after Covid-19).

The other points to note for this analysis include the dates of the Premier League's  suspension and return. EPL games were suspended on ***March 13, 2020*** and resumed on ***June 17, 2020*** to play without the attendance of any fans for the rest of the season. Home fans were allowed to return on all grounds beginning ***May 17, 2021*** but away fans could not attend because of travel restrictions. The 2021/22 season began with the attendance of all fans. 

### **Research questions:**

  1. How did fans attendance affect the outcome of games before & after Covid-19?
  2. How did the absense of fans affect the outcome of games during Covid-19?
  3. How did the return of home fans during the 2020/21 season affect the outcome of games? 

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import date, datetime as dt

pd.set_option('display.max_columns', None) # displaying maximum no. of columns

### **Reading the dataset files**

In [None]:
# EPL games from 2019/2020 season
covDf1 = pd.read_csv('/kaggle/input/english-premier-league-datasets-20182021/2019-2020.csv')

# EPL games from 2020/2021 season
covDf2 = pd.read_csv('/kaggle/input/english-premier-league-datasets-20182021/2020-2021.csv')

# EPL games from 2018/2019 season
preCovdf = pd.read_csv('/kaggle/input/english-premier-league-datasets-20182021/2018-2019.csv')

# EPL games from 2021/2022 season
postCovdf = pd.read_csv('/kaggle/input/english-premier-league-datasets-20182021/2021-2022.csv')

### **Cleaning & organizing the datasets**

In [None]:
# combining the four datasets 
org_df = pd.concat([preCovdf,covDf1, covDf2, postCovdf], ignore_index=True) 

# getting the columns only needed
org_df = org_df.iloc[:, 0:23]

# converting the Date column to a datetime data type
org_df['Date'] = pd.to_datetime(org_df['Date'], dayfirst=True)

# removing columns that are not needed
org_df.drop(['Div', 'Referee'], axis=1, inplace=True)

# viewing the first few rows of the dataset
org_df

In [None]:
# describing the dataset
org_df.describe()

In [None]:
# checking the data types & for null values 
org_df.info()

In [None]:
# getting the number of games played per season
totNo_games = org_df.shape[0]/4
totNo_games

In [None]:
# function that classifies the dataset by game seasons
def PL_seasons(dates): 
    if (dates.Date <= dt(2019, 5, 12)):   
        return 'PreCov'
    if ((dates.Date >= dt(2019, 8, 9)) & (dates.Date <= dt(2020, 7, 26))):
        return 'Cov1'
    if ((dates.Date > dt(2020, 7, 26)) & (dates.Date <= dt(2021, 5, 23))):
        return 'Cov2'
    else:
        return 'PostCov'

    
# using PL_seasons function to create a column in the dataframe
org_df['Season'] = org_df.apply(PL_seasons, axis=1)

org_df

In [None]:
# repositioning Season column
col = org_df.pop("Season")
df = org_df.insert(1, "Season", col)
org_df

### **Calculating game stats to analyze differences across the four seasons**

In [None]:
# Analyzing defense play over the four seasons

# calculating home & away saves 
Hsaves = org_df['HST'] - org_df['FTHG']
Asaves = org_df['AST'] - org_df['FTAG']

# calculating saves percentage for home & away teams
org_df['HT_saves%'] = Hsaves / org_df['HST']
org_df['AT_saves%'] = Asaves / org_df['AST']

In [None]:
# Analyzing offense play over the four seasons

# calculating scoring percentage for home & away teams 
org_df['HT_scoring%'] = org_df['FTHG'] / org_df['HST']
org_df['AT_scoring%'] = org_df['FTAG'] / org_df['AST']

# counting home & away wins
Hwins = org_df['FTR'].value_counts()['H']
Awins = org_df['FTR'].value_counts()['A']

# calculating winning percentages for home & away teams
org_df['HT_winning%'] =  Hwins / totNo_games
org_df['AT_winning%'] =  Awins / totNo_games

# calculating goal difference for home & away teams
org_df['HT_FTGdiff'] = org_df['FTHG'] - org_df['FTAG']
org_df['AT_FTGdiff'] = org_df['FTAG'] - org_df['FTHG']

# calculating total shots ratio
org_df['TSR'] = org_df['HS'] / (org_df['HS'] + org_df['AS'])

In [None]:
# calculating total points for home & away teams
org_df['HT_Pts'] = np.where(org_df['FTHG'] > org_df['FTAG'], 3, 
                             (np.where(org_df['FTHG'] < org_df['FTAG'], 0, 1)))
org_df['AT_Pts'] = np.where(org_df['FTAG'] > org_df['FTHG'], 3, 
                             (np.where(org_df['FTAG'] < org_df['FTHG'], 0, 1)))

# Half-time goal diff for home & away teams
org_df['HT_HTGdiff'] = org_df['HTHG'] - org_df['HTAG']
org_df['AT_HTGdiff'] = org_df['HTAG'] - org_df['HTHG']

# getting the numbere of home & away teams that made comebacks  
#   {a comeback here is defined as a state where a team in a match overcame
#    a substantial disadvantage in points}

org_df['HT_Comebacks'] = np.where((org_df['HT_HTGdiff'] < -2 & (org_df['FTHG'] > org_df['FTAG'])), 1,
                                    (np.where((org_df['HT_HTGdiff'] == org_df['AT_HTGdiff']), 0, 0)))
org_df['AT_Comebacks'] = np.where((org_df['AT_HTGdiff'] < -2 & (org_df['FTAG'] > org_df['FTHG'])), 1,
                                   (np.where((org_df['AT_HTGdiff'] == org_df['AT_HTGdiff']), 0, 0)))

# display the first few rows of the dataset
org_df.head()

In [None]:
# grouping the dataset by season
group_df = org_df.groupby(['Season']).sum().reset_index()

group_df

In [None]:
# calculating to analyze the data per game

# calculating goals per game
totGoals = group_df['FTHG'] + group_df['FTAG']
Gpergame = totGoals / totNo_games

# average corners per game
totcorners = group_df['HC'] + group_df['AC']
Corpergame = totcorners / totNo_games

# average fouls per game 
totFouls = group_df['HF'] + group_df['AF']
Foulpergame = totFouls / totNo_games

# average yellow cards per game 
totYC = group_df['HY'] + group_df['AY']
YCpergame = totYC / totNo_games

# average red cards per game 
totRC = group_df['HR'] + group_df['AR']
RCpergame = totRC / totNo_games

In [None]:
seasons = pd.Series(['Cov1', 'Cov2', 'PostCov', 'PreCov'], name="Seasons")

df = pd.concat([seasons, Gpergame, Corpergame, Foulpergame, YCpergame, RCpergame],
               axis=1, keys= ['Seasons', 'Avg.Goals/game', 'Avg.Corners/game',
                              'Avg.Fouls/game', 'Avg.YC/game', 'Avg.RC/game'])
df

In [None]:
print(df['Seasons'].max() + " season had the higest stats from the four seasons:")
print(df.max())

print(df['Seasons'].min() + " season had the lowest stats from the four seasons:")
print(df.min())

In [None]:
# dropping redundant or unnecessary columns
org_df.drop(['HTHG', 'HTAG', 'HTR', 'HS', 'AS', 'HST', 'AST', 'FTHG', 'FTAG'], axis=1, inplace=True)

# converting the final dataframe into a CSV file to
# use it for creating visuals in Tableau
org_df.to_csv('EPL_final.csv', encoding='utf-8', index=False)

# **Tableau Vizuals**

https://public.tableau.com/views/Data-480_research/Viz_1?:language=en-US&:display_count=n&:origin=viz_share_link

* **The absence of fans during Covid-19 could have played a role in decreasing the quality of several aspects of the game for home teams.**
    
    **Viz 1:** Total points decreased for home teams during the 2020/21 season (2nd half of Covid season)
    
    **Viz 2:** Total points for home teams increased when Covid-19 restriction were lifited (2021/22 season)
    
    **Viz 3:** Winning percentage for home teams increased when home fans returned (May-June of 2021/22 season)
    
    **Viz 4:** Scoring percentages were lower for 2018/19 & 2019/20 seasons
    
    **Viz 5:** Winning percentage for home teams decreased by 7% from 2019/20 through 2020/21 season
    
    **Viz 6:** Total points for home teams were lower during Covid seasons
    
    **Viz 7:** Game attributes like corners, fouls, yellow cards & red cards were similar over the course of the four seasons
    
    **Viz 8:** Away teams had more comebacks than home teams in general
    

In conclusion, there was a slight difference in games played over the four seasons. The main difference observed could mainly be attributed to the absence of home team advantage during the two Covid-19 seasons (2019/20 & 2020/21). Other than that, we were able to observe an increase as soon as home fans returned during 2020/21 season, however, we cannot entirely attribute this change to the returning home fans because the timeframe for the home fans return was very small (about 2 weeks before the end of the season).