<a href="https://colab.research.google.com/github/radhakrishnan-omotec/football-repo/blob/main/Part1_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Exploration
Before we get into building our xG model, we need to consider what sort of data we are interested in. Obviously, we need a large collection of shot data but more importantly we need the data to describe the type of shots that result in goals. We can deduce that the most important factors we need would be the distance from goal when the shot was taken, the angle with respect to the goal and what part of the body the shot was taken with.


In [2]:
!git clone https://github.com/radhakrishnan-omotec/football-repo.git

Cloning into 'football-repo'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 24 (delta 7), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (24/24), 940.73 KiB | 10.69 MiB/s, done.
Resolving deltas: 100% (7/7), done.


**Data Extraction and Wrangling**

First we have to import all the files that contain our event data. We will also import our libraries in the cell below.

In [1]:
import pandas as pd
import numpy as np
import os
import json
import requests  # Added for downloading dataset
from urllib.parse import urlparse  # Added for parsing URL
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Arc

In [12]:
# Function to download dataset from URL and save it to the specified path
def download_dataset(url, save_path):
    response = requests.get(url)
    with open(save_path, 'wb') as file:
        file.write(response.content)

# Specify the URL and save path
dataset_url = "https://figshare.com/ndownloader/files/14464685"
save_path = "/content/football-repo/dataset/event_data/events.zip"


# Download the dataset
download_dataset(dataset_url, save_path)

# Add code to extract the contents if needed

# Specify the directory where the extracted files are stored
directory = '/content/football-repo/dataset/event_data'

# Create a list of the json files from the directory
eventjsonfiles = []
for path in os.listdir(directory):
    eventjsonfiles.append(os.path.join(directory, path))

print(eventjsonfiles)

['/content/football-repo/dataset/event_data/extracted', '/content/football-repo/dataset/event_data/events.zip']


In [13]:
import zipfile

# Specify the directory where the extracted files should be stored
extracted_directory = '/content/football-repo/dataset/event_data/extracted'

# Specify the path to the downloaded zip file
zip_file_path = '/content/football-repo/dataset/event_data/events.zip'

# Create the directory if it doesn't exist
os.makedirs(extracted_directory, exist_ok=True)

# Extract the contents of the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_directory)

# Create a list of the extracted json files
jsonfiles = []
for path in os.listdir(extracted_directory):
    jsonfiles.append(os.path.join(extracted_directory, path))

print("List of extracted JSON files:")
print(jsonfiles)

List of extracted JSON files:
['/content/football-repo/dataset/event_data/extracted/events_World_Cup.json', '/content/football-repo/dataset/event_data/extracted/events_England.json', '/content/football-repo/dataset/event_data/extracted/events_European_Championship.json', '/content/football-repo/dataset/event_data/extracted/events_Spain.json', '/content/football-repo/dataset/event_data/extracted/events_France.json', '/content/football-repo/dataset/event_data/extracted/events_Germany.json', '/content/football-repo/dataset/event_data/extracted/events_Italy.json']


Now we are going to parse through the json files and extract all the relevent shot data to store in a tiddy seperate dataframe.

**Most of this will be done using the pandas and numpy library.**

In [15]:
def shot_matrix(eventdata):
    with open(eventdata) as f:
        data = json.load(f)

    #lets create the dataframe that we want to store our data in and all the attributes we are interested in
    shots_dataset = pd.DataFrame(columns=['Goal','x','y','playerid','teamid','matchid','header'])

    #remember that the jsonfiles include passes, shots, tackles etc so we need to filter through these
    #lets find all the occurences of a shot within the set
    #refer to link in the prevous cell for info on the Wyscout event dataset, including tag names
    event_df = pd.DataFrame(data)
    all_shots = event_df[event_df['subEventName']=='Shot']

    #now we need to fill in our shots_dataset matrix by attribute columns
    #we will do this by filtering through the all-shot df (dataframe) we just made
    for index,shot in all_shots.iterrows():
        #here we fill in the columns for goals and headers with binary descripters
        shots_dataset.at[index,'Goal']=0
        shots_dataset.at[index,'header']=0
        for tag in shot['tags']:
            if tag['id']==101:
                shots_dataset.at[index,'Goal']=1
            elif tag['id']==403:
                shots_dataset.at[index,'header']=1

        #now we are interested in distance from the goal as well as the angle formed with the goal
        #Wyscouts pitch has its origin at the top left of the pitch and is 100m x 100m
        #therefore x and y represent percentage of nearness to top left corner
        #most pitches are 105 meters by 68 so we will go with that
        shots_dataset.at[index,'Y']=shot['positions'][0]['y']*.68
        shots_dataset.at[index,'X']= (100 - shot['positions'][0]['x'])*1.05

        #now we use dummy variables x and y to calc distance and angle attributes
        shots_dataset.at[index,'x']= 100 - shot['positions'][0]['x']
        shots_dataset.at[index,'y']=shot['positions'][0]['y']
        shots_dataset.at[index,'Center_dis']=abs(shot['positions'][0]['y']-50)

        x = shots_dataset.at[index,'x']*1.05
        y = shots_dataset.at[index,'Center_dis']*.68
        shots_dataset.at[index,'Distance'] = np.sqrt(x**2 + y**2)

        #we are interested in the angle made between the width of the goal and the
        #straight line distance to the shot location. A goal is 7.32 meters wide
        #use the law of cosines
        c=7.32
        a=np.sqrt((y-7.32/2)**2 + x**2)
        b=np.sqrt((y+7.32/2)**2 + x**2)
        k = (c**2-a**2-b**2)/(-2*a*b)
        gamma = np.arccos(k)
        if gamma<0:
            gamma = np.pi + gamma
        shots_dataset.at[index,'Angle Radians'] = gamma
        shots_dataset.at[index,'Angle Degrees'] = gamma*180/np.pi

        #lastly we add the identifiers for player, team and match
        shots_dataset.at[index,'playerid']=shot['playerId']
        shots_dataset.at[index,'matchid']=shot['matchId']
        shots_dataset.at[index,'teamid']=shot['teamId']

        print("shots_dataset created :: Counter Index : ", index)
    return shots_dataset

In [None]:
#Now we read in our json files into our shot_matrix function
all_leagues = []
for file in jsonfiles:
    all_leagues.append(shot_matrix(file))
df = pd.concat(all_leagues)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43078 entries, 117 to 647286
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Goal           43078 non-null  object 
 1   x              43078 non-null  object 
 2   y              43078 non-null  object 
 3   playerid       43078 non-null  object 
 4   teamid         43078 non-null  object 
 5   matchid        43078 non-null  object 
 6   header         43078 non-null  object 
 7   Y              43078 non-null  float64
 8   X              43078 non-null  float64
 9   Center_dis     43078 non-null  float64
 10  Distance       43078 non-null  float64
 11  Angle Radians  43075 non-null  float64
 12  Angle Degrees  43075 non-null  float64
dtypes: float64(6), object(7)
memory usage: 4.6+ MB


In [20]:
df.describe()

Unnamed: 0,Y,X,Center_dis,Distance,Angle Radians,Angle Degrees
count,43078.0,43078.0,43078.0,43078.0,43075.0,43075.0
mean,33.473875,15.992062,11.45926,18.592949,0.414019,23.721566
std,9.366242,8.534094,7.681202,8.419041,0.252483,14.466236
min,0.0,0.0,0.0,0.68,0.0,0.0
25%,26.52,9.45,5.0,12.249445,0.250188,14.334692
50%,33.32,13.65,11.0,17.153297,0.327782,18.780499
75%,40.8,23.1,17.0,24.936,0.505984,28.990776
max,68.0,103.95,50.0,103.952224,3.141593,180.0


In [17]:
df.head()

Unnamed: 0,Goal,x,y,playerid,teamid,matchid,header,Y,X,Center_dis,Distance,Angle Radians,Angle Degrees
117,0,13,27,122940,16521,2057954,0,18.36,13.65,23.0,20.758904,0.234886,13.458001
154,0,10,69,101699,14358,2057954,0,46.92,10.5,19.0,16.648616,0.283528,16.244976
197,0,14,30,101857,14358,2057954,0,20.4,14.7,20.0,20.026233,0.270761,15.513438
232,1,7,60,102157,14358,2057954,1,40.8,7.35,10.0,10.013116,0.554534,31.772473
372,0,14,38,122671,16521,2057954,0,25.84,14.7,12.0,16.812959,0.380161,21.781597


**Data Cleaning**

Now, before we get into exploring the dataset we just created, we should do some data cleaning. It it normal, especially with such a large collection of data, that there could have been some values inputed incorrectly, some values missing or just situations that we did not anticipate for. For example, we should check to see why we are encountering an error in arccos.

In [21]:
#find out if the error is producing nan values
df.isnull().values.any()

True

In [23]:
#find how many such nan values
df.isnull().sum().sum()

6

In [22]:
df.isnull().sum()

Goal             0
x                0
y                0
playerid         0
teamid           0
matchid          0
header           0
Y                0
X                0
Center_dis       0
Distance         0
Angle Radians    3
Angle Degrees    3
dtype: int64

In [24]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Goal,x,y,playerid,teamid,matchid,header,Y,X,Center_dis,Distance,Angle Radians,Angle Degrees
417224,1,0,49,4131,698,2565801,0,33.32,0.0,1.0,0.68,,
365140,1,0,52,224971,2445,2516954,0,35.36,0.0,2.0,1.36,,
499325,1,0,57,206314,3161,2576251,0,38.76,0.0,7.0,4.76,,


So it seems that there were some goals scored from the touch line which would require us to rethink how we created our construction of the angle attribute.

 **Since there were only 3 occurences of such events and since they are normally unintentional rare events, I will remove them from our model. This is mainly to keep things simple.**

In [25]:
df.dropna()

Unnamed: 0,Goal,x,y,playerid,teamid,matchid,header,Y,X,Center_dis,Distance,Angle Radians,Angle Degrees
117,0,13,27,122940,16521,2057954,0,18.36,13.65,23.0,20.758904,0.234886,13.458001
154,0,10,69,101699,14358,2057954,0,46.92,10.50,19.0,16.648616,0.283528,16.244976
197,0,14,30,101857,14358,2057954,0,20.40,14.70,20.0,20.026233,0.270761,15.513438
232,1,7,60,102157,14358,2057954,1,40.80,7.35,10.0,10.013116,0.554534,31.772473
372,0,14,38,122671,16521,2057954,0,25.84,14.70,12.0,16.812959,0.380161,21.781597
...,...,...,...,...,...,...,...,...,...,...,...,...,...
646870,0,5,45,116269,3193,2576338,0,30.60,5.25,5.0,6.254798,0.980870,56.199735
646904,0,7,38,3548,3193,2576338,0,25.84,7.35,12.0,10.982172,0.465107,26.648679
647169,1,10,46,21177,3193,2576338,0,31.28,10.50,4.0,10.846585,0.635289,36.399362
647218,0,21,32,349102,3193,2576338,0,21.76,22.05,18.0,25.219439,0.253651,14.533147


**Now it seems we have some unnecessary columns that stored dummy variables when we computed distance and angles. Let's remove them.**

In [26]:
df.drop(columns = ['x','y','Center_dis'])
df

Unnamed: 0,Goal,x,y,playerid,teamid,matchid,header,Y,X,Center_dis,Distance,Angle Radians,Angle Degrees
117,0,13,27,122940,16521,2057954,0,18.36,13.65,23.0,20.758904,0.234886,13.458001
154,0,10,69,101699,14358,2057954,0,46.92,10.50,19.0,16.648616,0.283528,16.244976
197,0,14,30,101857,14358,2057954,0,20.40,14.70,20.0,20.026233,0.270761,15.513438
232,1,7,60,102157,14358,2057954,1,40.80,7.35,10.0,10.013116,0.554534,31.772473
372,0,14,38,122671,16521,2057954,0,25.84,14.70,12.0,16.812959,0.380161,21.781597
...,...,...,...,...,...,...,...,...,...,...,...,...,...
646870,0,5,45,116269,3193,2576338,0,30.60,5.25,5.0,6.254798,0.980870,56.199735
646904,0,7,38,3548,3193,2576338,0,25.84,7.35,12.0,10.982172,0.465107,26.648679
647169,1,10,46,21177,3193,2576338,0,31.28,10.50,4.0,10.846585,0.635289,36.399362
647218,0,21,32,349102,3193,2576338,0,21.76,22.05,18.0,25.219439,0.253651,14.533147
