# Developing a Model to Detect Offsides in a Football Match

Is it possible to teach a computer the most complex rule of football, the offside, by using very simple models and data from a single game?

## 1. The Offside 

A player is in an "offside position" if they are in the opposing team's half of the field and also "nearer to the opponents' goal line than both the ball and the second-last opponent." This is a brief description, and we can develop it a little further by stating the five conditions that must be met.<br>
1. A teammate touches the ball, either by a pass, a headbutt or kicking towards the goal;
2. The pass is not from a corner kick, a goal kick or a throw-in;
3. The receiving player is in the offensive half of the field;
4. The receiving player is closer to the goal line than the ball;
5. There is only one player from the other team between the receiving player and the goal line.

When all five conditions happen in a given play, the game must be stopped and a free kick is given to the defensive team.

### 1.1 Learning by Example

Our goal here is to create a model where these rules are not explicitly written, but are learned from a data set of annotated plays. That is, we are going to feed the model with a series of plays, with the position of the 22 players, 11 for each team, the player who is passing the ball and the receiver, with a label "offside" or "no_offside". We will evaluate the model afterwards to see how good is the fit.

# 2. Data set

We found some collections of data with offsides, but they didn't include the position of the players. Furthermore, we also needed data for the cases where no offside happened. There were two possible paths:
1. create a large number of plays by choosing randomly the position for all the players;
2. find real data from, at least, one football match.

### 2.1 Random positions

Pros: very easy to create a large number of plays.<br>
Cons: most of the plays are unrealistic for a football match.

**Conclusion:** too much junk.

### 2.2 Real data

Pros: real situations from a football match.<br>
Cons: very hard data to find; data cleaning necessary; possible need for data augmentation. 

**Jackpot:** we found **ONE** match!

We have found a data set called <a href="https://old.datahub.io/dataset/magglingen2013">Magglingen2013</a>.<br> 
*Recorded position data of professional football matches. The data includes positions on football field in 100ms steps (10Hz) of the players, and the ball.*<br>
*The games were performed in real competition situation by top swiss football club players (U19). The data was intentionally anonymized, not at last because of the high density of information available.*<br>
*This dataset gives you a glimpse on a possible data stream originating from a live match in some not to distant future.*<br>
*The terrain was an artificial green. Recordings are available from beginning of the first half time up to the end to the second half time.*

There are two matches listed at this website, but the links are broken. We were able to find the file for "*Game TR vs. FT*", so we are using that as our initial data set. The file is a JSON file, with a timestamp and, for each player, an identifier, the coordinates on the field (x,y), the distance to the ball and if he has the possession of the ball.

<img src="json.png"></img>

There is a second JSON file, matching the player ID with one of the two teams, and another ID for the ball itself.

<img src="json2.png"></img>

## 3. Data Exploration and Data Cleaning

Let's understand a little better the data before we try any cleaning. For example, we need to find out the size of the field to be able to find anomalies.

In [None]:
import json
print('Loading data')

#reading game data
print('Reading game data')
filename='tr-ft.json'
with open(filename) as json_file:
    data_game = json.load(json_file)

#reading team players
print('Reading team data')
filename2='tr-ft-gamedesc.json'    
with open(filename2) as json_file:
    data_teams = json.load(json_file)

data_teams

We need the IDs for each player. In the next step, we will create a list with the teams, first the eleven player from *Team 1*, then eleven players from *Team 2*.

In [None]:
#list of players, 11 for team 1, 11 for team 2
lineup=[]
for i in data_teams['player'].keys():
    teams=[i,data_teams['player'][i]['team']]
    
    lineup.append(teams)

lineup

The ID **5** is the ball. The IDs are ordered by team, so we don't need the second column.

In [None]:
lineup = [ item for elem in lineup for item in elem]
lineup = lineup[::2]
lineup

Now that we processed the team data, let's take a look at the game data. As we have showed before, for each instant we have the position of the ball and of the 22 players. 

In [None]:
data_game[0]

We are going to create a DataFrame where the first column is the timestamp and then the position for the ball and each one of the players, in the order of the lineup.

In [None]:
import numpy as np
events = []
for ii in range(len(data_game)):
    new_play = [data_game[ii]['ts']]
    positions = []
    for jj in range(len(data_game[ii]['data'])):
        position_player = [data_game[ii]['data'][jj]['id']]
        position_player.append(data_game[ii]['data'][jj]['x'])
        position_player.append(data_game[ii]['data'][jj]['y'])
        positions.append(position_player)
    
    for kk in lineup:
        temp_x = np.nan
        temp_y = np.nan    
        for ll in range(len(positions)):
            if int(kk) == positions[ll][0]:
                temp_x = positions[ll][1]
                temp_y = positions[ll][2]
        new_play.append(temp_x)
        new_play.append(temp_y)

    events.append(new_play)

print('Number of frames of the game: ',len(events))

In [None]:
columns_events=['timestamp']
for ii in range(len(lineup)):
    columns_events.append('x_'+lineup[ii])
    columns_events.append('y_'+lineup[ii])

In [None]:
import pandas as pd
df_events = pd.DataFrame(events, columns=columns_events)
df_events.head()

We have a DataFrame where each row is the configuration of players in the field at a given instant. Our data fas a 10Hz frequency, that is, we have 10 row for each second. We can see there is very little variation from one row to the next. <br>
Let's have a quick look at some statistics.

In [None]:
df_events.describe()

We need to get rid of those '99999.999' values and replace them with NaN. 

In [None]:
df_events = df_events.replace([99999.999],np.nan)
df_events.describe()

After this first clean up, we can see the columns x_5 and y_5 had a huge drop in the count, at 48908 from the original 68031. That means we don't have the location of the ball for 31 minutes and 52.3 seconds, but the data is available for 81 minutes and 30.8 seconds, consistent with a match duration of 90 minutes. There must be some events where the ball is out of bounds that add up to the missing time. <br>
The position of the players also have a drop in the count, what probably must be related to the half time break. <br>
Let's take a look at the distribution of (X,Y) for all the players. We will create histograms to try to determine the length and width of the field. 

In [None]:
list_of_x = []
list_of_y = []
for column in columns_events:
    if column[:2] == 'x_':
        list_of_x.append(df_events[column].tolist())
    elif column[:2] == 'y_':
        list_of_y.append(df_events[column].tolist())

list_of_x = [ item for elem in list_of_x for item in elem]
list_of_y = [ item for elem in list_of_y for item in elem]
print('X: ',list_of_x[:10])
print('Y: ',list_of_y[:10])


In [None]:
import matplotlib.pyplot as plt

#from the description of the position of the ball
x_min = -52
x_max = 52

# fixed bin size
bins = np.linspace(10*x_min, 10*x_max, 1050) # fixed bin size

plt.xlim([min(list_of_x), max(list_of_x)])
plt.title('Distribution of X (m)')
plt.xlabel('X (bin size = 1m)')
plt.ylabel('count')
plt.hist(list_of_x, bins=bins, alpha=0.5)
plt.show()

We can see we need to clean more of the data. Most of the data is within the range of positions for the ball, (-52,52), which we got from the table describing the columns. Let's take a closer look.

In [None]:
plt.xlim([-60, 60])
plt.title('Distribution of X (m)')
plt.xlabel('X (bin size = 1m)')
plt.ylabel('count')
plt.hist(list_of_x, bins=bins, alpha=0.5)
plt.show()

From this histogram, we learn two things:
1. The X position is in the direction of the length of the field, with the center at 0 and the goal lines at +/- 52m;
2. We need to clean more data, getting rid of all the positions where the players are more than 2 meters off the field (to allow corner kicks and the goalies retrieving the ball).

In [None]:
#from the description of the position of the ball
y_min = -34
y_max = 34

# fixed bin size
bins = np.linspace(10*y_min, 10*y_max, 1050) # fixed bin size

plt.xlim([min(list_of_y), max(list_of_y)])
plt.title('Distribution of Y (m)')
plt.xlabel('Y (bin size = 1m)')
plt.ylabel('count')
plt.hist(list_of_y, bins=bins, alpha=0.5)
plt.show()

We can see we need to clean more of the data. Most of the data is within the range of positions for the ball, (-34,34), which we got from the table describing the columns. Let's take a closer look.

In [None]:
plt.xlim([-40, 40])
plt.title('Distribution of Y (m)')
plt.xlabel('Y (bin size = 1m)')
plt.ylabel('count')
plt.hist(list_of_y, bins=bins, alpha=0.5)
plt.show()

From this histogram, we learn two things:
1. The Y position is in the direction of the width of the field, with the center at 0 and the side lines at +/- 34m;
2. We need to clean more data, getting rid of all the positions where the players are more than 2 meters off the field (to allow the throw-ins).

We are replacing every X over +/- 54 and every Y over +/- 36 with NaN.

In [None]:
for column in columns_events:
    if column[:2] == 'x_':
        df_events[column].loc[df_events[column] < -54] = np.nan
        df_events[column].loc[df_events[column] > 54] = np.nan
    if column[:2] == 'y_':
        df_events[column].loc[df_events[column] < -36] = np.nan
        df_events[column].loc[df_events[column] > 36] = np.nan


In [None]:
df_events.describe()

Now we have all the (x,y) inside the field plus a 2m margin. But we have a new problem: we have a different number of values for x and y for a single player, and different numbers for each player.

In [None]:
for column in columns_events:
    print(column,': ',df_events[column].count())

The lower count is for player 16, with around 42.000 rows. He was probably sent off at the start of the second half or was injured.<br>
We have two options:
1. Fill in the gaps in the data, with an average position or a new random location;
2. Remove all the rows with missing data.

Let's take a look at the distribution of these gaps during the game. The ball gets out of the field regularly, so we won't use its position in our analysis.

In [None]:
nans=[]
number_frames_nan=0
sequence_of_nans=0
df_nans=df_events.drop(['x_5', 'y_5'], axis=1).isna()
for ii in range(len(df_events)):
    count_nan=df_nans.loc[[ii]].sum().sum()
    if count_nan>0:
        number_frames_nan += 1
        sequence_of_nans += 1
        if sequence_of_nans > 40:
            sequence_of_nans = 40
    else:
        sequence_of_nans = 0
    nans.append(sequence_of_nans)
print('Number of frames with NaN: ',number_frames_nan)

In [None]:
plt.title('Sequence of frames with NaN')
plt.xlabel('Frames')
plt.ylabel('Number of frames with NaN in sequence')
plt.scatter(range(len(df_events)),nans)

How to read the above plot: most of the time we have full data. In the first half, there is an event with 4 rows in a sequence with at least one NaN, another sequence with 12 rows, another with 7 rows, and one with 24 rows. Then we see the end of the first half and the return of the game. We see four sequences (with 6, 20, 8 and 35). After these events, we see what was expected: one player is out for the remainder of the game. 

We have three different groups:
1. Half-time
2. One player out
3. Problematic data

It is important here to remember that the data has a frequency of 10Hz. A sequence of 35 rows, our largest sequence, is equivalent to 3.5 seconds of the game. 

**To keep things simple, we can get rid of all the time frames with incomplete data without compromising our analysis.**

Before we do that, let's make a transformation that will simplify our work later on. The teams trade sides for the second half. We will flip the data for the second half, multiplying them by *-1*, so that the team's offensive half is always the same, *team 1* attacking to the right and *team 2* to the left.

In [None]:
for ii in range(int(len(df_events)/2),len(df_events)):
    df_events.loc[ii]=-1*df_events.loc[ii]

From the original 68031 frames, we have 27505 incomplete rows. This leaves information for 40526 frames, equivalent to 67 minutes and 32 seconds, about 75% of a match. Good enough for us!

In [None]:
df_events_clean=df_events.dropna(subset=columns_events[3:]).reset_index(drop=True)
df_events_clean.describe()

We finally have a clean data set to start our project, with 40526 frames with the position of 22 players in real game situations. Let's create a function to show a random row.

In [None]:
df_events_clean.to_csv('df_clean.csv',index=False)

In [None]:
import math
# Create field
fieldx=[]
fieldy=[]
for i in range(-5200,5200,10):
    fieldx.append(i)
    fieldy.append(-3400)
    fieldx.append(i)
    fieldy.append(3400)

for i in range(-3400,3400,10):
    fieldx.append(-5200)
    fieldy.append(i)
    fieldx.append(5200)
    fieldy.append(i)
    fieldx.append(0)
    fieldy.append(i)

for i in range(360):
    fieldx.append(915*math.cos(math.radians(i)))
    fieldy.append(915*math.sin(math.radians(i)))

for i in range(0,2016,10):
    fieldx.append(5200-1650)
    fieldy.append(i)
    fieldx.append(5200-1650)
    fieldy.append(-i)
    fieldx.append(-5200+1650)
    fieldy.append(i)
    fieldx.append(-5200+1650)
    fieldy.append(-i)

for i in range(5200,5200-1650,-10):
    fieldx.append(i)
    fieldy.append(2016)
    fieldx.append(i)
    fieldy.append(-2016)
    fieldx.append(-i)
    fieldy.append(-2016)
    fieldx.append(-i)
    fieldy.append(2016)

for i in range(0,366+550,10):
    fieldx.append(5200-550)
    fieldy.append(i)
    fieldx.append(5200-550)
    fieldy.append(-i)
    fieldx.append(-5200+550)
    fieldy.append(i)
    fieldx.append(-5200+550)
    fieldy.append(-i)

for i in range(5200,5200-550,-10):
    fieldx.append(i)
    fieldy.append(366+550)
    fieldx.append(i)
    fieldy.append(-366-550)
    fieldx.append(-i)
    fieldy.append(-366-550)
    fieldx.append(-i)
    fieldy.append(366+550)

for i in range(360):
    if 5200-1100 + 915*math.cos(math.radians(i))<52-1650:
        fieldx.append(52-11 + 915*math.cos(math.radians(i)))
        fieldy.append(915*math.sin(math.radians(i)))

for i in range(360):
    if -5200+1100 + 915*math.cos(math.radians(i))>-5200+1650:
        fieldx.append(-5200+1100 + 915*math.cos(math.radians(i)))
        fieldy.append(915*math.sin(math.radians(i)))

fieldx.append(5200-1100)
fieldy.append(0)
fieldx.append(-5200+1100)
fieldy.append(0)

def plot_field(index):
    #getting data for the teams
    g1 = (100*df_events_clean.loc[index,'x_5'],100*df_events_clean.loc[index,'y_5'])
    g2 = (100*df_events_clean.loc[index,columns_events[3:25:2]],100*df_events_clean.loc[index,columns_events[4:25:2]])
    g3 = (100*df_events_clean.loc[index,columns_events[25::2]],100*df_events_clean.loc[index,columns_events[26::2]])
    data = (g2,g3,g1)
    colors = ("red", "blue","green")
    groups = ("team1", "team2", "ball")
    size=(30,30,50)

    # Create plot
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)

    plt.scatter(fieldx,fieldy,s=5,c="black",alpha=0.3)    

    for data, color, group, size in zip(data, colors, groups, size):
        x, y = data
        ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=size, label=group)

    plt.ylim(-3700, 3700)
    plt.xlim(-5500, 5500)
    plt.title('Time frame: '+str(index))
    plt.legend(loc=2)
    plt.show()

    

In [None]:
import random
np.random.seed(0)
plot_field(random.randint(0, len(df_events_clean)-1))

There are some frames without the ball, but that is expected. In most of frames we can see clearly the two goalies and that we players gather around the ball. As stated before, *team 1* attacks to the right and *team 2* attacks to the left. With these plots, we can assure our data represents actual game plays and we can move on to data augmentation.

# 4. Data Augmentation

We have more than 40 thousand situations, but let's pump up those numbers! We will create new data by adding noise to the positions of the players. By doing this, we are getting new plays, keeping them similar to real ones. For each row in our data set, we will add a small random contribution for the position of every player. We will get rid of the 'timestamp' column because this won't be used. We can also remove the columns for the ball, as we will use the position of the passing and the receiving plyers on our analysis.

In [None]:
df_plays = df_events_clean.drop(columns=['timestamp','x_5','y_5'])

In [None]:
repetition = 2
df_plays_rep = df_plays.copy()
for ii in range(repetition):
    df_noise = pd.DataFrame(4*np.random.rand(len(df_plays), len(columns_events[3:]))-2, columns =columns_events[3:])
    df_plays_rep=df_plays_rep.append(df_plays.add(df_noise)).reset_index(drop=True)
len(df_plays_rep)

In [None]:
df_plays_rep.loc[0::40526]

We have more than 120k plays in our data set. We need now to include three new columns: the passing player, the receiving player and the flag "offside/no_offside". There are 22x22 = 484 possibilities, including the ball coming from the other team and the player keeping the ball. With our 120k plays, we could have a total of 58.843.752 different game situations. This might be a problem, so we are going to randomly select 25 possibilities instead of the 484, leaving us with close to 3 million game situations. For simplicity, the passing and receiving player will be defined by the order of the lineup, that is, these columns will be filled by numbers between 1 and 22.

In [None]:
df_full_plays = df_plays_rep.copy()
df_full_plays['pass'] = 0
df_full_plays['rec'] = 0
df_full_plays['offside'] = 0

df_full = df_full_plays.copy()
for ii in range(24):
    df_full = df_full.append(df_full_plays)
    
passing = []
receiving = []

for ii in range(len(df_full)):
    passing.append(random.randint(1,22))
    receiving.append(random.randint(1,22))

df_full['pass'] = passing
df_full['rec'] = receiving

len(df_full)

In [None]:
df_full = df_full.reset_index(drop=True)
df_full.head()

We are now ready to annotate the data, determine if the play is an offside or not.

# 5. Data Annotation and Data set Balance

A player is in an "offside position" if they are in the opposing team's half of the field and also "nearer to the opponents' goal line than both the ball and the second-last opponent." This is a brief description, and we can develop it a little further by stating the five conditions that must be met.

1. A teammate touches the ball, either by a pass, a headbutt or kicking towards the goal;
2. The pass is not from a corner kick, a goal kick or a throw-in;
3. The receiving player is in the offensive half of the field;
4. The receiving player is closer to the goal line than the ball;
5. There is only one player from the other team between the receiving player and the goal line.

When all five conditions happen in a given play, the game must be stopped and a free kick is given to the defensive team.
We need to include two new conditions:

6. The receiving player is not the same as the passing one. This means the player can keep the ball to himself even if the above conditions happen.
7. The receiving player must be in the field. The passing player can be outside, when we consider the play a throw-in or a corner kick.

For each game situation previously created, we will evaluate these 7 conditions. The offsides are a small subset of the total plays. Besides checking for offsides, we are creating a number of groups to help creating a more balanced dataset:
1. when all 7 conditions are met, the offsides;
2. when just one of the other conditions is not met, the "close call" non offsides, but excluding condition 5;
3. when the only condition avoiding an offside is the number of players between the receiving player and the goal line, condition 5, the most recurring one;
4. when all 7 conditions aren't met;
5. all the other non offside cases.
 

Let's check how many cases are in each of these groups.

In [None]:
offsides = []
zeroconditions = []
condition1 = []
condition2 = []
condition3 = []
condition4 = []
condition5 = []
condition6 = []
condition7 = []

for ii in range(len(df_full)):
    
    if ii%100000 == 0:
        print('Analysing: ',ii)
    
    offsidemark = []
    
    receiver_team = int(df_full.loc[ii]['rec']/12)
    passer_team = int(df_full.loc[ii]['pass']/12)
    receiver = str(lineup[int(df_full.loc[ii]['rec'])])
    passer = str(lineup[int(df_full.loc[ii]['pass'])])
    
    #condition1: 
    #A teammate touches the ball, either by a pass, a headbutt or kicking towards the goal.
    #check to see if pass and rec are from the same team
    if receiver_team == passer_team:
        offsidemark.append(1)
    else:
        offsidemark.append(0)
    
    #condition2: 
    #The pass is not from a corner kick, a goal kick or a throw-in.
    #check to see if passing player is outside the field
    
    if ((df_full.loc[ii]['x_'+passer] < -52) or (df_full.loc[ii]['x_'+passer] > 52) or (df_full.loc[ii]['y_'+passer] < -34) or (df_full.loc[ii]['y_'+passer] > 34)):
        offsidemark.append(0)
    else:
        offsidemark.append(1)

    #condition3: 
    #The receiving player is in the offensive half of the field.
    #check to see if receiving player is on the offensive half
    
    if (((df_full.loc[ii]['x_'+receiver] > 0) and (receiver_team == 0)) or ((df_full.loc[ii]['x_'+receiver] < 0) and (receiver_team == 1))):
        offsidemark.append(1)
    else:
        offsidemark.append(0)
        
    #condition4: 
    #The receiving player is closer to the goal line than the ball.
    #check to see if receiving player is ahead of the passing player
        
    #team1 attacks to the right
    if ((receiver_team == 0) and (df_full.loc[ii]['x_'+receiver] > df_full.loc[ii]['x_'+passer])):
        offsidemark.append(1)
    #team2 attacks to the left
    elif ((receiver_team == 1) and (df_full.loc[ii]['x_'+receiver] < df_full.loc[ii]['x_'+passer])):
        offsidemark.append(1)
    else:
        offsidemark.append(0)
    
    #condition5: 
    #There is only one player from the other team between the receiving player and the goal line.
    #count how many players from the other team are between receiving playear and the goal line
    
    if (receiver_team == 0):
        count_ahead = 0
        for jj in range(22,2*len(lineup),2):
            if df_full.iloc[ii,jj] > df_full.loc[ii]['x_'+receiver]:
                count_ahead += 1
            
    if (receiver_team == 1):
        count_ahead = 0
        for jj in range(0,22,2):
            if df_full.iloc[ii,jj] < df_full.loc[ii]['x_'+receiver]:
                count_ahead += 1
    
    if count_ahead < 2:
        offsidemark.append(1)
    else:
        offsidemark.append(0)
    
    
    #condition6: 
    #The receiving player is not the same as the passing one
    #check to see receiver == passer
    
    if passer == receiver:
        offsidemark.append(0)
    else:
        offsidemark.append(1)

    #condition7: 
    #The receiving player must be in the field
    #check to see if receiving player is outside the field
    
    if ((df_full.loc[ii]['x_'+receiver] < -52) or (df_full.loc[ii]['x_'+receiver] > 52) or (df_full.loc[ii]['y_'+receiver] < -34) or (df_full.loc[ii]['y_'+receiver] > 34)):
        offsidemark.append(0)
    else:
        offsidemark.append(1)
        

    #now we check all the conditions and group the cases
    
    #when all conditions are met
    if sum(offsidemark) == 7:
        offsides.append(ii)
    
    #when no conditions are met
    if sum(offsidemark) == 0:
        zeroconditions.append(ii)
        
    #when just one of the conditions isn't met
    if sum(offsidemark) == 6:
        if offsidemark[0] == 0:
            condition1.append(ii)
        if offsidemark[1] == 0:
            condition2.append(ii)
        if offsidemark[2] == 0:
            condition3.append(ii)
        if offsidemark[3] == 0:
            condition4.append(ii)
        if offsidemark[4] == 0:
            condition5.append(ii)
        if offsidemark[5] == 0:
            condition6.append(ii)
        if offsidemark[6] == 0:
            condition7.append(ii)  

In [None]:
print('Number offsides: ',len(offsides))
print('Number of cases with zero conditions: ',len(zeroconditions))
print('Number of cases without condition 1: ',len(condition1))
print('Number of cases without condition 2: ',len(condition2))
print('Number of cases without condition 3: ',len(condition3))
print('Number of cases without condition 4: ',len(condition4))
print('Number of cases without condition 5: ',len(condition5))
print('Number of cases without condition 6: ',len(condition6))
print('Number of cases without condition 7: ',len(condition7))

From the 3 million plays, close to 45 thousand are offsides. Let's start our final dataset by getting all the cases without one of the conditions, excluding the 5th, and the ones with zero conditions.

In [None]:
final_list = condition1.copy()
if len(condition2)>0:
    final_list.extend(condition2)
if len(condition3)>0:
    final_list.extend(condition3)
if len(condition4)>0:
    final_list.extend(condition4)
if len(condition6)>0:
    final_list.extend(condition6)
if len(condition7)>0:
    final_list.extend(condition7)
if len(zeroconditions)>0:
    final_list.extend(zeroconditions)

len(final_list)

From the 341.528 cases for the condition 5, we will select the same number above. 

In [None]:
final_list.extend(random.sample(condition5,len(final_list)))
len(final_list)

We mapped 427.610 cases on our classification above. That means there are more than 2.5 million other situations. We will pick the same number above.

In [None]:
biglist = set([*range(len(df_full))])
biglist = biglist.difference(offsides)
biglist = biglist.difference(zeroconditions)
biglist = biglist.difference(condition1)
biglist = biglist.difference(condition2)
biglist = biglist.difference(condition3)
biglist = biglist.difference(condition4)
biglist = biglist.difference(condition5)
biglist = biglist.difference(condition6)
biglist = biglist.difference(condition7)

len(biglist)

In [None]:
final_list.extend(random.sample(biglist,len(final_list)))
len(final_list)

We now have 165.468 plays where there is no offside. We need the same number of offsides, but we only have 44.715. A second round of data augmentation is needed.

# 6. Second Data Augmentation

We need to increase the number of plays with offsides. The solution is quite simple: excluding throw-ins and corner kicks, the conditions are independent of Y. We will generate new plays by altering the position y for the players not involved in the play.

In [None]:
repetition = 3
df_offsides = df_full.loc[offsides].copy()
for ii in range(repetition):
    df_offsides1 = df_full.loc[offsides].copy()
    for jj in range(len(offsides)):
        receiver = str(lineup[int(df_offsides1.iloc[jj]['rec'])])
        passer = str(lineup[int(df_offsides1.iloc[jj]['pass'])])
        for kk in df_offsides1.columns:
            if (kk[:2] == 'y_') and (kk[2:] != receiver) and (kk[:2] != passer):
                df_offsides1.iloc[jj][kk] = df_offsides1.iloc[jj][kk] + 4*np.random.rand() -2
    df_offsides = df_offsides.append(df_offsides1).reset_index(drop=True)

df_offsides['offside']=1
len(df_offsides)        
    
    

From this group, we will pick the same number of cases we have on our final list.

In [None]:
df_final = df_full.loc[final_list].copy()
sample_offsides = random.sample([*range(len(df_offsides))],len(final_list))
df_final = df_final.append(df_offsides.loc[sample_offsides]).reset_index(drop=True)
len(df_final)

In [None]:
df_final.head()

In [None]:
df_final.tail()

Our input data set is ready! We have 330.936 plays, with half of them offsides.

Before moving on, let's take a look at some random plays to see if everything is right.

In [None]:
def plot_offside(index):
    #getting data for the teams
    receiver = str(lineup[int(df_final.loc[index,'rec'])])
    passer = str(lineup[int(df_final.loc[index,'pass'])])
    
    g0 = (100*df_final.loc[index,'x_'+receiver],100*df_final.loc[index,'y_'+receiver])
    g1 = (100*df_final.loc[index,'x_'+passer],100*df_final.loc[index,'y_'+passer])
    g2 = (100*df_final.loc[index,columns_events[3:25:2]],100*df_final.loc[index,columns_events[4:25:2]])
    g3 = (100*df_final.loc[index,columns_events[25:47:2]],100*df_final.loc[index,columns_events[26:48:2]])
    data = (g0,g1,g2,g3)
    colors = ("green","purple","red", "blue")
    groups = ("receiver","passer","team1", "team2")
    size=(90,90,30,30)

    # Create plot
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)

    plt.scatter(fieldx,fieldy,s=5,c="black",alpha=0.3)    

    for data, color, group, size in zip(data, colors, groups, size):
        x, y = data
        ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=size, label=group)

    plt.ylim(-3700, 3700)
    plt.xlim(-5500, 5500)
    if df_final.iloc[index]['offside'] == 1:
        plt.title('Time frame: '+str(index)+", OFFSIDE")
    else:
        plt.title('Time frame: '+str(index)+", NO OFFSIDE")
            
    plt.legend(loc=2)
    plt.show()


In [None]:
plot_offside(random.randint(0, len(df_final)-1))

# 7. Classification Models

Two different classification models will be evaluated: **Random Forest** and **Support Vector Classifier**. For the RF model, we will vary the number of decision trees (1, 10, 100, 1.000 and 10.000). For the SVC model, much more computer demanding, we will vary the size of the training set (25%, 50% and, hopefully, 75%).
At the end, we will compare their performances using ROC, Accuracy and Confusion Matrix.

## 7.1 Random Forest Classification

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import roc_curve, auc


We need two sets of data, the features and the labels. A training set with 75% of the data will be used to generate the model.

In [None]:
import pandas as pd
import numpy as np
df_final = pd.read_csv('df_final.csv')
labels = df_final['offside']
features = df_final.drop('offside',axis=1)
feature_list = list(features.columns)
features = np.array(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

Five different Random Forest models will be generated, with 1, 10, 100, 1.000 and 10.000 trees. 

In [None]:
roc_auc=[]
accuracy=[]
errors=[]
confusion=[]
for i in range(5):
    print('Number of trees: ',10**i)
    print('Classify data')
    rfc=RandomForestClassifier(n_estimators=10**i,random_state=0)
    rfc.fit(train_features,train_labels)
    pred_labels=rfc.predict(test_features)
    
    print('Measure results')
    false_positive_rate, true_positive_rate, thresholds = roc_curve(test_labels, pred_labels)
    roc_auc.append(auc(false_positive_rate, true_positive_rate))
    
    accuracy.append(metrics.accuracy_score(test_labels,pred_labels))
    errors.append(abs(pred_labels - test_labels)/len(test_labels))
    confusion.append(metrics.confusion_matrix(test_labels,pred_labels))


In [None]:
print('ok')

In [None]:
roc_auc

In [None]:
accuracy

In [None]:
confusion

In [None]:
from sklearn import svm

In [None]:
roc_auc1=[]
accuracy1=[]
confusion1=[]
for ii in range(3,0,-1):
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25*ii, random_state = 42)
    print('Classify data')
    svc=svm.SVC(random_state=0)
    svc.fit(train_features,train_labels)
    pred_labels=svc.predict(test_features)
    
    print('Measure results')
    false_positive_rate, true_positive_rate, thresholds = roc_curve(test_labels, pred_labels)
    roc_auc1.append(auc(false_positive_rate, true_positive_rate))
    
    accuracy1.append(metrics.accuracy_score(test_labels,pred_labels))
    confusion1.append(metrics.confusion_matrix(test_labels,pred_labels))
    

In [None]:
roc_auc1

In [None]:
accuracy1

In [None]:
confusion1

# 8. Conclusion

To our surprise, the **Random Forest Classifier with 1.000 trees**, was the best model! We were expecting the SVC to be a better fit, because the offside conditions use the relative position of the players and a higher dimensional method would be able to get a better understanding of the data.

When we compare the best RF to the best SVC, our RF performance is way better, with an error smaller than 2%! Very impressive result with such a simple model.

# **Final conclusion: we created a model to make a decision based on the position of the players without explicitly setting the rules with an accuracy of 98.5%**