---

**Author:** *Juan José Serrano Gutiérrez* (*juanjoguti* at *correo.ugr.es* or *juanjo.jjserra* at *gmail.com*)

---

# NFL Big Data Bowl 2021

## Overview

In American football, there are a plethora of defensive strategies and outcomes. The National Football League (NFL) has used previous Kaggle competitions to focus on offensive plays, but as the old proverb goes, "defense wins championships". Though metrics for analyzing quarterbacks, running backs, and wide receivers are consistently a part of public discourse, techniques for analyzing the defensive part of the game trail and lag behind. Identifying player, team, or strategic advantages on the defensive side of the ball would be a significant breakthrough for the game.

This competition uses NFL's Next Gen Stats data, which includes the position and speed of every player on the field during each play. You'll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to 'tackle' (ha)—which may require levels of football savvy, data aptitude, and creativity.

## Exploratory Data Analysis

The 2021 Big Data Bowl data contains player tracking, play, game, and player level information for all possible passing plays during the 2018 regular season. For purposes of this event, passing plays are considered to be ones on a pass was thrown, the quarterback was sacked, or any one of five different penalties was called (defensive pass interference, offensive pass interference, defensive holding, illegal contact, or roughing the passer). On each play, linemen (both offensive and defensive) data are not provided. The focus of this year's contest is on pass coverage.

To be able to start with the EDA, we should upload the data:

In [None]:
from pathlib import Path
import pandas as pd

path = Path('../input/nfl-big-data-bowl-2021')
df_games = pd.read_csv(path/'games.csv')
df_players = pd.read_csv(path/'players.csv')
df_plays = pd.read_csv(path/'plays.csv')
df_weeks = pd.read_csv(path/'week1.csv')

We have lot of data frames so we are going to analyze them one by one.

### Game data

First of all, we could have a look at how our data looks like:

In [None]:
df_games

The `games.csv` contains the teams playing each game. The key variable is `gameId`.
- `gameId`: Game identifier, unique (numeric)
- `gameDate`: Game Date (time, mm/dd/yyyy)
- `gameTimeEastern`: Start time of game (time, HH:MM:SS, EST)
- `homeTeamAbbr`: Home team three-letter code (text)
- `visitorTeamAbbr`: Visiting team three-letter code (text)
- `week`: Week of game (numeric)

Once we have seen our data, we should check if we have missing values as follows:

In [None]:
df_games.isnull().sum().sort_values(ascending = False)/len(df_games)

Now, we know we don't have missing values, we could start analyzing the data.

Some interesting information we could find out here is the number of games played by each date:

In [None]:
import plotly.express as px

games_by_date = df_games['gameDate'].value_counts().reset_index()
games_by_date.columns = [ 'date', 'games' ]
games_by_date = games_by_date.sort_values('games')

fig = px.bar(
    games_by_date, 
    y = 'date', 
    x = 'games', 
    orientation = 'h', 
    title = 'Number of games by date',
    labels = {'date': 'Game date', 'games': 'Games played'}
)

fig.show()

In the same way, we could obtain the number of games by game time:

In [None]:
games_by_time = df_games['gameTimeEastern'].value_counts().reset_index()
games_by_time.columns = [ 'time', 'games' ]
games_by_time = games_by_time.sort_values('games')

fig = px.bar(
    games_by_time, 
    y = 'time', 
    x = 'games', 
    orientation = 'h', 
    title = 'Number of games by date',
    labels = {'time': 'Game time', 'games': 'Games played'}
)

fig.show()

Other relevant information we could get is the games played at home and away:

In [None]:
home_games = df_games['homeTeamAbbr'].value_counts().reset_index()
home_games.columns = [ 'team', 'games' ]
home_games = home_games.sort_values('games')

fig = px.bar(
    home_games, 
    y = 'team', 
    x = 'games',
    orientation = 'h', 
    title = 'Number of games at home',
    labels = {'team': 'Home team', 'games': 'Games played'}
)

fig.show()

In [None]:
away_games = df_games['visitorTeamAbbr'].value_counts().reset_index()
away_games.columns = [ 'team', 'games' ]
away_games = home_games.sort_values('games')

fig = px.bar(
    away_games, 
    y = 'team', 
    x = 'games',
    orientation = 'h', 
    title = 'Number of games away',
    labels = {'team': 'Away team', 'games': 'Games played'}
)

fig.show()

In a similar way, we could get the number of games per week:

In [None]:
games_per_week = df_games['week'].value_counts().reset_index()
games_per_week.columns = [ 'week', 'games' ]
games_per_week = games_per_week.sort_values('games')

fig = px.bar(
    games_per_week, 
    y = 'week', 
    x = 'games',
    orientation = 'h', 
    title = 'Number of games per week',
    labels = {'week': 'Week', 'games': 'Games played'}
)

fig.show()

### Player data

Again, we start looking at the data:

In [None]:
df_players

Player data: The `players.csv` file contains player-level information from players that participated in any of the tracking data files. The key variable is `nflId`.
- `nflId`: Player identification number, unique across players (numeric)
- `height`: Player height (text)
- `weight`: Player weight (numeric)
- `birthDate`: Date of birth (YYYY-MM-DD)
- `collegeName`: Player college (text)
- `position`: Player position (text)
- `displayName`: Player name (text)

The first thing we must do is check the missing values:

In [None]:
df_players.isnull().sum().sort_values(ascending = False)/len(df_players)

After that, we should convert all heights to feet:

In [None]:
import numpy as np

players_height = df_players['height'].str.split('-',expand=True)
players_height.columns = [ 'first', 'second' ]

players_height.loc[(players_height['second'].notnull()), 'first'] \
= players_height[players_height['second'].notnull()]['first'].astype(np.int16) * 12 \
+ players_height[players_height['second'].notnull()]['second'].astype(np.int16)

df_players['height'] = players_height['first']
df_players['height'] = df_players['height'].astype(np.float32)
df_players['height'] /= 12
df_players

Now, we should be able to check the heights and weights distributions:

In [None]:
fig = px.histogram(
    df_players,
    x = "height", 
    nbins = 20,
    title = 'Height distribution',
    labels = {'height': 'Height'}
)
fig.show()

In [None]:
fig = px.histogram(
    df_players,
    x = "weight", 
    nbins = 20,
    title = 'Weight distribution',
    labels = {'weight': 'Weight'}
)
fig.show()

With the information we have here we could also get the top 50 colleges by the number of players they produce:

In [None]:
top_50_colleges = df_players['collegeName'].value_counts().reset_index()
top_50_colleges.columns = [ 'college', 'players' ]
top_50_colleges = top_50_colleges.sort_values('players').tail(50)

fig = px.bar(
    top_50_colleges, 
    y = 'college', 
    x = 'players', 
    orientation = 'h', 
    title = 'Top 50 colleges by number of players produced',
    labels = {'college': 'College', 'players': 'Players produced'}
)

fig.show()

Furthermore, we could get the most common positions according to the number of players for each one:

In [None]:
positions = df_players['position'].value_counts().reset_index()
positions.columns = [ 'position', 'players' ]
positions = positions.sort_values('players')

fig = px.bar(
    positions, 
    y = 'position', 
    x = 'players', 
    orientation = 'h', 
    title = 'Most common positions by number of players',
    labels = {'position': 'Position abbr.', 'players': 'Players'}
)

fig.show()

For those interesing, here you have the players positions abbreviation meaning:
- WR: Wide Receiver
- CB: Cornerback
- RB: Running Back
- TE: Tight End
- QLB: Outside Linebreaker
- QB: Quarterback
- FS: Free Safety
- LB: Linebacker
- SS: Strong Safety
- ILB: Inside Linebreaker
- DE: Defensive End
- DB: Defensive Back
- MLB: Middle Linebacker
- DT: Defensive Tackle
- FB: Fullback
- P: Punter
- LS: Long Snapper
- S: Safety
- K: Kicker
- HB: Running Back
- NT: Nose Tackle

### Play data

Once again, we look the data:

In [None]:
df_plays

This time, we have too many columns to be displayed so we could check their names as follows:

In [None]:
df_plays.columns

Play data: The `plays.csv` file contains play-level information from each game. The key variables are gameId and `playId`.
- `gameId`: Game identifier, unique (numeric)
- `playId`: Play identifier, not unique across games (numeric)
- `playDescription`: Description of play (text)
- `quarter`: Game quarter (numeric)
- `down`: Down (numeric)
- `yardsToGo`: Distance needed for a first down (numeric)
- `possessionTeam`: Team on offense (text)
- `playType`: Outcome of dropback: sack or pass (text)
- `yardlineSide`: 3-letter team code corresponding to line-of-scrimmage (text)
- `yardlineNumber`: Yard line at line-of-scrimmage (numeric)
- `offenseFormation`: Formation used by possession team (text)
- `personnelO`: Personnel used by offensive team (text)
- `defendersInTheBox`: Number of defenders in close proximity to line-of-scrimmage (numeric)
- `numberOfPassRushers`: Number of pass rushers (numeric)
- `personnelD`: Personnel used by defensive team (text)
- `typeDropback`: Dropback categorization of quarterback (text)
- `preSnapHomeScore`: Home score prior to the play (numeric)
- `preSnapVisitorScore`: Visiting team score prior to the play (numeric)
- `gameClock`: Time on clock of play (MM:SS)
- `absoluteYardlineNumber`: Distance from end zone for possession team (numeric)
- `penaltyCodes`: NFL categorization of the penalties that ocurred on the play. For purposes of this contest, the most important penalties are Defensive Pass Interference (DPI), Offensive Pass Interference (OPI), Illegal Contact (ICT), and Defensive Holding (DH). Multiple penalties on a play are separated by a ; (text)
- `penaltyJerseyNumber`: Jersey number and team code of the player commiting each penalty. Multiple penalties on a play are separated by a ; (text)
- `passResult`: Outcome of the passing play (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, text)
- `offensePlayResult`: Yards gained by the offense, excluding penalty yardage (numeric)
- `playResult`: Net yards gained by the offense, including penalty yardage (numeric)
- `epa`: Expected points added on the play, relative to the offensive team. Expected points is a metric that estimates the average of every next scoring outcome given the play's down, distance, yardline, and time remaining (numeric)
- `isDefensivePI`: An indicator variable for whether or not a DPI penalty ocurred on a given play (TRUE/FALSE)

Once again, we check for missing values:

In [None]:
df_plays.isnull().sum().sort_values(ascending = False)/len(df_plays)

As we can see, this time we have some missing values. In fact, the percentage of missing values for columns *penaltyJerseyNumbers* and *penaltyCodes* is too big, around 93-94%. Since those columns aren't giving us any useful information, we could removed it:

In [None]:
df_plays = df_plays.drop(columns = ['penaltyJerseyNumbers', 'penaltyCodes'])

There are still some columns with a low percentage of missing values we have to deal with. For those cases, we have two different approach:
- Drop all the rows with missing values.
- Fill the missing values using any technique such us linear interpolation.

As the percentages are lower than 1%, we could drop all the rows without losing too many information:

In [None]:
df_plays = df_plays.dropna()

Then, we could start analyzing the data by ploting the number of plays per team:

In [None]:
plays_per_team = df_plays['possessionTeam'].value_counts().reset_index()
plays_per_team.columns = [ 'team', 'plays' ]
plays_per_team = plays_per_team.sort_values('plays')

fig = px.bar(
    plays_per_team, 
    y = 'team',
    x = 'plays',
    orientation = 'h',
    title = 'Number of plays per team',
    labels = {'team': 'Team', 'plays': 'Number of plays'}
)

fig.show()

We could also check the plays by type:

In [None]:
plays_by_type = df_plays['playType'].value_counts().reset_index()
plays_by_type.columns = [ 'type', 'plays' ]
plays_by_type = plays_by_type.sort_values('plays')

fig = px.pie(
    plays_by_type, 
    names = 'type', 
    values = 'plays',  
    title = 'Number of plays of every type'
)

fig.show()

We have the information to plot the number of plays for each down:

In [None]:
plays_per_down = df_plays['down'].value_counts().reset_index()
plays_per_down.columns = [ 'down', 'plays' ]
plays_per_down = plays_per_down.sort_values('plays')

fig = px.pie(
    plays_per_down, 
    names = 'down', 
    values = 'plays',  
    title = 'Number of plays of every down',
)

fig.show()

We could plot the number of plays per quarter too:

In [None]:
plays_per_quarter = df_plays['quarter'].value_counts().reset_index()
plays_per_quarter.columns = [ 'quarter', 'plays' ]
plays_per_quarter = plays_per_quarter.sort_values('plays')

fig = px.pie(
    plays_per_quarter, 
    names = 'quarter', 
    values = 'plays',  
    title = 'Number of plays per quarter'
)

fig.show()

We could also check the plays by "yards to go" category:

In [None]:
plays_by_yardsToGo = df_plays['yardsToGo'].value_counts().reset_index()
plays_by_yardsToGo.columns = [ 'yardsToGo', 'plays' ]
plays_by_yardsToGo = plays_by_yardsToGo.sort_values('plays')

fig = px.bar(
    plays_by_yardsToGo, 
    y = 'yardsToGo', 
    x = "plays", 
    orientation = 'h', 
    title = 'Number of plays by yards to go',
    labels = {'yardsToGo': 'Yards to go', 'plays': 'Number of plays'}
)

fig.show()

Furthermore, we could have a look at the plays for every team yard side or for every yardline:

In [None]:
plays_by_yardlineSide = df_plays['yardlineSide'].value_counts().reset_index()
plays_by_yardlineSide.columns = [ 'yardlineSide', 'plays' ]
plays_by_yardlineSide = plays_by_yardlineSide.sort_values('plays')

fig = px.bar(
    plays_by_yardlineSide, 
    y = 'yardlineSide', 
    x = 'plays', 
    orientation = 'h', 
    title = 'Number of plays by team yardline side',
    labels = {'yardlineSide': 'Team yardline side', 'plays': 'Number of plays'}
)

fig.show()

In [None]:
plays_per_yardlineNumber = df_plays['yardlineNumber'].value_counts().reset_index()
plays_per_yardlineNumber.columns = [ 'yardlineNumber', 'plays' ]
plays_per_yardlineNumber = plays_per_yardlineNumber.sort_values('plays')

fig = px.bar(
    plays_per_yardlineNumber, 
    y = 'yardlineNumber', 
    x = 'plays', 
    orientation = 'h', 
    title = 'Number of plays by team yardline number',
    labels = {'yardlineNumber': 'Team yardline number', 'plays': 'Number of plays'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_plays, 
    x = 'absoluteYardlineNumber',
    nbins = 50,
    title = 'Absolute Yardline Number distribution',
    labels = {'absoluteYardlineNumber': 'Absolute yardline number'}
)

fig.show()

In addition, we could do a couple of plots to show the relation between the plays and the offensive formations:

In [None]:
plays_per_offensiveFormation = df_plays['offenseFormation'].value_counts().reset_index()
plays_per_offensiveFormation.columns = [ 'offenseFormation', 'plays' ]
plays_per_offensiveFormation = plays_per_offensiveFormation.sort_values('plays')

fig = px.pie(
    plays_per_offensiveFormation, 
    names = 'offenseFormation', 
    values = 'plays',
    title = 'Number of plays for every offense formation type'
)

fig.show()

In [None]:
plays_per_personnelO = df_plays['personnelO'].value_counts().reset_index()
plays_per_personnelO.columns = [ 'personnelO', 'plays' ]
plays_per_personnelO = plays_per_personnelO.sort_values('plays')

fig = px.bar(
    plays_per_personnelO, 
    y = 'personnelO', 
    x = 'plays', 
    orientation = 'h', 
    title = 'Number of plays by personnel O.',
    labels = {'personnelO': 'Personnel O.', 'plays': 'Number of plays'}
)

fig.show()

We could do exactly the same with the defensive data:

In [None]:
plays_by_defendersInBox = df_plays['defendersInTheBox'].value_counts().reset_index()
plays_by_defendersInBox.columns = [ 'defendersInTheBox', 'plays' ]
plays_by_defendersInBox = plays_by_defendersInBox.sort_values('plays')

fig = px.bar(
    plays_by_defendersInBox, 
    x = 'defendersInTheBox', 
    y = 'plays',  
    title = 'Number of plays by number of defenders in the box',
   labels = {'defendersInTheBox': 'Number of defenders in the box', 'plays': 'Number of plays'}
)

fig.show()

In [None]:
plays_per_numberOfPassRushers = df_plays['numberOfPassRushers'].value_counts().reset_index()
plays_per_numberOfPassRushers.columns = [ 'numberOfPassRushers', 'plays' ]
plays_per_numberOfPassRushers = plays_per_numberOfPassRushers.sort_values('plays')

fig = px.bar(
    plays_per_numberOfPassRushers, 
    x = 'numberOfPassRushers', 
    y = 'plays',  
    title = 'Number of plays per number of pass rushers',
    labels = {'numberOfPassRushers': 'Number of pass rushers', 'plays': 'Number of plays'}
)

fig.show()

Moreover, we could have a look at the number of plays for every dropback type or pass result:

In [None]:
plays_per_typeDropback = df_plays['typeDropback'].value_counts().reset_index()
plays_per_typeDropback.columns = [ 'typeDropback', 'plays' ]
plays_per_typeDropback = plays_per_typeDropback.sort_values('plays')

fig = px.pie(
    plays_per_typeDropback, 
    names = 'typeDropback', 
    values = 'plays',  
    title = 'Number of plays per type of Dropback',
    labels = {'typeDropback': 'Type of Dropback', 'plays': 'Number of plays'}
)

fig.show()

In [None]:
plays_by_passResult = df_plays['passResult'].value_counts().reset_index()
plays_by_passResult.columns = [ 'passResult', 'plays' ]
plays_by_passResult = plays_by_passResult.sort_values('plays')

fig = px.pie(
    plays_by_passResult, 
    names = 'passResult', 
    values = 'plays',
    title = 'Number of plays for every pass result'
)

fig.show()

Additionally, we could study the distribution for the plays result:

In [None]:
fig = px.histogram(
    df_plays, 
    x = 'playResult',
    nbins = 50,
    title = 'Play result distribution',
    labels = {'playResult': 'Play result'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_plays,
    x = 'offensePlayResult',
    nbins = 50,
    title = 'Offense play result distribution',
    labels = {'offensePlayResult': 'Offense play result'}
)

fig.show()

It would be also interesting to understand how the score was prior to the play:

In [None]:
fig = px.histogram(
    df_plays, 
    x = 'preSnapHomeScore',
    nbins = 50,
    title = 'Pre-snap home score distribution',
    labels = {'preSnapHomeScore': 'Pre-snap home score'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_plays, 
    x = 'preSnapVisitorScore',
    nbins = 50,
    title = 'Pre-snap visitor score distribution',
    labels = {'preSnapVisitorScore': 'Pre-snap visitor score'}
)

fig.show()

### Tracking data

One more time, we start by visualize the data:

In [None]:
df_weeks

Tracking data: Files `week[week].csv` contain player tracking data from all games in week `[week]`. The key variables are `gameId`, `playId`, and `nflId`. There are 17 weeks to a typical NFL Regular Season, and thus 17 data frames with player tracking data are provided.

Each of the 17 `week[week].csv` files contain player tracking data from all passing plays during Week `[week]` of the 2018 regular season. Nearly all plays from each `[gameId]` are included; certain plays or games with insufficient data are dropped. Each team and player plays no more than 1 game in a given week.
- `time`: Time stamp of play (time, yyyy-mm-dd, hh:mm:ss)
- `x`: Player position along the long axis of the field, 0 - 120 yards. See Figure 1 below. (numeric)
- `y`: Player position along the short axis of the field, 0 - 53.3 yards. See Figure 1 below. (numeric)
- `s`: Speed in yards/second (numeric)
- `a`: Acceleration in yards/second^2 (numeric)
- `dis`: Distance traveled from prior time point, in yards (numeric)
- `o`: Player orientation (deg), 0 - 360 degrees (numeric)
- `dir`: Angle of player motion (deg), 0 - 360 degrees (numeric)
- `event`: Tagged play details, including moment of ball snap, pass release, pass catch, tackle, etc (text)
- `nflId`: Player identification number, unique across players (numeric)
- `displayName`: Player name (text)
- `jerseyNumber`: Jersey number of player (numeric)
- `position`: Player position group (text)
- `team`: Team (away or home) of corresponding player (text)
- `frameId`: Frame identifier for each play, starting at 1 (numeric)
- `gameId`: Game identifier, unique (numeric)
- `playId`: Play identifier, not unique across games (numeric)
- `playDirection`: Direction that the offense is moving (text, left or right)
- `route`: Route ran by offensive player (text)

Now, we should check if we have missing values:

In [None]:
df_weeks.isnull().sum().sort_values(ascending = False)/len(df_weeks)

As we can see, the percentage of missing values for column *route* is big, around 72%. This column contains the ran by offensive player which could be interesting to know. 

However, with less than a 30% of information we're not going to be able to learn anything interesting so we could remove the column:

In [None]:
df_weeks = df_weeks.drop(columns = ['route'])

The columns *o*, *dir*, *position*, *jerseyNumber* and *nflId* also contains some missing values but even so, it's less than a 1% of the total. Then, we could remove those rows containing missing values:

In [None]:
df_weeks = df_weeks.dropna()

Now our data is prepared we could start checking the variable distribution:

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 'x',
    nbins = 50,
    title = 'X coordinate distribution'
)

fig.show()

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 'y',
    nbins = 50,
    title = 'Y coordinate distribution'
)

fig.show()

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 's',
    nbins = 50,
    title = 'Speed distribution',
    labels = {'s': 'Speed (yards/second)'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 'a',
    nbins = 50,
    title = 'Acceleration distribution',
    labels = {'a': 'Acceleration (yards/second^2)'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 'dis',
    nbins = 50,
    title = 'Distance distribution',
    labels = {'dis': 'Distance (yards)'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 'o',
    nbins = 50,
    title = 'Player orientation distribution',
    labels = {'o': 'Player orientation (degrees)'}
)

fig.show()

In [None]:
fig = px.histogram(
    df_weeks, 
    x = 'dir',
    nbins = 50,
    title = 'Angle of player motion distribution',
    labels = {'dir': 'Angle of player motion (degrees)'}
)

fig.show()

We also have some interesting information related to events:

In [None]:
events_per_game = df_weeks['gameId'].value_counts().reset_index()
events_per_game.columns = [ 'gameId', 'events' ]
events_per_game['gameId'] = events_per_game['gameId'].astype(np.int64).astype(str) + '-'
events_per_game = events_per_game.sort_values('events')

fig = px.bar(
    events_per_game, 
    y = 'gameId', 
    x = 'events', 
    orientation = 'h', 
    title = 'Number of events per game',
    labels = {'events': 'Number of events'},
    height = 1000
)

fig.show()

In [None]:
not_none_events = df_weeks[df_weeks['event'] != 'None']['event'].value_counts().reset_index()
not_none_events.columns = [ 'event', 'actions' ]
not_none_events = not_none_events.sort_values('actions')

fig = px.bar(
    not_none_events, 
    y = 'event', 
    x = 'actions', 
    orientation = 'h', 
    title = 'Not none events',
    labels = {'event': 'Events', 'actions': 'Number of actions'},
    height = 1000
)

fig.show()

Additionally, we have some information related to jerseys popularity:

In [None]:
jerseys_popularity = df_weeks['jerseyNumber'].value_counts().reset_index()
jerseys_popularity.columns = [ 'jerseyNumber', 'items' ]
jerseys_popularity['jerseyNumber'] = jerseys_popularity['jerseyNumber'].astype(np.int16).astype(str) + '-'
jerseys_popularity = jerseys_popularity.sort_values('items')

fig = px.bar(
    jerseys_popularity, 
    y = 'jerseyNumber', 
    x = 'items', 
    orientation = 'h', 
    title = 'Most popular jerseys',
    labels = {'jerseyNumber': 'Jersey number', 'items': 'Number of jerseys'},
    height = 1000
)

fig.show()

# Model: evaluation of plays



With the data we have, we could be interested in predict the *play result*. In order to do this, we should determine with predictors are relevant. For sure, the model don't need the time or the jersey number so we can start removing it:

In [None]:
data = df_weeks.drop(columns = ['time', 'jerseyNumber'])

The team is important so we could replace the *home* and *away* labels for the specific team abbr.:

In [None]:
data = pd.merge(data, df_games[['gameId', 'homeTeamAbbr', 'visitorTeamAbbr']], how = 'inner', on = 'gameId')
data.team = data.apply(lambda x: x.homeTeamAbbr if x.team == 'home' else x.visitorTeamAbbr, axis = 1)
data = data.drop(columns = ['homeTeamAbbr', 'visitorTeamAbbr'])

We have specific measures for speed or acceleration, so we could replace this values with their averages:

In [None]:
avg = data[['nflId', 's', 'a', 'dis']].groupby('nflId').mean()
data = data.drop(columns = ['s', 'a', 'dis'])
data = pd.merge(data, avg, how = 'inner', on = 'nflId')

In order to remove the *play direction*, we could make all the *x* components go in the same direction using the *absolute yardline number*:

In [None]:
def xmod(row):
    if row.playDirection == 'left': return row.absoluteYardlineNumber - row.x
    if row.playDirection == 'right': return row.x - row.absoluteYardlineNumber

data = pd.merge(data, df_plays[['gameId', 'playId', 'absoluteYardlineNumber']], how = 'inner', on = ['gameId', 'playId'])
data['x'] = data.apply(xmod, axis = 1)
data = data.drop(columns = ['playDirection', 'absoluteYardlineNumber'])

It could be also interesting to know the *height* and the *weight* of the players because it could influence in their *speed* or *acceleration*:

In [None]:
data = pd.merge(data, df_players[['nflId', 'height', 'weight']], how = 'inner', on = 'nflId')

Analyzing the plays, we have some measures related with the intesity: some players plays more focus when the game score is tight or when they have the obligation of go up.

To simplify the model, we could make one simple assumption: players always play at his best level. Then, we could remove the following list of variables from our list of predictors:
- *quarter*
- *down*
- *preSnapVisitorScore*
- *preSnapHomeScore*
- *gameClock*

We already include information about the yardline when we made all the *x* components go in the same direction so we could avoid to include the *yardlineSide* and the *yardlineNumber*. We could also remove the *play description* since its content will be difficult to process:

In [None]:
plays_info = df_plays[['gameId', 'playId', 'possessionTeam', 'yardsToGo', 'playType', 'offenseFormation', 
                       'personnelO', 'defendersInTheBox', 'numberOfPassRushers', 'personnelD', 'typeDropback', 
                       'passResult', 'offensePlayResult', 'playResult', 'epa', 'isDefensivePI']]

Now, we can merge our processed data with the relevant play information we have chosen:

In [None]:
data = data[['gameId', 'playId', 'x', 'y', 'team', 'displayName', 'position', 
             'height', 'weight', 's', 'a', 'dis', 'o', 'dir', 'event']]
data = pd.merge(data, plays_info, how = 'inner', on = ['gameId', 'playId'])
data = data.drop(columns = ['gameId', 'playId'])

To be able to apply most of the ML algorithms, we should encode some of our variables. Let's check our types:

In [None]:
data.dtypes

We should recode the bools and also the objects. We could start with the boolean ones:

In [None]:
columns_type_bool = data.dtypes[data.dtypes == bool]
cols_to_transform = list(columns_type_bool.index)
cols_to_transform

In [None]:
data[cols_to_transform] = data[cols_to_transform].astype(int)

Now, we do the same for the objects:

In [None]:
columns_type_object = data.dtypes[data.dtypes == object]
cols_to_transform = list(columns_type_object.index)
cols_to_transform

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data[cols_to_transform] = data[cols_to_transform].apply(lambda col: le.fit_transform(col), axis = 0, result_type = 'expand')

Finally, we should be able to obtain our model. In order to know how good our model is we should apply *K-fold Cross Validation*. Then, for $k = 10$, we prepare our folds as follows:

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

Y = data['playResult']
X = data.drop(columns = 'playResult')

cv = KFold(n_splits = 10, random_state = 1, shuffle = True)

At this point, we just need to choose the model and one metric to measure how good are the results obtained:

## Linear model

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
scores = cross_val_score(model, X, Y, scoring = 'r2', cv = cv, n_jobs = -1)
print('R^2 (coefficient of determination): %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
scores = cross_val_score(model, X, Y, scoring = 'f1_micro', cv = cv, n_jobs = -1)
print('F1-Score (micro): %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

Both results are good in their respective metrics: almost 90% for the Regression Linear Model and 100% for the Random Forest Classifier.

However, due to memory issues, we're not using all the data available. If we use all the `weeks[weeks].csv`, the results will be worse. In order to avoid it, we could simplify the model by simplify the *play results* using just three "labels":
- 1 if the team wins points,
- -1 if the team loses points and
- 0 in other case

In [None]:
def process_play_result(value):
    if value == 0: return 0
    if value > 0: return 1
    if value < 0: return -1
    
data['playResult'] = data['playResult'].apply(lambda value: process_play_result(value))
Y = data['playResult']
X = data.drop(columns = 'playResult')

scores = cross_val_score(model, X, Y, scoring = 'r2', cv = cv, n_jobs = -1)
print('R^2 (coefficient of determination) for Linear Regression Model: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

scores = cross_val_score(model, X, Y, scoring = 'f1_micro', cv = cv, n_jobs = -1)
print('F1-Score (micro) for Random Forest Model: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

In one hand, we lose information about the exact play result so it will be more difficult to get the exact result of the game. In the other hand, we get a more accuracy model so it will be easy to determine which team will win each play.