# Task 4 - Feature Engineering II

We will want to add even more features to achieve even more predictive power with our models.

In [1]:
from ift6758.data.nhl_data_parser import NHLDataParser
data_parser = NHLDataParser()

In [2]:
df = data_parser.get_shot_and_goal_pbp_df_for_season(2016)
df.head()

Unnamed: 0,gameId,timeRemaining,periodNumber,timeInPeriod,isGoal,shotType,emptyNet,xCoord,yCoord,zoneCode,...,shootingPlayer,goalieInNet,previousEvent,timeDiff,previousEventX,previousEventY,rebound,distanceDiff,shotAngleDiff,speed
0,2016020001,1129,1,01:11,0,wrist,0,-77.0,5.0,O,...,Mitch Marner,Craig Anderson,blocked-shot,1.0,-61.0,11.0,0,17.088007,0.0,17.088007
1,2016020001,1027,1,02:53,0,wrist,0,86.0,13.0,O,...,Chris Kelly,Frederik Andersen,giveaway,5.0,54.0,-5.0,0,36.71512,0.0,7.343024
2,2016020001,959,1,04:01,0,wrist,0,23.0,-38.0,N,...,Cody Ceci,Frederik Andersen,missed-shot,18.0,-72.0,0.0,0,102.318131,0.0,5.684341
3,2016020001,914,1,04:46,0,slap,0,33.0,-15.0,O,...,Erik Karlsson,Frederik Andersen,missed-shot,19.0,77.0,-2.0,0,45.880279,0.0,2.414752
4,2016020001,794,1,06:46,0,wrist,0,-34.0,28.0,O,...,Martin Marincin,Craig Anderson,hit,16.0,47.0,34.0,0,81.221918,0.0,5.07637


We should only have NaN in the previousEventX, previousEventY, distanceDiff, shotAngleDiff and speed columns.

In [3]:
df.isna().sum()
df[df['shotAngleDiff'] > 0.0]

Unnamed: 0,gameId,timeRemaining,periodNumber,timeInPeriod,isGoal,shotType,emptyNet,xCoord,yCoord,zoneCode,...,shootingPlayer,goalieInNet,previousEvent,timeDiff,previousEventX,previousEventY,rebound,distanceDiff,shotAngleDiff,speed
10,2016020001,575,1,10:25,0,slap,0,34.0,-25.0,O,...,Erik Karlsson,Frederik Andersen,shot-on-goal,9.0,34.0,20.0,1,45.000000,4.460848,5.000000
11,2016020001,574,1,10:26,1,backhand,0,82.0,3.0,O,...,Bobby Ryan,Frederik Andersen,shot-on-goal,1.0,34.0,-25.0,1,55.569776,1.245364,55.569776
14,2016020001,431,1,12:49,1,slap,0,34.0,-1.0,O,...,Erik Karlsson,Frederik Andersen,shot-on-goal,5.0,69.0,-8.0,1,35.693137,20.759783,7.138627
26,2016020001,1072,2,02:08,0,backhand,0,-87.0,8.0,O,...,Mark Stone,Frederik Andersen,shot-on-goal,6.0,-35.0,-19.0,1,58.591808,56.579241,9.765301
31,2016020001,681,2,08:39,0,wrist,0,45.0,-22.0,O,...,William Nylander,Craig Anderson,shot-on-goal,8.0,43.0,29.0,1,51.039201,5.663706,6.379900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80362,2016030416,720,2,08:00,0,wrist,0,73.0,38.0,O,...,Ron Hainsey,Pekka Rinne,shot-on-goal,7.0,81.0,7.0,1,32.015621,25.980421,4.573660
80365,2016030416,574,2,10:26,0,snap,0,56.0,15.0,O,...,Phil Kessel,Pekka Rinne,shot-on-goal,11.0,-83.0,6.0,1,139.291062,20.556045,12.662824
80369,2016030416,398,2,13:22,0,wrist,0,-84.0,-25.0,O,...,Viktor Arvidsson,Matt Murray,shot-on-goal,17.0,-65.0,5.0,1,35.510562,66.921779,2.088857
80378,2016030416,836,3,06:04,0,slap,0,37.0,-1.0,O,...,Mattias Ekholm,Matt Murray,shot-on-goal,5.0,84.0,21.0,1,51.894123,75.505796,10.378825


In [4]:
df.shape[0]

80389

If a single season has around 80000 shot/goal events, only about 4500 of these contain missing values for previous event coordinates. That's only 4500/80000 = 0.05625, so ~5.6% events. This is not insignificant, but isn't really a big concern either.

## Bonus features (powerplay)

We want to add some extra features to enhance our models even more. This requires knowing if teams are at even strength or not and powerplay statuses. The features to add are: time since the powerplay started, number of friendly skaters on-ice (excluding goalie), number of opposing skaters on-ice (excluding goalie).

In [5]:
from ift6758.data.nhl_summary_scraper import NHLSummaryScraper
summary_scraper = NHLSummaryScraper()

In [2]:
df = data_parser.get_shot_and_goal_pbp_df("2021020246")
df.head(5)

Unnamed: 0,gameId,timeRemaining,periodNumber,timeInPeriod,isGoal,shotType,emptyNet,xCoord,yCoord,zoneCode,...,timeDiff,previousEventX,previousEventY,rebound,distanceDiff,shotAngleDiff,speed,timeSincePPStarted,friendlySkaters,opposingSkaters
9,2021020246,1039,1,161,0,wrist,0,-59.0,-24.0,O,...,7.0,52.0,-28.0,0,111.072049,0.0,15.867436,0.0,5.0,5.0
10,2021020246,1026,1,174,0,wrist,0,-54.0,14.0,O,...,13.0,-59.0,-24.0,1,38.327536,16.858399,2.948272,0.0,5.0,5.0
16,2021020246,969,1,231,0,wrist,0,-66.0,15.0,O,...,7.0,82.0,7.0,0,148.216059,0.0,21.173723,0.0,5.0,5.0
28,2021020246,821,1,379,0,wrist,0,45.0,-3.0,O,...,15.0,92.0,13.0,0,49.648766,0.0,3.309918,0.0,5.0,5.0
32,2021020246,792,1,408,0,deflected,0,80.0,3.0,O,...,5.0,34.0,31.0,0,53.851648,0.0,10.77033,0.0,5.0,5.0
