<a href="https://colab.research.google.com/github/the-bucketless/nhl_notebooks/blob/main/nzone_faceoff_powerplay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Alison Lukan](https://twitter.com/AlisonL) wrote a piece about [powerplays carrying over into a new period](https://www.nhl.com/kraken/news/analytics-with-alison-seattle-kraken-carryover-power-play/c-327547706).  A common thought is that the clean ice surface would aid the powerplay, but people have noted that it rarely seems to actually benefit the team up a player (Ray Ferraro mentions it every time it comes up in a game he's working).  Alison's work shows the anecdotal evidence seems to hold up with powerplays spanning multiple periods underperforming.  

One thing I didn't see mentioned in Alison's article is a comparison of powerplays after a neutral zone faceoff vs those starting in a new period.  Any powerplay at the start of the period has to start at center ice, while those happening in period get to start in the opponent's end of the rink.  This is one noteworthy advantage to starting in period as opposed to on the fresh sheet, so it's plausible that the advantage being shown could have to do with where the powerplay gets its zone start.  So I figured we'd take a quick look to see if it made a difference.  

We'll look at strictly 5-on-4 situations that have a neutral zone faceoff at some point.  We're going to estimate the end of the powerplay by the last event that takes place in the strength state.  This means we'll be overestimating how effective the powerplays are, but it shouldn't favor either of the two situations we're interested in.  

I feel like someone else has done something like this before, but a quick search on [MetaHockey](https://metahockey.com/) didn't turn up anything.  If anyone knows of previous work I should link to, let me know.  Also, I threw this together pretty quick, so if I've missed something or done something foolish, some light mockery will be tolerated.  

The play-by-play data comes courtesy [Harry Shomer](https://twitter.com/offsides_review).

In [1]:
import pandas as pd

This next code block will print out which season it's currently working on.  It should take a minute to get through all of them.

In [2]:
seasons = []

for year in range(2007, 2021):
    season = f"{year}{year + 1}"

    print(season)

    pbp_url = f"https://hockey-data.harryshomer.com/pbp/nhl_pbp{season}.csv.gz"
    pbp = pd.read_csv(pbp_url, compression="gzip")

    pbp["game_seconds"] = pbp.Seconds_Elapsed + (pbp.Period - 1) * 1200
    pbp["pp_goal"] = (pbp.Event == "GOAL") & (pbp.Strength == "5x4")

    # This is to keep track of where changes in strength state take place.
    # There's some data cleaning that ought to be done to make sure this works properly, but we're ignoring that.
    pbp["strength_id"] = ((pbp.Home_Players != pbp.Home_Players.shift(1)) 
                          | (pbp.Away_Players != pbp.Away_Players.shift(1))).cumsum()
    
    is_strength_change = pbp.strength_id != pbp.strength_id.shift(1)

    # We need to keep track of whether or not there's a goal in a strength state at any point after a faceoff.
    # We'll do this by checking whether or not an event's strength_id matches the strength_id of the next goal.
    pbp.loc[pbp.pp_goal, "goal_strength_id"] = pbp.strength_id
    pbp["goal_strength_id"].bfill(inplace=True)
    pbp["has_goal"] = pbp.strength_id == pbp.goal_strength_id
    
    # We need to keep track of the most recent neutral zone faceoff within a given powerplay.
    # To ensure we don't end up with faceoffs from previous powerplays, we're starting this new column with
    # a value of -1 wherever a strength change takes place and NA values elsewhere.
    pbp.loc[is_strength_change, "faceoff_time"] = -1

    # Next, we add in the times when neutral zone faceoffs on a powerplay occurred.
    is_pp_nz_faceoff = (pbp.Event == "FAC") & (pbp.Ev_Zone == "Neu") & (pbp.Strength.isin(["5x4", "4x5"]))
    pbp.loc[is_pp_nz_faceoff, "faceoff_time"] = pbp.game_seconds

    # This way, we can use a forward fill to get the most recent faceoff for every event.
    # Events without a preceding neutral zone faceoff in their strength state will be denoted by a -1.
    pbp.faceoff_time.ffill(inplace=True)

    powerplay = (
        pbp.loc[pbp.Strength.isin(["5x4", "4x5"])]
        .groupby(["strength_id", "faceoff_time"], as_index=False)
        .agg({
            "has_goal": "first", 
            "game_seconds": "last"
        })
    )

    nzone_pp = powerplay.copy().loc[powerplay.faceoff_time != -1]
    nzone_pp["pp_time"] = nzone_pp.game_seconds - nzone_pp.faceoff_time

    # In case something's gone wrong, we'll get rid of anything with a negative time.
    # If there are any, we should inspect them instead to see what went wrong, but we'll keep on racing through.
    nzone_pp = nzone_pp[nzone_pp.pp_time >= 0]

    is_period_start = nzone_pp.faceoff_time % 1200 == 0
    period_start = nzone_pp.loc[is_period_start]
    in_period = nzone_pp.loc[~is_period_start]

    nzone_pp["is_period_start"] = nzone_pp.faceoff_time % 1200 == 0

    # To make the values a little more interpretable, we're listing them as goals per 2 minutes of powerplay time.
    season_summary = nzone_pp.groupby("is_period_start", as_index=False)[["has_goal", "pp_time"]].sum()
    season_summary["goals_per_minor"] = season_summary.has_goal / season_summary.pp_time * 120
    season_summary["season"] = season

    seasons.append(season_summary)

20072008
20082009
20092010
20102011
20112012
20122013
20132014
20142015
20152016
20162017
20172018
20182019
20192020
20202021


After all the work is done, we can inspect how things look in each season we have data for.  For those not reading the code comments, goals_per_minor isn't goals per minor penalty but goals per two minutes of powerplay time.

In [3]:
all_seasons = pd.concat(seasons)
all_seasons

Unnamed: 0,is_period_start,has_goal,pp_time,goals_per_minor,season
0,False,335,307572.0,0.130701,20072008
1,True,30,29885.0,0.120462,20072008
0,False,119,111655.0,0.127894,20082009
1,True,35,30176.0,0.139183,20082009
0,False,120,101734.0,0.141546,20092010
1,True,25,26697.0,0.112372,20092010
0,False,124,104619.0,0.14223,20102011
1,True,27,25991.0,0.124659,20102011
0,False,104,96815.0,0.128906,20112012
1,True,22,25340.0,0.104183,20112012


Finally, we can look at the grand totals for all the seasons combined.

In [4]:
totals = all_seasons.groupby("is_period_start")[["has_goal", "pp_time"]].sum()
totals["goals_per_minor"] = totals.has_goal / totals.pp_time * 120

totals

Unnamed: 0_level_0,has_goal,pp_time,goals_per_minor
is_period_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,1613,1357876.0,0.142546
True,379,343991.0,0.132213


When I first did this and didn't account for multiple neutral zone faceoffs within a powerplay, we got a much different result.  Now, it's hard to say if the effect isn't just the location of the zone start.