# Introduction
There was a thought back to my playing days that revolved around a quote by or about Wayne Gretzky that I'll do my best not butcher. It involved why he would take such a long first shift, or why his coach would double shift him on his first shift. It was so he could catch his "Second Wind", meaning that since he was tired after his first shift, he would catch his breathe, and there after would play better.

The coach or Wayne would do this intentionally to get into the game and play better throughout the rest of the game. I want to see if this has any creedence or is just some myth that a coach made up to play his best player more often at the beginning of the game.

## Hypothesis

Null Hypothesis: The first shift length for a player has no impact on the players goal output for the game. 

Alternative Hypthesis: A longer first shift improves the player goals for that game.

## Measurement:
There are plenty of ways to measure effect on a game, but we are going to keep it limited to goals for that game and then move on to other metrics later on. 

To consider a shift "long" we are going to calculate the average shift length for all players in the league. View how that fits on the distribution. 

Then to consider a first shift long for each game for that player we are going to compare it to the average length. We will place the player in a bin (Long shift, Not Long Shift).

Then compare the goals for all players over a season to see if the first shift had an impact on goals for, for a game. 

## Bias/Assumptions
Outcomes/Goals: We know that all players are not the same, for example Conner McDavid is going to be more valuable during his shifts and is going to have more shifts than a below average player in the NHL, but the hope is that by viewing the entire NHL population as a whole we are going to average out outliers like Connor McDavid and his less talented counter parts.

Inputs/Shifts: 

I had quite a few thoughts around this one:
The first one being is a shift long if it is just higher than the average or is there some point that makes a shift long? If say for example the average shift is 40 seconds should a shift that is 41 seconds be considered long? To help keep the experiment simple the first attempt I'm going to consider it as yes, that shift is "long". In the future, an adaptation might be to only include shifts that are "long" to be 1 standard deviation from the average shift. Roughly speaking, in a normal distribution, a shift that is 1 s.d. above the mean is equivalent to the 84th percentile. Now thats a long shift. 

Another thought I had from my playing days was that I took a really short first shift and then my second shift would be "long" to help me get into the game. But for the sake of the experiment we're only going to focus on the first shift. The thought here being that in an NHL game, if you are a forward on the 4th line you second shift might not occur until, (40 (sec) * 4 (# of lines) * 2 (Iterations/Shifts) ) = 320 Seconds / 60 Sec = 5.3 Minutes into the game. Within that time so many "events" (Goals, powerplays, penalities against, TV Timeouts) might occur to affect how shifts might be distributed. 

Finally, we are only going to consider the first shift of the first period. NHL intermissions are rather long (18 minutes), but we are going to assume that the players are into the game at that point.

## Considerations
Do we want to consider removing outliers from the data, the top and bottom percent of shifts. 

Do we want to consider standarizining the shifts? So that they are easier to view if they are above average?

Do we want to consider one stD away from the mean to be higher than average?

# Experiment

## Type: Diff from Diff
Since we have a category of short versus long first shift, I am going to do a diff from diff. My assumption is that the population of players and their outcomes are pretty standard and even. And the only difference is going to be their first shift length. Then I can compare the outcomes (Goals) to see if first shift length has an impact

## Type: Linear Regression
After I have decided in the null hypothesis is true or not, I am going to run a linear regression to see the first shift affects the likelihood of a goal

# Data

In [None]:
import pandas as pd
import numpy as np
import hockey_scraper as hs
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

In [None]:
shift2015 = pd.read_csv("../hockey_scraper_data/csvs/nhl_shifts_20152016.csv")
pbp2015 = pd.read_csv("../hockey_scraper_data/csvs/nhl_pbp_20152016.csv")
shift2016 = pd.read_csv("../hockey_scraper_data/csvs/nhl_shifts_20162017.csv")
pbp2016 = pd.read_csv("../hockey_scraper_data/csvs/nhl_pbp_20162017.csv")

In [None]:
#show full output on DataFrame Rows
pd.set_option('display.max_rows', 500)
# Show full number on describes
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [None]:
shift2015[:1]

In [None]:
shift2015RowNumber = shift2015.shape[0]
shift2015.shape

In [None]:
pbp2015[:1]

In [None]:
pbp2015RowNumber = pbp2015.shape[0]
pbp2015.shape

In [None]:
shift2016RowNumber = shift2016.shape[0]
shift2016.shape

In [None]:
pbp2016RowNumber = pbp2016.shape[0]
pbp2016.shape

In [None]:
shift2015.dtypes

In [None]:
pbp2015.dtypes

Important metrics for our experiment:

shift data:
Need every first shift for every player in that that. (For every game (gameID), find the the first shift (Unnamed: 0) for every player (playerId). Add that player to the category "Long" or "Not Long" category

pbp2015:
For every game, and every player, calculate if they scored (Go through every PBP entry for every game, and every player).


# Data Clean Up

In [None]:
# Rename column "Unnamed: 0" to shift_Id
shift2015.rename(columns = {"Unnamed: 0" : "Shift_Id"}, inplace = True)
shift2016.rename(columns = {"Unnamed: 0" : "Shift_Id"}, inplace = True)

In [None]:
# Rename column "Unnamed: 0" to Pbp_Id
pbp2015.rename(columns = {"Unnamed: 0" : "Pbp_Id"}, inplace = True)
pbp2016.rename(columns = {"Unnamed: 0" : "Pbp_Id"}, inplace = True)

# Convert player_ids to ints not floats

# Replace NaN and infinite values with a suitable value (e.g. 0)
pbp2015['p1_ID'].replace([np.inf, -np.inf, np.nan], 0, inplace=True)
pbp2016['p1_ID'].replace([np.inf, -np.inf, np.nan], 0, inplace=True)
pbp2015['p1_ID'] = pbp2015['p1_ID'].astype(int)
pbp2016['p1_ID'] = pbp2016['p1_ID'].astype(int)

## Remove Unwanted Data

In [None]:
# Drop columns
# Need to keep goalie IDs to drop the goal from the Shift Data
pbp2015.drop(columns = {'Description','Type','Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team',
       'Time_Elapsed', 'Seconds_Elapsed', 'Strength', 'Ev_Zone','awayPlayer1', 'awayPlayer1_id',
       'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3', 'awayPlayer3_id',
       'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id',
       'awayPlayer6', 'awayPlayer6_id', 'homePlayer1', 'homePlayer1_id',
       'homePlayer2', 'homePlayer2_id', 'homePlayer3', 'homePlayer3_id',
       'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id',
       'homePlayer6', 'homePlayer6_id', 'Away_Players', 'Home_Players',
       'Away_Score', 'Home_Score', 'xC', 'yC', 'Home_Coach',
       'Away_Coach'}, inplace = True)
pbp2016.drop(columns = {'Description','Type','Ev_Team', 'Home_Zone', 'Away_Team', 'Home_Team',
       'Time_Elapsed', 'Seconds_Elapsed', 'Strength', 'Ev_Zone','awayPlayer1', 'awayPlayer1_id',
       'awayPlayer2', 'awayPlayer2_id', 'awayPlayer3', 'awayPlayer3_id',
       'awayPlayer4', 'awayPlayer4_id', 'awayPlayer5', 'awayPlayer5_id',
       'awayPlayer6', 'awayPlayer6_id', 'homePlayer1', 'homePlayer1_id',
       'homePlayer2', 'homePlayer2_id', 'homePlayer3', 'homePlayer3_id',
       'homePlayer4', 'homePlayer4_id', 'homePlayer5', 'homePlayer5_id',
       'homePlayer6', 'homePlayer6_id', 'Away_Players', 'Home_Players',
       'Away_Score', 'Home_Score', 'xC', 'yC', 'Home_Coach',
       'Away_Coach'}, inplace = True)

In [None]:
print("Play by Play 2015", pbp2015.shape)
print("Play by Play 2016", pbp2016.shape)

In [None]:
# We want to remove Goalies from the shift Data
# Their shifts will skew the data

# Find all the unique goalie IDs from the Home and Away Goalie Ids in the Play By Play Data
# Union will drop duplicates
allGoalies2015 = np.union1d( pd.unique(pbp2015["Away_Goalie_Id"]), pd.unique(pbp2015["Home_Goalie_Id"]))
# Drop the empty fields
allGoalies2015 = allGoalies2015[~np.isnan(allGoalies2015)]


# Find all the unique goalie IDs from the Home and Away Goalie Ids in the Play By Play Data
# Union will drop duplicates
allGoalies2016 = np.union1d( pd.unique(pbp2016["Away_Goalie_Id"]), pd.unique(pbp2016["Home_Goalie_Id"]))
# Drop the empty fields
allGoalies2016 = allGoalies2016[~np.isnan(allGoalies2016)]

In [None]:
print("Goalie Shape 2015", allGoalies2015.shape)
print("Goalie Shape 2016", allGoalies2016.shape)

In [None]:
# We want to drop all goalies from the shift data
shift2015 = shift2015[~shift2015["Player_Id"].isin(allGoalies2015)]

shift2016 = shift2016[~shift2016["Player_Id"].isin(allGoalies2016)]

In [None]:
print("Shifts without Goalies 2015", shift2015.shape)
print("Row dropped from 2015: ",  str(shift2015RowNumber - shift2015.shape[0]) , "Rows Dropped")
print("Shifts without Goalies 2016", shift2016.shape)
print("Row dropped from 2016: ",  str(shift2016RowNumber - shift2016.shape[0]) , "Rows Dropped")

## Subset Data

In [None]:
# Subset Play by play data to be only goal data for counting later on. 
goalEvents2015 = pbp2015.loc[pbp2015["Event"]== "GOAL", :].copy()
# Rename the Play by Play Id to be Goal_Id
goalEvents2015.rename(columns={'Pbp_Id': 'Goal_Id'}, inplace=True)
# Drop all non-goal columns (Keep the assist columns p2_ID, p3_ID to count points)
goalEvents2015.drop(columns=["Date", 
                             "Period", 
                             "Event", 
                             "p1_name", 
                             "p2_name", 
                             "p2_ID", 
                             "p3_name", 
                             "p3_ID", 
                             "Away_Goalie", 
                             "Away_Goalie_Id",  
                             "Home_Goalie", 
                             "Home_Goalie_Id"], inplace = True)
goalEvents2016 = pbp2016.loc[pbp2016["Event"]== "GOAL", :].copy()
goalEvents2016.rename(columns={'Pbp_Id': 'Goal_Id'}, inplace=True)
goalEvents2016.drop(columns=["Date", 
                             "Period", 
                             "Event", 
                             "p1_name", 
                             "p2_name", 
                             "p2_ID", 
                             "p3_name", 
                             "p3_ID", 
                             "Away_Goalie", 
                             "Away_Goalie_Id",  
                             "Home_Goalie", 
                             "Home_Goalie_Id"], inplace = True)

In [None]:
print("Goals in 2015", goalEvents2015.shape[0])
print("Goals in 2016", goalEvents2016.shape[0])

In [None]:
# Limit the Shifts to the first period
firstPeriodShift2015 = shift2015.loc[shift2015["Period"] == 1, :].copy()
firstPeriodShift2016 = shift2016.loc[shift2016["Period"] == 1, :].copy()

In [None]:
# Get a list of all the Game Ids
gameIds2015 = firstPeriodShift2015["Game_Id"].unique()
gameIds2016 = firstPeriodShift2016["Game_Id"].unique()
print("Games in 2015:", gameIds2015.shape[0])
print("Games in 2016:", gameIds2016.shape[0])

In [None]:
def getFirstShift(gameIds, shifts):
    #Go through every game
    firstShift = []
    # For every gameId in the game IDs
    for game in gameIds:
        # Find all the shifts that game
        gameShifts = shifts[(shifts["Game_Id"] == game)]
        # Find the first shift for every player_id
        # Group by the Player Ids
        # Then take out the Shift_Id, Player_Id, Duration and Game_Id fields
        # Then take the first instance of that
        playerShifts = gameShifts.groupby("Player_Id")[["Shift_Id","Game_Id", "Player_Id", "Duration"]].first()
        # Add on the first shift for all those players
        firstShift.append(playerShifts)
    return pd.concat(firstShift)
            

In [None]:
# Find all the first shifts for 2015
firstShift2015 = getFirstShift(gameIds2015, firstPeriodShift2015)


In [None]:
firstShift2015.head()

In [None]:
firstShift2016 = getFirstShift(gameIds2016, firstPeriodShift2016)

### Understanding Data

In [None]:
print("All Shift Data 2015")
shift2015[["Duration"]].describe()

In [None]:
print("All Shift Data 2016")
shift2016[["Duration"]].describe()

In [None]:
print("All First Shift Data 2015")
firstShift2015[["Duration"]].describe()

In [None]:
print("All First Shift Data 2016")
firstShift2016[["Duration"]].describe()

## Distributions
Skewness is a measure of asymmetry of a distribution.In a normal distribution, the mean divides the curve symmetrically into two equal parts at the median and the value of skewness is zero. When the value of the skewness is negative, the tail of the distribution is longer towards the left hand side of the curve. When the value of the skewness is positive, the tail of the distribution is longer towards the right hand side of the curve


Kurtosis is one of the two measures that quantify shape of a distribution. Kutosis determine the volume of the outlier. Kurtosis describes the peakedness of the distribution, if the distribution is tall and thin it is (Kurtosis > 3). Values with high peakness distribution are near the mean or at the extremes. A flat distribution where the values are moderately spread out. 

### All Player Shifts

In [None]:
print( "Distribution skew of 2015 Shifts", shift2015["Duration"].skew())
print( "Distribution peakness of 2015 Shifts", shift2015["Duration"].kurtosis())

With the the skewness being greater than 1 at 1.63, the data is highly skewed. 

If the distribution is tall and thin it is called a leptokurtic distribution(Kurtosis > 3).

In [None]:
fig, ax = plt.subplots()
ax.hist(shift2015["Duration"], bins= 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2015 Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(shift2015[["Duration"]].min()), xmax = int(shift2015[["Duration"]].max()))
plt.show()

In [None]:
print( "Distribution skew of 2016 Shifts", shift2016["Duration"].skew())
print( "Distribution peakness of 2016 Shifts", shift2016["Duration"].kurtosis())

In [None]:
fig, ax = plt.subplots()
ax.hist(shift2016["Duration"],bins= 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2016 Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(shift2016[["Duration"]].min()), xmax = int(shift2016[["Duration"]].max()))
plt.show()

### First Shifts

In [None]:
print( "Distribution skew of 2015 First Shifts", firstShift2015["Duration"].skew())
print( "Distribution peakness of 2015 First Shifts", firstShift2015["Duration"].kurtosis())

In [None]:
fig, ax = plt.subplots()
ax.hist(firstShift2015["Duration"],bins= 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2015 First Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(firstShift2015[["Duration"]].min()), xmax = int(firstShift2015[["Duration"]].max()))
plt.show()

In [None]:
print( "Distribution skew of 2016 First Shifts", firstShift2016["Duration"].skew())
print( "Distribution peakness of 2016 First Shifts", firstShift2016["Duration"].kurtosis())

In [None]:
fig, ax = plt.subplots()
ax.hist(firstShift2016["Duration"],bins = 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2016 First Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(firstShift2016[["Duration"]].min()), xmax = int(firstShift2016[["Duration"]].max()))
plt.show()

## Outlier Removal

Since my data is right tailed, meaning there are som shifts that are extremely long, they will drag my mean to the right of the median, the middle point of the data. 

I propose removing any shifts longer than 128 seconds. Although that is double the average shift (44 Seconds) that would be the equal of doubling shifting your best player, which is what gretzky supposedly did. 

In [None]:
withOutliers2015 = shift2015.shape[0]
shift2015= shift2015[shift2015["Duration"] < 128.0 ]
print("All Shifts Rows dropped:", withOutliers2015 -  shift2015.shape[0])

In [None]:
withOutliers2016 = shift2016.shape[0]
shift2016 = shift2016[shift2016["Duration"] < 128.0 ]
print("All Shifts Rows dropped:", withOutliers2016 -  shift2016.shape[0])

In [None]:
FSOutliers2015 = firstShift2015.shape[0]
firstShift2015= firstShift2015[firstShift2015["Duration"] < 128.0 ]
print("First Rows dropped:", FSOutliers2015 -  firstShift2015.shape[0])

In [None]:
FSOutliers2016 = firstShift2016.shape[0]
firstShift2016= firstShift2016[firstShift2016["Duration"] < 128.0 ]
print("First Rows dropped:", FSOutliers2016 -  firstShift2016.shape[0])

### All Shifts

In [None]:
print( "Distribution skew of 2015 Shifts", shift2015["Duration"].skew())
print( "Distribution peakness of 2015 Shifts", shift2015["Duration"].kurtosis())

We seem to have normalized the data to a relatively acceptable skewness of .59 and a peakness value of 2.24. We don't want to normalize the data to much because we need enough long shifts to represent a double shift and see the shift's affect on the player's game. 

In [None]:
fig, ax = plt.subplots()
ax.hist(shift2015["Duration"], bins= 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2015 Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(shift2015[["Duration"]].min()), xmax = int(shift2015[["Duration"]].max()))
plt.show()

In [None]:
print( "Distribution skew of 2016 Shifts", shift2016["Duration"].skew())
print( "Distribution peakness of 2016 Shifts", shift2016["Duration"].kurtosis())

In [None]:
fig, ax = plt.subplots()
ax.hist(shift2016["Duration"],bins= 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2016 Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(shift2016[["Duration"]].min()), xmax = int(shift2016[["Duration"]].max()))
plt.show()

### First Shift

In [None]:
print( "Distribution skew of 2015 First Shifts", firstShift2015["Duration"].skew())
print( "Distribution peakness of 2015 First Shifts", firstShift2015["Duration"].kurtosis())

In [None]:
fig, ax = plt.subplots()
ax.hist(firstShift2015["Duration"],bins= 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2015 First Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(firstShift2015[["Duration"]].min()), xmax = int(firstShift2015[["Duration"]].max()))
plt.show()

In [None]:
print( "Distribution skew of 2016 First Shifts", firstShift2016["Duration"].skew())
print( "Distribution peakness of 2016 First Shifts", firstShift2016["Duration"].kurtosis())

In [None]:
fig, ax = plt.subplots()
ax.hist(firstShift2016["Duration"],bins = 100, density = True)
ax.set_xlabel('Shift Duration (Seconds)')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of 2016 First Shift Length')

fig.tight_layout()
plt.xlim(xmin=int(firstShift2016[["Duration"]].min()), xmax = int(firstShift2016[["Duration"]].max()))
plt.show()

# Experiment

In [None]:
# Decide the average shift
avgShift2015 = round(shift2015["Duration"].mean())
avgShift2016 = round(shift2016["Duration"].mean())


In [None]:
print("Average Shift 2015:", avgShift2015)
print("Average Shift 2016:", avgShift2016)

In [None]:
def isShiftLong(shifts, avgShift):
    category = []
    for shift in shifts:
        if shift >= avgShift:
            category.append('Long')
        else:
            category.append('Short')
    return category

In [None]:
firstShift2015["Shift_Category"] = np.where(firstShift2015["Duration"] > avgShift2015, 'Long', 'Short')

In [None]:
firstShift2015.reset_index(drop = True, inplace= True)
firstShift2015.head()

In [None]:
firstShift2016["Shift_Category"] = np.where(firstShift2016["Duration"] > avgShift2016, 'Long', 'Short')

In [None]:
firstShift2016.reset_index(drop = True, inplace= True)
firstShift2016.head()

for that game and for that player, take the sum of goals. Make another DF that will have the sum of the goals, the player and game id, will then have shit_category attached. Afterwards we will show the difference between the two categories

In [None]:
playerGoalsPer2015 = goalEvents2015.groupby(["Game_Id", "p1_ID"]).count().reset_index()
playerGoalsPer2015.head()

In [None]:
mergedGame2015 = firstShift2015.merge(playerGoalsPer2015, how='left', left_on=["Game_Id",'Player_Id'], right_on = ["Game_Id","p1_ID"])
mergedGame2015.fillna(0, inplace=True)
mergedGame2015.drop(columns = ["p1_ID"], inplace = True)
mergedGame2015.head()

In [None]:
playerGoalsPer2016 = goalEvents2016.groupby(["Game_Id", "p1_ID"]).count().reset_index()

In [None]:
mergedGame2016 = firstShift2016.merge(playerGoalsPer2016, how='left', left_on=["Game_Id",'Player_Id'], right_on = ["Game_Id","p1_ID"])
mergedGame2016.fillna(0, inplace=True)
mergedGame2016.drop(columns = ["p1_ID"], inplace = True)
mergedGame2016.head()

## Experiment Outcome

In order to reject my null hypothesis I will need to have sufficent evidence to to say that outcome I am observing is not due to chance. That the year 2015 or 2016 is not a random occurence.

In [None]:
mergedGame2015.groupby("Shift_Category")["Goal_Id"].count()

In [None]:
grouped2015 = mergedGame2015.groupby("Shift_Category")["Goal_Id"].mean()
grouped2015Short = grouped2015[1]
grouped2015Long = grouped2015[0]
grouped2015Diff = grouped2015Long - grouped2015Short
print(grouped2015)
print(grouped2015Diff)

In [None]:
mergedGame2016.groupby("Shift_Category")["Goal_Id"].count()

In [None]:
grouped2016 = mergedGame2016.groupby("Shift_Category")["Goal_Id"].mean()
grouped2016Short = grouped2016[1]
grouped2016Long = grouped2016[0]
grouped2016Diff = grouped2016Long - grouped2016Short
print(grouped2016)
print(grouped2016Diff)

Just looking at the Total Count of Goals and Mean of Goals for 2015 and 2016 the first shift duration does not have an effect on the number of goals on average. It actually looks like the shorter the shift the more goals players score throughout the game. 


Next we will look to see if shift duration and goals/game average is correlated.

Finally, we will look to see what we are observing if it is trully accurate or up to random chance

In [None]:
mergedGame2015.corr()

In [None]:
mergedGame2016.corr()

Goals/Game average is not corralted with First Shift Duration (Correlation Coefficient: -0.00944/-0.01100)

It does not appear that First Shift Duration has any relationship with Goals/Game average. 

Final thoughts:

- I wonder if we decided the shift category on the median and instead of the mean.

- I wonder if the first shift duration has anything to do with the point totals, not just goals.

- I wonder if the second shift should be long

We would like to see over thousands of seasons how much the Goals/Game average would differ for our "Long" first shift group versus our "Short" first shift group. 

For 2015 the Goals/Game average would differ for our "Long" first shift group versus our "Short" first shift group was -0.003 Goals per game. Meanings on average the Short first shift group scored .003 more goals per games than the Long first shift group.

For 2016 the Goals/Game average would differ for our "Long" first shift group versus our "Short" first shift group was -0.005 Goals per game. Meanings on average the Short first shift group scored .005 more goals per games than the Long first shift group

## Bootstrap

We will want to take a sample of the first shift group (sample of players first shifts), and match it up with the goal data for those games and players. We will then take the means of the two groups, Long/Short and then show the difference for 1000 samples. 

### Sample Size
Lets determine our N, this would be our sample size for First Shift Data

In [None]:
# a good sample size is around 10% of the population
## But that is too large. It is greater than 1000
rows = firstShift2015.shape[0]
print("Population of First Shifts", rows)
print("10% of First Shift Rows", rows * .1)

Your confidence level corresponds to something called a "z-score." A z-score is a value that indicates the placement of your raw score (meaning the percent of your confidence level) in any number of standard deviations below or above the population mean.

Z-scores for the most common confidence intervals are:

90% = 2.576
95% = 1.96
99% = 2.576

In [None]:
# Instead we will use a calculated sample size.
## Our intended level of confidence will be 95%
## where Z is the Z-score corresponding to your desired confidence level 
## p is the estimated proportion of the population with a certain characteristic
## E is the maximum error you are willing to tolerate in your estimate.
## N population size

Z = 1.96
E = 0.05
N = rows
p = .05
n = round(((Z**2 * p * (1-p))/ E**2)/(1 + ((Z**2 * p *(1-p))/((E**2) * N))))
n

In [None]:
sample = firstShift2015.sample( n = n, replace = True)
sample

In [None]:
sampleMerged2015 = sample.merge(playerGoalsPer2015, 
                                    how='left', 
                                    left_on=["Game_Id",'Player_Id'], 
                                    right_on = ["Game_Id","p1_ID"])
sampleMerged2015.fillna(0, inplace=True)
sampleMerged2015.drop(columns=["p1_ID"], inplace = True)
sampleMerged2015.head(36)

In [None]:
sampleGrouped2015 = sampleMerged2015.groupby("Shift_Category")["Goal_Id"].mean()
sampleGrouped2015Short = sampleGrouped2015[1]
sampleGrouped2015Long = sampleGrouped2015[0]
sampleGrouped2015Diff = sampleGrouped2015Long - sampleGrouped2015Short
print(sampleGrouped2015)
print(sampleGrouped2015Diff)

We find here in this sample that the short shift group scored on average .018 more goals per game

In [None]:
def goalsPerGameDiffBoot(shifts, goalsGame, sampleSize, iterations):
    goalDiffs = []
    for i in range(iterations):
        sample = shifts.sample(n = sampleSize, replace = True)
        sampleMerged = sample.merge(goalsGame, 
                                    how='left', 
                                    left_on=["Game_Id",'Player_Id'], 
                                    right_on = ["Game_Id","p1_ID"])
        sampleMerged.fillna(0, inplace=True)
        sampleGrouped = sampleMerged.groupby("Shift_Category")["Goal_Id"].mean()
        sampleGroupedShort = sampleGrouped[1]
        sampleGroupedLong = sampleGrouped[0]
        sampleGroupedDiff = sampleGroupedLong - sampleGroupedShort
        goalDiffs.append(sampleGroupedDiff)
        
    return goalDiffs

In [None]:
diffBootData1000 = goalsPerGameDiffBoot(shifts = firstShift2015, 
                                    goalsGame = goalEvents2015, 
                                    sampleSize = n, 
                                    iterations = 1000)

In [None]:
plt.hist(diffBootData1000, bins=20, color='blue', alpha=0.5)
plt.xlabel('Goal Average Differential')
plt.ylabel('Frequency')
plt.title('Histogram of 1000 iterations of Goal Differential')
plt.show()

In [None]:
diffBootData10000 = goalsPerGameDiffBoot(shifts = firstShift2015, 
                                    goalsGame = goalEvents2015, 
                                    sampleSize = n, 
                                    iterations = 10000)

In [None]:
plt.hist(diffBootData10000, bins=20, color='blue', alpha=0.5)
plt.xlabel('Goal Average Differential')
plt.ylabel('Frequency')
plt.title('Histogram of 10,000 iterations of Goal Differential')
plt.show()