# Python Hockey Analytics Tutorial

### Overview

Welcome to the TopDownHockey Python Hockey Analytics Tutorial! If you've made it this far, you've already managed to do the hardest thing I've ever had to do with Python: Installing it and installing Jupyter Lab. By comparison, everything you do going forward should be a breeze.

By the end of this tutorial, you will not only have a base-level understanding of Python as a programming language, but you will be comfortable enough in Python to perform small-scope data analysis on your own.

---

### Import Packages

Every one of these packages was already baked into the TopDownHockey_Scraper, so you've already got them installed in your computer. The following lines will import them into your current Python session.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import TopDownHockey_Scraper.TopDownHockey_EliteProspects_Scraper as tdhepscrape

---

### Scrape the Data

We will be using the <code>TopDownHockey_EliteProspects_Scraper</code> <code>module</code> from the <code>TopDownHockey_Scraper</code> package to scrape data from Elite Prospects. 

- A [module](https://docs.python.org/3/tutorial/modules.html) is a file containing Python statements and definitions. It can also be thought of as a code library. A Python package is a collection of modules.

Note that we imported every module as something with a shorthand abbreviation for its full name. This is because in order to call a function from a module, we need to type out the name of the module each time before the function, and it's much easier to type these shorter names. For example, when we call the <code>get_skaters</code> function from the <code>TopDownHockey_EliteProspects_Scraper</code> model later, we will type <code>tdhepscrape.get_skaters</code>. This is more efficient than typing <code>TopDownHockey_EliteProspects_Scraper.get_skaters</code>.

The <code>get_skaters</code> function takes two arguments: One or more seasons and one or more leagues. In this case, we're going to scrape AHL and NHL data for the past two seasons, which means we're going to build two <code>lists</code> and then feed them to our function. 

- A [list](https://docs.python.org/3/tutorial/datastructures.html) is a changeable set of elements that exist in a certain order. In this case, our lists will be made up of two strings each.

Our function will scrape all data for the leagues and seasons we provide and return a <code>dataframe</code>. 

- A [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) is a two-dimensional data structure that contains rows and columns.

We will assign this dataframe to an object called <code>ahl_nhl_skaters_1719</code>.

The "time magic" function at the top of the next cell will tell you how long the scrape takes. It should take about a minute.

In [None]:
%%time

# These are both lists, and each of them contain two strings. We use the equals sign to assign these lists to a named object which gets stored and can be used later.

leagues = ["ahl", "nhl"]
seasons = ["2017-2018", "2018-2019"]

# The get_skaters command not only returns a dataframe once it is completed, but also prints out messages as it runs in order to keep you updated on its progress.

ahl_nhl_skaters_1719 = tdhepscrape.get_skaters(leagues, seasons)

---

### Clean the Data

Data cleaning isn't anybody's favorite, but it's a key part of data analysis. Every minute you spend cleaning your data at the beginning of your analysis is an investment in the final product, and most of the time, you'll get a positive return on investment. It's much easier to identify and fix problems early on.

The first thing we want to do before we can clean our data is take a look at it. Our prior command stored the outputs of our function as a dataframe called ahl_nhl_skaters_1719 which contains all of the data we need. Let's start the process by taking a look at this object.

In [None]:
# To view an object, simply enter its name and then run the cell.

ahl_nhl_skaters_1719 

The bottom left corner of the dataframe we printed tells us we're working with 4,088 rows and 14 columns. When we look at our columns, we see a link, some identifying information, and some boxcar stats (for those unfamiliar with this term, it's just jargon which refers to the stats from games played through plus/minus). 

Ou player and playername column are actually similar, so we can get rid of one; ideally the ugly player column which contains redundant position data. In order to do this, we're going to use <code>drop</code>.

- [Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) allows you to specify labels to drop from columns or rows.

The line below might look daunting if you're brand new to things, but it's actually quite simple. Here's how the chain of command works for it:

- Our line starts with <code>ahl_nhl_skaters_1719 =</code>, which means the output of whatever comes after that will be assigned to an object with that name.
- Our object we've already created - also named ahl_nhl_skaters_1719 - is what we're going to be applying our functions on. (Essentially, we will be overwriting this.)
- The period in <code>ahl_nhl_skaters_1719.drop</code> tells us we're going to use drop on that object.
- In parentheses, we specify what we will be dropping: columns that are named player.

In [None]:
# drop syntax: dataframe.drop(columns = 'my_column_to_drop')

ahl_nhl_skaters_1719 = ahl_nhl_skaters_1719.drop(columns = 'player')

As you can see, no output was printed because we assigned the outputs of our command to a new variable instead of just printing it. Let's double check that we did everything right by printing it:

In [None]:
ahl_nhl_skaters_1719

Okay, so we did what we meant to. This new dataframe is better, but it's still far from ideal. We have our team, then our boxcar stats, then a big ugly link, and only after that do we actually see the season, league, player, and position. Ideally, this key identifier information would be at the start, our boxcar stats would come after, and only then would we see the link.

Why do we want to see the link column at all? It's big, ugly, and doesn't seem to add anything. Wouldn't the ideal layout be one that doesn't have this column? 

In a perfect world, yes. But in this imperfect world, we can occasionally mix up two completely different players who have the same name. In order to demonstrate, we'll use <code>loc</code> to filter out only players whose names are equal to Sebastian Aho.

- [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) allows us to locate a group of rows and columns based on whether their labels meet a certain criteria. (Official documentation doesn't confirm whether loc stands for anything, but I like to think it stands for locate, and it might help you to think the same.)

Before we run this and take a look at our Sebastian Ahos, let's review the chain of commands for this line of code:

- The period in <code>ahl_nhl_skaters_1719.loc</code> tells us that we will use the <code>loc</code> function on this object.
- Within the <code>loc</code> function in brackets, the period in <code>ahl_nhl_skaters_1719.playername</code> tells us we are selecting the playername column from this dataframe. This returns a <code>series</code>. (A [series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) is a one-dimensional array with axis labels).
- We then set the <code>playername</code> series equal to Sebastian Aho (using two equals signs. In Python, we generally use one equals sign to assign objects and two to determine whether objects are equal. 
- In essence, this locates only rows and columns where the values within our chosen series - the value from the <code>playername</code> column - is exactly equal to Sebastian Aho.

In [None]:
# loc syntax: dataframe.loc[dataframe.column==value_to_match]

ahl_nhl_skaters_1719.loc[ahl_nhl_skaters_1719.playername=="Sebastian Aho"]

Wait, what? Why isn't there a single player in our dataframe whose name is exactly identical to Sebastian Aho? The defenseman in the Islanders organization might have been a fever dream, but there is definitely a Sebastian Aho who plays center for the Carolina Hurricanes. He scored over a point per game in 2018-2019! It can't be right that there is nobody in here named Sebastian Aho, can it?

Technically, it can be, and it actually is. Our data unfortunately comes with some white spaces. In order to demonstrate this, we're going to use <code>iloc</code>. 

- [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) is very similar to loc, except it allows us to locate a group of rows and columns based on whether an integer value (row number in this case) meets a certain criteria. (Again, the official documentation doesn't confirm whether this stands for anything, but I like to think it stands for integer locate.)

Here we will use iloc to locate only the row whose axis label is 0. (In Python, indexes begin with 0 rather than 1.) Here's how the chain of commands for this code will play out:

- On its own, <code>ahl_nhl_skaters_1719.playername</code> returns a series containing of every player name within the dataframe in the same order they are in within the dataframe. 
- Within this list, the <code>iloc</code> function will locate all values within our series with an axis label of 0.

In [None]:
# iloc syntax: dataframe.column.iloc[integer_to_index]

ahl_nhl_skaters_1719.playername.iloc[0]

Well, would you look at that: There's a blank white space between the y in Terry and the single quote indicating the end of the name. Chris Terry obviously exists in the database, but if we filtered players whose names were exactly Chris Terry, we'd get nothing. Try it out for yourself:

In [None]:
ahl_nhl_skaters_1719.loc[ahl_nhl_skaters_1719.playername=="Chris Terry"]

There are a few ways to handle this issue, but the simplest way is to just clean the white space off of our playername field. In order to do this, we'll use <code>str.strip()</code>. 

- [str.strip](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) removes all leading and trailing characters in a series or index.

Let's break down the chain of commands in the next cell:

- Rather than assigning the output of our statement to the ahl_nhl_skaters_1719 dataframe, we are specifically assigning it to the <code>ahl_nhl_skaters_1719.playername</code> column.
- We use <code>.str.strip()</code> on our original column to remove the white space at both ends of this string.
- After this is done, we simply want to look at our Sebastian Ahos and our Chris Terrys. Instead of passing just one statement to <code>loc</code>, we pass two, and use <code>|</code> to locate situations where the first OR the second criteria are met. (<code>|</code> can be interpreted as or in this case. I wish Python would just let me type or, but beggars can't be choosers.)

In [None]:
#str.strip() syntax: dataframe.column.str.strip()

ahl_nhl_skaters_1719.playername = ahl_nhl_skaters_1719.playername.str.strip()

# We can use loc to filter out locate rows and columns that meet multiple conditions. 
# We use parentheses to specify each case and use | to pass the first OR second condition. 

ahl_nhl_skaters_1719.loc[(ahl_nhl_skaters_1719.playername=="Chris Terry") | (ahl_nhl_skaters_1719.playername=="Sebastian Aho")]

This confirms two things: 

1. Our str.strip() function worked and we successfully got rid of the white space around Chris Terry's name. 
2. There <i>are</i> two Sebastian Ahos.

In theory, we could remedy the Aho problem by using name and position as identifiers, or using the original player column that contained positional data next to playername. Since one Sebastian Aho is a forward and one is a defenseman, this would allow us to differentiate between the two of them.

The problem with this approach is that players with the same names don't always play different positions. For every pair of Sebastian Ahos and Colin Whites who play two different positions, there are pairs of Erik Gustafssons or Erik Karlssons who play the same position. (The other defenseman named Erik Karlsson has never played in the NHL, but he exists, and he <i>has</i> ruined my NHLe models.) While this project has a very small scope, you may eventually transition to projects with a much larger scope and need a process in place that handles these issues. 

Thankfully, every player has their own unique page on Elite Prospects, and thus their own unique link to that page. Take another look at our Sebastian Ahos; Bridgeport's defenseman has a different link than Carolina's forward. They're not both in here, but the same is true for the two Erik Gustafssons. <b>This is why we keep the link.</b> And because we've got the link, we don't need to further worry about pairs like these Sebastian Ahos. We didn't really have to clean our data to begin with, but it's a good practice to ensure that's the case before moving forward.

---

### Make the Data Look Good

We got completely side tracked by diving down the rabbit hole of Sebastian Ahos, but remember the ideal order I previously laid out for our data: player, season, team, league, and position in that order, then our boxcar stats in the order they came in, and last, that hideous-but-helpful link. We'll use loc to tell Python which columns to keep, and actually pass all of them, just in the order we want. We'll also <code>rename</code> that ugly playername column. 
- [Rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) alters axis labels. 

Let's break down the chain of commands in the following lines of code:

- We are using the <code>loc</code> function on our dataframe.
- Rather than one set of brackets, we use two sets of brackets. Within the outer set of brackets, we simply start with <code>:, </code>. This notifies Python that we will be working on columns rather than rows.
- Within the second set of brackets, we pass a list of column names we wish to keep. In this case, we're actually keeping all of them, just in a different order than they were in before.
- Once the first line is complete, ahl_nhl_skaters_1719 has now been overwritten to reflect our changes. In our next line, we use the rename statement. This is similar to the drop statement we used before, except instead we enter the original name of the column, a colon, and then the name you'd like that column to be changed to.

In [None]:
# Here we use loc to filter out columns with a certain name. We use :, then pass a list of column names in brackets to locate columns rather than rows.

# loc syntax for locating by column name: dataframe.loc[:, ['column_to_keep_one', 'column_to_keep_two']]

ahl_nhl_skaters_1719 = ahl_nhl_skaters_1719.loc[:, ['playername', 'team', 'season', 'league', 'position', 'gp', 'g', 'a', 'tp', 'ppg', 'pim', '+/-', 'link']]

#rename syntax: dataframe.rename(columns = {'old_column_name':'new_column_name'})

ahl_nhl_skaters_1719 = ahl_nhl_skaters_1719.rename(columns = {'playername':'player'})

ahl_nhl_skaters_1719

This looks great! Our data is clean, and we're ready to take a step forward.

---

### Prepare the Data for Analysis

Remember how we used loc in the past to filter out players whose names were Chris Terry or Sebastian Aho? Now we're going to use loc to build two separate dataframes: One for the 2017-2018 AHL season and one for the 2018-2019 NHL season. 

In [None]:
# We are now filtering out rows and columns that meet the first AND second criteria, meaning we use & instead of |. 

ahl_skaters_1718 = ahl_nhl_skaters_1719.loc[(ahl_nhl_skaters_1719.season=="2017-2018") & (ahl_nhl_skaters_1719.league=="ahl")]

nhl_skaters_1819 = ahl_nhl_skaters_1719.loc[(ahl_nhl_skaters_1719.season=="2018-2019") & (ahl_nhl_skaters_1719.league=="nhl")]

Now that we've got two separate dataframes set up, we're going to <code>merge</code> those two and create a new dataframe called ahl_1718_nhl_1819. 

- [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) brings two dataframes together who share common columns or indexes. 

Before we run the next line of command, let's break down the chain of commands with a useful image:

- We perform the <code>merge</code> function on ahl_skaters_1718 and pass our nhl_skaters_1819 as the first argument to our function. This means we will merge the 2017-2018 AHLers and 2018-2019 NHLers. 
- We merge on link, meaning we use the player link as the common field. In essence, we add in NHL data from 2018-2019 for skaters who have the same link as AHL skaters from 2017-2018.
- The type of merge we will be using is an <code>inner</code> merge. The venn diagrams below displays the different types of merges. (Merge and join are interchangeable in this case.)
- Our inner merge means that we will only be keeping AHL players from 2017-2018 who also appeared in the NHL in 2018-2019.

![img](https://docs.trifacta.com/download/attachments/160412683/JoinVennDiagram.png?version=1&modificationDate=1596167437085&api=v2)

In [None]:
# Merge syntax is first_dataframe.merge(second_dataframe, on = 'common_field', how = 'merge_type')

ahl_1718_nhl_1819 = ahl_skaters_1718.merge(nhl_skaters_1819, on = 'link', how = 'inner')

ahl_1718_nhl_1819

Yikes! Our merge was successful, and we've got a solid sample size worth of 302 players who played in the AHL one year and the NHL the next, but we've got an ugly bunch of columns with xs and ys attached to them. 

This is what happens when we merge two dataframes that have columns with the same name and we don't join on those columns: Columns with the same names get duplicated.

This isn't the end of the world. It's actually quite easy to handle. We know that the left side of our new dataframe - the one whose column names have x attached to them - contains AHL data from 2017-2018, and we know that the right side contains NHL data from 2018-2019. (If we forget, the name of our dataframe literally serves as a reminder.) We also know that these players don't change name or position from season to season, so we can just keep the player and position from the left side of our data and drop those on the right side.

So we've got 6 columns to drop. Let's do that first.

In [None]:
# Before, we dropped only one column. Now that we are dropping multiple, we must pass them as a list, which means we must enclose that list in brackets.

ahl_1718_nhl_1819 = ahl_1718_nhl_1819.drop(columns = ['season_x', 'league_x', 'season_y', 'league_y', 'player_y', 'position_y'])

ahl_1718_nhl_1819

That already looks a lot better, but it's still a bit of a mess. Before we do anything else, we should move our link field to the far right side of the dataframe, as far out of sight and out of mind as possible. Instead of typing out all 19 column names in the order we want, though, we're going to speed the process up and have a little bit of fun with lists and the functions attached to them.

- [lists](https://docs.python.org/3/tutorial/datastructures.html) can be operated on in many different ways.

Let's break down exactly what we're doing in this next cell:

- We first create a list which is simply the names of the columns in our dataframe. 
- We use the list-based <code>remove</code> method on our list to remove the first item in our list that is 'link'. 
- Note that unlike assigning changes to other objects like dataframes, lists do not require a strict assignment with an equals sign but instead are automatically overwritten. Simply entering the second line of code here will permanently change the list.
- We then use the list-based <code>append</code> method to add 'link' to the end of our list.
- We then print our list to ensure we've done things right.

In [None]:
my_columns = list(ahl_1718_nhl_1819.columns)

my_columns.remove('link')

my_columns.append('link')

my_columns

Now we've got a list of columns in the order we want them. The next step is to use the loc function in the method we used it before, feeding it <code>:,</code> first to indicate that we're interested in the columns rather than rows, and then passing a list of the columns we want. After that's done, we'll rename the columns to make them a bit more interpretable, and then print our new dataframe to make sure everything still looks right.

In [None]:
ahl_1718_nhl_1819 = ahl_1718_nhl_1819.loc[:, my_columns]

ahl_1718_nhl_1819 = ahl_1718_nhl_1819.rename(columns = {'player_x':'player', 'team_x':'ahl_team', 'position_x':'position', 'gp_x':'ahl_gp', 
                                                        'g_x':'ahl_g', 'a_x':'ahl_a', 'tp_x':'ahl_p', 'ppg_x':'ahl_ppg', 'pim_x':'ahl_pim', 
                                                        '+/-_x':'ahl_+/-', 'team_y':'nhl_team', 'gp_y':'nhl_gp', 'g_y':'nhl_g', 'a_y':'nhl_a', 
                                                        'tp_y':'nhl_p', 'ppg_y':'nhl_ppg', 'pim_y':'nhl_pim', '+/-_y':'nhl_+/-'})

ahl_1718_nhl_1819

Hey, this is starting to look like something we can actually work with! We've got our AHL data clearly laid out on one side and our NHL data clearly laid out on the other side. While we've made a lot of progress, we've still got just a few more changes to make. In order to determine what these changes are, let's use <code>dtypes</code>.

- [dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) returns the type that each column is in a dataframe.

In [None]:
ahl_1718_nhl_1819.dtypes

Every column in our dataframe is currently an <code>object</code> type. In order to perform mathematical functions on them, we need to convert them to <code>float</code> types. 

- [float](https://docs.python.org/3/library/functions.html#float) types are floating point numbers. Integers are fine in many cases, but if you're working with decimals like we will be, you need to convert them to float.

Let's test this out with just one column, using just points per game in the AHL. The chain of commands here is quite simple:

- We use <code>ahl_1718_nhl_1819.ahl_ppg</code> to operate on that column, treating it as a series. 
- We use the <code>astype</code> function and pass <code>float</code> as an argument to change it to a float type.
- [astype](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) converts an object in a dataframe to a specific type.

Instead of assigning this to a new object, let's just print it...

In [None]:
# astype syntax: dataframe.series.astype(chosen_type)

ahl_1718_nhl_1819.ahl_ppg.astype(float)

Uh-oh! It looks like we have some values in this column that are simply <code>-</code>, and Python can't convert these to a float type, which makes sense since they contain no numerical data. 

Let's take a look at every row with this value to confirm this is the case:

In [None]:
ahl_1718_nhl_1819[ahl_1718_nhl_1819.ahl_ppg=="-"]

Okay, so we need to handle this before we do anything else. We're also going to want to use our games played column at some point, so we should also make sure we clean that one as well, just in case some of these pesky - values exist in there too.

In order to do this cleaning, We're going to use the <code>where</code> function from the <code>numpy</code> module to change all values that are - to 0, and leave all other values as they were. Remember that we imported numpy as np, so when we call the where function of that module, we simply type np.where instead of numpy.where.

- [numpy.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html?highlight=where#numpy.where) returns one value if a condition is met and another if it isn't. 

In this case, here is the chain of command for these lines:

- We are assigning new values to 4 different columns based on the results of our <code>np.where</code> statement.
- Within our np.where statement, we are assigning a value of 0 if the current value is <code>-</code>. If the current value is not equal to -, we simply return what the value was before.
- We repeat this process for four fields - NHL/AHL games played and NHL/AHL points per game - and then print rows with an ahl_ppg value equal to - to ensure this worked. (We should see none.)

In [None]:
# np.where syntax: np.where(condition==True, value_to_return_if_true, value_to_return_if_false)

ahl_1718_nhl_1819.ahl_ppg = np.where(ahl_1718_nhl_1819.ahl_ppg=="-", 0, ahl_1718_nhl_1819.ahl_ppg)
ahl_1718_nhl_1819.nhl_ppg = np.where(ahl_1718_nhl_1819.nhl_ppg=="-", 0, ahl_1718_nhl_1819.nhl_ppg)

ahl_1718_nhl_1819.ahl_gp = np.where(ahl_1718_nhl_1819.ahl_gp=="-", 0, ahl_1718_nhl_1819.ahl_gp)
ahl_1718_nhl_1819.nhl_gp = np.where(ahl_1718_nhl_1819.nhl_gp=="-", 0, ahl_1718_nhl_1819.nhl_gp)

ahl_1718_nhl_1819[ahl_1718_nhl_1819.ahl_ppg=="-"]

Cool, this worked. Now we can change these all to float types. Then, within our new float type columns, we can filter out only players who played at least 20 games in both league and take a look at them.

In [None]:
ahl_1718_nhl_1819.ahl_ppg = ahl_1718_nhl_1819.ahl_ppg.astype(float) 

ahl_1718_nhl_1819.nhl_ppg = ahl_1718_nhl_1819.nhl_ppg.astype(float) 

ahl_1718_nhl_1819.ahl_gp = ahl_1718_nhl_1819.ahl_gp.astype(float) 

ahl_1718_nhl_1819.nhl_gp = ahl_1718_nhl_1819.nhl_gp.astype(float) 

ahl_1718_nhl_1819 = ahl_1718_nhl_1819.loc[(ahl_1718_nhl_1819.ahl_gp>=20) & (ahl_1718_nhl_1819.nhl_gp>=20)]

ahl_1718_nhl_1819

Our data is now in a position where we can really get to work, which is awesome. 

Unfortunately, our sample size took a big hit; the number of rows in our dataframe tells us we've only got 113 skaters remaining. In order to determine just how much of a concern that is, let's take a look at exactly how many forwards and defensemen we have by building two separate dataframes: One which contains only forwards and one which contains only defensemen, and then using <code>len</code> to determine the length of those dataframes.

- [len](https://docs.python.org/3/library/functions.html#len) returns the number of items (length) in an object. 

In [None]:
# First build two separate dataframes; one for forwards, one for defensemen.

forwards = ahl_1718_nhl_1819.loc[ahl_1718_nhl_1819.position!="D"]
defensemen = ahl_1718_nhl_1819.loc[ahl_1718_nhl_1819.position=="D"]


# Obtain two integers which denote the length of each of these dataframes, and thus the number of players who play each position within the dataframe.
# len syntax: len(object)

forward_count = len(forwards)
defenseman_count = len(defensemen)

The number of forwards and defensemen left are currently stored as two separate variables. Rather than just print them, we're going use Python's <code>print</code> function to print these values with a message. 

- [print](https://docs.python.org/3/library/functions.html#print) prints an output.

Before we do this, though, we must create new versions of our variables, which are currently integers, that are strings. We do this by using <code>str</code>.

- [str](https://docs.python.org/3/library/stdtypes.html#str) converts an object to a string.

Adding strings to one another and printing them is quite easy, but you have to be careful to make sure everything you're passing to your print function is a string.

In [None]:
# str syntax: str(object). 

forward_count_string = str(forward_count)
defenseman_count_string = str(defenseman_count)

# As long as you're working with multiple strings, you can print (string1 + string2). However, you will receive an error if you try to print two objects of different types.

print("We have the following number of forwards: " + forward_count_string)
print("We have the following number of defensemen: " + defenseman_count_string)

So we're working with 77 forwards and 36 defensemen. That's not great, but it's enough for some basic analytics.

---

### Analyze our Data

The first thing we're going to do is determine which teams had the most NHL forwards in 2018-2019 who made the jump from the AHL the prior year. 

(Going forward, I will be referring to the skaters who made this jump, playing at least 20 AHL games in 2017-2018 and 20 NHL games in 2018-2019, as "transitioning players." If I specifically refer to forwards or defensemen who made this jump, I will call them "transitioning forwards" or "transitioning defensemen.") 

In order to do this, We're going to use <code>groupby</code> to group our forward dataframe by team, and count the number of players for each team.

- [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) groups an object in some manner and applies some sort of function to the group before combining the results. 

Let's look at our chain of commands for this next line of code:

- We pass nhl_team as an argument to groupby; this groups by the NHL team column. 
- We extract the player series from our group.
- We use <code>count()</code> to count the number of players in this series for each team. Count is a function that can be applied to a group.

In [None]:
# groupby syntax: dataframe.groupby('grouping_column').target_column_to_apply_function_to.function()

forward_counts = forwards.groupby('nhl_team').player.count()

forward_counts

Right now we have a series which tells us how many players exist for each team. This is great! But we'd much rather have a dataframe; it's easier to work with and handle. We're going to use the <code>DataFrame</code> function from our <code>pandas</code> module (which we imported as <code>pd</code>) to convert this to a dataframe.

In [None]:
# pd.DataFrame syntax: pd.DataFrame(object_to_convert_to_dataframe)

forward_counts = pd.DataFrame(forward_counts)

Now, we're going to use the <code>sort_values</code> function to see which team had the most players.

- [sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) sorts by values along either axis. The default setting is the x-axis, or rows.

We also want to sort from greatest to least, so we will set ascending to False. (By default, it is set to true.)

In [None]:
# sort_values syntax: dataframe.sort_values(by = 'field', ascending = True/False)

forward_counts.sort_values(by = 'player', ascending = False)

The most common team was "totals" which could more easily be interpreted as "multiple teams". This makes sense. After that, the number of transitioning forwards who played for each team seems to be pretty evenly spread between 1, 2, 3, and 4.

But something else sticks out. Remember how every other dataframe we've built had an index with values of 0 through whatever at the far left? You may not have noticed it, but if you scroll back up, you'll see it was there for each and every one of them. This one doesn't have that. It's a good practice to keep an index. In order to do add one, we simply use <code>reset_index</code>.

- [reset_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) resets the index or a level of it.

Instead of just re-setting our index, we're also going to sort values again, since we didn't actually assign the dataframe with sorted values to a new variable. 

For beginner coders, it's totally fine to take things line by line, but it's also a lot more efficient to chain multiple statements together in fewer lines if possible. For this one, just to do a bit of practice we're going to do everything in one line: first sorting by the values in our player column, and then resetting the index of the new dataframe whose values are sorted. Once we've done this, let's print it out to make sure we did things right.

In [None]:
forward_counts = forward_counts.sort_values(by = 'player', ascending = False)

forward_counts = forward_counts.reset_index()

forward_counts

Now that we've got an index, it's time to take a closer look at it. It goes from 0 to 29, which means there are only 30 different values here. totals is not actually an NHL team either, which means we've only got 29 teams here. There are 31 NHL teams, so there must have been two that didn't play a single transitioning forward in 2018-2019. 

What if we want to know who these teams are? We could go to NHL.com and look for every team on this list, one-by-one, and find the two we're missing. It would work. But it would also be totally inefficient, and prone to mistakes.

There's a much better way to do this. Let's start by using <code>set</code> on the team column in our full dataframe of only NHL skaters to obtain a set of every NHL team, and then take a look at it.

- [set](https://docs.python.org/3/library/stdtypes.html#set) returns every unique value in an object.

In [None]:
# set syntax: set(object)

nhl_team_set = set(nhl_skaters_1819.team)

nhl_team_set

Okay, so we've got every NHL team. The next step is to determine which values in this set do not appear in our transitioning forwards. We're going to do this using <code>isin</code>. 

- [isin](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html) determines whether an element from a dataframe is contained in other values.

Notice how isin specifically determines whether an element from a dataframe exists in other values? We're not working with a dataframe here, we're working with a set. We need to comply by the rules and change this set to a dataframe. Let's do this and take a look at it:

In [None]:
nhl_team_df = pd.DataFrame(nhl_team_set)

nhl_team_df

This looks good. Our index goes up to 31, which means we've got 32 values, but the last one is 'totals' which isn't actually a team, so we really have 31 teams.

There is one issue, though: The column name. It's 0. This is what happens when you build a dataframe without column names. 

Let's change this real quick with the same rename function we used before, only this time, instead of passing the name of the column to be renamed as a string, we're going to pass it as an integer, since that's what this 0 is. (So, no quotes go around our 0.)

In [None]:
nhl_team_df = nhl_team_df.rename(columns = {0:'nhl_team'})

Now that our column is named nhl_team, we can get down to business. Remember, our end goal is a version of this dataframe that contains only teams that do not exist in our transitioning forwards. Our end goal is to build a list of the teams in our transitioning forwards dataframe and then filter our NHL team dataframe to include only teams who do not appear in that list.

Before we do this, though, we're actually just going to look at the teams who <i>are</i> in our list of transitioning forwards, just to get comfortable with the isin function. Here's how the chain of commands works here:

- The first line turns the forward_counts.nhl_team series into a list called <code>teams_in_transitioning_forwards</code>.
- The second line uses a typical loc statement with a condition. We first specify that we will be using values from the team series as indicators. We then apply <code>isin</code> for the filter and pass our new teams_in_transitioning_forwards list to isin.

In [None]:
# list syntax to convert an object to a list: list(object)

teams_in_transitioning_forwards = list(forward_counts.nhl_team)

#The isin function tells us whether or not these teams appear in the list.
#Isin syntax is dataframe.object.isin(values)

nhl_team_df.loc[nhl_team_df.nhl_team.isin(teams_in_transitioning_forwards)]

The index tops out at 31, which doesn't quite make sense. If there are 2 values in the original dataframe that don't meet this condition, shouldn't it top out at 29?

Technically, it doesn't have to. There <i>have to be</i> only 30 different values in the printed dataframe (including the pesky 'totals' value), but there actually are; the index hasn't been re-set from the original. If you look close at the index, you'll manage to find a few missing teams. Don't believe me? Take a look at the length of our dataframe.

In [None]:
len(nhl_team_df.loc[nhl_team_df.nhl_team.isin(teams_in_transitioning_forwards)])

So, the filter actually did work. Now it's time to find teams that don't meet this condition. And this simply requires one simple step: Adding the <code>~</code> operator to the start of our filter statement. This filter returns things that fail to meet our condition instead of things that do.

In [None]:
missing_teams = nhl_team_df.loc[~nhl_team_df.nhl_team.isin(teams_in_transitioning_forwards)]

missing_teams

There you go! The two teams who featured zero transitioning forwards were the Columbus Blue Jackets and Detroit Red Wings. I wouldn't read into this being any more than the answer to a trivia question, but it's important to develop the skills required to answer these questions when they come up.

Earlier I said that the number of transitioning forwards on each team looked to be a pretty even spread between 1 and 4. This was clearly incorrect, since there were two teams with values of 0 that I failed to consider. Before we scrutinize the initial claim any further, we ought to fix the portion that we already know was completely false and add those teams and their values of zero to our original dataframe. 

Real quickly, though, let's refresh our memories on what our original dataframe looks like. Instead of printing the whole thing, we're just going to print the <code>head</code> since we just want to see the structure.

- [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) returns the first n rows of a dataframe. If you do not pass an argument to n, it will return 5 rows by default.

In [None]:
# head syntax: dataframe.head(number_of_rows_you_want_to_see)

forward_counts.head()

This has the same structure as our missing teams dataframe, just with an extra player column. We know that the player value for every team in our dataframe of missing teams is 0, so we can just create a new column in that other dataframe using <code>assign</code> and then provide values of 0 to the player column.

- [assign](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html) assigns new columns to a dataframe or overwrites old one.

In [None]:
# assign syntax: dataframe.assign(column_name = column_values)

missing_teams = missing_teams.assign(player = 0)

missing_teams

Our missing team dataframe now has the same exact structure as the original. We're going to combine them by using the <code>concat</code> function from <code>pandas</code>.

- [concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) concatenates objects in a particular access.

In [None]:
# pd.concat syntax for concatenating two dataframes: pd.concat([first_dataframe, second_dataframe])

full_transitioning_forwards = pd.concat([forward_counts, missing_teams])

# Filter out totals since it is not actually a team.

full_transitioning_forwards = full_transitioning_forwards.loc[full_transitioning_forwards['nhl_team']!='totals']

full_transitioning_forwards 

This looks almost perfect, but our index is a bit off. Notice how it goes from 1 through 29, then to 12, and then to 17? We need to reset the index, but unlike doing so before, this is a bit more complicated because this dataframe already has an index. To display this complication, we'll print just the head of this dataframe with its index reset:

In [None]:
full_transitioning_forwards.reset_index().head()

Resetting the index on a dataframe that already has one creates a new 'index' column which is ugly and doesn't help anybody since it's not the actual index. We'll want to get rid of that before moving forward.

In [None]:
# Reset the index first, then drop the new 'index' column that is created by doing so.

full_transitioning_forwards = full_transitioning_forwards.reset_index().drop(columns = 'index')

# Print just the head to make sure we've done this correctly.

full_transitioning_forwards.head()

Okay, so we have our dataframe set to go. 

---

### Visualize the Data

Now it's time to visualize our data. In order to do this, we're going to use the <code>countplot</code> function from the <code>seaborn</code> module we imported earlier as <code>sns</code>.

- [countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html) shows the number of observations in a categorical bin using bars.

In [None]:
# countplot syntax: sns.countplot(dataframe.column)

sns.countplot(full_transitioning_forwards.player)

So, the teams with at least one transitioning forward actually <i>were</i> rather evenly distributed between 1 and 4 of them, but the two teams with none threw things off a bit.  

Now let's play with a little more data visualization. Among <i>all</i> players who make the transition - not just forwards - how well does scoring in the AHL predict scoring in the NHL? I like to answer questions like this by determining the correlation between two variables. We can calculate the correlation coefficient between these two using the <code>corrcoef</code> function from <code>numpy</code> (which we imported as <code>np</code>).

- [corrcoef](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) returns a Pearson R correlation coefficient matrix between two array-like sets of values.

In [None]:
# corrcoef syntax: np.corrcoef(array_like_object_1, array_like_object_2)

np.corrcoef(ahl_1718_nhl_1819.ahl_ppg, ahl_1718_nhl_1819.nhl_ppg)

Yeesh, that's kind of an ugly output. We don't really need this entire correlation coefficient matrix; we just need the correlation between the first and second value, which we can extract on its own by adding <code>[0, 1]</code> to the end of our command. This will tell Python to enter the first of the two arrays (the one with an index of 0) and extract the second item (which has an index of 1). Take a look below:

In [None]:
np.corrcoef(ahl_1718_nhl_1819.ahl_ppg, ahl_1718_nhl_1819.nhl_ppg)[0, 1]

Much better. This is the correlation coefficient between the two variables. But I prefer R^2, which tells us how much of the variance in one variable can be explained by the other. To extract R^2, given correlation coefficient, simply multiply the initial value by itself. This can be done by adding <code>**2</code> to the end of the number.

In [None]:
np.corrcoef(ahl_1718_nhl_1819.ahl_ppg, ahl_1718_nhl_1819.nhl_ppg)[0, 1]**2

We really don't need more than two decimal points here, so let's assign this output to a variable called RSQ and then round it to two decimal places.

In [None]:
correlation_coefficient = (np.corrcoef(ahl_1718_nhl_1819.ahl_ppg, ahl_1718_nhl_1819.nhl_ppg)[0, 1])

RSQ = correlation_coefficient**2

RSQ = round(RSQ, 2)

RSQ

Cool, we've got our R^2. Now it's time to build another data visualization with just a bit more effort than we put into our count plot. This time, we're going to use <code>regplot</code>.

- [regplot](https://seaborn.pydata.org/generated/seaborn.regplot.html) plots data and a linear regression model fit. 

We will feed our first column - the x variable - to regplot as x, and then do the same with y, and then make the optional decision to specify the color teal because I like it.

In [None]:
# regplot syntax: sns.regplot(x = array_like_object_for_x, y = array_like_object_for_y, color = 'your_optional_choice')

sns.regplot(x = ahl_1718_nhl_1819.ahl_ppg, y = ahl_1718_nhl_1819.nhl_ppg, color = 'teal')

Awesome! As we can see, the values aren't grouped too closely around our line but there is a solid correlation. This is already pretty useful, but there's a lot more that we can do to improve it. The first thing that comes to mind for me is adding labels that clearly specify what is going on here by adding column names that are easier to interpret.

In order to add this information, we first write out the same line of code. This lets Python know we will be printing this plot. We then use the xlabel and ylabel functions from matplotlib.pyplot (which we imported as plt) to change the x labels of the plot we're going to print. 

In [None]:
sns.regplot(x = ahl_1718_nhl_1819.ahl_ppg, y = ahl_1718_nhl_1819.nhl_ppg, color = 'teal')

# plt.xlabel syntax: plt.xlabel("Label you want")

plt.xlabel("AHL Points Per Game in 2017-2018")
plt.ylabel("NHL Points Per Game in 2018-2019")

That looks so much nicer! There's a lot of empty space in the top left corner, though; just enough to include an R^2. The plt.text function allows us to print text in a chosen location of our plot. We'll repeat this same cell, only with plt.text at the bottom.

In [None]:
sns.regplot(x = ahl_1718_nhl_1819.ahl_ppg, y = ahl_1718_nhl_1819.nhl_ppg, color = 'teal')
plt.xlabel("AHL Points Per Game in 2017-2018")
plt.ylabel("NHL Points Per Game in 2018-2019")

# Build a single string for our R^2 value.

RSQString = "R^2 = " + str(RSQ)

# plt.text syntax: plt.text(x-location, y-location, (text))
# We want to print the R^2 value in the top left corner where there isn't much data. We know that the top left corner has an x-value of about 0.2 and y-value of about 0.7.

plt.text(0.2, 0.7, RSQString)

#We add plt.show() to the end just to show only the plot itself without the text at the top.

plt.show()

Nice work! This is a very solid data visualization. It's nothing too fancy, but it doesn't have to be.

___

### Conduct Deeper Data Analysis

Now it's time to build a very rudimentary NHL equivalency model using the methodology laid out by Gabriel Desjardins in [League Equivalencies](http://hockeyanalytics.com/Research_files/League_Equivalencies.pdf). This is his formula:

- Quality of League x = (Average PPG in NHL in Year 2) / (Average PPG in league x in Year 1)

It's a very simple formula, but it's good enough to get started with and it's surprisingly effective. We're going to calculate this by using numpy's mean function to get the average points per game for each league and then divide them by one another.

In [None]:
# np.mean syntax: np.mean(array_of_values)

nhl_ppg_average = np.mean(ahl_1718_nhl_1819.nhl_ppg)

ahl_ppg_average = np.mean(ahl_1718_nhl_1819.ahl_ppg)

ahl_nhl_equivalency = nhl_ppg_average/ahl_ppg_average

ahl_nhl_equivalency

So one point in the AHL is worth about 0.45 points in the NHL, which is the same value that Desjardins laid out in his paper using older data. Funny how that works.

This tutorial is very close to being complete. There's just one thing left to do: Define our own function that obtains an NHLe value for a player who played in the AHL. 

You're already somewhat familiar with functions, as you've been using a handful of them throughout this tutorial. To give you a formal definition, a function is a chunk of code that only runs when you call it. We're going to build a function where the user inputs an AHL points per game value, and the function takes that value and automatically spits out an NHLe score. Here's how defining a function works:

- In line 1, we define the name of our function and enter, in parentheses, every argument that the function takes. We finish line 1 with a colon.
- Every line within the function after line 1 is indented. That notifies Python that the line is a part of the function.
- In the function, we write code.

In [None]:
# In line 1, we define our function as obtain_nhle_given_ahl_ppg and 1 argument to take in: ahl_ppg.
# In line 2, we calculate NHLe by multiplying AHL points per game by 82 and then multiplying that by the AHL to NHL equivalency we obtained previously.
# In line 3, we round this value to the 2nd decimal.
# In line 4, we print the NHLe.

def obtain_nhle_given_ahl_ppg(ahl_ppg):
    nhle = ahl_ppg * 82 * ahl_nhl_equivalency
    nhle = round(nhle, 2)
    print(nhle)

Our function is defined. Nothing happened because we just stored the function and haven't called it yet. It's now time to call it. 

Let's keep it simple and say we're talking about a player who scored 40 points in 40 AHL games and thus had an AHL points per game of 1. If we pass his AHL points per game as an argument to our function, we receive his NHLe:

In [None]:
obtain_nhle_given_ahl_ppg(1)

So, according to the very rudimentary NHLe model we built out, an AHL points per game of 1 has an NHLe of about 37 points per 82 games. Seems quite reasonable to me.

___

# Congratulations on completing the Python Hockey Analytics Tutorial!

Chances are that if you took a Python quiz right now that contained only material we covered in this tutorial, you would still not score extremely well. That's okay! The coolest part about computer programming is that you don't necessarily need to know anything by heart. It helps, but unless you're in a job interview, you can always look back at your previous work and figure out how you did something. 

While this marks the end of our tutorial, I hope it only marks the beginning of your journey into hockey analytics and computer programming. The community needs more smart people asking questions and finding answers, and if you were capable of completing this tutorial, we could use you.

If you have any other comments, questions, concerns, or suggestions for improving this tutorial, do not hesitate to reach out to me on Twitter @TopDownHockey or email me directly at patrick.s.bacon@gmail.com.