Things used:
    
- Assignment vs Print
- import 
- modules A module is a file containing Python definitions and statements.
- Package is a collection of Python modules. 
- List -  [Python Documentation - Lists](https://docs.python.org/3/tutorial/datastructures.html)
- Function - Block of code which only runs when it is called. 
- Drop - [Pandas Documentation - Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
- loc
- iloc
- str
- strip()
- rename
- merge
- Pyplot/Seaborn/Npcorcoeff

[Python Documentation - Lists](https://docs.python.org/3/tutorial/datastructures.html)
[Python Documentation - Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)


## Python Hockey Analytics Tutorial

### Overview

Welcome to the TopDownHockey Python Hockey Analytics Tutorial! If you've made it this far, you've already managed to do the hardest thing I've ever had to do with Python: Installing it and installing Jupyter Notebook. By comparison, everything you do going forward should be a breeze.

By the end of this tutorial, you will have not only have a base-level understanding of Python as a programming language, but that you will be comfortable enough in Python to perform small-scope data analysis on your own.

---

### Step 1: Importing Packages

The following line will import these packages. Importing a package or module essentially executes the code within that package and its modules. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import TopDownHockey_Scraper.TopDownHockey_EliteProspects_Scraper as tdhepscrape

### Step 2: Scraping Data

We will be using the <code>TopDownHockey_EliteProspects_Scraper</code> module from the <code>TopDownHockey_Scraper</code> package to scrape data from Elite Prospects. 

A module is a file containing Python statements and definitions; it can also be thought of as a code library.

The <code>get_skaters</code> function takes two arguments: One or multiple seasons and one or multiple leagues. In this case, we're going to scrape AHL and NHL data for the past two seasons, which means we're going to build two lists and then feed them to our function. Our function will scrape all data for the leagues and seasons we provide and return a dataframe. We will store this dataframe as <code>ahl_nhl_skaters_1719</code>.

The "time magic" function at the top of the next cell will tell you how long the scrape takes. It should take about a minute.

In [None]:
%%time

# These are both lists 

leagues = ["ahl", "nhl"]
seasons = ["2017-2018", "2018-2019"]

ahl_nhl_skaters_1719 = tdhepscrape.get_skaters(leagues, seasons)

### Step 2: Clean our Data

Data cleaning isn't fun, but it's a key part of data analysis. Every minute you spend cleaning your data at the beginning of your analysis is an investment in the final product, and most of the time, you'll get a positive return on investment. It's much easier to identify and fix problems early on.

Our end goal here is to look at all skaters who played in the AHL in 2017-2018 and then played in the NHL in 2018-2019. Let's start by taking a look at what we've got. 

In [None]:
ahl_nhl_skaters_1719 

We're working with 4,088 rows of boxcar stats for AHL and NHL skaters. It looks like our <code>player</code> and <code>playername</code> column are actually similar, so we can get rid of one; ideally the ugly <code>player</code> column which contains redundant position data. Let's do that using drop, and then take a look at our new dataframe:

In [None]:
ahl_nhl_skaters_1719 = ahl_nhl_skaters_1719.drop(columns = {'player'})

ahl_nhl_skaters_1719

Much better! But it's still not ideal. The first column is team, then we have our boxcar stats, then a big ugly link, and only after that do we actually see the season, league, player, and position. 

Ideally, the first things we'd see from left to right would be key identifying information like player, season, team, league, and position, then we'd see our boxcar stats in the order they're currently in, and then we'd see the link last.

Why do we want to see the link column at all? It's big, ugly, and doesn't seem to add anything. Wouldn't the real ideal layout be one that doesn't have this column? In a perfect world, yes. But in this imperfect world, we can occasionally mix up two completely different players who have the same name. Let me demonstrate by displaying a filtered dataframe that contains only players whose names are equal to Sebastian Aho:

In [None]:
ahl_nhl_skaters_1719.loc[ahl_nhl_skaters_1719.playername=="Sebastian Aho"]

Wait, what on earth!? This tells us that there actually isn't a single player in our dataframe whose name is exactly identical to Sebastian Aho, which is obviously wrong. Maybe the defenseman in the Islanders organization was just a fever dream, but there is definitely a Sebastian Aho who plays center for the Carolina Hurricanes. He scored over a point per game in 2018-2019! It can't be right that there is nobody in here named Sebastian Aho, can it?

Technically, it can be and it actually is because our data comes with some white spaces. Let's take a look at an example of this by printing out the name of the very first player in the dataframe.

In [None]:
ahl_nhl_skaters_1719.playername.iloc[0]

See that empty white space between the y in Terry's name and the single quote indicating the end of the name? That's a blank white space. Chris Terry obviously exists in the database, but if we filtered players whose names were exactly <code>Chris Terry</code>, we'd get nothing. Try it out for yourself:

In [None]:
ahl_nhl_skaters_1719.loc[ahl_nhl_skaters_1719.playername=="Chris Terry"]

There are a few ways to handle this issue, but the simplest way is to just clean the white space off of our playername field. In order to do this, we'll first use <code>str</code> to treat our playername column as a string, and then use the <code>strip()</code> function to strip the white space. Let's do this, and then just take a look at a filtered dataframe which contains players whose names are exactly Chris Terry or Sebastian Aho:

In [None]:
ahl_nhl_skaters_1719.playername = ahl_nhl_skaters_1719.playername.str.strip()

ahl_nhl_skaters_1719.loc[(ahl_nhl_skaters_1719.playername=="Chris Terry") | (ahl_nhl_skaters_1719.playername=="Sebastian Aho")]

This confirms two things: First, our <code>strip()</code> function worked and we successfully got rid of the white space around Chris Terry's name, and second there are two Sebastian Ahos.

In theory, we could remedy this particular problem by using name and position as identifiers, or using the original player column that contained positional data next to playername. Since one Sebastian Aho is a forward and one is a defenseman, this would allow us to differentiate between the two of them.

The problem with this approach is that players with equal names do not always play different positions. For every pair of Sebastian Ahos and Colin Whites who play two different positions, there are pairs of Erik Gustafssons or Erik Karlssons who both play the same position. (The other defenseman named Erik Karlsson has never played in the NHL, but he exists. And he has ruined my NHLe models. I promise.) While this project has a very small scope, you may eventually transition to projects with a much larger scope and need a process in place that handles these issues. 

Thankfully, every player has their own unique page on Elite Prospects, and thus their own unique link to that page. Take another look at our Sebastian Ahos; Bridgeport's defenseman has a different link than Carolina's forward. The same is true for the two Erik Gustafssons. <b>This is why we keep the link.</b> And because we've got the link, we don't need to further worry about pairs like these Sebastian Ahos. We didn't really have to clean our data to begin with, but it's a good practice to make sure

### Step 3: Make it look good

The ideal format I previously laid out for our data is player, season, team, league, and position in that order, then our boxcar stats in the order they came in, and then that hideous-but-helpful link. We'll use loc to tell Python which coluns to keep, and actually pass all of them, just in the order we want. We'll also rename that ugly playername column, overwrite the dataframe with this neater one, and then take a look at it:

In [None]:
ahl_nhl_skaters_1719 = ahl_nhl_skaters_1719.loc[:, ['playername', 'team', 'season', 'league', 'position', 'gp', 'g', 'a', 'tp', 'ppg', 'pim', '+/-', 'link']]

ahl_nhl_skaters_1719 = ahl_nhl_skaters_1719.rename(columns = {'playername':'player'})

ahl_nhl_skaters_1719

### Step 4: Prepare our data for analysis

Remember how we used loc in the past to filter out players whose names were Chris Terry or Sebastian Aho? Now we're going to use loc to build two separate dataframes: One for the 2017-2018 AHL season and one for the 2018-2019 NHL season. Then we're going to merge those two and create a new dataframe called ahl_1718_nhl_1819:

In [None]:
ahl_skaters_1718 = ahl_nhl_skaters_1719.loc[(ahl_nhl_skaters_1719.season=="2017-2018") & (ahl_nhl_skaters_1719.league=="ahl")]

nhl_skaters_1819 = ahl_nhl_skaters_1719.loc[(ahl_nhl_skaters_1719.season=="2018-2019") & (ahl_nhl_skaters_1719.league=="nhl")]

ahl_1718_nhl_1819 = ahl_skaters_1718.merge(nhl_skaters_1819, on = 'link', how = 'inner')

ahl_1718_nhl_1819

Yikes! Our merge was successful, and we've got a solid sample size worth of 302 players who meet the criteria, but we've got an ugly bunch of columns with xs and ys attached to them. 

This is what happens when we merge two dataframes that have columns with the same name and we don't join across those columns. These two dataframes had the same column names, so everything except for the link field we joined across is duplicated.

This isn't the end of the world. It's actually quite easy to handle. We know that the left side of our new dataframe - the one whose column names have x attached to them - contains AHL data from 2017-2018, and we know that the right side contains NHL data from 2018-2019. (If we forget, the name of our dataframe literally serves as a reminder.) We also know that these players don't change name or position from season to season, so we can just keep the player and position from the left side of our data and drop those on the right side.

So we've got 6 columns to drop. Let's do that first.

In [None]:
ahl_1718_nhl_1819 = ahl_1718_nhl_1819.drop(columns = {'season_x', 'league_x', 'season_y', 'league_y', 'player_y', 'position_y'})

ahl_1718_nhl_1819

That already looks a lot better, but it's still a bit of a mess. Before we do anything else, we want to move our link field to the far right side of the dataframe, as far out of sight and out of mind as possible. Instead of typing out all 19 column names in the order we want, though, we're going to make a list with the column names we want. Then, we're going to remove the link column, append it at the end, and then print our list before moving forward.

In [None]:
my_columns = list(ahl_1718_nhl_1819.columns)

my_columns.remove('link')

my_columns.append('link')

my_columns

Notice that unlike with overwriting or creating new dataframes where you have to explicitly assign the outputs of a command to something, using append or remove on a list automatically appends it and overwrites it.

Now we've got a list of columns in the order we want them. The next step is to use the loc function and just feed it our columns in the order we want them, and then rename all of them.

In [None]:
ahl_1718_nhl_1819 = ahl_1718_nhl_1819.loc[:, my_columns]

ahl_1718_nhl_1819 = ahl_1718_nhl_1819.rename(columns = {'player_x':'player', 'team_x':'ahl_team', 'position_x':'position', 'gp_x':'ahl_gp', 
                                                        'g_x':'ahl_g', 'a_x':'ahl_a', 'tp_x':'ahl_p', 'ppg_x':'ahl_ppg', 'pim_x':'ahl_pim', 
                                                        '+/-_x':'ahl_+/-', 'team_y':'nhl_team', 'gp_y':'nhl_gp', 'g_y':'nhl_g', 'a_y':'nhl_a', 
                                                        'tp_y':'nhl_p', 'ppg_y':'nhl_ppg', 'pim_y':'nhl_pim', '+/-_y':'nhl_+/-'})

ahl_1718_nhl_1819

Hey, this is starting to look like something we can actually work with! We've got our AHL data clearly laid out on one side and our NHL data clearly laid out on the other side. While we've made a lot of progress, we've still got just a few more changes to make. 

Every column in our dataset is currently an <code>object</code> type, and in order to handle them mathematically, we need to convert them to float types. But before we can do this, we need to handle of all players whose target fields - gp and ppg for each league in this case - are <code>-</code>. We're going to use numpy's where function to change all values that are - to 0, and leave all other values as they were.

In [None]:
ahl_1718_nhl_1819.ahl_gp = np.where(ahl_1718_nhl_1819.ahl_gp=="-", 0, ahl_1718_nhl_1819.ahl_gp)
ahl_1718_nhl_1819.nhl_gp = np.where(ahl_1718_nhl_1819.nhl_gp=="-", 0, ahl_1718_nhl_1819.nhl_gp)

ahl_1718_nhl_1819.ahl_ppg = np.where(ahl_1718_nhl_1819.ahl_ppg=="-", 0, ahl_1718_nhl_1819.ahl_ppg)
ahl_1718_nhl_1819.nhl_ppg = np.where(ahl_1718_nhl_1819.nhl_ppg=="-", 0, ahl_1718_nhl_1819.nhl_ppg)

Now we can change these to float types. If we did this before, and still had a column with a - value, we'd get an error as that value would be processed as a string and strings cannot be changed to float. Once we've changed these to float types, we can filter out only players who played at least 20 games in both league and take a look at them.

In [None]:
ahl_1718_nhl_1819.ahl_gp = ahl_1718_nhl_1819.ahl_gp.astype(float) 

ahl_1718_nhl_1819.nhl_gp = ahl_1718_nhl_1819.nhl_gp.astype(float) 

ahl_1718_nhl_1819.ahl_ppg = ahl_1718_nhl_1819.ahl_ppg.astype(float) 

ahl_1718_nhl_1819.nhl_ppg = ahl_1718_nhl_1819.nhl_ppg.astype(float) 

ahl_1718_nhl_1819 = ahl_1718_nhl_1819.loc[(ahl_1718_nhl_1819.ahl_gp>19) & (ahl_1718_nhl_1819.nhl_gp>19)]

ahl_1718_nhl_1819

Awesome! Let's use the matplotlib and seaborn libraries to visualize our data. 

In [None]:
sns.set_palette("cubehelix", 8)
sns.regplot(x = "ahl_ppg", y = "nhl_ppg", data = ahl_1718_nhl_1819, color = 'teal')
plt.xlabel("AHL Points Per Game")
plt.ylabel("NHL Points Per Game")

R2 = round((np.corrcoef(ahl_1718_nhl_1819.ahl_ppg, ahl_1718_nhl_1819.nhl_ppg)[0, 1])**2, 2)
plt.text(0.2, 0.7, f'R-squared = {R2}')
plt.show()

In [None]:
ahl_1718_nhl_1819_forwards = ahl_1718_nhl_1819.loc[ahl_1718_nhl_1819.position.str.strip()!="D"]

ahl_1718_nhl_1819_defensemen = ahl_1718_nhl_1819.loc[ahl_1718_nhl_1819.position.str.strip()=="D"]

In [None]:
print('Skaters - R^2 =',round((np.corrcoef(ahl_1718_nhl_1819.ahl_ppg, ahl_1718_nhl_1819.nhl_ppg)[0, 1])**2, 2))
print('Forwards - R^2 =',round((np.corrcoef(ahl_1718_nhl_1819_forwards.ahl_ppg, ahl_1718_nhl_1819_forwards.nhl_ppg)[0, 1])**2, 2))
print('Defensemen - R^2 =',round((np.corrcoef(ahl_1718_nhl_1819_defensemen.ahl_ppg, ahl_1718_nhl_1819_defensemen.nhl_ppg)[0, 1])**2, 2))