# Inferential statistics
## Part I - Data Cleaning

Your family is very passionate about basketball. You always have discussions over players, games, statistics and whatnot. As you can imagine those discussions never reach a conclusion since everyone is simply sharing their opinion with no statistics to back them up!

![](../images/basket.jpg)

Since you are attending a data analysis bootcamp you'd like to take advantage of your newfound knowledge to finally put an end to your family's discussions. 

Luckily we have found a dataset containing data related to the players of the WNBA for the 2016-2017 season that we can use. 

Let's start with cleaning the data and then we'll continue with a general exploratory analysis and some inferential statistics.

### Dataset

The dataset we will be using contains the statistics from the WNBA players for the 2016-2017 season. You will be able to find more information on the dataset in the [codebook](../data/codebook.md) uploaded to the repository.

### Libraries

First we'll import the necessary libraries first and increase the maximum number of displayed columns so you will be able to see all the dataset in the same window.

# Codebook

## Dataset

The dataset we are working with contains personal data and game statistics for the 142 players of the WNBA. The data represents the performances of the players during all the games of the 2016/2017 season.

For those of you that are less accustomed to basketball lingo here are some definitions:
- **Field Goal**: any shot made from inside the 3-point line.
- **Free Throws**: shots that are given to a player after they suffer a foul. The play stops and the player can freely shot from behind the free throw line.
- **Rebound**: a recovered basketball after a failed shot. If the shot was made by a teammate it's an Offensive Rebound, if instead the shot was made by an opponent is a Defensive Rebound.
- **Turnover**: losing a basketball before your team has had a chance of shooting the ball.
- **Blocks**: blocking an opponent's shot.
- **Double doubles**: a player is said to have performed a double-double when they accumulate at least a double digit number in two out of five of the main statistics: points, rebounds, blocks, steals and assists.
- **Triple doubles**: same as double-double but with three out of five statistics.
- **Positions**: here's the wikipedia page if you'd like to better understand the various positions in basketball: https://en.wikipedia.org/wiki/Basketball\_positions

## Features Description

| Feature   | Description  |
|:---|:---|
| Name | Name  |
| Team | Team |
| Pos  | Position |
| Height  | Height  |
| Weight  |  Weight |
| BMI  | Body Mass Index |
| Birth\_Place  | Birth place  |
| Birthdate  |  Birthdate |
| Age  |  Age |
| College  |  College |
| Experience  |  Experience |
| G | Games Played |
| MIN | Minutes Played |
| FGM | Field Goals Made |
| FGA | Field Goals Attempts |
| FG% | Field Goals % |
| 3PM | 3Points Made |
| 3PA | 3Points Attempts |
| 3P% | 3Points % |
| FTM | Free Throws made |
| FTA | Free Throws Attempts |
| FT% | Free Throws % |
| OREB | Offensive Rebounds |
| DREB | Defensive Rebounds |
| REB | Total Rebounds |
| AST | Assists |
| STL | Steals |
| BLK | Blocks |
| TO | Turnovers |
| PTS | Total points |
| DD2 | Double doubles |
| TD3 | Triple doubles |

## Source
[WNBA Player Stats 2017]

In [5]:
# Libraries
import pandas as pd
import os
import this

print("\npandas version:", pd.__version__)

pd.set_option('max_columns', 100)


pandas version: 0.25.1


### Load the dataset

Load the dataset into a df called `wnba` and take an initial look at it using the `head()` method.

In [15]:
#your code here
wnba = pd.read_csv("../data/wnba.csv")
#os.listdir("../data")
#os.getcwd()
wnba.head(10)

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0
5,Alexis Peterson,SEA,G,170,63.0,21.799308,US,"June 20, 1995",22,Syracuse,R,14,90,9,34,26.5,2,9,22.2,6,6,100.0,3,13,16,11,5,0,11,26,0,0
6,Alexis Prince,PHO,G,188,81.0,22.91761,US,"February 5, 1994",23,Baylor,R,16,112,9,34,26.5,4,15,26.7,2,2,100.0,1,14,15,5,4,3,3,24,0,0
7,Allie Quigley,CHI,G,178,64.0,20.19947,US,"June 20, 1986",31,DePaul,8,26,847,166,319,52.0,70,150,46.7,40,46,87.0,9,83,92,95,20,13,59,442,0,0
8,Allisha Gray,DAL,G,185,76.0,22.20599,US,"October 20, 1992",24,South Carolina,2,30,834,131,346,37.9,29,103,28.2,104,129,80.6,52,75,127,40,47,19,37,395,0,0
9,Allison Hightower,WAS,G,178,77.0,24.302487,US,"June 4, 1988",29,LSU,5,7,103,14,38,36.8,2,11,18.2,6,6,100.0,3,7,10,10,5,0,2,36,0,0


### Check NaN values
As you know, one of our first steps is to check if there are any NaN values in the dataset to find any issues. Look for the columns that cointain NaN values and count how many rows there are with that value.

In [17]:
#your code here
wnba.isnull().sum()

Name            0
Team            0
Pos             0
Height          0
Weight          1
BMI             1
Birth_Place     0
Birthdate       0
Age             0
College         0
Experience      0
Games Played    0
MIN             0
FGM             0
FGA             0
FG%             0
3PM             0
3PA             0
3P%             0
FTM             0
FTA             0
FT%             0
OREB            0
DREB            0
REB             0
AST             0
STL             0
BLK             0
TO              0
PTS             0
DD2             0
TD3             0
dtype: int64

We can see that there are only two NaNs in the whole dataset, one in the Weight column and one in the BMI one. Let's look at the actual rows that contain the NaN values.

In [20]:
#your code here
wnba[wnba.isnull().any(axis = 1)]

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
91,Makayla Epps,CHI,G,178,,,US,"June 6, 1995",22,Kentucky,R,14,52,2,14,14.3,0,5,0.0,2,5,40.0,2,0,2,4,1,0,4,6,0,0


It looks like there is only a single row that has NaN values in it, which is good! Just in case, let's check how much removing a single row may influence our dataset by calculating the percentage of values we will be removing.

In [29]:
#your code here
"""
less than 1%
"""
(len(wnba[wnba.isnull().any(axis = 1)]) / len(wnba)) * 100

0.6993006993006993

It is very important to be as careful as possible when dealing with NaN values and only drop data when it is strictly necessary. This decision can also be influenced by the nature of our analysis. If, for example, our analysis will not require the Weight and BMI of the players at all we can simply keep the row, given that the NaN values are only present in the Weight and BMI column.

In this specific example, let's say our decision is to drop it. Write some code to drop the NaN values. 

In [32]:
#your code here
wnba.dropna(inplace = True)

# checking:
wnba[wnba.isnull().any(axis = 1)]

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3


**Do you think it is a good decision? Think about a case in which you wouldn't want to drop the value.**

In [33]:
#your answer here
"""
The more I practice, and read and think about this, the less inclined I am to drop NaN's, so
I'm on the fence on dropping the value. 

(Heavy influenced by https://towardsdatascience.com/missing-values-dont-drop-them-f01b1d8ff557
and https://arxiv.org/pdf/1611.09477.pdf — section 2.1, which is mentioned in the medium 
article above).

The thing is, in this case, the dataset is not that big (143 entries, 142 dropping the NaN's),
so there isn't really a worry about storage nor processing power. Also, there may be ways to 
easily and quickly check the weight (and from there calculate the BMI) of the player, like
scrapping a wikipedia page, or a wnba profile page, or her club... 

Actually, a quick google search shows that her weight is missing *everywhere*, meaning that 
this data is really not available. Here it might not mean much, but in some cases missing data
will be at least as informative has the data itself.

Another thing we could do (which I'm not sure it would be any better), would be to check the
mean BMI and weight for a wnba player her height, and fill it like that, while create a new 
binary feature that is a flag for whether or not that value was originally missing (again, 
read about article and paper). This approach seems more reasonable to me at this point, but
I'm really not sure (and will appreciate direction).

All this being said... I think I would indeed prefer following the flagging method, filling it
with the mean of weight and BMI for her height, and I'd be curious to see the impact of this
change in the overall data, for it may indeed provide no difference, or worse: worse 
conclusions.
"""

(142, 32)

### Let's make an overview of the dataset
First, check the data types of our data:

In [34]:
#your code here
wnba.dtypes

Name             object
Team             object
Pos              object
Height            int64
Weight          float64
BMI             float64
Birth_Place      object
Birthdate        object
Age               int64
College          object
Experience       object
Games Played      int64
MIN               int64
FGM               int64
FGA               int64
FG%             float64
3PM               int64
3PA               int64
3P%             float64
FTM               int64
FTA               int64
FT%             float64
OREB              int64
DREB              int64
REB               int64
AST               int64
STL               int64
BLK               int64
TO                int64
PTS               int64
DD2               int64
TD3               int64
dtype: object

It looks like most of the data types are correct. Birthdate column could be casted to a `datetime` type, however, we won't use it in our analysis so for simplicity, let's leave it as an `object`. Weight column could also be casted to an `int64` type as all numbers are integers.

**Let's change the type of Weight column for practice.**

In [49]:
#your code here
# I think, but I'm not sure, that this is the best way:
wnba["Weight"] = pd.to_numeric(wnba["Weight"], errors = "coerce", downcast = "integer")

# Also, I think it might make sense to change the 'R's into 0 ('R', I think, means Rookie, so, 
# someone that has less than a year as a player), and then converting the dtype of this col to
# int as well, but I'll leave it as is (since there's no indication to do this in this 
# exercise).
print("Unique Experience Entries:", wnba["Experience"].unique())

wnba["Weight"] # note the dtype

Unique Experience Entries: ['2' '12' '4' '6' 'R' '8' '5' '3' '1' '9' '10' '11' '7' '13' '14' '15']


0      71
1      73
2      69
3      84
4      78
       ..
138    70
139    84
140    69
141    84
142    59
Name: Weight, Length: 142, dtype: int8

**After checking the data types, let's check for outliers using the describe() method.**

In [51]:
#your code here
round(wnba.describe(), 2)

Unnamed: 0,Height,Weight,BMI,Age,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,184.61,78.98,23.09,27.11,24.43,500.11,74.4,168.7,43.1,14.83,43.7,24.98,39.54,49.42,75.83,22.06,61.59,83.65,44.51,17.73,9.78,32.29,203.17,1.14,0.01
std,8.7,11.0,2.07,3.67,7.08,289.37,55.98,117.17,9.86,17.37,46.16,18.46,36.74,44.24,18.54,21.52,49.67,68.2,41.49,13.41,12.54,21.45,153.03,2.91,0.08
min,165.0,55.0,18.39,21.0,2.0,12.0,1.0,3.0,16.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0
25%,175.75,71.5,21.79,24.0,22.0,242.25,27.0,69.0,37.12,0.0,3.0,0.0,13.0,17.25,71.58,7.0,26.0,34.25,11.25,7.0,2.0,14.0,77.25,0.0,0.0
50%,185.0,79.0,22.87,27.0,27.5,506.0,69.0,152.5,42.05,10.5,32.0,30.55,29.0,35.5,80.0,13.0,50.0,62.5,34.0,15.0,5.0,28.0,181.0,0.0,0.0
75%,191.0,86.0,24.18,30.0,29.0,752.5,105.0,244.75,48.63,22.0,65.5,36.18,53.25,66.5,85.92,31.0,84.0,116.5,66.75,27.5,12.0,48.0,277.75,1.0,0.0
max,206.0,113.0,31.56,36.0,32.0,1018.0,227.0,509.0,100.0,88.0,225.0,100.0,168.0,186.0,100.0,113.0,226.0,334.0,206.0,63.0,64.0,87.0,584.0,17.0,1.0


**Comment on your result. What do you see?**

In [20]:
#your answer here
"""
Games Played min: outlier?
MIN min and max: outliers?
FGM min and max: outliers?
FGA min and max: outliers?
FG% max: outlier?
3PM max: outlier?
3PA max: outlier?
3P% max: outlier?
FTM max: outlier?
FTA max: outlier?
FT% min: outlier?
OREB max: outlier?
DREB min and max: outlier?
REB min and max: outliers?
AST max: outliers?
STL max: outliers?
BLK max: outliers?
TO max: outliers?
PTS min and max: outliers?
DD2 max: outliers? (almost, definetly, for sure)
TD3 max: outliers? —> almost, definetly, for sure, 'though I'm not sure I should interpret 
this situation like this.

It's also interesting to compare the mean with the 50 percentile and get a feel for the 
skewness of the data, depending on the variable (if I'm thinking correctly).
"""

**Now we can save the cleaned data to a new .csv file called `wnba_clean.csv` in the data folder.**

In [52]:
#your code here
wnba.to_csv(r"../data/wnba_clean.csv")