# Inferential statistics
## Part I - Data Cleaning

Your family is very passionate about basketball. You always have discussions over players, games, statistics and whatnot. As you can imagine those discussions never reach a conclusion since everyone is simply sharing their opinion with no statistics to back them up!

![](../images/basket.jpg)

Since you are attending a data analysis bootcamp you'd like to take advantage of your newfound knowledge to finally put an end to your family's discussions. 

Luckily we have found a dataset containing data related to the players of the WNBA for the 2016-2017 season that we can use. 

Let's start with cleaning the data and then we'll continue with a general exploratory analysis and some inferential statistics.

### Dataset

The dataset we will be using contains the statistics from the WNBA players for the 2016-2017 season. You will be able to find more information on the dataset in the [codebook](../data/codebook.md) uploaded to the repository.

### Libraries

First we'll import the necessary libraries first and increase the maximum number of displayed columns so you will be able to see all the dataset in the same window.

In [1]:
import pandas as pd
pd.set_option('max_columns', 100)

### Load the dataset

Load the dataset into a df called `wnba` and take an initial look at it using the `head()` method.

In [7]:
#your code here
wnba = pd.read_csv('C:\\Users\\guilh\\M2-mini-project2\\data\\wnba.csv')
wnba.head()

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0


In [107]:
#gets a random sample from the dataset with 30 rows
wnba.sample(30)

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
7,Allie Quigley,CHI,G,178,64,20.19947,US,"June 20, 1986",31,DePaul,8,26,847,166,319,52.0,70,150,46.7,40,46,87.0,9,83,92,95,20,13,59,442,0,0
65,Jessica Breland,CHI,F,191,77,21.106878,US,"February 23, 1988",29,North Carolina,5,10,78,9,16,56.3,0,0,0.0,4,5,80.0,5,13,18,2,1,9,3,22,0,0
78,Keisha Hampton,CHI,F,185,78,22.790358,US,"February 22, 1990",27,DePaul,1,30,504,64,157,40.8,14,52,26.9,65,81,80.2,36,59,95,24,20,7,21,207,0,0
31,Carolyn Swords,SEA,C,198,95,24.232221,US,"July 19, 1989",28,Boston College,6,26,218,19,39,48.7,0,0,0.0,16,20,80.0,10,29,39,9,5,4,22,54,0,0
126,Stephanie Talbot,PHO,G,185,87,25.420015,AU,"December 20, 1990",26,Australia,R,30,555,47,114,41.2,15,38,39.5,29,44,65.9,28,58,86,50,22,8,28,138,0,0
111,Renee Montgomery,MIN,G,170,63,21.799308,US,"February 12, 1986",31,Connecticut,9,29,614,71,181,39.2,30,89,33.7,44,51,86.3,12,34,46,96,24,1,43,216,0,0
13,Amber Harris,CHI,F,196,88,22.907122,US,"January 16, 1988",29,Xavier,3,22,146,18,44,40.9,0,10,0.0,5,8,62.5,12,28,40,5,3,9,6,41,0,0
58,Isabelle Harrison,SAN,C,191,83,22.751569,US,"September 27, 1993",23,Kentucky,3,31,832,154,300,51.3,1,2,50.0,55,85,64.7,66,134,200,46,26,24,63,364,5,0
135,Theresa Plaisance,DAL,F,196,91,23.688047,US,"May 18, 1992",25,LSU,4,30,604,80,213,37.6,35,101,34.7,22,24,91.7,38,89,127,24,23,22,24,217,1,0
10,Alysha Clark,SEA,F,180,76,23.45679,US,"July 7, 1987",30,Middle Tennessee,6,30,843,93,183,50.8,20,62,32.3,38,51,74.5,29,97,126,50,22,4,32,244,0,0


### Check NaN values
As you know, one of our first steps is to check if there are any NaN values in the dataset to find any issues. Look for the columns that cointain NaN values and count how many rows there are with that value.

In [8]:
#your code here
#this checks if there are any null values
wnba.isnull().values.any()

True

In [14]:
#this checks which columns have null values
wnba.columns[wnba.isnull().any()].tolist()

['Weight', 'BMI']

We can see that there are only two NaNs in the whole dataset, one in the Weight column and one in the BMI one. Let's look at the actual rows that contain the NaN values.

In [33]:
#your code here
wnba[['Weight', 'BMI']]



Unnamed: 0,Weight,BMI
0,71.0,21.200991
1,73.0,21.329438
2,69.0,23.875433
3,84.0,24.543462
4,78.0,25.469388
...,...,...
138,70.0,22.093170
139,84.0,23.025685
140,69.0,22.530612
141,84.0,22.550941


In [28]:
#this shows only the rows that have null values 
wnba[wnba.isnull().any(axis=1)] 

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
91,Makayla Epps,CHI,G,178,,,US,"June 6, 1995",22,Kentucky,R,14,52,2,14,14.3,0,5,0.0,2,5,40.0,2,0,2,4,1,0,4,6,0,0


It looks like there is only a single row that has NaN values in it, which is good! Just in case, let's check how much removing a single row may influence our dataset by calculating the percentage of values we will be removing.

In [34]:
#your code here
#counts the number of columns
len(wnba.columns)

32

In [35]:
#counts the number of values in the dataframe
len(wnba)

143

In [44]:
a = (len(wnba.columns)/len(wnba))*100

In [50]:
#this will print the percentage rounded to the first decimal
print('Removing a row we will be removing ',"%.1f" % round(a ,2),'%') 

Removing a row we will be removing  22.4 %


It is very important to be as careful as possible when dealing with NaN values and only drop data when it is strictly necessary. This decision can also be influenced by the nature of our analysis. If, for example, our analysis will not require the Weight and BMI of the players at all we can simply keep the row, given that the NaN values are only present in the Weight and BMI column.

In this specific example, let's say our decision is to drop it. Write some code to drop the NaN values. 

In [56]:
#your code here
#drops the colummsn Weight and BMI
wnba.drop(columns = ['Weight','BMI'], inplace = False )

Unnamed: 0,Name,Team,Pos,Height,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,Tiffany Hayes,ATL,G,178,US,"September 20, 1989",27,Connecticut,6,29,861,144,331,43.5,43,112,38.4,136,161,84.5,28,89,117,69,37,8,50,467,0,0
139,Tiffany Jackson,LA,F,191,US,"April 26, 1985",32,Texas,9,22,127,12,25,48.0,0,1,0.0,4,6,66.7,5,18,23,3,1,3,8,28,0,0
140,Tiffany Mitchell,IND,G,175,US,"September 23, 1984",32,South Carolina,2,27,671,83,238,34.9,17,69,24.6,94,102,92.2,16,70,86,39,31,5,40,277,0,0
141,Tina Charles,NY,F/C,193,US,"May 12, 1988",29,Connecticut,8,29,952,227,509,44.6,18,56,32.1,110,135,81.5,56,212,268,75,21,22,71,582,11,0


**Do you think it is a good decision? Think about a case in which you wouldn't want to drop the value.**

In [13]:
#your answer here
"""the body mass index (BMI) is obtained using the height and weight, by example we could want to keep the BMI to check if there
is a relation with the number of games played!!!
"""

### Let's make an overview of the dataset
First, check the data types of our data:

In [57]:
#your code here
#checks the types off all the columns
wnba.dtypes

Name             object
Team             object
Pos              object
Height            int64
Weight          float64
BMI             float64
Birth_Place      object
Birthdate        object
Age               int64
College          object
Experience       object
Games Played      int64
MIN               int64
FGM               int64
FGA               int64
FG%             float64
3PM               int64
3PA               int64
3P%             float64
FTM               int64
FTA               int64
FT%             float64
OREB              int64
DREB              int64
REB               int64
AST               int64
STL               int64
BLK               int64
TO                int64
PTS               int64
DD2               int64
TD3               int64
dtype: object

It looks like most of the data types are correct. Birthdate column could be casted to a `datetime` type, however, we won't use it in our analysis so for simplicity, let's leave it as an `object`. Weight column could also be casted to an `int64` type as all numbers are integers.

**Let's change the type of Weight column for practice.**

In [79]:
#your code here


#fills the columns Weight and BMI with the value 0 this don't change the dataframe if we call by ex. wnba['Weight'].head(92)
#we will see that there is still the NaN values
#wnba_filled = (wnba[['Weight', 'BMI']].fillna(0))

#Just yo check if there are still null values
#wnba_filled.isnull().values.any()

#changes the type of column Weight to int64 and shows the type of the column Weight and BMI without modifing the original df!!!
#wnba.astype({'Weight':'int64'})

False

In [111]:
# this changes the dataframe, the inplace = True is a paramater that allows to change the dataframe
wnba['Weight'].fillna(0,inplace = True)
wnba['BMI'].fillna(0,inplace = True)

In [91]:
wnba['Weight'].head(92)

0      71.0
1      73.0
2      69.0
3      84.0
4      78.0
      ...  
87     65.0
88     78.0
89    104.0
90     90.0
91      0.0
Name: Weight, Length: 92, dtype: float64

In [103]:
#changes the type of column Weight to int64 
wnba['Weight'] = wnba['Weight'].astype(np.int64)

#shows the type of the column Weight
wnba['Weight'].dtypes

Name             object
Team             object
Pos              object
Height            int64
Weight            int64
BMI             float64
Birth_Place      object
Birthdate        object
Age               int64
College          object
Experience       object
Games Played      int64
MIN               int64
FGM               int64
FGA               int64
FG%             float64
3PM               int64
3PA               int64
3P%             float64
FTM               int64
FTA               int64
FT%             float64
OREB              int64
DREB              int64
REB               int64
AST               int64
STL               int64
BLK               int64
TO                int64
PTS               int64
DD2               int64
TD3               int64
dtype: object

**After checking the data types, let's check for outliers using the describe() method.**

In [104]:
#your code here
wnba.describe()

Unnamed: 0,Height,Weight,BMI,Age,Games Played,MIN,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
count,143.0,143.0,142.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,184.566434,78.426573,23.091214,27.076923,24.356643,496.972028,73.895105,167.622378,42.901399,14.727273,43.426573,24.803497,39.272727,49.111888,75.578322,21.923077,61.160839,83.083916,44.230769,17.608392,9.713287,32.090909,201.79021,1.132867,0.006993
std,8.685068,12.793864,2.073691,3.67917,7.104259,290.77732,56.110895,117.467095,10.111498,17.355919,46.106199,18.512183,36.747747,44.244854,18.712194,21.509276,49.761919,68.302197,41.483017,13.438978,12.520193,21.502017,153.381548,2.90031,0.083624
min,165.0,0.0,18.390675,21.0,2.0,12.0,1.0,3.0,14.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0
25%,176.5,71.0,21.785876,24.0,22.0,240.0,26.0,66.0,36.95,0.0,3.0,0.0,12.0,16.5,71.15,7.0,25.5,34.0,11.0,7.0,2.0,13.5,75.0,0.0,0.0
50%,185.0,79.0,22.873314,27.0,27.0,504.0,69.0,152.0,42.0,10.0,32.0,30.3,29.0,35.0,80.0,13.0,50.0,62.0,33.0,15.0,5.0,28.0,177.0,0.0,0.0
75%,191.0,86.0,24.180715,30.0,29.0,750.0,105.0,244.5,48.55,22.0,65.0,36.15,52.5,66.0,85.85,31.0,84.0,116.0,66.5,27.0,12.0,48.0,277.5,1.0,0.0
max,206.0,113.0,31.55588,36.0,32.0,1018.0,227.0,509.0,100.0,88.0,225.0,100.0,168.0,186.0,100.0,113.0,226.0,334.0,206.0,63.0,64.0,87.0,584.0,17.0,1.0


**Comment on your result. What do you see?**

In [20]:
#your answer here
"""""The column BMI have one less element (total of 142) when compared with the other columns (total of 143),for the weight there is an outliner
but that is due to the Nan value that was replaced with zero """

**Now we can save the cleaned data to a new .csv file called `wnba_clean.csv` in the data folder.**

In [110]:
#your code here
#saves the dataframe wnba to a csv file, if the file doesn't exist it will create one
wnba.to_csv('C:\\Users\\guilh\\M2-mini-project2\\data\\wnba_clean.csv')