Think of Python as a builder. They have a mental skillset but often find themselves needing to complement that skillset with physical tools, for example to measure a sheet of concrete or to nail some nails into a frame. This is where **modules** come in. 

**Modules** are like a toolboxes that enable the data scientist the flexibility and power of calling pre-written code (methods) from pre-existing Classes. 

In [6]:
# Import modules
import numpy as np
import pandas as pd

By default Jupyter will restrict our view our the entire width of the dataframe. I like to turn this off unless the dataframe is significant dimensionality.

In [22]:
# Turn off max columns so we can view the entire dataframe width
pd.set_option('display.max_columns', None)

In [20]:
# Reading in our data to our assigned variables 'df'
df = pd.read_csv('all_seasons.csv')

Having imported the necessery tools for our analysis and imported our data, let's take a high level look at out dataset.

I have downloaded this dataset from Kaggle and I know that it's pretty useable already. However, most often our data won't be so readily useable, if we have for example scrapped it from a webpage or used an API call.

Let's begin by visualising the dataset by calling the ".head()" method on our dataframe variable "df". This is the third time we have called a Pandas method. The first time was when we turned off max columns with "pd.set_option" and the second time was when we read in our csv file into our notebook with "pd.read_csv()". 

If you notice at the top where we called in the modules/libraries/Classes, we assigned an alias to the Numpy and Pandas classes. This is to save us time when we call methods from each class from thereon out. Instead of having to write "Pandas.set_option()" or "Numpy.mean()", we can use the Class aliases "pd.set_option()" and "np.mean()".

In [23]:
df.head()

Unnamed: 0.1,Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
0,0,Travis Knight,LAL,22.0,213.36,106.59412,Connecticut,USA,1996,1,29,71,4.8,4.5,0.5,6.2,0.127,0.182,0.142,0.536,0.052,1996-97
1,1,Matt Fish,MIA,27.0,210.82,106.59412,North Carolina-Wilmington,USA,1992,2,50,6,0.3,0.8,0.0,-15.1,0.143,0.267,0.265,0.333,0.0,1996-97
2,2,Matt Bullard,HOU,30.0,208.28,106.59412,Iowa,USA,Undrafted,Undrafted,Undrafted,71,4.5,1.6,0.9,0.9,0.016,0.115,0.151,0.535,0.099,1996-97
3,3,Marty Conlon,BOS,29.0,210.82,111.13004,Providence,USA,Undrafted,Undrafted,Undrafted,74,7.8,4.4,1.4,-9.0,0.083,0.152,0.167,0.542,0.101,1996-97
4,4,Martin Muursepp,DAL,22.0,205.74,106.59412,,USA,1996,1,25,42,3.7,1.6,0.5,-14.5,0.109,0.118,0.233,0.482,0.114,1996-97


I typically like to get a feel for the dimensions of the dataset after visualising the top.

In [26]:
print(f'The dataframe contains {df.shape[0]} rows and {df.shape[1]} columns.')

The dataframe contains 11700 rows and 22 columns.


This can also be achieved without calling an index, returning a tuple.

In [28]:
df.shape

(11700, 22)

I now normally like to develop a quick TLDR of my dataset with some summary statistics. This is extremely easy using the ".describe()" methods of the Pandas class.

In [30]:
df.describe()

Unnamed: 0.1,Unnamed: 0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
count,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0,11700.0
mean,5849.5,27.131966,200.728501,100.526791,51.717179,8.169299,3.564957,1.811179,-2.16641,0.054981,0.141534,0.18538,0.510402,0.131228
std,3377.643409,4.340006,9.169827,12.526481,24.985236,5.956115,2.487498,1.792117,12.076914,0.043595,0.062793,0.052957,0.098306,0.094244
min,0.0,18.0,160.02,60.327736,1.0,0.0,0.0,0.0,-200.0,0.0,0.0,0.0,0.0,0.0
25%,2924.75,24.0,193.04,90.7184,32.0,3.6,1.8,0.6,-6.3,0.021,0.096,0.15,0.479,0.065
50%,5849.5,26.0,200.66,99.79024,58.0,6.7,3.0,1.2,-1.3,0.042,0.132,0.182,0.523,0.103
75%,8774.25,30.0,208.28,108.86208,74.0,11.5,4.7,2.4,3.2,0.084,0.18,0.218,0.559,0.178
max,11699.0,44.0,231.14,163.29312,85.0,36.1,16.3,11.7,300.0,1.0,1.0,1.0,1.5,1.0


Averaging 36.1 points in a season is rpetty unheard of. I know who it was, but for all the viewers at home lets bring up this record using two more Pandas methods in one line of code. First we filter the dataframe to using ".iloc()" (integer location) and then call the ".idxmax()" method the return the max value in the specified column. 

The result is the row/record where this value lies.

In [32]:
df.iloc[df['pts'].idxmax()]

Unnamed: 0                   10507
player_name           James Harden
team_abbreviation              HOU
age                           29.0
player_height               195.58
player_weight             99.79024
college              Arizona State
country                        USA
draft_year                    2009
draft_round                      1
draft_number                     3
gp                              78
pts                           36.1
reb                            6.6
ast                            7.5
net_rating                     6.3
oreb_pct                     0.023
dreb_pct                     0.157
usg_pct                      0.396
ts_pct                       0.616
ast_pct                      0.394
season                     2018-19
Name: 10507, dtype: object

Let's now finish off the data quality checks before diving into the proper EDA using various techniques.

In [34]:
# Checking if there are any numm values in each servies/columns and then summing all nulls
df.isnull().sum()

Unnamed: 0           0
player_name          0
team_abbreviation    0
age                  0
player_height        0
player_weight        0
college              0
country              0
draft_year           0
draft_round          0
draft_number         0
gp                   0
pts                  0
reb                  0
ast                  0
net_rating           0
oreb_pct             0
dreb_pct             0
usg_pct              0
ts_pct               0
ast_pct              0
season               0
dtype: int64

# Now the fun begins

Now we've crossed our T's and dotted our I's to ensure our data is of high quality and we don't have to do any data preprocessing and cleaning, let's go over some EDA techniques to get a better feel for the characteristics of our data.

## Estimates of Location

**Mean:** The sum of all values divided by the number of values.

In [35]:
# With methods
df['reb'].mean()

3.5649572649572736

In [36]:
# Without methods
def my_list(x):
    n_students = len(x)
    sentinel = 0
    total = 0
    
    while sentinel < n_students:
        age = x[sentinel]
        total += age
        sentinel += 1
    mean = total / n_students
    return mean

my_list(df['reb'])

3.5649572649572736

**Trimmed Mean**: Involves removing a user-specified fixed set of numbers from the tails of a distribution and taking the mean from the remaining array.

This techniques helps reduce the impact of outliers and is widely preferred to the standard mean.

In [50]:
# With methods
from scipy.stats import trim_mean

print(f'Without any trim: {trim_mean(df["reb"], 0)}')
print(f'With 1% trim on each end: {trim_mean(df["reb"], 0.01)}')
print(f'With 5% trim on each end: {trim_mean(df["reb"], 0.05)}')
print(f'With 20% trim on each end: {trim_mean(df["reb"], .2)}')

Without any trim: 3.564957264957265
With 1% trim on each end: 3.5056514913657773
With 5% trim on each end: 3.3610161443494775
With 20% trim on each end: 3.1016381766381764


In [53]:
# Without methods

**Weighted Mean:** Same as mean but with every value multiplied with a user specified $  X_{i} $ before summing and dividng by the number of values.

This techniques is useful when some values are intrinsically more valuable than others or the data collected does not equally represent the different groups we are measuring.

In [55]:
print(f'Using points as weighting: {np.average(df["reb"], weights = df["pts"])}')
print(f'Using points as weighting: {np.average(df["reb"], weights = df["pts"])}')

Using points series as weighting: 4.694401595299475
