# NBA: What's changed?

<img src="https://media.giphy.com/media/l0MYHiW8ozFLda6ze/giphy.gif">

Welcome to the game of basketball.
In this data have the last twenty years of data for major basketball leagues around the world. Data consists:
* League
* Season
* Season Stage
* Player Name
* Games Played
* Total Minutes
* Field Goal Mades
* Field Goal Attempts
* Three Pointer Mades
* Three Pointer Attempts
* Free Throw Mades
* Free Throw Attempts
* Turnovers
* Personal Fauls
* Offensive Rebounds
* Defensive Rebounds
* Total Rebounds
* Total Asists
* Total Steals
* Total Blocks
* Total Points
* Player Birth Year
* Player Birth Month
* Player Birth Date
* Player Height in feet
* Player Height in cm
* Player Weight in lbs
* Player Weight in kg
* Player Nationality
* Player High School
* Player Draft Round
* Player Draft Team

* I should say in the beginning. I will not focus on a specific player. My goal is to analyze the differences between today and twenty years ago.
* I will only do my analysis on the NBA.
* In this data, we have regular seasons, playoffs and international games. **I will only look at the regular season in my analysis** because playoffs play out differently. You do not see the exact basketball evolution in playoffs. In playoffs teams adapt eachother in each round. *You beat your opponent, good for you, now get ready to next team, mindset.* I mean, match-ups played out according to the opponent, everything changes, things like shot selection, playing type, playing pace etc.
* Finally, data does not have *per game* data so I will create that using other columns. I want to look at *points per game*, *minutes per game*, *three pointer mades and attemps per game* and *turnovers per game* stats the most. Since data in little unbalanced within different seasons using per game stats would be way more healthier anyway.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/basketball-players-stats-per-season-49-leagues/players_stats_by_season_full_details.csv')
print(df.shape)
df.head()

In [None]:
df.columns

In [None]:
df = df[['League', 'Season', 'Stage','GP', 'MIN', 'FGM',
       'FGA', '3PM', '3PA', 'TOV', 'PF', 'PTS', 'birth_year', 
         'height_cm', 'weight_kg', 'nationality']]

In [None]:
df['Season'].value_counts()

I will do all of the preprocess to the all data because I do not want to increase my workload afterwards. If I divide the dataset into NBA and then into different seasons, then I will need to do these procedures to every single data frame that I created.

First, I will create my new columns, *per game* stats.
Then I will fill or drop the NaN values and start with my analysis.

In [None]:
#Points per game
df['PPG'] = df['PTS'] / df['GP']
#Minutes per game
df['MPG'] = df['MIN'] / df['GP']
#Turnover per game
df['TOVPG'] = df['TOV'] / df['GP']
#3P Attempt per game
df['3PAPG'] = df['3PA'] / df['GP']
#3P Made per game
df['3PMPG'] = df['3PM'] / df['GP']
#Field Goal Attempt per game
df['FGAPG'] = df['FGA'] / df['GP']
#Field Goal Made per game
df['FGMPG'] = df['FGM'] / df['GP']
#Fauls per game
df['FPG'] = df['PF'] / df['GP']

In [None]:
df.columns

In [None]:
df = df.drop(['MIN', 'FGM', 'FGA', '3PM', '3PA', 'TOV', 'PF', 'PTS'], axis=1)

We have some missing values in birth years, weight and height. I will fill them with their means.

In [None]:
df['birth_year'] = df['birth_year'].fillna(df['birth_year'].mean())
df['weight_kg'] = df['weight_kg'].fillna(df['weight_kg'].mean())
df['height_cm'] = df['height_cm'].fillna(df['height_cm'].mean())

In [None]:
df['age'] = 0
for i in range(0, len(df)-1):
    df['age'][i] = int(df['Season'][i][0:4]) - int(df['birth_year'][i])

Also, since we are talking about NBA. I will divide the players nationality by American and Non-American.

In [None]:
for i in range(0, len(df)-1):
    if df['nationality'][i] != 'United States':
        df['nationality'][i] = 'Non American'

In [None]:
df['nationality'].value_counts()

I do not know how but one player from Ukraine is really loyal to his country.

In [None]:
df['nationality'] = df['nationality'].replace('Ukraine', 'Non American')

In [None]:
df = df.drop(['birth_year'], axis=1)

In [None]:
df.head()

Now I do not have any missing values in my data and I got every information that I need.

I will divide the NBA into three different times, I believe these time regions are similar to eachother.:
* 2000s
* 2010 - 2015
* 2015 - 2020

We extract the NBA and its Regular Seasons from the data.

In [None]:
df = df.loc[df['League'] == 'NBA']
df = df.loc[df['Stage'] == 'Regular_Season']

In [None]:
df['Season'].value_counts()

In [None]:
df_2000 = df.loc[(df.Season == '1999 - 2000') | (df.Season == '2000 - 2001') |
                (df.Season == '2001 - 2002') | (df.Season == '2002 - 2003') |
                (df.Season == '2003 - 2004') | (df.Season == '2004 - 2005') |
                (df.Season == '2005 - 2006') | (df.Season == '2006 - 2007') |
                (df.Season == '2007 - 2008') | (df.Season == '2008 - 2009') | 
                (df.Season == '2009 - 2010')]
df_2015 = df.loc[(df.Season == '2010 - 2011') | (df.Season == '2011 - 2012') |
                (df.Season == '2012 - 2013') | (df.Season == '2013 - 2014') |
                (df.Season == '2014 - 2015')]
df_2020 = df.loc[(df.Season == '2015 - 2016') | (df.Season == '2016 - 2017') |
                (df.Season == '2017 - 2018') | (df.Season == '2018 - 2019') |
                (df.Season == '2019 - 2020')]

# PLAYER NATIONALITY PIE

In [None]:
plt.rcParams["figure.figsize"] = (20,5)
fig, axs = plt.subplots(1,3)
labels = 'American', 'Non-American'
explode = (0, 0.1)
fig.suptitle('Nationalities in 2000s(I), 2010-2015(II) and 2015-2020(III)')
axs[0].pie(df_2000['nationality'].value_counts(), labels=labels, explode=explode, autopct='%1.1f%%');
axs[1].pie(df_2015['nationality'].value_counts(), labels=labels, explode=explode, autopct='%1.1f%%');
axs[2].pie(df_2020['nationality'].value_counts(), labels=labels, explode=explode, autopct='%1.1f%%');

# PLAYERS' HEIGHTS AND WEIGHTS

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['height_cm'], label='2015-2020')
sns.distplot(df_2015['height_cm'], label='2010-2015')
sns.distplot(df_2000['height_cm'], label='2000-2010')
plt.legend()

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['weight_kg'], label='2015-2020')
sns.distplot(df_2015['weight_kg'], label='2010-2015')
sns.distplot(df_2000['weight_kg'], label='2000-2010')
plt.legend()

* Not going to lie, I taught height and weight difference would be huge between the eras but it was not. Nowadays you do not see big mans in the game but I guess they were always rare.
* Although I should say that 2000s has maximum values in both height and weight departments.

# AGE OF THE PLAYERS

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['age'], label='2015-2020')
sns.distplot(df_2015['age'], label='2010-2015')
sns.distplot(df_2000['age'], label='2000-2010')
plt.legend()

* 2000s and 2010-15 has very similary graphs here.
* 2020s looks different than those two eras. It has more young players and also it has the more old players than other two eras. 
* I can not explain exactly why there are more young players playing, maybe better scouting through social media? BUT I can explain the old players: technology, resources and awareness. Huh, and LeBron James.

# NUMBER OF GAMES PLAYED

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['GP'], label='2015-2020')
sns.distplot(df_2015['GP'], label='2010-2015')
sns.distplot(df_2000['GP'], label='2000-2010')
plt.legend()

# NUMBER OF MINUTES PLAYED PER GAME

* We see that in 2000s players played more games than in todays' time.
* Actually number went down in each time region.
* Teams are trying to save their players to playoff because people' careers depent more on playoffs.

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['MPG'], label='2015-2020')
sns.distplot(df_2015['MPG'], label='2010-2015')
sns.distplot(df_2000['MPG'], label='2000-2010')
plt.legend()

* Just like number of games played, number of minutes players are playing are decreased.
* This is also for protecting players and saving them to important moments
* What a time to be an athlete, right?

# POINTS PER GAME

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['PPG'], label='2015-2020')
sns.distplot(df_2015['PPG'], label='2010-2015')
sns.distplot(df_2000['PPG'], label='2000-2010')
plt.legend()

* People would assume points per game numbers increase after each season. But actually we see that in 2010-2015 seasons are the least scoring seasons in recent NBA history.
* Of course in recent years players adapted the three-point-era so numbers are the highest for 2015-2020.

As you can see on the pie charts above. Non-American player percentage increases with time. NBA always had quality foreign players but their number significantly higher in recent years.

# 3-POINTER ATTEMPTS AND MAKES

In [None]:
plt.rcParams["figure.figsize"] = (15,7)
f, axes = plt.subplots(2, 1)
sns.distplot(df_2020['3PAPG'], label='2015-2020', ax=axes[0])
sns.distplot(df_2015['3PAPG'], label='2010-2015', ax=axes[0])
sns.distplot(df_2000['3PAPG'], label='2000-2010', ax=axes[0])

sns.distplot(df_2020['3PMPG'], ax=axes[1])
sns.distplot(df_2015['3PMPG'], ax=axes[1])
sns.distplot(df_2000['3PMPG'], ax=axes[1])
f.legend()

* Total field goal is nearly the same BUT story behind the three pointer is different. There is huge difference in 3 point attempts and also in makes. But there is not a huge different between attempts and makes.
* This tells us that although players shoots better these days, they could also shoot back in the day. Its just the difference between the games.

# TURNOVERS

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['TOVPG'], label='2015-2020')
sns.distplot(df_2015['TOVPG'], label='2010-2015')
sns.distplot(df_2000['TOVPG'], label='2000-2010')
plt.legend()

* Game became a little more fast nowadays. So it makes sense that 2020s have more turnovers than 2000s. 
* BUT it is surprising that 2015s are the most careless of them all.
* This tells me that 2015s were the transition era between 2000s and 2020s. Teams adapted this play style now and they are more careful.

# PERSONAL FAULS

In [None]:
plt.figure(figsize=(14,7))
sns.distplot(df_2020['FPG'], label='2015-2020')
sns.distplot(df_2015['FPG'], label='2010-2015')
sns.distplot(df_2000['FPG'], label='2000-2010')
plt.legend()

* Old heads were right. 2000s was really a massacre.

This is it for this notebook. I tried to do an analysis on NBA for the last twenty seasons. If you like it, upvotes are welcomed. Take care.

<img src="https://media.giphy.com/media/xUPOqo6E1XvWXwlCyQ/giphy.gif">