**Theory**
Stephen Curry and co. changed the narrative for defense wins championships. What if offense always won championships, and a good defense helped? What if the phrase was always backwards - a myth? So to check this, I'd like to answer the question by comparing the last ten champions and seeing 

In [2]:
import pandas as pd
import numpy as np

### Player and Coach Data
Below we extract player and coach data for the last ten seasons (2010-2019) from Sports Reference site. This includes:
- Player average statistics over the course of a season (includes postseason)
- Coach record data over the course of season and career for experience tracking (includes postseason)
- Last ten NBA champions and runner up along with main performers in that year's postseason

In [31]:
df_players = pd.DataFrame()
df_coaches = pd.DataFrame()

#get data - interested in last ten full seasons; remember 2011 lockout season
for i in range(2010, 2020):
    players = 'https://www.basketball-reference.com/leagues/NBA_'+str(i)+'_per_game.html#per_game_stats::none'
    coaches = 'https://www.basketball-reference.com/leagues/NBA_'+str(i)+'_coaches.html#NBA_coaches::none'
    df_player = pd.read_html(players, header=0)[0]
    df_coach = pd.read_html(coaches, header=0)[0]
    df_player['Year'] = i
    df_coach['Year'] = i
    df_players = pd.concat([df_players, df_player], ignore_index=True) #only get the dataframe
    df_coaches = pd.concat([df_coaches, df_coach], ignore_index=True) #only last ten seasons   

print(df_players.shape)
print(df_coaches.shape)

(6371, 31)
(352, 27)


In [41]:
#get data for the last ten champions (first table, first ten rows)
champions = pd.read_html('https://www.basketball-reference.com/playoffs/', header=1)[0][0:10]
del champions['Unnamed: 5']
print(champions.shape)
champions.head()

(10, 9)


Unnamed: 0,Year,Lg,Champion,Runner-Up,Finals MVP,Points,Rebounds,Assists,Win Shares
0,2019.0,NBA,Toronto Raptors,Golden State Warriors,K. Leonard,K. Leonard (732),D. Green (223),D. Green (187),K. Leonard (4.9)
1,2018.0,NBA,Golden State Warriors,Cleveland Cavaliers,K. Durant,L. James (748),D. Green (222),L. James (198),L. James (5.2)
2,2017.0,NBA,Golden State Warriors,Cleveland Cavaliers,K. Durant,L. James (591),K. Love (191),L. James (141),L. James (4.3)
3,2016.0,NBA,Cleveland Cavaliers,Golden State Warriors,L. James,K. Thompson (582),D. Green (228),R. Westbrook (198),L. James (4.7)
4,2015.0,NBA,Golden State Warriors,Cleveland Cavaliers,A. Iguodala,L. James (601),D. Howard (238),L. James (169),S. Curry (3.9)


### Players
High count, but players switch teams and get traded. More concerned with coaches. After reviewing structure of players tables, the **Rk** variable that is just a count is removed because we can count players in a season by using the newly added year column which is more informative.

### Coaches
The count seems higher than what is expected. The above implies that on average 35 coaches had a job each season. However, that is a higher turnover rate. There might be some columns with header values in there...

### Champions
All of the features are categorical. Numerical features can be extracted to get the top points, rebounds, assists, and win shares for the players that postseason if needed. This will be done along with changing the year column from floats to integers. We can also remove the league column because all data here is NBA data.

In [43]:
#Players table
del df_players['Rk']
df_players.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,Arron Afflalo,SG,24,DEN,82,75,27.1,3.3,7.1,0.465,...,0.7,2.4,3.1,1.7,0.6,0.4,0.9,2.7,8.8,2010
1,Alexis Ajinça,C,21,CHA,6,0,5.0,0.8,1.7,0.5,...,0.2,0.5,0.7,0.0,0.2,0.2,0.3,0.8,1.7,2010
2,LaMarcus Aldridge,PF,24,POR,78,78,37.5,7.4,15.0,0.495,...,2.5,5.6,8.0,2.1,0.9,0.6,1.3,3.0,17.9,2010
3,Joe Alexander,SF,23,CHI,8,0,3.6,0.1,0.8,0.167,...,0.3,0.4,0.6,0.3,0.1,0.1,0.0,1.1,0.5,2010
4,Malik Allen,PF,31,DEN,51,3,8.9,0.9,2.3,0.397,...,0.7,0.9,1.6,0.3,0.2,0.1,0.4,1.3,2.1,2010


In [32]:
#Coaches table
df_coaches.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Seasons,Seasons.1,Unnamed: 5,Regular Season,Regular Season.1,Regular Season.2,Regular Season.3,...,Playoffs,Playoffs.1,Playoffs.2,Playoffs.3,Playoffs.4,Playoffs.5,Playoffs.6,Playoffs.7,Playoffs.8,Year
0,,,,w/ Franch,Overall,,Current Season,Current Season,Current Season,w/ Franchise,...,Current Season,Current Season,Current Season,w/ Franchise,w/ Franchise,w/ Franchise,Career,Career,Career,2010
1,Coach,Tm,,#,#,,G,W,L,G,...,G,W,L,G,W,L,G,W,L,2010
2,Mike Woodson,ATL,,6,6,,82,53,29,492,...,11,4,7,29,11,18,29,11,18,2010
3,Doc Rivers,BOS,,6,11,,82,50,32,492,...,24,15,9,71,41,30,86,46,40,2010
4,Larry Brown,CHA,,2,29,,82,44,38,164,...,4,0,4,4,0,4,235,120,115,2010


In [34]:
#seeing how far in titles that are not values go into the dataframe before removing
print(df_coaches.iloc[0:2].values)

[[nan nan nan 'w/ Franch' 'Overall' nan 'Current Season' 'Current Season'
  'Current Season' 'w/ Franchise' 'w/ Franchise' 'w/ Franchise' 'Career'
  'Career' 'Career' 'Career' nan 'Current Season' 'Current Season'
  'Current Season' 'w/ Franchise' 'w/ Franchise' 'w/ Franchise' 'Career'
  'Career' 'Career' 2010]
 ['Coach' 'Tm' nan '#' '#' nan 'G' 'W' 'L' 'G' 'W' 'L' 'G' 'W' 'L' 'W%'
  nan 'G' 'W' 'L' 'G' 'W' 'L' 'G' 'W' 'L' 2010]]


###   Labels
The coaching table needs to be transformed. There is data with two distinctions:
- Regular and playoff seasons -> key will be (R, P) e.g. Regular season wins = R-Wins
- Current and franchise count seasons -> (C, F)  e.g. Current season, regular season wins = C-Wins

Key:
- Current season, regular season = **CR**  
- Current season, playoffs = **CP**  
- Franchise, regular season = **FR**  
- Franchise, playoff = **FP**  
- Career regular season (experience) = **Car** 
- Career playoffs = **Car.P**

Index made below is based on header deciphered seen above

In [35]:
#renaming remaining useful columns after review and key definitions
cols = ['Coach', 'F-Seasons', 'Car-Seasons', 'CR-G', 'CR-W', 'CR-L', 'FR-G', 'FR-W', 'FR-L', 'Car-G', 'Car-W', 'Car-L', 'Car-W%', 'CP-G', 'CP-W', 'CP-L', 'FP-G', 'FP-W', 'FP-L', 'Car.P-G', 'Car.P-W', 'Car.P-L']

In [38]:
#get coaching data and transform each season's table before appending
df_coaches = pd.DataFrame()
for i in range(2010, 2020):
    coaches = 'https://www.basketball-reference.com/leagues/NBA_'+str(i)+'_coaches.html#NBA_coaches::none'
    df_coach = pd.read_html(coaches, header=0)[0]
    #remove empty columns (html tag extra columns carried over)
    del df_coach['Unnamed: 1'], df_coach['Unnamed: 2'], df_coach['Unnamed: 5'], df_coach['Unnamed: 16']
    #rename columns and delete rows with names in them before appending to larger dataframe
    df_coach.columns = cols 
    df_coach = df_coach[2:]     
    df_coach['Year'] = i
    df_coaches = pd.concat([df_coaches, df_coach], ignore_index=True)

df_coaches.shape

(332, 23)

In [39]:
df_coaches.head()

Unnamed: 0,Coach,F-Seasons,Car-Seasons,CR-G,CR-W,CR-L,FR-G,FR-W,FR-L,Car-G,...,CP-G,CP-W,CP-L,FP-G,FP-W,FP-L,Car.P-G,Car.P-W,Car.P-L,Year
0,Mike Woodson,6,6,82,53,29,492,206,286,492,...,11,4,7,29,11,18,29,11,18,2010
1,Doc Rivers,6,11,82,50,32,492,280,212,831,...,24,15,9,71,41,30,86,46,40,2010
2,Larry Brown,2,29,82,44,38,164,79,85,2310,...,4,0,4,4,0,4,235,120,115,2010
3,Vinny Del Negro,2,2,82,41,41,164,82,82,164,...,5,1,4,12,4,8,12,4,8,2010
4,Mike Brown,5,5,82,61,21,410,272,138,410,...,11,6,5,71,42,29,71,42,29,2010


In [59]:
#Champions table
champions['Year'] = champions['Year'].apply(lambda x: int(x)) #years into integers
del champions['Lg'] #delete league name col
champions.head()

Unnamed: 0,Year,Champion,Runner-Up,Finals MVP,Points,Rebounds,Assists,Win Shares
0,2019,Toronto Raptors,Golden State Warriors,K. Leonard,K. Leonard (732),D. Green (223),D. Green (187),K. Leonard (4.9)
1,2018,Golden State Warriors,Cleveland Cavaliers,K. Durant,L. James (748),D. Green (222),L. James (198),L. James (5.2)
2,2017,Golden State Warriors,Cleveland Cavaliers,K. Durant,L. James (591),K. Love (191),L. James (141),L. James (4.3)
3,2016,Cleveland Cavaliers,Golden State Warriors,L. James,K. Thompson (582),D. Green (228),R. Westbrook (198),L. James (4.7)
4,2015,Golden State Warriors,Cleveland Cavaliers,A. Iguodala,L. James (601),D. Howard (238),L. James (169),S. Curry (3.9)


In [78]:
#extract top values from points, rebounds, assists, and win shares
def extract_values(text):
    value = text.split('(')[1]
    value = value.split(')')[0]
    try: fin_value = int(value)
    except: fin_value = float(value)
    return fin_value

In [81]:
#rename columns
champions = champions.rename(columns = {'Points':'Top Scorer', 'Rebounds':'Top Rebr', 'Assists':'Top Asst', 
'Win Shares': 'WS Lead'})
#extract top performer values
champions['Points'] = champions['Top Scorer'].apply(lambda x: extract_values(x))
champions['Rebounds'] = champions['Top Rebr'].apply(lambda x: extract_values(x))
champions['Assists'] = champions['Top Asst'].apply(lambda x: extract_values(x))
champions['Win Shares'] = champions['WS Lead'].apply(lambda x: extract_values(x))

In [83]:
champions.head()

Unnamed: 0,Year,Champion,Runner-Up,Finals MVP,Top Scorer,Top Rebr,Top Asst,WS Lead,Points,Rebounds,Assists,Win Shares
0,2019,Toronto Raptors,Golden State Warriors,K. Leonard,K. Leonard (732),D. Green (223),D. Green (187),K. Leonard (4.9),732,223,187,4.9
1,2018,Golden State Warriors,Cleveland Cavaliers,K. Durant,L. James (748),D. Green (222),L. James (198),L. James (5.2),748,222,198,5.2
2,2017,Golden State Warriors,Cleveland Cavaliers,K. Durant,L. James (591),K. Love (191),L. James (141),L. James (4.3),591,191,141,4.3
3,2016,Cleveland Cavaliers,Golden State Warriors,L. James,K. Thompson (582),D. Green (228),R. Westbrook (198),L. James (4.7),582,228,198,4.7
4,2015,Golden State Warriors,Cleveland Cavaliers,A. Iguodala,L. James (601),D. Howard (238),L. James (169),S. Curry (3.9),601,238,169,3.9
