NBA team's Franchise value can be predicted by the city's  population

**Overview:**

*Los Angeles Lakers costs $3B. The team was bough in 1967 for 16.5M by Jerry Buss and now it  is owned by Jeanie Buss. The performance of the team is worse than others of NBA, anyway its franchise value is one of the highest.  If this team were "on sale" would you buy? *

Every basketball team has its franchise value. But this value does not reflect the team's performance. On contrary, the worst in terms of performance teams (like LA Lakers, NY Knicks) have the highest franchise value (as of 2017). The most paid players do not necessary play for the highest valued teams. 
To measure a player’s impact, different sport analysts invented their own measurement, like Real Plus-Minus, PIayer Impact Estimate (PIE) and so on.
The idea od this notebook to explore  the relationship between metrics as  the team's franchise value, social media activity of MVPs and the team's performance. 


In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
color = sns.color_palette()
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%matplotlib inline

Let's take a look at fan's attendence statistics of each team. TOTAL and AVG shows attendence during 2017yr. Elo - rating systems, points (after every game, the winning team takes points from the losing one) CONF - eastern or western conference of NBA. 


In [None]:
#the top most attended games: sort the dataframe by "total" attendence
attendance_valuation_elo_df  = pd.read_csv("../input/social-power-nba/nba_2017_att_val_elo.csv");attendance_valuation_elo_df.head()
attendance_valuation_elo_df_sorted = attendance_valuation_elo_df.sort_values(by=['TOTAL'], ascending=False)
attendance_valuation_elo_df_sorted.head(6)

It looks like there is correlation between Attendance and Franchise Valuation. Let's check on graph: TOTAL attendence and VALUE_MILLIONS

In [None]:
sns.lmplot(x="TOTAL", y="VALUE_MILLIONS", data=attendance_valuation_elo_df)

It seems logical: the more people attend games, the higher valuation of the team. 
Moreover, ticket sales is one of the teams source of Revenue. 
Thus, lets analyze what else can be factors of franchise valuation

In [None]:
arenas  = pd.read_csv("../input/nba-arenas-pop/NBA_Arenas_Pop.csv")
arenas.head(6)


In [None]:
val_atten = attendance_valuation_elo_df.copy()
val_arena = val_atten.merge(arenas, how="inner", on="TEAM")
df = val_arena.drop(["Unnamed: 0", "GMS"], axis=1)
df.head(6)

In [None]:
df.to_csv('df.csv', index=False)


Correlation Heatmap shows a strong relationship between franchise value and such factors as total and average attendance, and density of the city's population). Cities with the highest density have two basketball team. And those two cities have the highest valued franchise. As we can see, Elo does not have an impact on the team's valuation.


In [None]:
corr = df.corr()
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap = cmap,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)


As following graph shows, Population density timpacst the most on franchise valuation, as all the light squares with the highest valuation are on the right side of the map.

In [None]:
valuations2 = df.pivot("TEAM",  "POPULATION_2016", "VALUE_MILLIONS")
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Team AVG Attendance vs Valuation in Millions Vs Capacity of Arena")
sns.heatmap(valuations2,linewidths=.5, annot=True, fmt='g')

In [None]:
numerical_df = df.loc[:,["TOTAL", "ELO", "VALUE_MILLIONS", "POPULATION_2016","two teams in city", "CAPACITY", "OPENED"]]

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
print(scaler.fit(numerical_df))
print(scaler.transform(numerical_df))

In [None]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3)
kmeans = k_means.fit(scaler.transform(numerical_df))
df['cluster'] = kmeans.labels_
df.sort_values(by = ["cluster"], ascending = True)

In [None]:
#Top paid players (Salary)
salary_df = pd.read_csv("../input/social-power-nba/nba_2017_salary.csv");salary_df.head()
salary_df_sorted = salary_df.sort_values(by = "SALARY",ascending=False )
salary_df_sorted.head(12) 

In [None]:
#Player Impact Estimate, Top PIE
#a player’s impact on each individual game they play
pie_df = pd.read_csv("../input/social-power-nba/nba_2017_pie.csv");pie_df.head()
pie_df_sorted = pie_df.sort_values(["PIE"],ascending = False)
pie_df_sorted.head(6)

In [None]:
# Real Plus_Minus (RPM), top RPM
# ESPN metrics that merely registers the net change in score (plus or minus) while each player is on the court.
plus_minus_df = pd.read_csv("../input/social-power-nba/nba_2017_real_plus_minus.csv");plus_minus_df.head()
plus_minus_df_sorted = plus_minus_df.sort_values (["RPM"],ascending = False) 
plus_minus_df_sorted.head(12)

In [None]:
# Basketball Reference Statistics
br_stats_df = pd.read_csv("../input/social-power-nba/nba_2017_br.csv");br_stats_df.head()

In [None]:
#rename columns in order to merge 
plus_minus_df.rename(columns={"NAME":"PLAYER", "WINS": "WINS_RPM"}, inplace=True)
players = []
for player in plus_minus_df["PLAYER"]:
    plyr, _ = player.split(",")
    players.append(plyr)
plus_minus_df.drop(["PLAYER"], inplace=True, axis=1)
plus_minus_df["PLAYER"] = players
plus_minus_df.head()

In [None]:

nba_players_df = br_stats_df.copy()
nba_players_df.rename(columns={'Player': 'PLAYER','Pos':'POSITION', 'Tm': "TEAM", 'Age': 'AGE', "PS/G": "POINTS"}, inplace=True)
nba_players_df.drop(["G", "GS", "TEAM"], inplace=True, axis=1)
nba_players_df = nba_players_df.merge(plus_minus_df, how="inner", on="PLAYER")
nba_players_df.head()

In [None]:

pie_df_subset = pie_df[["PLAYER", "PIE", "PACE", "W"]].copy()
nba_players_df = nba_players_df.merge(pie_df_subset, how="inner", on="PLAYER")
nba_players_df.head()

In [None]:
salary_df.rename(columns={'NAME': 'PLAYER'}, inplace=True)
salary_df["SALARY_MILLIONS"] = round(salary_df["SALARY"]/1000000, 2)
salary_df.drop(["POSITION","TEAM", "SALARY"], inplace=True, axis=1)
salary_df.head()

In [None]:
diff = list(set(nba_players_df["PLAYER"].values.tolist()) - set(salary_df["PLAYER"].values.tolist()))

In [None]:
len(diff)


In [None]:

nba_players_with_salary_df = nba_players_df.merge(salary_df); 

In [None]:

plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Player Correlation Heatmap:  2016-2017 Season (STATS & SALARY)")
corr = nba_players_with_salary_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

3P: The number of 3 point field goal attempts that a player makes, 3PA:  The number of 3 point  goals that a player has attempted, 3P% = 3P/3PA

FG%: The percentage of field goal attempts that a player makes (FG/FGA). FGA: The number of field goals that a player or team has attempted. 

eFG%: Measures field goal percentage adjusting for made 3-point field goals being 1.5 times more valuable than made 2-point field goals.

FT - free twrows, FTA - free throws attempt
ORB: Offensive Rebounds, DRB: Deffensive Rebounds

AST: assists -- passes that lead directly to a made basket -- by a player, STL: defensive player "steals" a ball from a offense , causing a turnover.
BLK: blocks, 
GP: Games Played
Wins_RPM : provide an estimate of the number of wins each player has contributed to his team's win total on the season.




There is relationship between Salary and Player impact on his team

In [None]:
sns.lmplot(x="SALARY_MILLIONS", y="WINS_RPM", data=nba_players_with_salary_df)


In [None]:
results = smf.ols('W ~POINTS', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())


In [None]:
results = smf.ols('W ~WINS_RPM', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())


In [None]:
results = smf.ols('SALARY_MILLIONS ~POINTS', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())
