**Load and Understand the Data**

In [None]:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline

In [None]:
attendance_df = pd.read_csv("../input/nba_2017_attendance.csv");attendance_df.head()
# number of games the teamed in the season
# Total attendance for the whole season
# Percentage of average capacity of the stadium that is filled

In [None]:
endorsement_df = pd.read_csv("../input/nba_2017_endorsements.csv");endorsement_df.head()
# endorsement: money paid for advertisement

In [None]:
valuations_df = pd.read_csv("../input/nba_2017_team_valuations.csv");valuations_df.head()

In [None]:
salary_df = pd.read_csv("../input/nba_2017_salary.csv");salary_df.head()
# small forward
# Point Guard
# Center
# Power forward
# salary in dollar

In [None]:
pie_df = pd.read_csv("../input/nba_2017_pie.csv");pie_df.head()
# GP game played in season (no tie), 82 games in the whole season
# Win
# Loss
# Min: average minutes per game in this season
# OFFRTG: score gained everytime he get the ball
# DEFRTG: points allowed when he/she faced in players in 100 times
# NETRTG：offrtg - defrtg 
# AST ratio：
# OREB%：Offense Rebound percentage 
# DREB%：the higher the better
# TO Ratio: turnover ratio


In [None]:
plus_minus_df = pd.read_csv("../input/nba_2017_real_plus_minus.csv");plus_minus_df.head()
# RPM real plus minus 
# ORPM higher the better
# DRPM higher the better
# WINS 

In [None]:
br_stats_df = pd.read_csv("../input/nba_2017_br.csv");br_stats_df.head()
# rk ranking
# pos position
# tm team
# G games
# GS 
# MP minutes played per game 
# FG field goal 
# FGA field goal attempted
# ft% free through percentage
# ORB 
# DRB 
# TRB = orb + drb
# ast assist 
# stl steel 
# BLK block
# TOV turnover 
# PF personal false 
# PS/

In [None]:
elo_df = pd.read_csv("../input/nba_2017_elo.csv");elo_df.head()


In [None]:
attendance_valuation_df = attendance_df.merge(valuations_df, how="inner", on="TEAM")

In [None]:
attendance_valuation_df.head()


In [None]:
attendance_valuation_elo_df = pd.read_csv("../input/nba_2017_att_val_elo.csv")

In [None]:
attendance_valuation_elo_df.head()


**Q1: Higher salary means higher endorsement?**

In [None]:
endorsement_df

In [None]:
endorsement_df['SALARY'] = endorsement_df['SALARY'].str.replace(',', '')
endorsement_df['SALARY'] = endorsement_df['SALARY'].str.replace('$', '')
endorsement_df['SALARY'] = endorsement_df['SALARY'].astype(float)

In [None]:
endorsement_df['ENDORSEMENT'] = endorsement_df['ENDORSEMENT'].str.replace(',', '')
endorsement_df['ENDORSEMENT'] = endorsement_df['ENDORSEMENT'].str.replace('$', '')
endorsement_df['ENDORSEMENT'] = endorsement_df['ENDORSEMENT'].astype(float)

In [None]:
endorsement_df

In [None]:
# calculate total of salary and endorsement
endorsement_df["total"] = endorsement_df.ENDORSEMENT + endorsement_df.SALARY
# set general plot properties
plt.subplots(figsize = (20,15))
ax = plt.axes()
# Plot 1 - background - "total" (top) series
sns.set_color_codes("muted")
sns.barplot(x="total", y = "NAME", data = endorsement_df, label = "Endorsement", color = 'b')
#Plot 2 - overlay - "bottom" series
sns.set_color_codes("pastel")
sns.barplot(x="SALARY", y = "NAME", data = endorsement_df, label = "Salary", color = "b")
# Add a legend 
ax.legend(ncol=2, loc="lower right", frameon=True) 
# add label
ax.set(ylabel="Player Names",
       xlabel="Player Salary and Endorsement")
# remove rim of table
sns.despine(left=True, bottom=True)
# reference: https://github.com/noahgift/spot_price_machine_learning/blob/master/notebooks/spot_pricing_ml.ipynb
# reference: http://randyzwitch.com/creating-stacked-bar-chart-seaborn/

Interestingly, we see that Stephen Curry's endorsement is much higher than his salary compared to any other players. And Carmelo Anthony has the lowest endorsement compared to other players. If the amount of endorsement can reveal the level of popularity in a broad business market, this visualization may mean that Stephen Curry is more broadly accepted by the business world compared to other players. However, it's more likely that the relatively low endorsement of Carmelo Anthony and Chris Paul is caused by the time they signed the contract for endorsement - Carmelo Anthony and Chris Paul are old guys in the game and they probably signed the contract for certain amount of endorsement 10 years ago, in which time one dollar has higher buying power than one dollar nowadays due to inflation. So we may need inflation data to make the amount of endorsement more accurate or comparable. 

In [None]:
results = smf.ols('ENDORSEMENT ~ SALARY', data=endorsement_df).fit()

In [None]:
print(results.summary())

In [None]:
import numpy as np
A = endorsement_df['SALARY'].values
B = endorsement_df['ENDORSEMENT'].values
print (np.corrcoef(A,B))

From the result of correlation and linear regression, we can see that salary and endorsement is not very related to each other.  I think it indicates that the value of a player in the professional field and in the business world is not directly related. 

 A question stemed from Q1: Since the players that we know about the endorsement data are all star players, could there be any relationship between endorsement and age? I was assuming that among these super star players, the elder guys get lower endorsement because they signed the endorse contract in earlier years. However, from the following analysis, this assumption doesn't seem a sound assumtion. 

In [None]:
player_stats_df

In [None]:
player_stats_df[['ENDORSEMENT','Age','NAME']].sort_values(by = 'ENDORSEMENT', ascending = False)

In [None]:
a = player_stats_df['ENDORSEMENT'].values

In [None]:
b = player_stats_df['Age'].values

In [None]:
print (np.corrcoef(a,b))

**Q2: for the teams, relationship between elo and value in millions?**

In [None]:
attendance_valuation_elo_df.info()

In [None]:
import plotly.plotly as py
import cufflinks as cf
print (cf.__version__)

In [None]:
elo_value = attendance_valuation_elo_df[['TEAM','VALUE_MILLIONS','ELO']].sort_values(by='VALUE_MILLIONS', ascending = False)

In [None]:
# need to set the index of table to be the teams
elo_value = elo_value.set_index('TEAM')

In [None]:
cf.go_offline()
elo_value.iplot(title="Team ELO and Value ",
                    xTitle="Teams",
                    yTitle="",
                   #bestfit=True, bestfit_colors=["pink"],
                   #subplots=True,
                   shape=(4,1),
                    #subplot_titles=True,
                    fill=True,)

From the chart above, interestingly we see that the teams whose value in million is higher than ELO are all from relatively big cities - cities with better economy and higher amount of people. However, this visualization is of problem in some ways since value is in millions but ELO is not in the unit of millions so there should be another y axis. I need to do more research on how to add another axis in this chart.
reference: https://github.com/noahgift/real_estate_ml/blob/master/notebooks/explore_zillow_data_sets.ipynb
reference: https://plot.ly/ipython-notebooks/cufflinks/

Reference of ELO rating: https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/

**Q3: For players, relationship between Assist to Turnover Ratio and Salary?**

In [None]:
#pie_df

In [None]:
br_stats_df.head(5)

In [None]:
endorsement_df

In [None]:
player_stats_df = pd.merge(endorsement_df,br_stats_df , left_on = 'NAME', right_on = 'Player')

In [None]:
player_stats_df

In [None]:
player_stats_df['ast_tov'] = player_stats_df['AST']/player_stats_df['TOV']

In [None]:
sns.lmplot(x="SALARY", y="ast_tov", data=player_stats_df)
plt.show()

In [None]:
sns.barplot(x = 'ast_tov', y = 'NAME', data = player_stats_df, )

Assist to Turnover is a widely accepted metrics to gauge a performance of an NBA player. A player who provides more assist and less turnover is more likely to be a supportive team player, a strong leader who organize otin a team. So I assumed that they gets higher salary. However, for the ten players, we don't see a relationship between assist to turnover and salary. This is probably because of a lack of more player data. 

**Q4: Correlation of all the data points related to the players?**

In [None]:
wikipedia_df = pd.read_csv("../input/nba_2017_player_wikipedia.csv");wikipedia_df.head()

In [None]:
twitter_df = pd.read_csv("../input/nba_2017_twitter_players.csv");twitter_df.head()

In [None]:
player_stats_df.head(5)

In [None]:
player_stats_twitter_df = pd.merge(player_stats_df, twitter_df, left_on = 'NAME', right_on = 'PLAYER')

In [None]:
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Player Correlation Heatmap")
corr = player_stats_twitter_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
           cmap="Oranges")

From this heatmap, we see that some technical professional metrics are strongly related to monetization metrics salary and endorsement. And I would like to take a look at their relationship using scatterplot. 

In [None]:
from ggplot import *

In [None]:
p = ggplot(player_stats_twitter_df,aes(x="2P%", y="ENDORSEMENT")) + geom_point(size=100, color='orange') + stat_smooth(method='lm') 
p + xlab("2P%") + ylab("Endorsement") + ggtitle("NBA Players 2016-2017: Age vs Salary")

In [None]:
player_stats_twitter_df[['2P%','ENDORSEMENT']].corr()

From the scatterplot and correlationmatrix, 2-Point Field Goal Percentage is strongly related with endorsement. 
reference: https://www.basketball-reference.com/about/glossary.html

In [None]:
sns.lmplot(x="FG%", y="ENDORSEMENT", data=player_stats_twitter_df)

In [None]:
player_stats_twitter_df[["FG%", "ENDORSEMENT"]].corr()

Field Goal Percentage is strongly related with endorsement. 

In [None]:
sns.lmplot(x="eFG%", y="ENDORSEMENT", data=player_stats_twitter_df)

In [None]:
player_stats_twitter_df[["eFG%", "ENDORSEMENT"]].corr()

Effective Field Goal Percentage is strongly related with endorsement. 

In [None]:
sns.lmplot(x="TWITTER_FAVORITE_COUNT", y="ENDORSEMENT", data=player_stats_twitter_df)

In [None]:
player_stats_twitter_df[["TWITTER_FAVORITE_COUNT", "ENDORSEMENT"]].corr()

Twitter favorate count is relatively strongly related to endorsement. 

However, these factors strongly related with endorsement is not very related to salary. Based on the heatmap,  salary is more related with Defensive Rebounds and Total Rebound. 