This is my attempt to study impact of social influence on the NBA dataset. This data set has multiple data sources and I'm a new bie to NBA so hopefully this will an interesting analysis for me.

In [None]:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.cluster import KMeans
color = sns.color_palette()
%matplotlib inline

In [None]:
attendance_df = pd.read_csv("../input/nba_2017_attendance.csv");attendance_df.head()

In [None]:
endorsement_df = pd.read_csv("../input/nba_2017_endorsements.csv");endorsement_df.head()


In [None]:
valuations_df = pd.read_csv("../input/nba_2017_team_valuations.csv");valuations_df.head()


In [None]:
salary_df = pd.read_csv("../input/nba_2017_salary.csv");salary_df.head()

In [None]:
pie_df = pd.read_csv("../input/nba_2017_pie.csv");pie_df.head()

In [None]:
plus_minus_df = pd.read_csv("../input/nba_2017_real_plus_minus.csv");plus_minus_df.head()


In [None]:
br_stats_df = pd.read_csv("../input/nba_2017_br.csv");br_stats_df.head()


In [None]:
elo_df = pd.read_csv("../input/nba_2017_elo.csv");elo_df.head()


Now that we have all the datasets loaded, let's see how each of them looks like and then proceed with the analysis

In [None]:
attendance_df.describe()

In [None]:
endorsement_df.describe()

In [None]:
valuations_df.describe()

In [None]:
salary_df.describe()

In [None]:
pie_df.describe()

In [None]:
plus_minus_df.describe()

In [None]:
br_stats_df.describe()

In [None]:
elo_df.describe()

As seen above, valuations and salary have very significant difference between mean and median. So let's see their histogram plot to see the shape

In [None]:
valuations_df.hist()

In [None]:
salary_df.hist()

Clearly, both salary and valuations data are right skewed. So, clearly there are some outliers in both data which will not be explained by our model. Let's see if this is true or not.

In [None]:
attendance_valuation_df = attendance_df.merge(valuations_df, how="inner", on="TEAM")

In [None]:
attendance_valuation_df.head()


In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"));sns.pairplot(attendance_valuation_df, hue="TEAM")

In [None]:
import numpy as np
corr = attendance_valuation_df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0.5,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Clearly, team valuation is fairly correlated to Total and Avg. Let's move ahead and explore how much is the correlation and how well these variables explain the variation in valuation

In [None]:
valuations = attendance_valuation_df.pivot("TEAM", "AVG", "VALUE_MILLIONS")



In [None]:
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Team AVG Attendance vs Valuation in Millions:  2016-2017 Season")
sns.heatmap(valuations,linewidths=.5, annot=True, fmt='g')

Now, we have this plot. My above hypothesis seems to be true. As there are several data points for valuation wherein even though the avg attendance is not the highest, the valuation is significantly high. Hence, there are other inferences to make here. One is what we discussed in class related to the location factor of these teams. On that line of thought, I suggest that we look at relationship of variables such as average spending, average earnings / salary etc. in these cities / states. Another line of thought is the relationship of valuation with the revenues generated by historical games. Revenue is definitely ticket price in each of these locations multiplied by average sales. Both these data sets are fairly easy to get. But since we do not have this data available, let's move on with the current data and explore.

In [None]:
results = smf.ols('VALUE_MILLIONS ~AVG', data=attendance_valuation_df).fit()


In [None]:
print(results.summary())


Here, adj. R squared value is pretty low, so my inference of this model is that avg does not explain all the variation in team valuation. 

In [None]:
sns.residplot(y="VALUE_MILLIONS", x="AVG", data=attendance_valuation_df)


In [None]:
attendance_valuation_elo_df = attendance_valuation_df.merge(elo_df, how="inner", on="TEAM")


In [None]:
attendance_valuation_elo_df.head()


In [None]:
corr_elo = attendance_valuation_elo_df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_elo, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
#ax.set_title("NBA Team Correlation Heatmap:  2016-2017 Season (ELO, AVG Attendance, VALUATION IN MILLIONS)")
f, ax = plt.subplots(figsize=(11, 9))
ax.set_title("NBA Team Correlation Heatmap:  2016-2017 Season (ELO, AVG Attendance, VALUATION IN MILLIONS)")
sns.heatmap(corr_elo, mask=mask, cmap=cmap, vmax=1, center=0.5,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
corr_elo


In [None]:

ax = sns.lmplot(x="ELO", y="AVG", data=attendance_valuation_elo_df, hue="CONF", size=12)
ax.set(xlabel='ELO Score', ylabel='Average Attendence Per Game', title="NBA Team AVG Attendance vs ELO Ranking:  2016-2017 Season")


In [None]:
attendance_valuation_elo_df.groupby("CONF")["ELO"].median()


In [None]:
attendance_valuation_elo_df.groupby("CONF")["AVG"].median()


In [None]:
results = smf.ols('AVG ~ELO', data=attendance_valuation_elo_df).fit()


In [None]:
print(results.summary())


In [None]:
from sklearn.cluster import KMeans


In [None]:
k_means = KMeans(n_clusters=3)


In [None]:
cluster_source = attendance_valuation_elo_df.loc[:,["AVG", "ELO", "VALUE_MILLIONS"]]


In [None]:
kmeans = k_means.fit(cluster_source)


In [None]:
attendance_valuation_elo_df['cluster'] = kmeans.labels_


In [None]:
ax = sns.lmplot(x="ELO", y="AVG", data=attendance_valuation_elo_df,hue="cluster", size=12, fit_reg=False)
ax.set(xlabel='ELO Score', ylabel='Average Attendence Per Game', title="NBA Team AVG Attendance vs ELO Ranking Clustered on ELO, AVG, VALUE_MILLIONS:  2016-2017 Season")

In [None]:
kmeans.__dict__


In [None]:
kmeans.cluster_centers_


In [None]:
cluster_1 = attendance_valuation_elo_df["cluster"] == 1


In [None]:
attendance_valuation_elo_df[cluster_1]
