Exploration of How Social Media Can Predict Winning Metrics Better Than Salary

Scott Virshup's contributions:
* Written documentation about each step's functions. A readable notebook is a good notebook
* Additional visualization of the twitter data
* Clustering classification with wikipedia page views, player salary, and wins

Import libraries, clean the data, and join datasets where applicable

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
color = sns.color_palette()
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%matplotlib inline

In [None]:
attendance_valuation_elo_df = pd.read_csv("../input/nba_2017_att_val_elo.csv");attendance_valuation_elo_df.head()

In [None]:
salary_df = pd.read_csv("../input/nba_2017_salary.csv");salary_df.head()


In [None]:
pie_df = pd.read_csv("../input/nba_2017_pie.csv");pie_df.head()

In [None]:
plus_minus_df = pd.read_csv("../input/nba_2017_real_plus_minus.csv");plus_minus_df.head()

In [None]:
br_stats_df = pd.read_csv("../input/nba_2017_br.csv");br_stats_df.head()

In [None]:

plus_minus_df.rename(columns={"NAME":"PLAYER", "WINS": "WINS_RPM"}, inplace=True)
players = []
for player in plus_minus_df["PLAYER"]:
    plyr, _ = player.split(",")
    players.append(plyr)
plus_minus_df.drop(["PLAYER"], inplace=True, axis=1)
plus_minus_df["PLAYER"] = players
plus_minus_df.head()

In [None]:

nba_players_df = br_stats_df.copy()
nba_players_df.rename(columns={'Player': 'PLAYER','Pos':'POSITION', 'Tm': "TEAM", 'Age': 'AGE', "PS/G": "POINTS"}, inplace=True)
nba_players_df.drop(["G", "GS", "TEAM"], inplace=True, axis=1)
nba_players_df = nba_players_df.merge(plus_minus_df, how="inner", on="PLAYER")
nba_players_df.head()

In [None]:

pie_df_subset = pie_df[["PLAYER", "PIE", "PACE", "W"]].copy()
nba_players_df = nba_players_df.merge(pie_df_subset, how="inner", on="PLAYER")
nba_players_df.head()

In [None]:
salary_df.rename(columns={'NAME': 'PLAYER'}, inplace=True)
salary_df["SALARY_MILLIONS"] = round(salary_df["SALARY"]/1000000, 2)
salary_df.drop(["POSITION","TEAM", "SALARY"], inplace=True, axis=1)
salary_df.head()

**Lets start analyzing the data now**

In [None]:
diff = list(set(nba_players_df["PLAYER"].values.tolist()) - set(salary_df["PLAYER"].values.tolist()))

In [None]:
len(diff)


In [None]:
# merge the two dataframes
nba_players_with_salary_df = nba_players_df.merge(salary_df); 

Create a heatmap of NBA Player Correlation. The result is such a crowded (and small) visual, that it is difficult to intake. While the correlations may be relevant, doing a general mapping like this at the outset simples serves to give a high-level overview.
Some interesting findings:
* DRPM seems to have consistently low correlations with many variables
* 2 point attempts and 2 points are highly correlated, as intuition would back up
* 3 point attempts do not have a large correlation with FG %

In [None]:

plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Player Correlation Heatmap:  2016-2017 Season (STATS & SALARY)")
corr = nba_players_with_salary_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

Focusing in on two variables allows for a more precise understanding of the relationship between the two.
Salary and Wins_RPM show a strong positive correlation

In [None]:
sns.lmplot(x="SALARY_MILLIONS", y="WINS_RPM", data=nba_players_with_salary_df)


What follows are **OLS regression outputs**

Dependent: wins 
Independent: points
* Points is statistically significant and has a positive impact on wins

Dependent: wins 
Independent: wins_rpm
* Wins_rpm is statistically significant and has a positive impact on wins.
* This result is questionably significant because the formula for wins_rpm likely involves wins

Dependent: SALARY_MILLIONs 
Independent: Points
* Points increases salary positively, and is statistically significant

Dependent: SALARY_MILLIONs 
Independent: Wins_rpm
* Wins_rpm increases salary positively, and is statistically significant

In [None]:
results = smf.ols('W ~POINTS', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())


In [None]:
results = smf.ols('W ~WINS_RPM', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())


In [None]:
results = smf.ols('SALARY_MILLIONS ~POINTS', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())


In [None]:
results = smf.ols('SALARY_MILLIONS ~WINS_RPM', data=nba_players_with_salary_df).fit()


In [None]:
print(results.summary())


In [None]:
from ggplot import *


This creates a standard ggplot scatter plot that is 3 dimensions. The x and y axes are points/game and wins/rpm respectively. The 3rd dimension is the color, which shows salary in millions.

The main takeaway from this is that, while there appears to be a positive relationship between points and wins_rpm, the variance of wins_rpm increases as you increase points. On the other side, the color relationship, which is harder to see as clearly, appears to be positively correlated with both variables as well, though the sample size is small for the largest values of each variable.

In [None]:

p = ggplot(nba_players_with_salary_df,aes(x="POINTS", y="WINS_RPM", color="SALARY_MILLIONS")) + geom_point(size=200)
p + xlab("POINTS/GAME") + ylab("WINS/RPM") + ggtitle("NBA Players 2016-2017:  POINTS/GAME, WINS REAL PLUS MINUS and SALARY")

Start creating new datasets with the wikipedia pageview data and the twitter data

In [None]:
wiki_df = pd.read_csv("../input/nba_2017_player_wikipedia.csv");wiki_df.head()


In [None]:
wiki_df.rename(columns={'names': 'PLAYER', "pageviews": "PAGEVIEWS"}, inplace=True)


In [None]:
median_wiki_df = wiki_df.groupby("PLAYER").median()


In [None]:

median_wiki_df_small = median_wiki_df[["PAGEVIEWS"]]

In [None]:
median_wiki_df_small.head()

In [None]:
median_wiki_df_small = median_wiki_df_small.reset_index()


In [None]:
median_wiki_df_small.head()

In [None]:
nba_players_with_salary_wiki_df = nba_players_with_salary_df.merge(median_wiki_df_small)


In [None]:
twitter_df = pd.read_csv("../input/nba_2017_twitter_players.csv");twitter_df.head()


In [None]:
nba_players_with_salary_wiki_twitter_df = nba_players_with_salary_wiki_df.merge(twitter_df)


In [None]:
nba_players_with_salary_wiki_twitter_df.head()

In [None]:

plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Player Correlation Heatmap:  2016-2017 Season (STATS & SALARY & TWITTER & WIKIPEDIA)")
corr = nba_players_with_salary_wiki_twitter_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

Following this is all stuff added by me, Scott Virshup

In [None]:
# Creates variable called "positive marks" which basically just combines favorites and retweets.

nba_players_with_salary_wiki_twitter_df['TWITTER_POSITIVE_MARK'] = round(nba_players_with_salary_wiki_twitter_df['TWITTER_FAVORITE_COUNT'] + 2*(nba_players_with_salary_wiki_twitter_df['TWITTER_RETWEET_COUNT']))

In [None]:
# Scatter plot of salary and twitter positive marks
sns.lmplot(x="SALARY_MILLIONS", y="TWITTER_POSITIVE_MARK", data=nba_players_with_salary_wiki_twitter_df)

A classic twitter metric is "the ratio" - the ratio of replies to favorites (https://fivethirtyeight.com/features/the-worst-tweeter-in-politics-isnt-trump/). Unfortunately, we cannot use this metric in this examination of the data because the twitter data does not include replies. The ratio would potentially lend some insight into the relative popularity on twitter. Some other helpful twitter metrics to include in future data collection would be:
* replies
* followers
* tweets
* tweets per time period average
* verified or not

In [None]:
nba_players_with_salary_wiki_twitter_df.head()

Lets do some clustering

In [None]:
# Number of clusters
k_means = KMeans(n_clusters=3)

# Choose the columns that the clusters will be based upon
cluster_source = nba_players_with_salary_wiki_twitter_df.loc[:,["SALARY_MILLIONS", "W", "PAGEVIEWS"]]

# Create the clusters
kmeans = k_means.fit(cluster_source)

# Create a column, 'cluster,' denoting the cluster classification of each row
nba_players_with_salary_wiki_twitter_df['cluster'] = kmeans.labels_

# Create a scatter plot with colors based on the cluster
ax = sns.lmplot(x="PAGEVIEWS", y="SALARY_MILLIONS", data=nba_players_with_salary_wiki_twitter_df,hue="cluster", size=12, fit_reg=False)
ax.set(xlabel='Wikipedia Pageviews', ylabel='Salary in millions', title="NBA player Wikipedia pageviews vs Salary in millions clustered on SALARY_MILLIONS, W, PAGEVIEWS:  2016-2017 Season")


In the above scatter plot, the clusters are broken off by very distinct wikipedia page-view levels. Cluster 2 has extremely high page views compared to the others. We also see what appears to be a relationship between pageviews and salary, in that the lowest cluster seems to be concentrated at a lower salary than the higher two clusters. The largest cluster, cluster 2, has too small of a sample size to make generalized conclusions, however.