## 1. Introduction

The National Basketball Association (NBA), comprised of thirty professional basketball teams in the United States and Canada, is one of the most influential athletic leagues in North America. As of 2018, the NBA is the fifth largest sports league in the world with a total revenue of about \\$8 billion. Some of this tremendous value is distributed to NBA players in the form of
inordinately large salaries. Most teams have annual payrolls of over \\$100 million, so it is crucial for teams to ensure that players are worth the price of their contracts.  

We expect salaries to be driven by two main factors: on-court performance and "social power." We expect that players who help their teams win more games and contend for championship titles should be paid well. Therefore, on-court performance data should be able to capture a large portion of the variation in player salaries.The second factor, "social power," is a player’s ability to increase team revenues in the form of increased ticket and merchandise sales. We hypothesize that players with large social media followings and wide cultural influence will earn larger salaries regardless of their team’s performance.

In [None]:
#Load required packages
library(alr4)
library(tidyverse)
library(olsrr)

## 2. Data and Approach

The data used in this analysis comes from the "Social Power NBA" dataset posted on the Kaggle website.
The data includes on-court performance statistics from a sample of 100 NBA players
during the 2016-2017 NBA season, as well as Twitter engagement metrics for each player.

In [None]:
#Read the data file into nba_data
nba_data <- read.csv("NBA Data.csv")
head(nba_data)

Looking at the scatterplot, we can see several variables that will need transformation. The points in the 
followers and free throw attempt rate (ftar) plots are clustered on the left-hand side of their respective plots. 
We may need to transform these variables to achieve a linear mean function.

Some players also have zero Twitter followers, and the turnover rate (tov) for some players is zero. 
We will add 0.01 to Twitter followers and 0.001 to turnover rate for each player to ensure the values are strictly positive. 
This allows us to power transform these variables later if necessary.

In [None]:
#View scatterplot of the data
pairs(~ salary + age + netrtg + astpct + rebpct +
        usg + followers + tov + ftar + ws, data = nba_data)

#Add 0.01 to followers and 0.001 to tov
nba_data$followers <- nba_data$followers+0.01
nba_data$tov <- nba_data$tov+0.001

We used the multivariate Box-Cox method to simulataneously select the optimal transformations for each predictor. 
Rather than using the estimated powers from the Box-Cox transformation, we used the closest “standard” power transformation 
values. We ran the likelihood ratio tests with the null hypotheses of no transformations and log transformations of all 
predictors, both of which were rejected.

We proceed with a log transformation of age, assist percentage, and Twitter followers, a square
root transformation of win share, and an inverse square root transformation of free throw attempt
rate.

In [None]:
#Multivariate Box-Cox transformation of predictors
summary(trp <- powerTransform(cbind(age,astpct,rebpct,followers,usg,ftar,ws,tov)~1, data=nba_data))

To select the predictors for our initial regression model, we used backward stepwise variable selection with AIC as the 
information criterion. This selection method begins with a modelcontaining all the potential predictors, then removes the 
variable with the least explanatory power according to the information criterion. This process is repeated, removing one 
variable at a time, until the criterion no longer improves.

In [None]:
#Model with all transformed predictors
init_model_1 <- lm(salary~pos+log(age)+netrtg+log(astpct)+rebpct+
                   usg+log(followers)+tov+I(ftar^(-1/2))+I(ws^(1/2)), data=nba_data)

#MODEL 1: Stepwise regression with transformed predictors and untransformed response
step(init_model_1, scope=~1, direction="backward")

## 3. Initial Model
The backward selection algorithm specified the model shown below. The model has an R-squared value of 0.5347. The The two most 
significant predictors in the model are win share and Twitter followers, which provides evidence for our initial theory that 
player salary is influenced by both on-court performance and social influence.

In [None]:
summary(m1 <- lm(salary ~ log(age) + usg + log(followers) + tov + 
                    I(ws^(1/2)), data = nba_data))

We used a normal Q-Q plot and residual histogram to ensure the residuals are normally distributed with a constant variance. 
The residuals fall approximately on the normal line in the Q-Q plot, with some slight skew at the tail ends of the plot.

In [None]:
#Normal qqplot of residuals
ols_plot_resid_qq(m1)

The residual histogram is slightly right-skewed but approximates a normal distribution. These two plots do not reveal any 
obvious violations of our normality assumptions.

In [None]:
#Residual histogram
ols_plot_resid_hist(m1)

The residual vs. fitted value plot below raises some potential concerns. We suspect there
may be a non-constant variance issue, as the variance appears to be smaller at the tail ends of the
plot than in the middle. Thus, we consider the possibility of transforming the response variable to
achieve a constant variance.

In [None]:
#Plot residual vs. fitted values
ols_plot_resid_fit(m1)

## 4. Response Transformation
Next, we use an inverse response plot and a Box-Cox transformation plot to explore potential transformations of our response 
variable (salary).The plots will show us the optimal power transformation of the response. 

In [None]:
#Inverse response plot
inverseResponsePlot(m1)

#Box-Cox transformation plot
boxCox(m1)
summary(powerTransform(m1))

After inspecting these plots, a square root transformation of salary seems to be best to minimize RSS while also stabilizing variance. Using the square root transformation will also ensure all salary values remain strictly positive.

## 5. Final Model
Next, we fit a regression model of our transformed salary value on all of the initial predictor values. We again used backward selection to automatically select the best predictors for the model.

In [None]:
#Model with transformed predictors and Salary^(1/2)
init_model_2 <- lm(I(salary^(1/2))~pos+log(age)+netrtg+log(astpct)+rebpct+
                     usg+log(followers)+tov+I(ftar^(-1/2))+I(ws^(1/2)), data=nba_data)

In [None]:
#Backward selection algorithm for new model
step(init_model_2, scope=~1, direction="backward")

The backward selection algorithm specified the final model shown below. The adjusted R-squared value is 0.5593, which represents about a 2.5% improvement over the initial model. The coefficients on age, usage rate, followers, and win share are statistically significant and all predictors have their expected signs.

In [None]:
#MODEL 2: Model with transformed predictors and response
m2 <- lm(I(salary^(1/2)) ~ log(age) + netrtg + usg + log(followers) + 
        tov + I(ws^(1/2)), data = nba_data)
summary(m2)

Looking at the residual histogram below, there are no major violations of our normality assumptions. However, there are some potential outliers on the left-hand side of the plot.

In [None]:
#Residual histogram
ols_plot_resid_hist(m2)

The residual vs. fitted values plot below does appear to have more constant variance than the initial model, but we are still concerned about the possibility of outliers.

In [None]:
#Plot residual vs. fitted values
ols_plot_resid_fit(m2)

## 6. Outlier Analysis
We use a Cook's distance bar plot to identify potential outliers.

In [None]:
#Cook's distance plot
ols_plot_cooksd_bar(m2)

The biggest outliers appear to be observations 17, 38, and 69. As shown below, these correspond to Rudy Gobert, Edy Tavares, and Brook Lopez, respectively.

In [None]:
#Read the data file into nba_data_1 (untransformed data)
nba_data_1 <- read.csv("NBA Data.csv")

#Add predicted salary from final model to nba_data_1
nba_data_1$fitted_salary <- (fitted(m2))^2

#Identify and print outliers
outliers <- c(17, 38, 69)
variables <- c("player", "age", "usg", "followers", "ws", "salary", "fitted_salary")
print(nba_data_1[outliers, variables])

These outliers give us an indication of the types of observations the regression model is unlikely to predict accurately.

Rudy Gobert was in the final year of his 4-year rookie contract with the Utah Jazz during the 2016-2017 season. Gobert was fairly young at 25, and had a high win share value of 14.3. This led the regression model to predict a salary of \\$10.22M, while he earned only \\$2.1M in the 2016-2017 season. After the season, Gobert signed a 4-year contract with the Utah Jazz for \\$25.5M per year. Gobert represents overperforming young players still in their rookie contracts.

Edy Tavares played a single NBA game in 2016-2017 for the Cleveland Cavaliers, and had a below average win share, usage rate, and number of Twitter followers. The model predicts a salary of \$40,000, while his actual salary was \\$1M. Tavares entered the league in 2015-2016, giving him one year of experience entering the season. The league minimum salary in 2016-2017 for players with one year of experience was \\$874,636, meaning the Cavaliers were forced to pay Tavares much more than his performance warranted. Tavares is a case of an underperforming player benefitting from a high league minimum salary.

Brook Lopez was 29 years old during the 2016-2017 season with a high usage rate of 29% and an average win share value of 4.9. However, Lopez did not have a Twitter account during the season, so the model was unable to accurately predict his salary. The model predicted a salary of about \\$7.5M, compared to his actual salary of \$21.6M. Lopez represents productive players without a Twitter account or wide social media influence.

## 7. Conclusion
We see that both game performance and social media following impact NBA player compensation. The model performs well, but the outlier analysis demonstrates some of the potential issues that could be worked out in the future. Specifically, we might consider including a dummy variable to indicate whether a player has a Twitter account, as well as taking the league minimum salary into consideration. 

An unfortunate consequence of our model is that transforming the response variable makes it difficult to interpret the magnitude of the coefficient estimates. A future model may attempt to leave the response untransformed to give us more information about the effects of the predictors on the response.