# National Basketball Association (NBA) Dataset


# Loading the data

In [62]:
NBA = read.csv("C:/Users/Ravi/Downloads/Downloads/NBA_train.csv")

In [None]:
#Now let’s look at the structure and summary of the data
str(NBA)

In [None]:
summary(NBA)

So here The goal of a basketball team is making the playoffs.

In [None]:
# How many wins to make the playoffs?
table(NBA$W, NBA$Playoffs)
 

### Adding Additional Variables into the dataset


Now we will add a variable that is the difference between points scored and points allowed.



In [None]:
# Compute Points Difference
NBA$PTSdiff = NBA$PTS - NBA$oppPTS

# Check for linear relationship
plot(NBA$PTSdiff, NBA$W, xlab="Points Difference",ylab="Wins",col="blue",main="PTSdiff vs Wins")

From the graph we can clearly see that there’s an incredibly strong linear relationship between these two variables. So it seems like linear regression is going to be a good way to predict how many wins a team will have given the point difference.

In [None]:
# Linear regression model for wins
WinsReg = lm(W ~ PTSdiff, data=NBA)
summary(WinsReg)

From the summary we notice that we’ve got very significant variables over here. And an R squared of 0.9423, which is very high.

### Regression Equation                      
                  
 *W* = 41 + 0.0326 ** PTSdiff*

So we saw earlier with the table that a team would want to win about at least 42 games in order to have a good chance of making it to the playoffs. So what does this mean in terms of their points difference? If we want this to be greater than or equal to 42, that means that the PTSdiff would need to be greater than or equal to 42 minus 41 divided by 0.0326. So if we actually do that calculation, we see that this is equal to 30.67. So we need to score at least 31 more points than we allow in order to win at least 42 games.

### Linear Regression Model For Points Scored
Now let’s build an equation to predict points scored using some common basketball statistics.

Understanding the variables of the dataset, X2PA for two-point attempts, X3PA for three-point attempts, FTA for free throw attempts, AST for assists, ORB offensive rebounds, DRB for defensive rebounds, TOV for turnovers, STL for steals and BLK for blocks.

In [None]:
# Linear regression model for points scored
PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data=NBA)
summary(PointsReg)

If we take a look at the summary, we can see that some of our variables are indeed very significant. Others are less significant. For example, steals only has one significance star. And some don’t seem to be significant at all. For example, defensive rebounds, turnovers, and blocks. We do have a pretty good R-squared value, 0.8992, so it shows that there really is a linear relationship between points and all of these basketball statistics.

### Sum of Squared Errors

In [None]:
PointsReg$residuals
SSE = sum(PointsReg$residuals^2)
print(SSE)

SSE is quite a lot here 28,394,314. So the sum of squared errors number is not really a very interpretable quantity.

### Root Mean Squared Error
Root Mean Squared Error is much more interpretable. It’s more like the average error we make in our predictions. let’s calculate it here.

In [None]:
RMSE = sqrt(SSE/nrow(NBA))
print(RMSE)

In [None]:
# Average number of points in a season
print(mean(NBA$PTS))

RMSE in our case is 184.4. which makes much more sense. Because mean value is too high with respect to RMSE value.

### Removing Insignifcant Variables
Now lets remove some of the insignificant variables one at a time.

If we see the summary of our previous model, we can see that turnovers is not so significant so we can remove it from our model, because it’s p value is much higher 0.6859.

In [None]:
summary(PointsReg)

In [None]:
PointsReg2 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data=NBA)
summary(PointsReg2)

So in our first regression model PointsReg, we had an R-squared of 0.8992. And R-squared of PointsReg2 is 0.8991. So almost exactly identical.

So there is no loss in removing the turnover variable.

Now we will remove defensive rebounds on the basis of higher p value after turnovers.

In [None]:
PointsReg3 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data=NBA)
summary(PointsReg3)

Let’s look at the summary again to see if the R-squared has changed. And it’s the same, it’s 0.8991. we also justified removing defensive rebounds.

Now we will remove blocks on the basis of higher p value after defensive rebounds.

In [None]:
PointsReg4 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
summary(PointsReg4)

So doing this R-squared value stayed the same but our model became much simpler with respect to our first model. Because we removed all the insignificant variables.

Now lets Compute SSE and RMSE for new model

In [None]:
# Compute SSE and RMSE for new model
SSE_4 = sum(PointsReg4$residuals^2)
RMSE_4 = sqrt(SSE_4/nrow(NBA))
print(SSE_4)

In [None]:
print(RMSE_4)

If we look at the SSE and RMSE of our new model we can see that the values of them increased slightly but we wouldn’t mind that because our model became much more simpler and we Refrained from the over fitting also.

### Load Test Data
Let’s load the NBA_test data, on which we will apply our prediction model and will try to predict points scored correctly.

In [None]:
NBA_test = read.csv("C:/Users/Ravi/Downloads/Downloads/NBA_test.csv")

Make predictions on test set by applying regression model build previously.



In [None]:
PointsPredictions = predict(PointsReg4, newdata=NBA_test)

The actual test of the accuracy of our model will be on new data set. So let’s find out the SSE and RMSE of our model on the test data set.

In [None]:
SSE = sum((PointsPredictions - NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
R2 = 1 - SSE/SST
R2

### Compute the RMSE

In [None]:
RMSE = sqrt(SSE/nrow(NBA_test))
RMSE

We see that we have an R-squared value of 0.8127. and root mean squared error here is 196.37. So it’s a little bit higher than before. But it’s not too bad. Because mean of points is 8370. And We’re making an average error of about 196 points.