# Chapter 8 Lab 3

## Goal
We are going to learn how to regression models using the metrics for regression discussed in Section 8.2.3 of Chapter 8. We will use sklearn's LinearRegression to fit the data to 'Kills'.

## Preparation

Load the required packages below.

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sys
from io import StringIO
from math import sqrt

import warnings
warnings.filterwarnings('ignore')

## Step 1: Load the Data

We've done these steps enough to condense them together in order to speed the process of getting to metric evaluation.

In [6]:
dota_df1 = pd.read_csv('DoTalicious_cleaned1000players.csv')
dota_df1.drop(['PlayerID'], axis=1, inplace=True)

dota_df1 = dota_df1.rename(columns=lambda x: x.strip())

dota_df1['TotalTime'] = pd.to_numeric(dota_df1['TotalTime'], errors='coerce')

dota_df1.loc[dota_df1['SkillLevel'] ==' SkillLevelNull', 'SkillLevel'] = '1'

dota_df1['SkillLevel'] = pd.Series(dota_df1['SkillLevel'].astype('float'))

dota_df1.drop([517], inplace=True)

dota_df1.describe()

Unnamed: 0,GamesPlayed,GamesWon,GamesLeft,Ditches,Points,SkillLevel,Kills,KillsPerMin,Deaths,Assists,CreepsKilled,CreepsDenied,NeutralsKilled,TowersDestroyed,RaxsDestroyed,TotalTime
count,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0,867.0
mean,92.343714,50.369089,1.650519,0.711649,1014.840856,0.532872,608.6609,0.131799,545.094579,951.763552,7636.831603,677.627451,1259.800461,70.786621,29.740484,228340.0
std,205.574415,115.795471,3.23932,1.808982,119.954984,0.609597,1502.754591,0.065829,1204.140278,2230.8092,19801.02441,2029.775852,2969.501221,175.869846,74.661056,475106.2
min,1.0,0.0,0.0,0.0,626.837,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,1349.0
25%,3.0,1.0,0.0,0.0,961.886,0.0,11.0,0.09,17.5,18.0,160.5,12.0,16.0,1.0,0.0,6270.0
50%,15.0,7.0,0.0,0.0,996.646,0.0,74.0,0.13,93.0,125.0,983.0,76.0,146.0,7.0,3.0,35160.0
75%,90.0,47.5,2.0,1.0,1046.975,1.0,514.5,0.17,554.0,865.0,6732.0,541.5,1130.5,62.0,26.0,223410.0
max,3156.0,1764.0,40.0,18.0,2010.24,3.0,23742.0,0.42,16988.0,34390.0,372360.0,43910.0,42900.0,2601.0,1141.0,4294920.0


Next, we'll split the data first into predictors and target and then into training and test

In [9]:
y = dota_df1['Kills']
X = dota_df1.drop(['Kills'], axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

## Step 2: Build a regression model and evaluate its performance

The output below - for various reasons - will differ than that of the R labs. However, the conclusions are the same.

Also, there are two ways to calculate R-squared: by calling the 'score' attribute of LinearRegression or importing r2_score. We've shown both below.

Lastly, sklearn has a 'mean_squared_error' that you can import. In order to arrive at RMSE, you need to take the square root of the MSE score, as shown below.

In [32]:
linreg = LinearRegression().fit(X_train, y_train)

y_pred = linreg.predict(X_train)

r2_score_calc = linreg.score(X_train, y_train)
mse_score = mean_squared_error(y_train, y_pred)
rmse_score = sqrt(mse_score)
mae_score = mean_absolute_error(y_train, y_pred)
r2_met_score = r2_score(y_train, y_pred)

print("The RMSE score for the training data is: ", rmse_score)
print("The LinearRegression built-in R-squared score for the training data is: ", r2_score_calc)
print("The r2_score R-squared score for the training data is: ", r2_met_score)
print("The MAE score for the training data is: ", mae_score)

The RMSE score for the training data is:  135.20203705014563
The LinearRegression built-in R-squared score for the training data is:  0.9922409084856602
The r2_score R-squared score for the training data is:  0.9922409084856602
The MAE score for the training data is:  61.91692833234588


You can see that, despite using the R-squared attribute of LinearRegression or the separate scklearn 'r2_score', the metric is the same.

In [34]:
y_pred = linreg.predict(X_test)

r2_score_calc_tr = linreg.score(X_test, y_test)
mse_score_tr = mean_squared_error(y_test, y_pred)
rmse_score_tr = sqrt(mse_score_tr)
mae_score_tr = mean_absolute_error(y_test, y_pred)
r2_met_score_tr = r2_score(y_test, y_pred)

print("The RMSE score for the test data is: ", rmse_score_tr)
print("The LinearRegression built-in R-squared score for the training data is: ", r2_score_calc_tr)
print("The r2_score R-squared score for the training data is: ", r2_met_score_tr)
print("The MAE score for the training data is: ", mae_score_tr)


The RMSE score for the test data is:  195.99303217330583
The LinearRegression built-in R-squared score for the training data is:  0.980218201314834
The r2_score R-squared score for the training data is:  0.980218201314834
The MAE score for the training data is:  83.42655086959786


Performance on the test set is slightly worse that that of the training - which is to be expected. It is still very good performance however.

At this point, you would normally proceed with either tuning parameters to improve performance or moving forward with the model. 

We will see how to tune parameters in the next lab.

## Conclusion

This lab has demonstrated how use sklearn's LinearRegression model and to then compute common performance metrics such as RMSE, R-squared and mean absolute value (MAE).