# Unit 1 Assignment

In [None]:
#Shreya Reddy Vurelly
#Krishnasai Chaluvadi

In this assignment, we will focus on sports analytics. This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Description of Variables

The description of variables are provided in "Baseball - Data Dictionary.docx"

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

In [2]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(999)

In [3]:
#We will predict the "attendance_binary" value in the data set:

baseball = pd.read_csv("baseball.csv")
baseball.head()

Unnamed: 0,attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333,1
1,0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
2,0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333,0
3,0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333,1
4,0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333,1


In [4]:
# let's split data into train and test sets
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(baseball, test_size=0.3)

In [5]:
# check missing values
train_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [6]:
test_set.isna().sum()

attendance_binary            0
previous_attendance          0
previous_away_team_errors    0
previous_away_team_hits      0
previous_away_team_runs      0
game_type                    0
previous_game_type           0
previous_home_team_errors    0
previous_home_team_hits      0
previous_home_team_runs      0
game_day                     0
previous_game_day            0
temperature                  0
wind_speed                   0
sky                          0
previous_game_duration       0
previous_homewin             0
dtype: int64

In [8]:
train_y = train_set[['attendance_binary']]
test_y = test_set[['attendance_binary']]

train_inputs = train_set.drop(['attendance_binary'], axis=1)
test_inputs = test_set.drop(['attendance_binary'], axis=1)

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [11]:
train_inputs.dtypes

previous_attendance            int64
previous_away_team_errors      int64
previous_away_team_hits        int64
previous_away_team_runs        int64
game_type                     object
previous_game_type            object
previous_home_team_errors      int64
previous_home_team_hits        int64
previous_home_team_runs        int64
game_day                      object
previous_game_day             object
temperature                    int64
wind_speed                     int64
sky                           object
previous_game_duration       float64
previous_homewin               int64
dtype: object

In [12]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [15]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['previous_homewin']

In [16]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    numeric_columns.remove(col)

In [17]:
binary_columns

['previous_homewin']

In [18]:
numeric_columns

['previous_attendance',
 'previous_away_team_errors',
 'previous_away_team_hits',
 'previous_away_team_runs',
 'previous_home_team_errors',
 'previous_home_team_hits',
 'previous_home_team_runs',
 'temperature',
 'wind_speed',
 'previous_game_duration']

In [19]:
categorical_columns

['game_type', 'previous_game_type', 'game_day', 'previous_game_day', 'sky']

In [20]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [21]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [22]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [23]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

In [25]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

print("Train set transform data:", train_x)
print("Train set transformed data shape:", train_x.shape)

array([[ 1.0759142 , -0.73284426,  0.65755259, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.08802865, -0.73284426, -1.39758022, ...,  0.        ,
         0.        ,  1.        ],
       [-0.18964197, -0.73284426,  1.53832379, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.24819514, -0.73284426, -1.10398982, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.73307527,  0.53175894, -1.69117062, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.59497055, -0.73284426,  1.24473339, ...,  0.        ,
         0.        ,  0.        ]])

In [28]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

print("Train set transform data:", test_x)
print("Train set transformed data shape:", test_x.shape)

Train set transform data: [[-0.76265543  0.53175894 -0.51680901 ...  1.          0.
   1.        ]
 [ 1.12396818 -0.73284426  0.07037179 ...  0.          0.
   1.        ]
 [ 0.72267704 -0.73284426  1.53832379 ...  1.          0.
   1.        ]
 ...
 [-1.466404   -0.73284426 -0.51680901 ...  0.          0.
   1.        ]
 [ 1.11791095 -0.73284426 -0.81039942 ...  1.          0.
   1.        ]
 [-0.36732056  0.53175894  0.07037179 ...  1.          0.
   0.        ]]
Train set transformed data shape: (729, 37)


## Find the Baseline (0.5 point)

In [29]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_y)

In [30]:
from sklearn.metrics import accuracy_score

In [31]:
# This is the baseline Train Accuracy

dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_y, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.5188457008244994


In [32]:
# This is the baseline Test Accuracy

dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_y, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.5185185185185185


# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)

## SVM Model 1:

In [33]:
from sklearn.svm import SVC
 
lin_svm = SVC(kernel="linear")

lin_svm.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [34]:
from sklearn.metrics import accuracy_score

In [35]:
#Predict the train values
train_y_pred = lin_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8292108362779741

In [36]:
#Predict the test values
test_y_pred = lin_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8436213991769548

In [37]:
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[306,  45],
       [ 69, 309]], dtype=int64)

In [38]:
from sklearn.metrics import classification_report

#We usually create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           0       0.82      0.87      0.84       351
           1       0.87      0.82      0.84       378

    accuracy                           0.84       729
   macro avg       0.84      0.84      0.84       729
weighted avg       0.85      0.84      0.84       729



## SVM Model 2:

In [97]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm_1 = SVC(kernel="poly", degree=3, coef0=1, C=10)

pol_svm_1.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [98]:
#Predict the train values
train_y_pred_1 = pol_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred_1)

0.85924617196702

In [99]:
#Predict the test values
test_y_pred_1 = pol_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred_1)

0.8203017832647462

In [100]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm_2 = SVC(kernel="poly", degree=3, coef0=1, C=1)

pol_svm_2.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [101]:
#Predict the train values
train_y_pred_2 = pol_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred_2)

0.85924617196702

In [102]:
#Predict the test values
test_y_pred_2 = pol_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred_2)

0.8203017832647462

In [103]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm_3 = SVC(kernel="poly", degree=3, coef0=1, C=0.1)

pol_svm.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [104]:
#Predict the train values
train_y_pred_3 = pol_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred_3)

0.85924617196702

In [105]:
#Predict the test values
test_y_pred_3 = pol_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred_3)

0.8203017832647462

## SVM Model 3:

In [106]:
rbf_svm_1 = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm_1.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [107]:
#Predict the train values
train_y_pred_1 = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred_1)

0.8368669022379269

In [108]:
#Predict the test values
test_y_pred_1 = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred_1)

0.821673525377229

In [109]:
rbf_svm_2 = SVC(kernel="rbf", C=1, gamma='scale')

rbf_svm_2.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [110]:
#Predict the train values
train_y_pred_2 = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred_2)

0.8368669022379269

In [111]:
#Predict the test values
test_y_pred_2 = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred_2)

0.821673525377229

In [112]:
rbf_svm_3 = SVC(kernel="rbf", C=0.1, gamma='scale')

rbf_svm_3.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [113]:
#Predict the train values
train_y_pred_3 = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred_3)

0.8368669022379269

In [114]:
#Predict the test values
test_y_pred_3 = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred_3)

0.821673525377229

In [115]:
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred_3)

array([[293,  58],
       [ 72, 306]], dtype=int64)

In [116]:
from sklearn.metrics import classification_report

#We usually create the classification report on test set
print(classification_report(test_y, test_y_pred_3))

              precision    recall  f1-score   support

           0       0.80      0.83      0.82       351
           1       0.84      0.81      0.82       378

    accuracy                           0.82       729
   macro avg       0.82      0.82      0.82       729
weighted avg       0.82      0.82      0.82       729



# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRregression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [45]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_logreg = SGDClassifier(max_iter=100, penalty=None, eta0=0.01) 

sgd_logreg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [46]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8180212014134276

In [47]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.803840877914952

## SGD Model 2:

In [73]:
from sklearn.linear_model import SGDClassifier 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_logreg = SGDClassifier(max_iter=100, penalty='l2', eta0=0.01) 

sgd_logreg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [74]:
#Predict the train values
train_y_pred = sgd_logreg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8056537102473498

In [75]:
#Predict the test values
test_y_pred = sgd_logreg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.7928669410150891

In [76]:
from sklearn.metrics import confusion_matrix

#We usually create the confusion matrix on test set
confusion_matrix(test_y, test_y_pred)

array([[273,  78],
       [ 73, 305]], dtype=int64)

In [77]:
from sklearn.metrics import classification_report

#We usually create the classification report on test set
print(classification_report(test_y, test_y_pred))

              precision    recall  f1-score   support

           0       0.79      0.78      0.78       351
           1       0.80      0.81      0.80       378

    accuracy                           0.79       729
   macro avg       0.79      0.79      0.79       729
weighted avg       0.79      0.79      0.79       729



## LogisticRegression Model:

In [51]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(penalty='none')

log_reg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [52]:
log_reg.predict(test_x)

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,

In [53]:
# Create a new DataFrame

predictions = pd.DataFrame(log_reg.predict(test_x), columns=['Predicted'])

predictions

Unnamed: 0,Predicted
0,1
1,1
2,1
3,0
4,1
...,...
724,0
725,0
726,0
727,1


In [54]:
# Add the actual to the same DataFrame

predictions['Actual'] = np.array(test_y)

predictions

Unnamed: 0,Predicted,Actual
0,1,0
1,1,1
2,1,1
3,0,0
4,1,1
...,...,...
724,0,0
725,0,0
726,0,0
727,1,1


In [55]:
from sklearn.metrics import accuracy_score

In [56]:
#Predict the train values
train_y_pred = log_reg.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8321554770318021

In [57]:
#Predict the test values
test_y_pred = log_reg.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8257887517146777

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

## Which model performs the best and why? (0.5 points) How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the highest TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (0.5 points)

## Is there any evidence of overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (0.5 points)