# Assignment 1 - Rachel Rozet

In this assignment, we will focus on education. This dataset contains data about high school students. Each row represents a single student. The school administrators want to predict a student's cumulative GPA at the time of graduation so that they can make interventions for struggling students. The goal is to predict the CGPA of a student. 

## Description of Variables

The description of variables are provided in "High School - Data Dictionary.docx"

## Goal

Use the **high_school.csv** data set and build a model to predict **CGPA**.

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Section 1: (6 points in total)

## Data Prep (5.5 points)

### Load Data Set

In [709]:
# Set-up imports for initial data load
import numpy as np
import pandas as pd

np.random.seed(42)

In [711]:
# Import data set
school_data = pd.read_csv("high_school.csv")
school_data.head()

Unnamed: 0,Gender,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences,CGPA
0,female,bachelor's degree,married,regularly,yes,3.0,school_bus,71,74,2,2,no,3,6,C
1,female,some college,married,sometimes,yes,0.0,,90,88,1,2,yes,3,4,D
2,female,master's degree,single,sometimes,yes,4.0,school_bus,93,91,1,2,yes,3,10,B
3,male,associate's degree,married,never,no,1.0,,56,42,1,3,yes,2,2,F
4,male,some college,married,sometimes,yes,0.0,school_bus,78,75,1,2,no,3,4,C


In [713]:
# Import necessary data prep libraries

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

### Train, Test, Split

In [716]:
# Split data into test and train data sets

from sklearn.model_selection import train_test_split

train, test = train_test_split(school_data, test_size=0.3)

In [718]:
# Check missing values in train data set

train.isna().sum()

Gender                   0
ParentEdu              105
ParentMaritalStatus     72
ExtraCurricular         35
IsFirstChild            69
Siblings                77
Transportation         181
AvgReadingScore          0
AvgWritingScore          0
traveltime               0
studytime                0
internet                 0
freetime                 0
absences                 0
CGPA                     0
dtype: int64

In [720]:
# Separate target variable (CGPA) from train data sets (only include model inputs in train)

train_target = train[['CGPA']]
test_target = test[['CGPA']]

train_inputs = train.drop(['CGPA'],axis=1)
test_inputs = test.drop(['CGPA'],axis=1)

In [722]:
# Verify that CGPA is not included in train inputs
train_inputs

Unnamed: 0,Gender,ParentEdu,ParentMaritalStatus,ExtraCurricular,IsFirstChild,Siblings,Transportation,AvgReadingScore,AvgWritingScore,traveltime,studytime,internet,freetime,absences
483,male,,,sometimes,yes,2.0,,49,42,2,2,yes,4,12
2004,female,bachelor's degree,married,sometimes,no,3.0,school_bus,65,67,1,2,yes,4,16
277,male,associate's degree,married,never,yes,1.0,school_bus,61,59,2,1,yes,2,22
1552,male,some high school,single,sometimes,yes,0.0,private,31,27,3,1,yes,2,0
221,male,associate's degree,married,regularly,no,1.0,school_bus,77,80,1,3,no,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1638,male,high school,single,regularly,no,5.0,private,53,53,1,2,yes,3,2
1095,female,some college,single,sometimes,yes,1.0,,76,79,1,2,yes,4,8
1130,female,high school,divorced,regularly,yes,1.0,school_bus,75,75,1,3,yes,3,4
1294,male,high school,divorced,sometimes,yes,0.0,,40,46,1,3,yes,4,4


### Identify Numeric, Categorical, and Binary Columns

In [725]:
# Identify data types of columns included as train inputs
train_inputs.dtypes

Gender                  object
ParentEdu               object
ParentMaritalStatus     object
ExtraCurricular         object
IsFirstChild            object
Siblings               float64
Transportation          object
AvgReadingScore          int64
AvgWritingScore          int64
traveltime               int64
studytime                int64
internet                object
freetime                 int64
absences                 int64
dtype: object

In [727]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

### Pipeline Data Transformations

In [730]:
# Numeric transformer:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'median')),
    ('scaler', StandardScaler())])

In [732]:
# Categorical transformer:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [734]:
# Column transformer for numeric and categorical columns:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='drop',sparse_threshold=0)

In [736]:
# Display numeric columns
numeric_columns

['Siblings',
 'AvgReadingScore',
 'AvgWritingScore',
 'traveltime',
 'studytime',
 'freetime',
 'absences']

In [738]:
# Display categorical columns
categorical_columns

['Gender',
 'ParentEdu',
 'ParentMaritalStatus',
 'ExtraCurricular',
 'IsFirstChild',
 'Transportation',
 'internet']

### Fit and Transform Data Sets

In [741]:
# Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-0.07311637, -1.35881531, -1.65738705, ...,  1.        ,
         0.        ,  1.        ],
       [ 0.62359018, -0.26725329, -0.04930921, ...,  0.        ,
         0.        ,  1.        ],
       [-0.76982292, -0.54014379, -0.56389412, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.76982292,  0.41497297,  0.46527569, ...,  0.        ,
         0.        ,  1.        ],
       [-1.46652948, -1.97281894, -1.40009459, ...,  1.        ,
         0.        ,  1.        ],
       [-0.76982292,  0.41497297,  0.27230635, ...,  0.        ,
         0.        ,  1.        ]])

In [743]:
# Identify train data shape
train_x.shape

(1658, 33)

In [745]:
# Fit and transform the test data
test_x = preprocessor.fit_transform(test_inputs)

test_x

array([[-0.0899091 ,  1.4688247 ,  0.93468229, ...,  0.        ,
         0.        ,  1.        ],
       [-0.0899091 ,  0.59394735,  0.10166056, ...,  0.        ,
         0.        ,  1.        ],
       [-0.0899091 , -0.28093   ,  0.03758196, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.59746046, -2.09798296, -2.14109027, ...,  1.        ,
         0.        ,  1.        ],
       [-0.77727866, -1.49229864, -1.62846151, ...,  0.        ,
         0.        ,  1.        ],
       [-0.77727866, -1.35770212, -1.17991134, ...,  0.        ,
         0.        ,  1.        ]])

In [747]:
# Identify test data shape
test_x.shape

(711, 33)

### View transformed data in pandas data frame

In [750]:
# Get feature / column names
preprocessor.get_feature_names_out()

array(['num__Siblings', 'num__AvgReadingScore', 'num__AvgWritingScore',
       'num__traveltime', 'num__studytime', 'num__freetime',
       'num__absences', 'cat__Gender_female', 'cat__Gender_male',
       "cat__ParentEdu_associate's degree",
       "cat__ParentEdu_bachelor's degree", 'cat__ParentEdu_high school',
       "cat__ParentEdu_master's degree", 'cat__ParentEdu_some college',
       'cat__ParentEdu_some high school', 'cat__ParentEdu_unknown',
       'cat__ParentMaritalStatus_divorced',
       'cat__ParentMaritalStatus_married',
       'cat__ParentMaritalStatus_single',
       'cat__ParentMaritalStatus_unknown',
       'cat__ParentMaritalStatus_widowed', 'cat__ExtraCurricular_never',
       'cat__ExtraCurricular_regularly', 'cat__ExtraCurricular_sometimes',
       'cat__ExtraCurricular_unknown', 'cat__IsFirstChild_no',
       'cat__IsFirstChild_unknown', 'cat__IsFirstChild_yes',
       'cat__Transportation_private', 'cat__Transportation_school_bus',
       'cat__Transportation_

In [752]:
# Recreate the data set in the processed format (for visualization purposes)
school_train_processed = pd.DataFrame(train_x, columns=preprocessor.get_feature_names_out())

school_train_processed

Unnamed: 0,num__Siblings,num__AvgReadingScore,num__AvgWritingScore,num__traveltime,num__studytime,num__freetime,num__absences,cat__Gender_female,cat__Gender_male,cat__ParentEdu_associate's degree,...,cat__ExtraCurricular_sometimes,cat__ExtraCurricular_unknown,cat__IsFirstChild_no,cat__IsFirstChild_unknown,cat__IsFirstChild_yes,cat__Transportation_private,cat__Transportation_school_bus,cat__Transportation_unknown,cat__internet_no,cat__internet_yes
0,-0.073116,-1.358815,-1.657387,0.783857,-0.062247,0.763217,0.740064,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
1,0.623590,-0.267253,-0.049309,-0.639621,-0.062247,0.763217,1.218065,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,-0.769823,-0.540144,-0.563894,0.783857,-1.276433,-1.227993,1.935065,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,-1.466529,-2.586823,-2.622234,2.207335,-1.276433,-1.227993,-0.693937,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
4,-0.769823,0.551418,0.786891,-0.639621,1.151938,-0.232388,-0.693937,0.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,2.017003,-1.085925,-0.949833,-0.639621,-0.062247,-0.232388,-0.454936,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1654,-0.769823,0.483196,0.722568,-0.639621,-0.062247,0.763217,0.262064,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
1655,-0.769823,0.414973,0.465276,-0.639621,1.151938,-0.232388,-0.215936,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1656,-1.466529,-1.972819,-1.400095,-0.639621,1.151938,0.763217,-0.215936,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


## Find the Baseline (0.5 point)

In [755]:
# Import Dummy Classifier to identify most frequent classification of CGPA
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_target)

In [757]:
# Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_target, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.33293124246079614


In [759]:
# Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_target, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.3263009845288326


# Section 2: (3 points in total)

Build three different SVM models (by changing the kernels, regularization, etc.). Generate their training and test values. Each model is worth 1 point. 

(Add cells as needed)



## SVM Model 1:

In [763]:
# Create SVC linear model
from sklearn.svm import SVC
np.random.seed(42)
 
svm_clf = SVC(kernel="linear", C=7, decision_function_shape='ovr')

svm_clf.fit(train_x, train_target.values.ravel())

In [765]:
# Predict the train values
train_y_pred = svm_clf.predict(train_x)

# Predict the test values
test_y_pred = svm_clf.predict(test_x)

In [767]:
from sklearn.metrics import accuracy_score

# Predict the test values
test_y_pred = svm_clf.predict(test_x)

# Train accuracy
accuracy_score(train_target, train_y_pred)

0.6827503015681544

In [769]:
# Test accuracy
accuracy_score(test_target, test_y_pred)

0.6315049226441631

In [771]:
# Create confusion matrix for SVM 1
from sklearn.metrics import confusion_matrix

confusion_matrix(test_target, test_y_pred)

array([[ 37,  14,   1,   0,   0],
       [ 14,  43,  30,   2,   1],
       [  2,  26,  88,  33,   2],
       [  0,   4,  56,  92,  34],
       [  0,   0,   1,  42, 189]])

In [773]:
# Create classification report for SVM 1 on test data set
from sklearn.metrics import classification_report

print(classification_report(test_target, test_y_pred))

              precision    recall  f1-score   support

           A       0.70      0.71      0.70        52
           B       0.49      0.48      0.49        90
           C       0.50      0.58      0.54       151
           D       0.54      0.49      0.52       186
           F       0.84      0.81      0.83       232

    accuracy                           0.63       711
   macro avg       0.61      0.62      0.61       711
weighted avg       0.64      0.63      0.63       711



### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Based on the closeness of the train and test accuracy scores (train = 68%, test = 63%), the model is not over-fitting. If a larger disparity were to appear between the two metrics, adjustments would need to be made to the model to eliminate over-fitting of the train dataset. Though the model could perform better, its accuracy percentage is almost double that of the original baseline (33%). Therefore, it may be considered as a good starting point for finding an appropriate match to the data set (for predicting CGPA).

## SVM Model 2:

In [787]:
# Create SVC polynomial model with 2 degrees to check if data may be non-linear
np.random.seed(42)
pol_svm2 = SVC(kernel="poly", degree=2, coef0=1.1, C=0.095, decision_function_shape='ovr')

pol_svm2.fit(train_x, train_target.values.ravel())

In [789]:
# Predict the train values
train_y_pred = pol_svm2.predict(train_x)

# Train accuracy
accuracy_score(train_target, train_y_pred)

0.658021712907117

In [791]:
# Predict the test values
test_y_pred = pol_svm2.predict(test_x)

# Test accuracy
accuracy_score(test_target, test_y_pred)

0.6104078762306611

In [793]:
# Get confusion matrix for SVM 2
confusion_matrix(test_target, test_y_pred)

array([[ 14,  30,   8,   0,   0],
       [  1,  43,  42,   3,   1],
       [  0,  12,  99,  37,   3],
       [  0,   1,  63,  81,  41],
       [  0,   0,   3,  32, 197]])

In [795]:
# Get classification report for SVM 2
print(classification_report(test_target, test_y_pred))

              precision    recall  f1-score   support

           A       0.93      0.27      0.42        52
           B       0.50      0.48      0.49        90
           C       0.46      0.66      0.54       151
           D       0.53      0.44      0.48       186
           F       0.81      0.85      0.83       232

    accuracy                           0.61       711
   macro avg       0.65      0.54      0.55       711
weighted avg       0.63      0.61      0.60       711



### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Based on the closeness of the train and test accuracy scores (train = 66%, test = 61%), the model is not over-fitting. If a larger disparity were to appear between the two metrics, adjustments would need to be made to the model to eliminate over-fitting of the train data set.

## SVM Model 3:

In [854]:
# Create SVC model with RBF kernel
np.random.seed(42)
rbf_svm = SVC(kernel="rbf", C=0.2, gamma='scale')

rbf_svm.fit(train_x, train_target.values.ravel())

In [856]:
# Predict the train values
train_y_pred = rbf_svm.predict(train_x)

# Predict the test values
test_y_pred = rbf_svm.predict(test_x)

In [858]:
# Train accuracy
accuracy_score(train_target, train_y_pred)

0.6127864897466827

In [860]:
#Test accuracy
accuracy_score(test_target, test_y_pred)

0.5414908579465542

In [862]:
# Get confusion matrix for SVM 3
confusion_matrix(test_target, test_y_pred)

array([[  1,  27,  24,   0,   0],
       [  0,  12,  72,   5,   1],
       [  0,   1, 108,  36,   6],
       [  0,   0,  70,  64,  52],
       [  0,   0,   6,  26, 200]])

In [864]:
# Get classification report for SVM 3
print(classification_report(test_target, test_y_pred))

              precision    recall  f1-score   support

           A       1.00      0.02      0.04        52
           B       0.30      0.13      0.18        90
           C       0.39      0.72      0.50       151
           D       0.49      0.34      0.40       186
           F       0.77      0.86      0.81       232

    accuracy                           0.54       711
   macro avg       0.59      0.41      0.39       711
weighted avg       0.57      0.54      0.50       711



### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Compared to the other models shown earlier, the disparity between the train and test accuracy is slightly larger, with train at 61% and test at approximately 54%. The margin between these values may not likely be considered over fitting (as it is less than 7%). However, if one disproved of this margin and cared to decrease it, potential over fitting could be resolved by decreasing the C value (i.e., decreasing the amount of regularization within the model). Although, one must note that doing so decreases both the train and test accuracy measures slightly. The correction shown below decreases the C value from 0.2 to 0.1. As a result, the train accuracy decreases to 56%, while the test accuracy decreases to 51%. This reduces the disparity between the two accuracy measures from nearly 7% (in the original model, with C=0.2) to around 5% (with C=0.1).

In [868]:
# Reduce regularization (C-value) to scale down potential overfitting of model
rbf_svm = SVC(kernel="rbf", C=0.1, gamma='scale')

rbf_svm.fit(train_x, train_target.values.ravel())

In [870]:
# Predict the train values
train_y_pred = rbf_svm.predict(train_x)

# Predict the test values
test_y_pred = rbf_svm.predict(test_x)

In [872]:
# Train accuracy
accuracy_score(train_target, train_y_pred)

0.5566948130277443

In [874]:
# Test accuracy
accuracy_score(test_target, test_y_pred)

0.5147679324894515

# Section 3: (3 points in total)

Build two different SGD models (by changing the penalty, etc. or adding polynomial terms) and one LogisticRegression model. Generate their training and test values. Each model is worth 1 point.

(Add cells as needed)

## SGD Model 1:

In [878]:
# Create SGD linear model with no penalty
from sklearn.linear_model import SGDClassifier
np.random.seed(42)

sgd_class = SGDClassifier(max_iter=130, penalty=None, eta0=0.01) 

sgd_class.fit(train_x, train_target.values.ravel())

In [880]:
# Train accuracy
accuracy_score(sgd_class.predict(train_x), np.array(train_target.values.ravel()))

0.52291917973462

In [882]:
# Get predicted values for test data set
sgd_class.predict(test_x)

array(['B', 'D', 'C', 'F', 'A', 'C', 'C', 'C', 'F', 'D', 'D', 'F', 'C',
       'D', 'F', 'D', 'D', 'F', 'D', 'F', 'D', 'F', 'A', 'F', 'C', 'F',
       'A', 'C', 'D', 'A', 'C', 'C', 'C', 'A', 'F', 'F', 'F', 'D', 'D',
       'C', 'F', 'D', 'C', 'F', 'F', 'F', 'C', 'D', 'A', 'C', 'F', 'D',
       'C', 'F', 'C', 'A', 'D', 'F', 'D', 'D', 'D', 'F', 'F', 'F', 'F',
       'D', 'F', 'F', 'F', 'A', 'C', 'A', 'F', 'C', 'D', 'D', 'D', 'C',
       'F', 'F', 'F', 'C', 'A', 'F', 'F', 'F', 'C', 'D', 'D', 'D', 'A',
       'C', 'F', 'A', 'C', 'A', 'A', 'D', 'D', 'F', 'A', 'C', 'D', 'C',
       'F', 'F', 'C', 'A', 'C', 'F', 'F', 'C', 'C', 'F', 'F', 'D', 'A',
       'A', 'F', 'C', 'F', 'F', 'A', 'F', 'C', 'A', 'F', 'F', 'C', 'D',
       'A', 'A', 'D', 'F', 'A', 'F', 'D', 'D', 'F', 'A', 'F', 'C', 'C',
       'F', 'D', 'D', 'F', 'C', 'A', 'F', 'A', 'D', 'F', 'F', 'F', 'F',
       'F', 'A', 'A', 'C', 'D', 'F', 'F', 'A', 'D', 'A', 'A', 'C', 'F',
       'F', 'F', 'C', 'A', 'F', 'F', 'C', 'D', 'D', 'C', 'C', 'F

In [884]:
# Create a new DataFrame to display predicted values (for visualization purposes)

predictions = pd.DataFrame(sgd_class.predict(test_x), columns=['Test Predicted'])

predictions

Unnamed: 0,Test Predicted
0,B
1,D
2,C
3,F
4,A
...,...
706,F
707,D
708,F
709,F


In [886]:
# Add the actual CGPA values to the same DataFrame for comparison

predictions['Test Actual'] = np.array(test_target.values.ravel())

predictions

Unnamed: 0,Test Predicted,Test Actual
0,B,A
1,D,D
2,C,D
3,F,F
4,A,C
...,...,...
706,F,D
707,D,B
708,F,F
709,F,F


In [888]:
# Test accuracy
accuracy_score(predictions['Test Predicted'], predictions['Test Actual'])

0.5232067510548524

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Based on the closeness of the train and test accuracy scores (train = 52%, test = 52%), the model is not over-fitting. If a larger disparity were to appear between the two metrics, adjustments would need to be made to the model to eliminate over-fitting of the train data set. The model does have a higher accuracy percentage than the baseline (33%), though it does not perform as well as other models explored earlier, such as the SVM linear model. Therefore, the model may not be as good of a match for the data set, compared to other model techniques.

## SGD Model 2:

In [893]:
# Create SGD model with Elastic Net penalty
np.random.seed(42)
sgd_class_elastic = SGDClassifier(max_iter=50, penalty='elasticnet', l1_ratio=0.5, alpha = 0.005, 
                          eta0=0.01)
sgd_class_elastic.fit(train_x, train_target.values.ravel())

In [690]:
# Train accuracy
accuracy_score(sgd_class_elastic.predict(train_x), np.array(train_target.values.ravel()))

0.5651387213510254

In [692]:
# Get predicted values for test data set
test_pred = sgd_class_elastic.predict(test_x)

In [694]:
# Test accuracy
accuracy_score(test_pred, np.array(test_target.values.ravel()))

0.5414908579465542

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Based on the closeness of the train and test accuracy scores (train = approximately 57%, test = approximately 54%), the model is not over-fitting. If a larger disparity were to appear between the two metrics, adjustments would need to be made to the model to eliminate over-fitting of the train dataset.

## LogisticRegression Model:

In [697]:
# Create Logistic Regression model 
from sklearn.linear_model import LogisticRegression
np.random.seed(42)

log_reg = LogisticRegression(max_iter = 170, random_state=42).fit(train_x, train_target.values.ravel())

In [699]:
# Get predictions for train data set
train_pred = log_reg.predict(train_x)

In [701]:
# Train accuracy
accuracy_score(train_pred, np.array(train_target.values.ravel()))

0.6767189384800965

In [703]:
# Get predictions for test data set
test_pred = log_reg.predict(test_x)

In [705]:
# Test accuracy
accuracy_score(test_pred, np.array(test_target.values.ravel()))

0.6286919831223629

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Based on the closeness of the train and test accuracy scores (train = 68%, test = 63%), the model is not over-fitting. If a larger disparity were to appear between the two metrics, adjustments would need to be made to the model to eliminate over-fitting of the train dataset. That being said, the model could perform better, as it only has a slightly higher than average accuracy percentage (above 60%). Another model technique, therefore, might be considered a better match for the data set.

# Discussion (3 points in total)


## List the train and test values of each model you built (1 point)

**If the train/test values listed here do not match the outputs of models, you will lose points.**

## Which model performs the best and why? (1 point) 

Hint: The best model is the one that has the best TEST score (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

The best model assessed in this assignment is the SVM linear model with a test accuracy of 63.15% (barely beating out the logistic regression model with a test accuracy score of 62.90% and the 2-degree polynomial SVM model with a test accuracy of 61%). Given how close the test and test accuracies of the three top models are, an organization looking to implement one of these models should look closely at computation costs, model complexity, etc. before choosing which to use. The models would likely perform incredibly similar to one another, leaving most of their differences to external factors, such as data load and model runtime. 

## How does your best model compare to the baseline? (1 point)

The SVM linear model has a test accuracy score of almost double the baseline (63% compared to 33%). While there is room to grow for the SVM linear model to increase its accuracy, it beats out the baseline by a large enough margin to be considered for potential use (and / or further experimentation and fine-tuning with more train and test data). 