# DTSC670: Foundations of Machine Learning Models

## Assignment 1: Johnny Likes Pie

#### Name:

### CodeGrade
Note that this assignment will be automatically graded through CodeGrade and you will have unlimited submission attempts.  When submitting to CodeGrade, your notebook should be named `assignment1.ipynb` and there should be no errors in the file or CodeGrade will not be able to grade it.  Before submitting, I suggest that you restart your kernel and attempt to run all cells again to ensure that there will be no errors when CodeGrade runs your script.

### Details

First, make sure that you watch the video titled "Should You Play Golf Today" in the "Preparation for Assignment 1" section of Brightspace.  This assignment is meant to purely allow you to perform some basic steps with Scikit-Learn to get you used to working with it.

The following data describes features of different types of pie, along with a positive or negative classification of the pie based whether or not Johnny likes it.  A positive classification means Johnny likes that pie; a negative classification means Johnny does not like that pie.

<img src="JohnnyPies.png " width ="600" />

### Import Data

Let's start out by importing some standard imports.

In [1]:
# common imports
import numpy as np
import pandas as pd

# Do not change these options; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore') 

Next you should place the data file called `JohnnyPiesData.csv` and this Jupyter notebook in the same directory.  Use the [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to read in the data from the comma-separated values (csv) file to a Pandas DataFrame called `pie_df` and output the data to take a look.

In [2]:
pie_df = pd.read_csv("JohnnyPiesData.csv")
# pie_df 

## Prepare Data for Linear Regression

- Drop the `Example` column from the `pie_df` DataFrame, because it offers no information.

- Encode all categorical data into numeric data via the "One Hot Encoding" technique provided by the Pandas [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function.  

  - Since we are performing ordinary least squares linear regression, we will want to drop one of the newly created Boolean-valued features (output from the `get_dummies()` function) to prevent introducing unwanted correlation in the data.  Include `drop_first = True` as an argument to the `get_dummies()` function.

- Store the final features in a DataFrame called `features`.  The one-hot-encoded columns must go in the same order as the original data so that the linear regression coefficients match what CodeGrade is expecting.

- Store the positive class labels in a DataFrame called `response`.  The `response` data must be a DataFrame and not a Series or some of the code towards the end of this notebook may not function correctly and your output might be slightly different than what CodeGrade is expecting.

**Note:** Since we are not concerned with generalization error in this assignment, we will not split our data into training and test sets. In 'real-world' projects, you would want to split your data to see how your model performs with data that it has never seen before.

In [3]:
#Encode all categorical data into numeric data via the "One Hot Encoding" technique provided by the Pandas get_dummies() 
#function. Since we are performing ordinary least squares linear regression, we will want to drop one of the newly created 
#Boolean-valued features (output from the get_dummies() function) to prevent introducing unwanted correlation in the data. 
#Include drop_first = True as an argument to the get_dummies() function

features = pd.get_dummies(
    pie_df, columns = ['Crust Shape','Crust Size','Crust Shade','Filling Size','Filling Shade', 'Class'], 
    drop_first = True)
# features

In [4]:
# Store the positive class labels in a DataFrame called response. The response data must be a DataFrame and not a Series 
# or some of the code towards the end of this notebook may not function correctly and your output might be slightly 
# different than what CodeGrade is expecting.

response = features[['Class_pos']];
# response

In [5]:
features = features.drop(columns = ['Example'], axis = 1);
features = features.drop(columns = ['Class_pos'], axis = 1);
# features

## Perfrom Linear Regression Model Fitting

1. Import the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class from the `sklearn.linear_model` library. 

2. Instantiate an object of the `LinearRegression` class called `reg_model`.

3. Train the model by invoking the `fit()` method of the `reg_model` object and passing it `features` and `response`.

In [6]:
from sklearn.linear_model import LinearRegression
reg_model = LinearRegression()
reg_model.fit(features, response)
# reg_model

## Examine Linear Regression Model Parameters

View the trained model parameters by using the `coef_` and `intercept_` attributes of the trained model.

In [7]:
# Coefficients
reg_model.coef_

array([[-0.52586207, -0.83189655, -0.56465517, -0.63793103, -0.92672414,
         0.70258621,  0.12068966, -1.07327586]])

In [8]:
# Intercepts
reg_model.intercept_

array([1.56034483])

## Making Predictions Using the Linear Regression Model

Evaluate the model's performance on the training data set by invoking the `predict()` method and passing `features` to it.  Save this output as `preds`. 


In [9]:
preds = reg_model.predict(features)
# preds

Below are the results from the linear regression model:

The column "Class_pos" regards the "positive" or negative classification of the pie.  The column "Regression_Predictions" regards the predictions made by the linear regression model directly.  The column "Predicted_Responses" are the adjusted prdeictions made by the model after employing the cut-off values of 0 being 0 <= x <= 0.5 and 1 being 0.5 < x <= 1.0.

Note:  Make sure that your `response` is a DataFrame and not a Series or some of the code below may not function correctly.

In [10]:
# resp_comp = Response Comparison

resp_comp = response.copy() 
reg_outputs = [float(reg_model.predict(np.reshape(row, (1, -1)))) for row in features.itertuples(index=False)]
predicted_resp = np.array([1 if reg_output > 0.5 else 0 for reg_output in reg_outputs])
resp_comp = resp_comp.assign(Regression_Predictions = reg_outputs)
resp_comp = resp_comp.assign(Predicted_Responses = predicted_resp)
# resp_comp

## Calculate Model Accuracy

Use the [accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function to calculate the accuracy score of the model.  Save the accuracy score as `acc_score`.

In [11]:
### ENTER CODE HERE ###resp_comp = resp_comp.copy();
resp_comp = resp_comp.drop(columns = ['Regression_Predictions'], axis = 1);
# resp_comp

In [12]:
from sklearn.metrics import accuracy_score

acc_score = accuracy_score(resp_comp['Class_pos'], resp_comp['Predicted_Responses'])
# acc_score