<a href="https://colab.research.google.com/github/ryanking916/Wine-Quality-Data/blob/main/Copy_of_HW2_FA23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2 (Feed Forward Neural Networks)

Choose a dataset that you're interested in from among these options (or choose your own data set as long as it's large enough and **you check with me** in advance):

- [Boston Housing Data](https://github.com/selva86/datasets/blob/master/BostonHousing.csv) (More info [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html))
- [Wine Quality Data](https://archive.ics.uci.edu/ml/datasets/wine+quality)
- [Spam Emails](https://archive.ics.uci.edu/ml/datasets/Spambase)
- [European Soccer Data](https://www.kaggle.com/datasets/hugomathien/soccer)(would only recommend if you want to spend some time joining and cleaning data)
-

Then Build a Deep FEED FORWARD Neural Network (No Convolutional or Recurrent Layers) using keras/tensorflow (at least 3 *hidden* layers) to predict either a category or a continuous value.

Make sure that:

- your NN has some sort of regularization (or multiple types if needed)
- you've properly z-scored or otherwise scaled your data before training
- your model architechture and loss function are appropriate for the problem
- you print out at least 2 metrics for both train and test data to examine

Then, using the SAME predictors and outcome, **build a simpler ML model from 392** (some options listed below with documentation) and check if your Neural Net did better (essentially I want you to PROVE whether you needed a neural network for the task or not).



- [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [KNN Regression](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
- [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
- [Decision Tree Regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
- [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Random Forest Regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)


Lastly, create a **technical report** discussing your model building process, the results, and your reflection on it. The report should follow the format in the example including an Introduction, Analysis, Methods, Results, and Reflection section. Your report is practice for presenting results to non-technical audiences in your Data Science career (e.g. your boss, CEO, shareholders...)

# Technical Report Sections

## Introduction
An introduction should introduce the problem and data you're working on, give some background and relevant detail for the reader, and explain why it is important.

## Analysis
Any exploratory analysis of your data, and general summarization of the data (e.g. summary statistics, correlation heatmaps, graphs, information about the data...). Tell the reader about the types of variables you have and some general information about them, Plots and/or Tables are always great. This should also include any cleaning and joining you did.

If you want a table you can make one with [this website](https://www.tablesgenerator.com/markdown_tables) and paste the markdown table here. For example:

## Methods
Explain the structure of your model and your approach to building it. This can also include changes you made to your model in the process of building it. Someone should be able to read your methods section and *generally* be able to tell exactly what architechture you used. However REMEMBER that this should be geared towards an audience who might not understand Tensorflow code.

## Results
Detailed discussion of how your model performed, and your discussion of how your model performed.

## Reflection
Reflections on what you learned/discovered in the process of doing the assignment. Write about any struggles you had (and hopefully overcame) during the process. Things you would do differently in the future, ways you'll approach similar problems in the future, etc.



# What to Turn In

- PDF of your technical report (rendered through Quarto)
- your code as a .py, .ipynb, or link to github (you must turn it in either as a file, or a link to something that has timestamps of when the file was last edited)
- a README file as a .txt or .md

In [None]:
# Imports
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from keras.datasets import mnist
import tensorflow.keras as kb
from tensorflow.keras import backend
from tensorflow.keras import regularizers
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer




from plotnine import *
from ucimlrepo import fetch_ucirepo

from sklearn.metrics import mean_squared_error, mean_absolute_error, roc_auc_score

from sklearn.linear_model import LinearRegression # Linear Regression Model
from sklearn.preprocessing import StandardScaler #Z-score variables
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Loading data using code from the Machine Learning UCI Repo

# fetch dataset
wine_quality = fetch_ucirepo(id=186)

# data (as pandas dataframes)
X = wine_quality.data.features
y = wine_quality.data.targets

# metadata
#print(wine_quality.metadata)

# variable information
#print(wine_quality.variables)

# Concatenating the features and targets
wine = pd.concat([X, y], axis=1)

# Displaying the first few rows
print(wine.head())


   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0            7.0              0.27         0.36            20.7      0.045   
1            6.3              0.30         0.34             1.6      0.049   
2            8.1              0.28         0.40             6.9      0.050   
3            7.2              0.23         0.32             8.5      0.058   
4            7.2              0.23         0.32             8.5      0.058   

   free_sulfur_dioxide  total_sulfur_dioxide  density    pH  sulphates  \
0                 45.0                 170.0   1.0010  3.00       0.45   
1                 14.0                 132.0   0.9940  3.30       0.49   
2                 30.0                  97.0   0.9951  3.26       0.44   
3                 47.0                 186.0   0.9956  3.19       0.40   
4                 47.0                 186.0   0.9956  3.19       0.40   

   alcohol  quality  
0      8.8        6  
1      9.5        6  
2     10.1        6 

In [111]:
# Printing summary statistics
print(wine.describe())

       fixed_acidity  volatile_acidity  citric_acid  residual_sugar  \
count    4898.000000       4898.000000  4898.000000     4898.000000   
mean        6.854788          0.278241     0.334192        6.391415   
std         0.843868          0.100795     0.121020        5.072058   
min         3.800000          0.080000     0.000000        0.600000   
25%         6.300000          0.210000     0.270000        1.700000   
50%         6.800000          0.260000     0.320000        5.200000   
75%         7.300000          0.320000     0.390000        9.900000   
max        14.200000          1.100000     1.660000       65.800000   

         chlorides  free_sulfur_dioxide  total_sulfur_dioxide      density  \
count  4898.000000          4898.000000           4898.000000  4898.000000   
mean      0.045772            35.308085            138.360657     0.994027   
std       0.021848            17.007137             42.498065     0.002991   
min       0.009000             2.000000         

In [106]:
# Setting predictor and predict variables
predictors = ["fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", "chlorides",
              "free_sulfur_dioxide", "total_sulfur_dioxide", "density", "pH", "sulphates", "alcohol"]
predict = "quality"

print(wine.shape)
X = wine[predictors]
y = wine[predict]

# Creating train and test splits
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

# Z scoring predictors
z = StandardScaler()
X_train[predictors] = z.fit_transform(X_train[predictors])
X_test[predictors] = z.transform(X_test[predictors])



(4898, 12)


In [110]:
# build structure of the model
model = kb.Sequential([
    kb.layers.Dense(256, activation = 'relu', input_shape=(X_train.shape[1],)),
    kb.layers.Dropout(0.5),
    kb.layers.Dense(128, activation = 'relu'),
    kb.layers.Dropout(0.3),
    kb.layers.Dense(64, activation = 'relu'),
    kb.layers.Dropout(0.2),
    kb.layers.Dense(32, activation = 'relu'),
    kb.layers.Dropout(0.1),
    kb.layers.Dense(10, activation='softmax')
])

# compile model
model.compile(loss="categorical_crossentropy", optimizer=kb.optimizers.SGD(0.01),
              metrics=['accuracy', 'mae'])

# one hot encoding to help fix error

y_train_encoded = to_categorical(y_train - 1, num_classes=10)
y_test_encoded = to_categorical(y_test - 1, num_classes=10)


#fit the model
model.fit(X_train, y_train_encoded, epochs = 100, validation_data=(X_test, y_test_encoded))

# Evaluate the model and getting metrics
train_metrics = model.evaluate(X_train, y_train_encoded)
test_metrics = model.evaluate(X_test, y_test_encoded)

print(f'Train Loss: {train_metrics[0]}, Train Accuracy: {train_metrics[1]}, Train MAE: {train_metrics[2]}')
print(f'Test Loss: {test_metrics[0]}, Test Accuracy: {test_metrics[1]}, Test MAE: {test_metrics[2]}')


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [105]:
# Logistic regression model building
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
train_preds = log_reg.predict(X_train)
test_preds = log_reg.predict(X_test)

# Calculating accuracy
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

# Calculating MAE
train_mae = mean_absolute_error(y_train, train_preds)
test_mae = mean_absolute_error(y_test, test_preds)

# Output metrics
print(f'Logistic Regression Train Accuracy: {train_acc}')
print(f'Logistic Regression Train MAE: {train_mae}')

print(f'Logistic Regression Test Accuracy: {test_acc}')
print(f'Logistic Regression Test MAE: {test_mae}')


Logistic Regression Train Accuracy: 0.5336906584992342
Logistic Regression Train MAE: 0.5222052067381318
Logistic Regression Test Accuracy: 0.5551020408163265
Logistic Regression Test MAE: 0.5112244897959184
