# How the code is organized

The code is provided in 2 sections:

**Section 1:** Red Wine Analysis, Model Creation and Saving

**Section 2:** White Wine Analysis, Model Creation and Saving

For deployment, refer to the Flask application under the `deployment` folder

In [None]:
#import the relevant libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## Section 1 - Red Wine Analysis, Model Creation and Saving

In [None]:
# read the dataset

red_wine_data = pd.read_csv('datasets/winequality-red.csv', delimiter =';')
red_wine_data.head()

### Develop an ML Model for Red Wine

Developing an ML model involves a number of steps.

Let us adopt the following Machine Learning Pipeline:

1. Sanity Check 
2. EDA/Preprocessing
3. Feature Engineering
4. Model Building
5. Model Saving
6. Model Deployment - this is covered under the `deployment` folder

Once the model is deployed, the pipeline extends to include the below steps:

7. Model in Production
8. Observe model behaviour
9. Obtain updated datasets
10. Redo steps 1..9 if required

_Note: these extended steps are not covered in this exercise_

### Sanity Check

1. Shape and data sufficiency: Check if there are sufficient rows of data for an ML problem
2. Datatypes: Check whether all the columns in the given dataset is numeric
3. Missing Values: Check whether there are missing values
4. Zero-variance: Check if there are any zero variance column in the dataset
5. Range of numbers in each column: Check if the column values within the dataset are in the same magnitude
6. Correlation: Check correlation between feature columns & target
7. Target: Check for discrete values

In [None]:
# 1 Shape
red_wine_data.shape

In [None]:
# 2 Datatypes
red_wine_data.dtypes

In [None]:
# 3 Missing values
red_wine_data.info()

In [None]:
# 4 Identify zero variance columns
for col in red_wine_data:
    print(col, red_wine_data[col].value_counts().count())

In [None]:
# 5. Range of numbers in each column
for i in red_wine_data.columns:
    print("Range of {}: minimum {} & maximum {} " .format(i, red_wine_data[i].min(), red_wine_data[i].max()))

In [None]:
# 6 Relationship between features & the target
red_wine_data.corr()

In [None]:
plt.figure(figsize=(16, 6))
sns.heatmap(red_wine_data.corr(), annot=True);

In [None]:
#7. Target: Check for discrete values
red_wine_data['quality'].value_counts()

### Insights / Sanity Check Conclusions

1. **Shape and data sufficiency: Check if there are sufficient rows of data for an ML problem**
    1. **INSIGHT:** Shape of the data is (1599, 12). i.e., dataset contains ~1600 observations, which is much greater than number of columns (12). Hence we can apply ML techniques rather than statistical rule-based approach.


2. **Datatypes: Check whether all the columns in the given dataset is numeric**
    1. **INSIGHT:** `Dtype` indicates that all columns are numeric
    

3. **Missing Values: Check whether there are missing values**
    1. **INSIGHT:** `Non-Null Count` indicates there are no missing values in the dataset


4. **Zero-variance: Check if there are any zero variance column in the dataset**
    1. **INSIGHT:** No zero-variance columns found in the dataset


5. **Range of numbers in each column: Check if the column values within the dataset are in the same magnitude**
    1. **INSIGHT:** Each column has numbers within the same magnitude or plottable in a graph


6. **Correlation: Check correlation between feature columns & target**
    1. **INSIGHT:** The columns `pH`, `free sulfur dioxide`, `residual sugar` have very weak correlation (0.00 - 0.20)
    2. **INSIGHT:** The columns `fixed acidity`, `citric acid`, `chlorides`, `total sulfur dioxide`, `density` have weak correlation (0.20 - 0.40)
    3. ***Note:*** *absolute values of correlations were considered*


7. **Other Observations:**
    1. **INSIGHT:** Since (a) the target is given (b) target is continuous (number between 0..10), we can conclude that this is a supervised linear regression problem
    2. **INSIGHT:** The target variable, i.e., `quality` has discrete values which indicates that this can be solved using classification methods also. However we will continue with Linear Regression in this exercise.


### EDA/Preprocessing
_(Based on the insights from the sanity check, we can now determine how to process the data.)_

#### Checklist of STANDARD EDA items

1. Strategy for missing data
    1. Action: No missing data, no action to be taken
    
    
2. Convert categorical to numeric
    1. Action: No Categorical data, no action to be taken
    
    
3. Dimensionality reduction/Drop the identified columns
    1. Action: Drop identified columns in Insights 6A and 6B
    
    
4. Check for Outliers, Normalize data in columns to fit a range (*Optional*)
    1. Action: As per Insights 5A there are no Outliers

#### Approach:
We will follow a 3-step approach as outlined below:

Step 1:
1. First, we will process the complete dataset without dropping any columns.
2. We will build the ML model with the complete data, test and validate the predictions.

Step 2:
1. As per Insights 6A, we will drop the columns that show very weak correlations. These columns are - `pH`, `free sulfur dioxide`, `residual sugar`
2. The dataset will thus have 9 features (including target)
3. We will build the ML model with the remaining data, test and validate the predictions

Step 3:
1. As per Insights 6B, we will next drop the columns that show weak correlations. These columns are - `fixed acidity`, `citric acid`, `chlorides`, `total sulfur dioxide`, `density`
2. The dataset will thus have 4 features (including target)
3. We will build the ML model with the remaining data, test and validate the predictions

Step 4:
1. Compare the Accuracy of all the three models developed
2. Choose the best model for deployment


#### Step 1:
Build the ML model with the complete data, test and validate the predictions.

In [None]:
# create the "features and target" data sets
X = red_wine_data.drop('quality',axis=1)
y = red_wine_data['quality']

# split the features and target data sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Model 1 train/test shapes:')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# create and fit a linear regression model
lm_red_wine1 = LinearRegression()
red_model1 = lm_red_wine1.fit(X_train, y_train)

# computing yhat (ie train_predictions) using X (ie train_features)
train_predictions = lm_red_wine1.predict(X_train)
train_prediction = [int(round(x,0)) for x in train_predictions]

In [None]:
# simple function to compare actual and predicted values
def compare_prediction(y, yhat):
    comp_matrix = pd.DataFrame(zip(y_train,train_prediction), columns = ['Actual', 'Predicted'])
    comp_matrix['Err'] = abs(comp_matrix['Actual']-comp_matrix['Predicted'])
    comp_matrix['PctErr'] = comp_matrix['Err']/comp_matrix['Actual'] * 100
    mean_value = np.mean(comp_matrix['PctErr'])
    return comp_matrix, mean_value

In [None]:
# compare actual and predicted values
comp_matrix, mean = compare_prediction(y, train_prediction)
print("Model 1 prediction comparison and mean error:", comp_matrix, mean)

accuracy1 = round((100-mean),2)
print('Model1 accuracy =', accuracy1)

#### Step 2: Drop columns showing very weak correlations (0.0 - 0.2)
1. Drop pH, free sulfur dioxide, residual sugar columns.
2. Build the ML model, test and validate the predictions.

In [None]:
### 3. Dimensionality reduction/Drop the identified columns
lst = ['pH', 'free sulfur dioxide', 'residual sugar']
red_wine_data.drop(lst, axis =1, inplace = True)

In [None]:
# create the "features and target" data sets
X = red_wine_data.drop('quality',axis=1)
y = red_wine_data['quality']

# split the features and target data sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Model 2 train/test shapes:')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# create and fit a linear regression model
lm_red_wine2 = LinearRegression()
red_model2 = lm_red_wine2.fit(X_train, y_train)

# computing yhat (ie train_predictions) using X (ie train_features)
train_predictions = lm_red_wine2.predict(X_train)
train_prediction = [int(round(x,0)) for x in train_predictions] 

In [None]:
# compare actual and predicted values
comp_matrix, mean = compare_prediction(y, train_prediction)
print("Model 2 prediction comparison and mean error:", comp_matrix, mean)

accuracy2 = round((100-mean),2)
print("Model 2 accuracy =", accuracy2)

#### Step 3: Drop columns showing weak correlations (0.2 - 0.4)
1. Drop columns fixed acidity, citric acid, chlorides, total sulfur dioxide, density.
2. Build the ML model, test and validate the predictions.

In [None]:
### 3. Dimensionality reduction/Drop the identified columns
lst = ['fixed acidity', 'citric acid', 'chlorides', 'total sulfur dioxide', 'density']
red_wine_data.drop(lst, axis =1, inplace = True)

In [None]:
# create the "features and target" data sets
X = red_wine_data.drop('quality',axis=1)
y = red_wine_data['quality']

# split the features and target data sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Model 2 train/test shapes:')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# create and fit a linear regression model
lm_red_wine3 = LinearRegression()
red_model3 = lm_red_wine3.fit(X_train, y_train)

# computing yhat (ie train_predictions) using X (ie train_features)
train_predictions = lm_red_wine3.predict(X_train)
train_prediction = [int(round(x,0)) for x in train_predictions]

In [None]:
# compare actual and predicted values
comp_matrix, mean = compare_prediction(y, train_prediction)
print("Model 3 prediction comparison and mean error:", comp_matrix, mean)

accuracy3 = round((100-mean),2)
print("Model 3 accuracy =", accuracy3)

#### Step 4:

Compare the Accuracy of all the three models developed


In [None]:
print("Model1 Accuracy {}".format(accuracy1))
print("Model2 Accuracy {}".format(accuracy2))
print("Model3 Accuracy {}".format(accuracy3))

#### Choose the best model for deployment.

Since we are getting almost the same accuracy across models, we are going for model3 as it has least number of features.

### Red Wine: Model Saving

In [None]:
# save red_model3 as per analysis
model_file = open('models/red_wine_model.pkl', 'wb')
pickle.dump(obj=red_model3, file=model_file)
model_file.close()

In [None]:
# reload the model from disk and check if it is saved properly.
model_file = open('models/red_wine_model.pkl', 'rb')
lr_model = pickle.load(model_file)
model_file.close()
print(lr_model)

## Section 2 - White Wine Analysis, Model Creation and Saving
We will follow the same steps as we did for Red Wine analysis.

In [None]:
# Read the white wine quality dataset
white_wine_data = pd.read_csv('datasets/winequality-white.csv', delimiter = ';')
white_wine_data.head()

### Develop an ML Model as per the Pipeline
Refer to the corresponding red wine section

### Sanity Check

In [None]:
# 1 Shape
white_wine_data.shape

In [None]:
# 2 Datatypes
white_wine_data.dtypes

In [None]:
# 3 Missing values
white_wine_data.info()

In [None]:
#4 zero variance column needs to be removed. 
for col in white_wine_data:
    print(col, white_wine_data[col].value_counts().count())

In [None]:
# 5. Range of numbers in each column
for i in white_wine_data.columns:
    print("Range of {}: minimum {} & maximum {} " .format(i, white_wine_data[i].min(), white_wine_data[i].max()))

In [None]:
# 6 Relationship between features & the target
white_wine_data.corr()

In [None]:
plt.figure(figsize=(16, 6))
sns.heatmap(white_wine_data.corr(), annot=True);

In [None]:
#7. Target: Check for discrete values
white_wine_data['quality'].value_counts()

### Insights / Sanity Check Conclusions


1. **Shape and data sufficiency: Check if there are sufficient rows of data for an ML problem ---> (DONE)**
    1. **INSIGHT:** Shape of the data is (4898, 12), which is much greater than number of columns (12). Hence we can apply ML techniques rather than statistical rule-based approach.


2. **Datatypes: Check whether all the columns in the given dataset is numeric ---> (DONE)**
    1. **INSIGHT:** `Dtype` indicates that all columns are numeric
    

3. **Missing Values: Check whether there are missing values ---->DONE**
    1. **INSIGHT:** `Non-Null Count` indicates there are no missing values in the dataset


4. **Zero-variance: Check if there are any zero variance column in the dataset ---> (DONE)**
    1. **INSIGHT:** No zero-variance columns found in the dataset


5. **Range of numbers in each column: Check if the column values within the dataset are in the same magnitude ---> (DONE)**
    1. **INSIGHT:** Each column has numbers within the same magnitude


6. **Correlation: Check correlation between feature columns & target -->DONE**

    1. **INSIGHT:** The columns `fixed acidity`, `volatile acidity`, `citric acid`,`residual sugar`, `total sulphur dioxide`, `free sulphur dioxide`,`ph`,`sulphates` have very weak correlation (0.00 - 0.20)
    2. **INSIGHT:** The columns `chlorides`, `density` have weak correlation (0.20 - 0.40)
    3. ***Note:*** *absolute values of correlations were considered*


7. **Other Observations:**
    1. **INSIGHT:** Since (a) the target is given (b) target is continuous (number between 0..10), we can conclude that this is a supervised linear regression problem
    2. **INSIGHT:** The target variable, i.e., `quality` has discrete values which indicates that this can be solved using classification methods also. However we will continue with Linear Regression in this exercise.


## EDA/Preprocessing
Refer to the EDA steps in the red wine analysis

### Checklist of STANDARD EDA items

1. Strategy for missing data
    1. Action: No missing data, no action to be taken
    
    
2. Convert categorical to numeric
    1. Action: No Categorical data, no action to be taken
    
    
3. Dimensionality reduction/Drop the identified columns
    1. Action: Drop identified columns in Insights 6A and 6B
    
    
4. Check for Outliers, Normalize data in columns to fit a range (*Optional*)
    1. Action: As per Insights 5A there are no Outliers

### White Wine Analysis Approach After Insights:
We will follow a 3-step approach as outlined below:

Step 1:
1. First, we will process the complete dataset without dropping any columns.
2. We will build the ML model with the complete data, test and validate the predictions.

Step 2:
1. As per Insights 6A, we will drop the columns that show very weak correlations. These columns are - `pH`, `free sulfur dioxide`, `residual sugar`
2. The dataset will thus have 9 features (including target)
3. We will build the ML model with the remaining data, test and validate the predictions

Step 3:
1. As per Insights 6B, we will next drop the columns that show weak correlations. These columns are - `fixed acidity`, `citric acid`, `chlorides`, `total sulfur dioxide`, `density`
2. The dataset will thus have 4 features (including target)
3. We will build the ML model with the remaining data, test and validate the predictions

Step 4:
1. Compare the Accuracy of all the three models developed
2. Choose the best model for deployment


#### Step 1:
Build the ML model with the complete data, test and validate the predictions.

In [None]:
# create the "features and target" data sets
X = white_wine_data.drop('quality',axis=1)
y = white_wine_data['quality']

# split the features and target data sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('White Wine Model 1 train/test shapes:')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# create and fit a linear regression model
lm_white_wine1 = LinearRegression()
white_model1 = lm_white_wine1.fit(X_train, y_train)

# computing yhat (ie train_predictions) using X (ie train_features)
train_predictions = lm_white_wine1.predict(X_train)
train_prediction = [int(round(x,0)) for x in train_predictions]

In [None]:
# compare actual and predicted values
comp_matrix, mean = compare_prediction(y, train_prediction)
print("White Wine Model 1 prediction comparison and mean error:", comp_matrix, mean)

accuracy1 = round((100-mean),2)
print("White Wine Model 1 accuracy =", accuracy1)

#### Step 2: Drop columns showing very weak correlations (0.0 - 0.2)
1. drop the columns `fixed acidity`, `volatile acidity`, `citric acid`,`residual sugar`, `total sulfur dioxide`, `free sulfur dioxide`,`pH`,`sulphates`
2. Build the ML model, test and validate the predictions.

In [None]:
### 3. Dimensionality reduction/Drop the identified columns
lst = ['fixed acidity', 'volatile acidity', 'citric acid','residual sugar',
       'total sulfur dioxide', 'free sulfur dioxide','pH','sulphates']
white_wine_data.drop(lst, axis=1,inplace=True)

In [None]:
# create the "features and target" data sets
X = white_wine_data.drop('quality',axis=1)
y = white_wine_data['quality']

# split the features and target data sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('White Wine Model 2 train/test shapes:')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# create and fit a linear regression model
lm_white_wine2 = LinearRegression()
white_model2 = lm_white_wine2.fit(X_train, y_train)

# computing yhat (ie train_predictions) using X (ie train_features)
train_predictions = lm_white_wine2.predict(X_train)
train_prediction = [int(round(x,0)) for x in train_predictions]

In [None]:
# compare actual and predicted values
comp_matrix, mean = compare_prediction(y, train_prediction)
print("White Wine Model 2 prediction comparison and mean error:", comp_matrix, mean)

accuracy2 = round((100-mean),2)
print("White Wine Model 2 accuracy =", accuracy2)

#### Step 3: Drop columns showing weak correlations (0.2 - 0.4)
1. Drop columns `chlorides` and `density`
2. Build the ML model, test and validate the predictions.

Note: we will be left with just 1 column - i.e., `alcohol` - this doesn't make sense, but let's continue nonetheless.

In [None]:
### 3. Dimensionality reduction/Drop the identified columns
lst = ['chlorides', 'density']
white_wine_data.drop(lst, axis=1,inplace=True)

In [None]:
# create the "features and target" data sets
X = white_wine_data.drop('quality',axis=1)
y = white_wine_data['quality']

# split the features and target data sets into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# create and fit a linear regression model
lm_white_wine3 = LinearRegression()
white_model3 = lm_white_wine3.fit(X_train, y_train)

# computing yhat (ie train_predictions) using X (ie train_features)
train_predictions = lm_white_wine3.predict(X_train)
train_prediction = [int(round(x,0)) for x in train_predictions]

In [None]:
# compare actual and predicted values
comp_matrix, mean = compare_prediction(y, train_prediction)
print("White Wine Model 3 prediction comparison and mean error:", comp_matrix, mean)

accuracy3 = round((100-mean),2)
print("White Wine Model 3 accuracy =", accuracy3)

#### Step 4:

Compare the Accuracy of all the three models developed


In [None]:
print("Model1 Accuracy {}".format(accuracy1))
print("Model2 Accuracy {}".format(accuracy2))
print("Model3 Accuracy {}".format(accuracy3))

#### Choose the best model for deployment.
Model 3 predicts the quality using just 1 column - i.e., `alcohol` - this doesn't make sense. We will be conservative and choose Model 1 for the way forward.

### White Wine Model Saving

In [None]:
# save white_model1 as per analysis
model_file = open('models/white_wine_model.pkl','wb')
pickle.dump(white_model1, model_file)
model_file.close()

In [None]:
# reload the model from disk and check if it is saved properly
model_file = open('models/white_wine_model.pkl', 'rb')
lr_model = pickle.load(model_file)
model_file.close()
print(lr_model)