### Objective- To predict the quality of wine on a scale of 1-10 (bad----good)

In [1]:
# Importing the basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
    # For Model building and its evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, recall_score,confusion_matrix, ConfusionMatrixDisplay, classification_report, precision_score,roc_auc_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
    # To supress warnings
from sklearn.exceptions import ConvergenceWarning
ConvergenceWarning('ignore')
import warnings
warnings.filterwarnings('ignore')
# For Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [2]:
df= pd.read_csv(r"Y:\Data\Projects\Machine Learning Project\notebooks\data\QualityPrediction.csv")

In [3]:
# Creating a copy of raw data
df_raw= df.copy()

In [4]:
df.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


#### Observations-
All the variables are numerical in nature

Checking for duplicate values

- Duplicate observations will not provide any additional information to the model and hence should be dropped
- Further, when the dataset will be split into training and test dataset, exactly same observation values may end up being in both. This may result in overfitting

In [114]:
print(df.duplicated().sum())

240


There are a total of 240 duplicate entries

In [115]:
df[df.duplicated()]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4,7.4,0.700,0.00,1.90,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
11,7.5,0.500,0.36,6.10,0.071,17.0,102.0,0.99780,3.35,0.80,10.5,5
27,7.9,0.430,0.21,1.60,0.106,10.0,37.0,0.99660,3.17,0.91,9.5,5
40,7.3,0.450,0.36,5.90,0.074,12.0,87.0,0.99780,3.33,0.83,10.5,5
65,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.99620,3.41,0.39,10.9,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1563,7.2,0.695,0.13,2.00,0.076,12.0,20.0,0.99546,3.29,0.54,10.1,5
1564,7.2,0.695,0.13,2.00,0.076,12.0,20.0,0.99546,3.29,0.54,10.1,5
1567,7.2,0.695,0.13,2.00,0.076,12.0,20.0,0.99546,3.29,0.54,10.1,5
1581,6.2,0.560,0.09,1.70,0.053,24.0,32.0,0.99402,3.54,0.60,11.3,5


Removing the duplicate entries

In [116]:
df.drop_duplicates(keep='first', inplace = True)

In [117]:
df.shape

(1359, 12)

Now there are a total of 1359 observations and 12 columns (12 features and 1 target variable)

In [118]:
df.reset_index(drop=True, inplace=True ) # drop = True to avoid creating a new column

In [119]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5


In [6]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

Log transformation is one of the methods to reduce skewness, (specially right skewed) (achieve normal distribution) and handle outliers.

It is especially helpful for models assuming normal distribution like linear regression

Trying to the transform the skewed featues of dataset here too (for logistic regression) to handle skewness and outliers

In [None]:
# transformed_df = np.log1p(df.drop(['quality', 'pH', 'density'], axis=1)) # Transformation makes sense only for numerical features that are skewed(especially right skewed).
#                                                                 Thus removing the categorical feature which happens to be the target variale and the only categorical variable

In [122]:
# transformed_df = pd.concat([transformed_df,df[['pH', 'density','quality']]],axis=1)

**This will be the first step of preprocessing that will be used in transformer**

**But, instead of dropping the columns on which transformation is not to  be applied, we need to mention the columns on which transformation is to  be applied**

In [123]:
col_for_log_trans= df.columns.drop(['density', 'pH', 'quality']).tolist()
col_for_log_trans

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'sulphates',
 'alcohol']

In [124]:
log_transformation= FunctionTransformer(func= np.log1p, feature_names_out='one-to-one')
# feature_names_out='one-to-one': tells scikit-learn to retain column names

In [125]:
# transformed_df

In [126]:
# transformed_df.columns

#### Observations
- "fixed acidity" has moderate correlation with a lot of features- "citric acid", "density" and "pH"
- "free sulfur dioxide" and "total sulfur oxide" also have a moderate correaltion
- These correaltions become stronger for the transformed data(except between "citric acid" and fixed "acidity")
- We can check for VIF values too (after scaling) 

Dropping the features "fixed acidity" and " free sulfur dioxide" owing to multicollinearity

In [127]:
# df.drop(['fixed acidity', 'free sulfur dioxide'],axis=1,inplace=True)

**Second step for Transformation**

**Selected columns will be used for transformation while remainder will be dropped as remainder='drop**

In [None]:
required_col= [col for col in df.columns if col not in['fixed acidity', 'free sulfur dioxide', 'quality']]
required_col

['volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

## Model Creation

Models will be created based on the following algorithms-
- Logistic Regression on original data
- Decision Tree
- Random Forest
- K Nearest Neighbours
- Naive Bayes
- Support Vector Machine
- Logistic Regression on transformed data


#### Creating independent and dependent variables

In [129]:
x= df.drop('quality',axis=1) # Independent variable to be used for all models
y=df.quality # Target (dependent) variable to be used for all models

In [130]:
x,y

(      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
 0               7.4             0.700         0.00             1.9      0.076   
 1               7.8             0.880         0.00             2.6      0.098   
 2               7.8             0.760         0.04             2.3      0.092   
 3              11.2             0.280         0.56             1.9      0.075   
 4               7.4             0.660         0.00             1.8      0.075   
 ...             ...               ...          ...             ...        ...   
 1354            6.8             0.620         0.08             1.9      0.068   
 1355            6.2             0.600         0.08             2.0      0.090   
 1356            5.9             0.550         0.10             2.2      0.062   
 1357            5.9             0.645         0.12             2.0      0.075   
 1358            6.0             0.310         0.47             3.6      0.067   
 
       free su

#### Splitting the data into training and testing data

In [131]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.2,random_state=42)

In [132]:
x_train.shape,x_test.shape, y_train.shape, y_test.shape

((1087, 11), (272, 11), (1087,), (272,))

**The third step for transformer**

#### Feature Scaling

#### Preprocessing before feeding into the model

In [None]:
# For distance based models
preprocessor_scale= ColumnTransformer(transformers=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log transformation', log_transformation, col_for_log_trans),
    ('scaling', StandardScaler(), required_col)
],remainder='drop') # Dropping  rest of columns owing to multi collinearity

In [None]:
# For non distance based models
preprocessor_not_scale= ColumnTransformer(transformers=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log transformation', log_transformation, col_for_log_trans),
],remainder='drop') # Dropping  rest of columns owing to multi collinearity

In [135]:
#Preprocessing of distance based(requiring scaling) models
x_train_log_scaled= preprocessor_scale.fit_transform(x_train)
x_train_log_scaled

array([[ 2.29253476,  0.3220835 ,  0.33647224, ..., -0.25982192,
         0.51985077,  1.98424001],
       [ 2.02814825,  0.53062825,  0.07696104, ...,  0.85726724,
        -0.45848943, -0.2158671 ],
       [ 2.05412373,  0.44468582,  0.0295588 , ...,  0.85726724,
        -0.17074231, -0.39920936],
       ...,
       [ 2.05412373,  0.3852624 ,  0.07696104, ..., -0.06268854,
        -0.05564346, -0.76589387],
       [ 2.29253476,  0.27763174,  0.27002714, ..., -0.91693319,
        -0.6311377 , -0.03252484],
       [ 2.31253542,  0.29266961,  0.35065687, ..., -0.85122206,
        -0.6311377 ,  0.88418646]], shape=(1087, 18))

In [136]:
x_test_log_scaled= preprocessor_scale.transform(x_test)

In [137]:
#Preprocessing of non-distance based(not requiring scaling) models
x_train_log_not_scaled= preprocessor_not_scale.fit_transform(x_train)

In [138]:
x_test_log_not_scaled= preprocessor_not_scale.transform(x_test)

In [139]:
print(x_test_log_scaled.shape)  # Should be (n_samples, n_features), not (n_samples, n_features, 1)

(272, 18)


#### List of classification algorithms to train the model

In [140]:
# A dictionary comprising of list of models
models = {
'Logistic Regression on original data' : LogisticRegression(),
'Logistic Regression on transformed data' : LogisticRegression(solver= 'liblinear'),
'Decision Tree': DecisionTreeClassifier(random_state=42,max_depth=4, criterion='gini'),
'Random Forest': RandomForestClassifier(random_state=42),
'K Nearest Neighbours': KNeighborsClassifier(n_neighbors=3),
'Naive Bayes': GaussianNB(),
'Support Vector Machine': SVC(random_state = 42)
}

#### Evaluation of models

In [141]:
# Function to evaluate the models
accuracy=[]
def evaluation(model_name,model, actual, predicted):
    accuracy.append(round(accuracy_score(actual,predicted),2))
    print(f'\nClassification report and parameters for {model_name}')
    print('*'*65)
    print(classification_report(actual,predicted))
    print('Parameters: ',model.get_params())

#### Model creation on various algorithms

In [150]:
distance_models = (LogisticRegression, KNeighborsClassifier, SVC)
tree_models = (DecisionTreeClassifier, RandomForestClassifier, GaussianNB)

for name, model in models.items():
    if isinstance(model, distance_models):
        model.fit(x_train_log_scaled, y_train)
        y_pred = model.predict(x_test_log_scaled)
    else:
        # isinstance(model, tree_models):
        model.fit(x_train_log_not_scaled, y_train)
        y_pred = model.predict(x_test_log_not_scaled)
    evaluation(name, model, y_test,y_pred)



Classification report and parameters for Logistic Regression on original data
*****************************************************************
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00        11
           5       0.67      0.74      0.70       120
           6       0.58      0.63      0.60       103
           7       0.60      0.48      0.54        31
           8       0.00      0.00      0.00         3

    accuracy                           0.62       272
   macro avg       0.31      0.31      0.31       272
weighted avg       0.58      0.62      0.60       272

Parameters:  {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Classification 

#### Accuracy Scores can be summarized as-

In [151]:
acc_scores=pd.DataFrame(zip(models.keys(),accuracy),columns=['Models', 'Accuracy Score']).sort_values(by='Accuracy Score',ascending=False)

In [152]:
acc_scores

Unnamed: 0,Models,Accuracy Score
3,Random Forest,0.64
0,Logistic Regression on original data,0.62
1,Logistic Regression on transformed data,0.61
6,Support Vector Machine,0.61
5,Naive Bayes,0.59
2,Decision Tree,0.56
4,K Nearest Neighbours,0.5


#### Inferences
- None of the model is able to predict the categories that are not present in the dataset
- f1-score for categories '3','4' and '8' is 0 for all the models except the one trained on Naive Bayes algorithm (support for these categories is also on the lower side)
- Transformed data was expected to perform better on the logistic regression model, but it did not
- The model trained on Random Forest algorithm has highest accuracy score(0.67), although SVM(0.65) and logistic regression(0.62) also have comparable values

#### Cross Validation- To evaluate how well the machine learning model generalizes to unseen data

In [158]:
rf = RandomForestClassifier()
# Perform 5-fold cross-validation
scores = cross_val_score(rf, x_train_log_not_scaled, y_train, cv=5, scoring='accuracy')
print("Cross-validation scores:", np.round(scores,2))
print("Mean accuracy:", np.round(scores.mean(),2))

Cross-validation scores: [0.6  0.6  0.61 0.54 0.58]
Mean accuracy: 0.59


#####
- The drop in the accuracy of the model is quite significant
- This indicates that the issue of overfitting
- Logistic regression is the next best model

#### Trying cross validation on logistic regression

In [159]:
lr = LogisticRegression(max_iter=1000, random_state=42)
# Perform 5-fold cross-validation
scores = cross_val_score(lr, x_train_log_scaled, y_train, cv=5, scoring='accuracy')
print("Cross-validation scores:", np.round(scores,2))
print("Mean accuracy:", np.round(scores.mean(),2))

Cross-validation scores: [0.64 0.62 0.53 0.52 0.59]
Mean accuracy: 0.58


####
- Even for logistic regression, there is a drop in accuracy

#### Trying cross validation on SVM

In [160]:
svm= SVC()
scores = cross_val_score(svm, x_train_log_scaled, y_train, cv=5, scoring='accuracy')
print("Cross-validation scores:", np.round(scores, 2))
print("Mean accuracy:", np.round(scores.mean(), 2))

Cross-validation scores: [0.57 0.61 0.55 0.56 0.58]
Mean accuracy: 0.57


- There is a fall in accuracy level even for SVM

#### Hypertuning of parameters

##### Trying to optimize the random forest model by hypertuning its parameters

In [161]:
# Instantiating random forest object
rf = RandomForestClassifier(max_features='sqrt', random_state=42)

In [162]:
# Creating a dictionary of parameters as keys and the values these parameters can take(or the ones we want to try) as values
parameters = {
'criterion': ['gini', 'entropy'],
'max_depth': [None,5,10],
'max_features':['sqrt',None],
'bootstrap': [True, False],
'max_samples':[None,2,4],
'max_leaf_nodes':[None,5,10]
}
# Creating a GridSearchCV object
grid_search = GridSearchCV(estimator=rf, cv=5,param_grid = parameters, verbose=1, n_jobs=1,return_train_score=True)

# Fitting the training set on this object
grid_search.fit(x_train_log_not_scaled, y_train)
# Printing the best parameters
print("Best Hyperparameters:")
print(grid_search.best_params_)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best Hyperparameters:
{'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'max_samples': None}


In [164]:
best_model = grid_search.best_estimator_
y_pred2 = best_model.predict(x_test_log_not_scaled)

In [None]:
# rf.set_params(
# bootstrap= True,
# criterion= 'gini',
# max_depth = None,
# max_features = 'sqrt',
# oob_score = True,
# max_samples = None,
# max_leaf_nodes = None
# )

In [None]:
# rf.fit(x_train, y_train)
# y_pred2= rf.predict(x_test)

In [None]:
print(classification_report(y_test,y_pred2))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00        11
           5       0.70      0.73      0.72       120
           6       0.59      0.65      0.62       103
           7       0.61      0.55      0.58        31
           8       0.00      0.00      0.00         3

    accuracy                           0.63       272
   macro avg       0.32      0.32      0.32       272
weighted avg       0.60      0.63      0.62       272



No change in the accuracy of the model was observed

##### Trying to use robust scaler instead of standard scaler for  the logistic regression model

## Conclusion

- The model trained on Random Forest Algorithm shows the best performance amongst all the models with an accuracy of 0.67
- The drawback of the model though is that for three categories, it shows a f1 score of 0
- Further, there were supposed to be a total of 10 categories in the quality of wine(1-10),but only six were present in the given dataset.
    The model fails to predict the categories absent in the dataset. Higher amount of data may have been of help in this respect

In [None]:
pipe= Pipeline(steps=[
('preprocessing', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])

In [None]:
# pipe.fit(x_train,y_train)

NameError: name 'pipe' is not defined

In [None]:
# y_pred = pipe.predict(x_test)


In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.6286764705882353
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00        11
           5       0.72      0.71      0.71       120
           6       0.57      0.69      0.62       103
           7       0.58      0.48      0.53        31
           8       0.00      0.00      0.00         3

    accuracy                           0.63       272
   macro avg       0.31      0.31      0.31       272
weighted avg       0.60      0.63      0.61       272

