<a href="https://www.kaggle.com/code/vinciusmayrink/predicting-wine-classes-with-scikit-learn?scriptVersionId=169779859" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Wine classification with Scikit-Learn

# The problem domain

This project aims to develop a machine learning model for predicting wine types (classes) based on a set of analytical characteristics. The model will learn from historical data to identify patterns between the features (e.g., acidity, alcohol content) and the corresponding wine class.

We will explore and compare different machine learning algorithms suitable for classification tasks. Common choices include Support Vector Machines (SVM), Random Forests, or Logistic Regression.

This notebook will serve as a record of our exploration and development process as we build a machine learning model for wine classification.

The data is available at [scikit-learn.org](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine).

# Importing libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from pandas.plotting import scatter_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, f1_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Importing dataset

In [None]:
wine = load_wine(as_frame=True)
print(wine.DESCR)

# Checking the data

In [None]:
print(wine.data.describe())

In [None]:
print(wine.data.info())

# Understanding the data

Before diving into model building, it's crucial to understand the characteristics of our wine dataset. Exploratory Data Analysis (EDA) will help us gain insights into the data, identify potential issues, and guide feature selection for the machine learning model.

The insights gained from EDA will be crucial for data cleaning and feature selection. We might identify features with significant missing values that need to be addressed, or features with low variance that might not be informative for the model. The analysis will also guide which features to include in the model training process.

First, running the *Correlation Matrix*. Calculating the Pearson's correlation coefficient between numerical features to quantify the strength and direction of linear relationships.

In [None]:
corr_matrix = df_wine_full.corr()
print(corr_matrix["class"].sort_values(ascending=False))

In [None]:
df_wine_full = pd.concat([wine.data, wine.target], axis=1)
df_wine_full = df_wine_full.rename(columns={'target': 'class'})
attributes = ["alcalinity_of_ash", "nonflavanoid_phenols", "malic_acid", "color_intensity"]
scatter_matrix(df_wine_full[attributes], figsize=(12, 8))
plt.show()

Plotting the correlations of all features available

In [None]:
warnings.filterwarnings('ignore') # ignore seaborn warnings

feature_set_1 = [
    'alcohol', 'malic_acid', 'ash',
    'alcalinity_of_ash', 'magnesium',
    'total_phenols', 'flavanoids', 'class'
]

sns.pairplot(df_wine_full[feature_set_1], hue="class")

In [None]:
feature_set_2 = [
    'nonflavanoid_phenols', 'proanthocyanins',
    'color_intensity', 'hue', 'od280/od315_of_diluted_wines',
    'proline', 'class'
]

sns.pairplot(df_wine_full[feature_set_2], hue="class")

# Model training

We will explore and compare different machine learning algorithms suitable for classification tasks. Common choices include Random Forests, or Logistic Regression.

In [None]:
X = wine.data[attributes]
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

## Logistic Regression

Logistic regression is a statistical method commonly used for classification tasks in machine learning. It estimates the probability of an event happening based on one or more independent variables.

Making some hyperparameter tunning

In [None]:
%%time

param_grid = [
    {'C': 10**np.linspace(-3,3,20)}
]

log_reg = LogisticRegression(solver='lbfgs', penalty='none', random_state=42)
lr_gridsearch = GridSearchCV(log_reg, param_grid, cv=10, scoring='accuracy', 
                             refit=True)
lr_gridsearch.fit(X_train, y_train)

print(lr_gridsearch.best_score_)
print(lr_gridsearch.best_params_)

In [None]:
%%time

new_log_reg = LogisticRegression(solver='lbfgs', penalty='none', random_state=42)
new_log_reg.set_params(**lr_gridsearch.best_params_)
scores = cross_val_score(new_log_reg, X_train, y_train, cv=10, scoring="accuracy")
print(scores.mean())

In [None]:
y_train_pred_reg = cross_val_predict(new_log_reg, X_train, y_train, cv=10)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred_reg, normalize="true")
plt.show()

In [None]:
precision_reg = precision_score(y_train, y_train_pred_reg, average=None)
recall_reg = recall_score(y_train, y_train_pred_reg, average=None)
print("Mean precision for linear regression: " + str(precision_reg.mean()))
print("Mean recall for linear regression: " + str(recall_reg.mean()))

In [None]:
f1_score_reg = f1_score(y_train, y_train_pred_reg, average=None)
print(f1_score_reg.mean())

## Random Forest


Random forest is a powerful and versatile machine learning algorithm that excels in both classification and regression tasks. It operates by constructing a multitude of decision trees at training time, hence the name "forest."

In [None]:
%%time

param_grid = [{
    'max_depth': [2, 4, 8, 16, 32, 64], 
    'min_samples_leaf': [2, 4, 8, 16],
    'n_estimators': [10, 50, 100, 200, 500]
}]

forest_clf = RandomForestClassifier(random_state=42)
rf_gridsearch = GridSearchCV(forest_clf, param_grid, cv=10, scoring='accuracy', refit=True)
rf_gridsearch.fit(X_train, y_train)

print(rf_gridsearch.best_score_)
print(rf_gridsearch.best_params_)

In [None]:
%%time

new_forest_clf = RandomForestClassifier(random_state=42)
new_forest_clf.set_params(**rf_gridsearch.best_params_)
scores = cross_val_score(new_forest_clf, X_train, y_train, cv=10, scoring="accuracy")
print(scores.mean())

In [None]:
y_train_pred_forest = cross_val_predict(new_forest_clf, X_train, y_train, cv=10)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred_forest, normalize="true")
plt.show()

In [None]:
precision_forest = precision_score(y_train, y_train_pred_forest, average=None)
recall_forest = recall_score(y_train, y_train_pred_reg, average=None)
print("Mean precision for linear regression: " + str(precision_forest.mean()))
print("Mean recall for linear regression: " + str(recall_forest.mean()))

In [None]:
f1_score_forest = f1_score(y_train, y_train_pred_forest, average=None)
print(f1_score_forest.mean())

# Validation

Evaluation of the trained model's performance on the separate testing fold of the data. This ensures the model is not simply memorizing the training data, but can generalize to unseen data.

## Logistic Regression

In [None]:
%%time

final_log_reg = LogisticRegression(solver='lbfgs', penalty='none', random_state=42)
final_log_reg.set_params(**lr_gridsearch.best_params_)
scores = cross_val_score(final_log_reg, X_test, y_test, cv=10, scoring="accuracy")
mean_score = scores.mean()
print(mean_score)

In [None]:
y_test_pred_reg = cross_val_predict(final_log_reg, X_test, y_test, cv=10)
ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred_reg, normalize="true")
plt.show()

In [None]:
precision_test_reg = precision_score(y_test, y_test_pred_reg, average=None)
recall_test_reg = recall_score(y_test, y_test_pred_reg, average=None)
print("Mean precision for linear regression: " + str(precision_test_reg.mean()))
print("Mean recall for linear regression: " + str(recall_test_reg.mean()))

In [None]:
f1_score_forest = f1_score(y_test, y_test_pred_reg, average=None)
print(f1_score_forest.mean())

## Random Forest

In [None]:
%%time

final_forest_clf = RandomForestClassifier(random_state=42)
final_forest_clf.set_params(**rf_gridsearch.best_params_)
scores = cross_val_score(final_forest_clf, X_test, y_test, cv=10, scoring="accuracy")
mean_score = scores.mean()
print(mean_score)

In [None]:
y_test_pred_forest = cross_val_predict(final_forest_clf, X_test, y_test, cv=10)
ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred_reg, normalize="true")
plt.show()

In [None]:
precision_test_forest = precision_score(y_test, y_test_pred_forest, average=None)
precision_test_forest = recall_score(y_test, y_test_pred_forest, average=None)
print("Mean precision for random forest: " + str(precision_test_forest.mean()))
print("Mean recall for random forest: " + str(precision_test_forest.mean()))

In [None]:
f1_score_forest = f1_score(y_test, y_test_pred_forest, average=None)
print(f1_score_forest.mean())

# Conclusions

As expected, both models performed slightly worse on unseen (test) data compared to the training data. This is a common observation in machine learning, and it's also true for our analysis.

Two key challenges in model performance are overfitting and underfitting. These represent opposite ends of the spectrum in terms of how well a model generalizes to unseen data. In this case, the models exhibit minimal signs of overfitting.

Comparing the results, both models showed very similar performance on the training and test data. Additionally, the random forest model required significantly more time for training and prediction. Considering these factors, logistic regression emerges as the more efficient choice between the two models evaluated.

# Further reading

* [Other kaggle datasets](https://www.kaggle.com/datasets)
* [Other Scikit-Learn](https://scikit-learn.org/stable/datasets/toy_dataset.html)

# Acknowledgements

Materials used for the analysis:

* [Gemini](https://gemini.google.com/)
* [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition](https://learning.oreilly.com/library/view/hands-on-machine-learning/)