Hello everyone,

In this notebook, I will first explore the main characteristics of the variables, and then their relationships with each other. Next, I will create new variables by combining the existing ones with mathematical operations. Finally, I will train a model to predict wine quality on the final variable set.

I will create the new features utilizing a new Python open-source library called [Feature-engine](https://feature-engine.readthedocs.io/en/latest/index.html)

Feature-engine classes preserve Scikit-learn functionality with the methods fit and transform to first learn the parameters from the data, and then transform the data utilizing those parameters.

The beauty of using Feature-engine is that we can accomodate all transformations within a Scikit-learn pipeline, so that we can in a few lines of code, create all the new variables and then train a model on the final dataset. And, when scoring the test set, we only need to feed the raw data to the pipeline to obtain the final predictions.

## I hope you find this kernel useful and if you do, your **UPVOTES** will be highly appreciated.


In [None]:
# let's install Feature-engine

!pip install feature-engine

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.pipeline import Pipeline


# import classes from Feature-engine
from feature_engine.creation import MathematicalCombination, CombineWithReferenceFeature

In [None]:
# Load dataset

data = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

print(data.shape)

data.head()

## Exploratory Data Analysis

Let's have a look at the variables and their relationships.

In [None]:
# check how many wines of different qualities there are

# percentage of wines of each quality
(data['quality'].value_counts() / len(data)).sort_index().plot.bar()

# plot
plt.title('Wine Quality')
plt.ylabel('Percentage of wines in the data')
plt.xlabel('Wine Quality')
plt.show()

Most wines are medium to low quality. Only a few of high quality (>6)

In [None]:
# let's transform the target into binary

# wines with quality below 6 will be considered low quality (0)
data['quality'] = np.where(data['quality'] <= 6, 0, 1)

(data['quality'].value_counts() / len(data)).plot.bar()

plt.title('Wine Quality')
plt.ylabel('Percentage of wines in the data')
plt.xlabel('Wine Quality')
plt.show()

In [None]:
# let's explore variable distributions with histograms

data.hist(bins=50, figsize=(10,10))

plt.show()

All variables are continuous.

In [None]:
# let's evaluate the mean variable value per wine quality

g = sns.PairGrid(data, x_vars=["quality"], y_vars=data.columns[0:-1])
g.map(sns.barplot)
plt.show()

There doesn't seem to be a difference in pH between wines of low and high quality, but high quality wines tend to have more alcohol, for example.

Similarly, good quality wines tend to have more sulphates but less free and total sulfur, a molecule that is part of the sulphates.

Good quality wines tend to have more citric acid, yet surprisingly, the pH in good quality wines is not lower. So the pH must be equilibrated through something else, for example the sulphates.

In [None]:
# now let's explore the data with boxplots

# reorganise for plotting
df = data.melt(id_vars=['quality'])

# capture variables
cols = df.variable.unique()

# plot first 6 columns
g = sns.axisgrid.FacetGrid(df[df.variable.isin(cols[0:6])], col='variable', sharey=False)
g.map(sns.boxplot, 'quality','value')
plt.show()

In [None]:
# plot remaining columns
g = sns.axisgrid.FacetGrid(df[df.variable.isin(cols[6:])], col='variable', sharey=False)
g.map(sns.boxplot, 'quality','value')
plt.show()

In [None]:
data.head()

In [None]:
# the citric acid affects the pH of the wine

plt.scatter(data['citric acid'], data['pH'], c=data['quality'])
plt.xlabel('Citric acid')
plt.ylabel('pH')
plt.show()

In [None]:
# the sulphates may affect the pH of the wine

plt.scatter(data['sulphates'], data['pH'], c=data['quality'])
plt.xlabel('sulphates')
plt.ylabel('pH')
plt.show()

In [None]:
plt.scatter(data['sulphates'], data['citric acid'], c=data['quality'])
plt.xlabel('sulphates')
plt.ylabel('citric acid')
plt.show()

Good quality wine tend to have more citric acid and more sulphate, thus similar pH.

In [None]:
# let's evaluate the relationship between some molecules and the density of the wine

g = sns.PairGrid(data, y_vars=["density"], x_vars=['chlorides','sulphates', 'residual sugar', 'alcohol'])
g.map(sns.regplot)
plt.show()

## Create additional variables

Let's combine variables into new ones to capture additional information.

In [None]:
# combine fixed and volatile acidity to create total acidity
# and mean acidity

combinator = MathematicalCombination(
    variables_to_combine=['fixed acidity', 'volatile acidity'],
    math_operations = ['sum', 'mean'],
    new_variables_names = ['total_acidity', 'average_acidity']
)

data = combinator.fit_transform(data)

# note the new variables at the end of the dataframe
data.head()

In [None]:
# let's combine salts into total minerals and average minerals

combinator = MathematicalCombination(
    variables_to_combine=['chlorides', 'sulphates'],
    math_operations = ['sum', 'mean'],
    new_variables_names = ['total_minerals', 'average_minerals']
)

data = combinator.fit_transform(data)

# note the new variable at the end of the dataframe
data.head()

In [None]:
# let's determine the sulfur that is not free

combinator = CombineWithReferenceFeature(
    variables_to_combine=['total sulfur dioxide'],
    reference_variables=['free sulfur dioxide'],
    operations=['sub'],
    new_variables_names=['non_free_sulfur_dioxide']
)

data = combinator.fit_transform(data)

# note the new variable at the end of the dataframe
data.head()

In [None]:
# let's calculate the % of free sulfur

combinator = CombineWithReferenceFeature(
    variables_to_combine=['free sulfur dioxide'],
    reference_variables=['total sulfur dioxide'],
    operations=['div'],
    new_variables_names=['percentage_free_sulfur']
)

data = combinator.fit_transform(data)

# note the new variable at the end of the dataframe
data.head()

In [None]:
# let's determine from all free sulfur how much is as salt

combinator = CombineWithReferenceFeature(
    variables_to_combine=['sulphates'],
    reference_variables=['free sulfur dioxide'],
    operations=['div'],
    new_variables_names=['percentage_salt_sulfur']
)

data = combinator.fit_transform(data)

# note the new variable at the end of the dataframe
data.head()

In [None]:
# now let's explore the new variables with boxplots

new_vars = [
    'total_acidity',
    'average_acidity',
    'total_minerals',
    'average_minerals',
    'non_free_sulfur_dioxide',
    'percentage_free_sulfur',
    'percentage_salt_sulfur']

# reorganise for plotting
df = data[new_vars+['quality']].melt(id_vars=['quality'])

# capture variables
cols = df.variable.unique()

# plot first 6 columns
g = sns.axisgrid.FacetGrid(df[df.variable.isin(cols)], col='variable', sharey=False)
g.map(sns.boxplot, 'quality','value')
plt.show()

## Machine Learning Pipeline

Now we are going to carry out all variable creation within a Scikit-learn Pipeline and add a classifier at the end.

In [None]:
data = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

# make binary target
data['quality'] = np.where(data['quality'] <= 6, 0, 1)

# separate dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['quality'], axis=1),
    data['quality'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

In [None]:
pipe = Pipeline([
    # variable creation
    ('acidity', MathematicalCombination(
        variables_to_combine=['fixed acidity', 'volatile acidity'],
        math_operations = ['sum', 'mean'],
        new_variables_names = ['total_acidity', 'average_acidity']
        )
    ),
    
    ('total_minerals', MathematicalCombination(
        variables_to_combine=['chlorides', 'sulphates'],
        math_operations = ['sum', 'mean'],
        new_variables_names = ['total_minerals', 'average_minearals'],
        )
    ),
    
    ('non_free_sulfur', CombineWithReferenceFeature(
        variables_to_combine=['total sulfur dioxide'],
        reference_variables=['free sulfur dioxide'],
        operations=['sub'],
        new_variables_names=['non_free_sulfur_dioxide'],
        )
    ),
    
    ('perc_free_sulfur', CombineWithReferenceFeature(
        variables_to_combine=['free sulfur dioxide'],
        reference_variables=['total sulfur dioxide'],
        operations=['div'],
        new_variables_names=['percentage_free_sulfur'],
        )
    ),
    
    ('perc_salt_sulfur', CombineWithReferenceFeature(
        variables_to_combine=['sulphates'],
        reference_variables=['free sulfur dioxide'],
        operations=['div'],
        new_variables_names=['percentage_salt_sulfur'],
        )
    ),
    
    # =====  the machine learning model ====
    
    ('gbm', GradientBoostingClassifier(n_estimators=10, max_depth=2, random_state=1)),
])

# create new variables, and then train gradient boosting machine
# uses only the training dataset

pipe.fit(X_train, y_train)

In [None]:
# make predictions and determine model performance

# the pipeline takes in the raw data, creates all the new features and then
# makes the prediction with the model trained on the final subset of variables

# obtain predictions and determine model performance

pred = pipe.predict_proba(X_train)
print('Train roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))

pred = pipe.predict_proba(X_test)
print('Test roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

## Feature importance

In [None]:
new_vars = ['total_acidity', 'average_acidity', 'total_minerals', 'average_minearals',
           'non_free_sulfur_dioxide', 'percentage_free_sulfur','percentage_salt_sulfur']

In [None]:
importance = pd.Series(pipe.named_steps['gbm'].feature_importances_)
importance.index = list(X_train.columns) + new_vars

importance.sort_values(ascending=False).plot.bar(figsize=(15,5))
plt.ylabel('Feature importance')
plt.show()

We see that some of the variables that we created are somewhat important for the prediction, like average_minerals, total_minerals, and total and average acidity.

That is all folks!


## References and further reading

- [Feature-engine](https://feature-engine.readthedocs.io/en/latest/index.html), Python open-source library
- [Python Feature Engineering Cookbook](https://www.packtpub.com/data/python-feature-engineering-cookbook)

## Other Kaggle kernels featuring Feature-engine

- [Feature selection for bank customer satisfaction prediction](https://www.kaggle.com/solegalli/feature-selection-with-feature-engine)
- [Feature engineering and selection for house price prediction](https://www.kaggle.com/solegalli/predict-house-price-with-feature-engine)

