## Red wine regression:
![red wine](https://robbreport.com/wp-content/uploads/2017/06/wine_domred1.jpg?w=1000)
Red wine, often a sign of luxurious life have a variety of qualities and prices based on their qualities. There have been done some research on this topic in last decade; where [statistical modeling](https://arxiv.org/pdf/1402.3646.pdf) and [classification algorithms](https://ieeexplore.ieee.org/abstract/document/9104095) have been used and discussed.<br/>
In this notebook, we will perform some EDA of the data, then perform outlier detection, feature assumption fitting and different other things; and will test regression model of wine quality using multiple regression models.<br/>
Here are the different sections of the notebook.<br/>
(1) [Exploratory data analysis and data understanding](#section1)<br/>
(2) [Outlier detection](#section2)<br/>
(3) [modeling](#section3)<br/>
<br/>
Resources:<br/>
(2) [modeling with regression models](#section2)<br/>
<br/>
(1) [Ordinal regression using mord api](https://pythonhosted.org/mord/reference.html#mord.OrdinalRidge)<br/>
(2)[mord regression github code](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)<br/>
(3)[sklearn ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)<br/>

## <a id = 'section1'>Exploratory data analysis</a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
print(data.shape)
data.head()

check the quality values

In [None]:
data['quality'].unique()

let's just normalize the quality to 0-5 from 3-8.

In [None]:
data['quality'] =  data['quality'] -3

In [None]:
data['quality'].value_counts()

Clearly, class 0,1 and 5 are very low in number. We will merge 0,1 with 2; and merge 4 with 5.

In [None]:
data['quality'] = data['quality'].replace([0,1],2)
data['quality'] = data['quality'].replace(5,4)

In [None]:
print(data['quality'].unique())
data['quality'] = data['quality'] - 2

Hence, final quality standards are 0,1,2. We will now analyze the features and relation with the quality variable of them to get a understanding of the data.

In [None]:
import matplotlib.pyplot as plt
for col in data.columns:
    fig = plt.figure()
    fig.suptitle(col,fontsize = 14)
    plt.hist(data[col].tolist())
    plt.show()

Let's analyze each feature into a bit of details to get better understanding of the feature and the predicate variable i.e. the quality.

### fixed acidity and quality relation:

In [None]:
def plot_cols(data,col):
    fig,axs = plt.subplots(3,figsize = (15,15))
    fig.suptitle(col+"distribution in different qualities")
    for i in range(3):
        axs[i].hist(data[data['quality']==i][col].tolist())

In [None]:
plot_cols(data,'fixed acidity')

In [None]:
plot_cols(data,'volatile acidity')

In [None]:
plot_cols(data,'citric acid')

In [None]:
plot_cols(data,'residual sugar')

In [None]:
plot_cols(data,'chlorides')

In [None]:
plot_cols(data,'free sulfur dioxide')

In [None]:
plot_cols(data,'total sulfur dioxide')

In [None]:
plot_cols(data,'density')

In [None]:
plot_cols(data,'pH')

In [None]:
plot_cols(data,'sulphates')

In [None]:
plot_cols(data,'alcohol')

## <a id = 'section2'>outlier detection</a>:
We will now find outliers on feature level by fitting probability distributions on them. 

In [None]:
from scipy import stats

In [None]:
dist = getattr(stats,'norm')
row_list = []
for col in data.columns:
    if col == 'quality':
        continue
    curr_dict = {}
    curr_dict['column'] = col
    parameter = dist.fit(data[col])
    curr_dict['mean'] = parameter[0]
    curr_dict['variance'] = parameter[1]
    test_stat,p_val = stats.kstest(data[col],'norm',parameter)
    curr_dict['test_statistics'] = test_stat
    curr_dict['p-value'] = p_val
    curr_dict['Is_normal'] = (curr_dict['test_statistics']<1.36)*1.0
    row_list.append(curr_dict)
normal_fit_df = pd.DataFrame(row_list)

In [None]:
normal_fit_df

so each of the features are basically from normal distribution with 95% confidence. Now, we will cap the feature values at 5 and 95 percentile values respectively at bottom and top.

In [None]:
def value_capper(x,low_val,up_val):
    if x<low_val:
        return low_val
    elif x>up_val:
        return up_val
    return x

In [None]:
normal_fit_df['sd'] = normal_fit_df['variance']**0.5
normal_fit_df['upper_val'] = normal_fit_df['mean'] + 1.96*normal_fit_df['sd']
normal_fit_df['lower_val'] = normal_fit_df['mean'] - 1.96*normal_fit_df['sd']
for col in data.columns:
    if col == 'quality':
        continue
    data[col] = data[col].apply(
                                lambda x : value_capper(x,
                                normal_fit_df[normal_fit_df['column']==col]['upper_val'].tolist()[0],
                                normal_fit_df[normal_fit_df['column']==col]['lower_val'].tolist()[0])
                                )

In [None]:
normal_fit_df

## <a id = 'section3'>Modeling the data</a>:
In this section we will try out different modeling and analyse the results.

## Ordinal regression

In [None]:
!pip install mord

In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data,test_size = 0.2,shuffle = True,stratify = data['quality'])
X_train = train_data.drop('quality',axis = 1)
X_test = test_data.drop('quality',axis = 1)
Y_train = train_data['quality']
Y_test = test_data['quality']
print(X_train.shape,Y_train.shape,X_test.shape,Y_test.shape)

In [None]:
Y_test

In [None]:
import mord
from sklearn.metrics import classification_report as clr
ord_reg = mord.OrdinalRidge(alpha=10.0, fit_intercept=True, 
                            normalize=False, copy_X=True, 
                            max_iter=10000, tol=0.0001, solver='auto')
ord_reg.fit(X_train,Y_train)
print(ord_reg.score(X_train,Y_train))
print(ord_reg.score(X_test,Y_test))
print(X_train.columns)
print(ord_reg.coef_)
pred_test = ord_reg.predict(X_test)
pred_test = [int(f) for f in pred_test]
pred_train = ord_reg.predict(X_train)
pred_train = [int(f) for f in pred_train]
print(clr(Y_train,pred_train))
print(clr(Y_test,pred_test))

In [None]:
pred_test = ord_reg.predict(X_test)
print(pred_test)

The result of ordinalRidge is basically the negative of mean_squared_error from the fitted regression. The performance is very bad in the 3rd class. Also, clearly some of the features are not getting captured enough because of the linear relation. 

In [None]:
from sklearn.ensemble import RandomForestClassifier as rfc
classifier = rfc(n_estimators = 128,max_depth = 3,
                 class_weight = {0:1,1:1.1,2:1.9},
                 max_features = 'auto',oob_score = True)
Y_train = [int(f) for f in Y_train]
Y_test = [int(f) for f in Y_test]
classifier.fit(X_train,Y_train)
pred_train = classifier.predict(X_train)                             
print(classifier.oob_score_)
print(clr(Y_train,pred_train))
pred_test = classifier.predict(X_test)
print(clr(Y_test,pred_test))

so in case of random forest, after tuning also, although we reached 50% around accuracy as similar as ordinal regression; we have improved the f1-scores in this case. But still it is far from acceptable accuracy; so let's check some other models as well. We will try out the xgboost.

In [None]:
import numpy as np
from sklearn.utils import class_weight
class_weights = list(class_weight.compute_class_weight('balanced',
                                             np.unique(train_data['quality']),
                                             train_data['quality']))

w_array = np.ones(len(Y_train), dtype = 'float')
for i in range(len(Y_train)):
    w_array[i] = class_weights[Y_train[i]-1]

In [None]:
w_array

In [None]:
import xgboost as xgb
xgb = xgb.XGBClassifier(n_estimators = 100,
                        learning_rate = 0.3,
                        verbosity = 0,
                        #max_depth = 5,it fixes the depth on its own.
                        #num_parallel_tree = 32, this didn't improve performance
                        random_state = 42)
xgb.fit(X_train,Y_train,sample_weight = w_array)
pred_train = xgb.predict(X_train)
pred_test = xgb.predict(X_test)
print(clr(Y_train,pred_train))
print(clr(Y_test,pred_test))