In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, accuracy_score
from scipy.stats import shapiro
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from statsmodels.stats.diagnostic import normal_ad, het_breuschpagan
import statsmodels.api as sm
from sklearn.preprocessing import minmax_scale

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

<div class="alert alert-block alert-info"><strong>Content</strong></div>
<div class="list-group">
    <a class="list-group-item list-group-item-action" href="#ds">The Dataset</a>
    <a class="list-group-item list-group-item-action" href="#mda">Missing Data Aanalysis</a>
    <a class="list-group-item list-group-item-action" href="#out">Outliers</a>
    <a class="list-group-item list-group-item-action" href="#asu">Assumptions</a>
    <a class="list-group-item list-group-item-action" href="#fsu">Fix Assumptions</a>
    <a class="list-group-item list-group-item-action" href="#ols">Ordinary Least Squares(OLS)</a>
    <a class="list-group-item list-group-item-action" href="#mtt">Model Training and Testing</a>
    <a class="list-group-item list-group-item-action" href="#rf">References</a>
    
</div>


In this notebook, I'm going to take a different approach and try to find whether there is approximately a linear relationship between Alcohol content and other provided factors. Moreover, try to find how accurately we can predict Alcohol content. 

Before starting the analysis let me give you a brief introduction about the technique that we are going to use. We are going to use multiple linear regressions and linear Regression is a simple approach for supervised learning. In particular, Linear Regression is a useful tool for prediction a quantitavie response. Linear Regression can be perfomed as a Simple Linear Regression or Multiple Linear Regression. Both simple and multiple linear regressions assume that there is approximately a linear relationship between the inputs(X) and the output(Y). The main difference is the number of independent variables that they take as inputs(X). Simple linear regression just takes a single input(X), while multiple linear regression takes multiple inputs(X). Mathamatically we can write multiple linear regression as:

$$ Y = W_0 + W_1X_1 + W_2X_2 + ... + W_nX_n $$ 

I'm not going to discuss more linear regression here, there are plenty of very good articles available so you can check them if you want to gain your knowledge about linear regression. In this task, our output variable (target variable in machine learning or the dependent variable in statistical modeling) is Alcohol content and all other inputs (features in machine learning or independent variables in statistical modeling) are input variables. 

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
df.head()

In [None]:
df.info()

Let's understand the dataset, dataset contains 1599 samples and 12 variables including the target variable. We can see that 11 variables are numerical and 1 variable is ordinal.

<div class="alert alert-block alert-success" id='ds'><strong>The Dataset</strong></div>

* fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
* volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
* citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
* residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
* chlorides: the amount of salt in the wine
* free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
* total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
* density: the density of water is close to that of water depending on the percent alcohol and sugar content
* pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
* sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
* quality: score between 0 and 10
* alcohol: the percent alcohol content of the wine - output varaible 

Ok, now we have some idea about the data. Let's see if there are any missing values in the data.

<div class="alert alert-block alert-success" id='mda'><strong>Missing Data Aanalysis</strong></div>

In [None]:
msno.matrix(df)

There are no missing data available on the dataset. The next most important thing is to test linear regression assumptions. If the assumptions are not satisfied, the interpretation of the results will not always be valid. This can be very dangerous depending on the application.

<div class="alert alert-block alert-success" id='out'><strong>Outliers</strong></div>

> In statistics, an outlier is an observation point that is distant from other observations. - Wikipedia

Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.  We can use a box plot to see the outliers graphically.

In [None]:
def plotBoxplot(data):
    fig, axes = plt.subplots(ncols=3, nrows=4, figsize=(15,15))
    fig.tight_layout(pad=4.0)

    col = 0
    row = 0
    colors = ['#bad9e9', '#7ab6d6', '#3c8abd']

    for i, column in enumerate(data.columns):
        sns.boxplot(y=column, data=data, ax=axes[row][col], color=colors[col])

        if (i + 1) % 3 == 0:
            row += 1
            col = 0
        else:
            col += 1
            
plotBoxplot(df)

Box plot use the Inter Quartile Range(IQR) to display data and outliers(shape of the data) but in order to be get a list of identified outlier, we have to calculate IQR.

> The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. - Wikipedia


In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

print(IQR)

As I told earlier, there are different ways to treat outliers such as remove outliers from the dataset, KNN imputation, etc. However, I'm not a domain expert therefore in this scenario, I will remove outliers from the dataset.

In [None]:
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape

Ok, now it's time to test assumptions. To test the assumption, I'm going to use OLS(Ordinary Least Squares) model from statsmodels. Later, I will use LinearRegression model from sklearn and sklearn -> LinearRegression model also uses the Ordinary Least Squares techique.

<div class='alert alert-block alert-success' id='asu'><strong>Assumptions</strong></div>

### 1. Linearity

This assumes that there is a linear relationship between the independent variables and the dependent variable.

**If this assumption is not statified** - The predictions will be extremely inaccurate because our model is underfitting. This is a serious violation that should not be ignored.

**How to detect ?** If there is only one independent variable, this is pretty easy to test with a scatter plot. For multiple independent variables, we use a scatter plot to see our predicted values vs the actual values (residuals). The points should lie on or around a diagonal line on the scatter plot.

In [None]:
#create tmp train/test split for assumptions test
X = df.drop(['alcohol'], axis=1)
y = df['alcohol']

X = sm.add_constant(X)
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=50)

model = sm.OLS(y_train, x_train).fit()
print(model.summary())

In [None]:
y_pred = model.predict(x_test)

In [None]:
#plot the actual vs predicted values
sns.regplot(y_test, y_pred, line_kws={'color':'red'}, ci=None)

plt.xlabel('Actual')
plt.ylabel('Predictions')
plt.title('Prediction vs Actual')

plt.show()

We can see that predictions are relatively even spread around the diagonal line and this indicates that there is a linear relationship between independent and dependant variables. Ok, let's test the next assumption.

### 2. No Multicollinearity among Independant Varaibles

This assumes that the independent variables used in the regression are not correlated with each other.

**If this assumtions is not satisfied** - coefficients and standard errors of affected varaibles are unreliable.

**How to detect ?** There are a few ways such as a heatmap of the correlation or variance inflation factor (VIF).

In [None]:
plt.figure(figsize=(25, 10))
sns.heatmap(df.loc[:, df.columns != 'alcohol'].corr(), annot=True, fmt='.2f')

In [None]:
def calVIF(X):
    vif = pd.DataFrame()
    vif['variables'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

    return vif

calVIF(df.loc[:, df.columns != 'alcohol'])

We can see that there are some high correlated variables such as density, pH, and fixed acidity. To fix multicollinearity, we can remove highly correlated variables(with a high variance inflation factor). Let's go to the next assumption.

### 3. Homoscedasticity

This assumes that the same variance within our error terms (no Heteroscedasticity).

**If this assumption is not satisfied** - standard errors in output cannot be relied upon.

**How to detect ?** conduct Breusch-Pagan test or Goldfeld-Quandt test.

In [None]:
#Breusch-Pagan test
bp = het_breuschpagan(model.resid, model.model.exog)
print('Lagrange multiplier statistic: {0} and p-value: {1}'.format(bp[0], bp[1]))

According to the results of the Breusch-Pagan test, we can see that the Lagrange multiplier statistic for the test is 302.83 and the corresponding p-value is 3.91. And hypotheses of this test are:

* H0: Homoscedasticity is present
* H1: Homoscedasticity is not present (i.e. heteroscedasticity exists)

Because this p-value is not less than 0.05, we fail to reject the null hypothesis. Therefore, we do not have sufficient evidence to say that heteroscedasticity is present in the regression model.

### 4. Normality of the Error Terms

This assumes that the error terms of the model are normally distributed.

**If this assumption is not satisfied** - violation of this assumption could affect standard errors in the output.

**How to detect ?** There are different ways to do so, histogram or Q-Q plot, Shapiro-Wilk test, Kolmogorov-Smirnov, and the Anderson-Darling test for normality.

In [None]:
plt.title('Distribution of Residuals')
sns.distplot(model.resid)
plt.show()

To further clarify this, we can do the Anderson-Darling test. Hypotheses for the Anderson-Darling test for the normal distribution are given below:

* H0: The data follows the normal distribution
* H1: The data do not follow the normal distribution

In [None]:
#Anderson-Darling test
def normalityError(residuals):
    p_value = normal_ad(residuals)[1]

    print('p-value from the test:', p_value)

    if p_value < 0.05:
        print('Residuals are not normally distributed')
    else:
        print('Residuals are normally distributed')
        
normalityError(model.resid)

Even though the histogram looks normally distributed, the Anderson-Darling test suggests that residuals are not normally distributed. To fix this we can apply nonlinear transformations, excluding specific variables (such as long-tailed variables), or removing outliers. Based on the assumption test, we can see that there are two violations.

1. Multicollinearity
2. Normality of the Error Terms

Therefore, let's fix those violations.

<div class='alert alert-block alert-success' id='fsu'><strong>Fix Assumptions</strong></div>

### 1. Fix Multicollinearity

As I mentioned earlier, to fix multicollinearity we have to remove highly correlated variables(with a high variance inflation factor). There is no upper limit in VIF and VIF that exceeds 10 is often regarded as indicating multicollinearity.

In [None]:
calVIF(df[['citric acid','residual sugar','density', 'sulphates']])

After removing multicollinearity we have four variables. Let's address the next issue.

### 2. Fix Normality of the Error Terms

Let's measure the skewness of the selected variables, if skewness is 0, the data are perfectly symmetrical, although it is quite unlikely for real-world data. As a general rule of thumb:
* If skewness is less than - 1 or greater than 1, the distribution is highly skewed.
* If skewness is between - 1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
* If skewness is between - 0.5 and 0.5, the distribution is approximately symmetric.

In [None]:
#skewness and kurtosis
def skew(columns, df):
    dfs = df[columns].agg(['skew', 'kurtosis']).transpose()
    for index, row in dfs.iterrows():
        if (abs(row['skew']) > 1):
            dfs.loc[index,'label'] = 'highly skewed'
        elif (0.5 < abs(row['skew']) < 1):
            dfs.loc[index,'label'] = 'moderately  skewed'
        else:
            dfs.loc[index,'label'] = 'approximately symmetric'
        
    return dfs

columns = ['citric acid','residual sugar','density', 'sulphates']
skew(columns, df)

We can see two variables are not normally distributed, to further confirm this we can conduct the Shapiro-Wilk test.

Shapiro-Wilk uses below hypothesis:
* H0 = The sample comes from a normal distribution.
* H1 = The sample is not coming from a normal distribution.

If the data is not normalized, we can apply a transformation to convert skewed distribution to a normal distribution/less-skewed distribution.

In [None]:
#Shapiro-Wilk test
def shapiroWilk(columns, df):
    data = []

    for column in columns:
        stat, p = shapiro(df[column])
        if (p < 0.05):
            label = 'The null hypothesis can be rejected.'
        else:
            label = 'The null hypothesis cannot be rejected.'

        data.append([column, stat, p, label])

    return pd.DataFrame(data, columns=['column', 'statistic', 'p-value', 'label'])
    

columns = ['residual sugar', 'sulphates']
shapiroWilk(columns, df)

In [None]:
def transformDistribution(columns, method):
    fig, axes = plt.subplots(ncols=2, nrows=len(columns), figsize=(10, 8))
    fig.tight_layout(pad=4)
    
    for i, column in enumerate(columns):
        if method == 'sqrt':
            trans = np.sqrt(df[column])
        elif method == 'log':
            trans = np.log10(df[column]+1)
        else:
            trans, params = stats.boxcox(df[column]+1)
    
        sns.distplot(df[column], ax=axes[i][0])
        sns.distplot(trans, ax=axes[i][1])

        axes[i][0].set_title('Distribution of {0}'.format(column))
        axes[i][1].set_title('Distribution of {0} (log)'.format(column))
        axes[i][1].set_xlabel('{0} (log)'.format(column))
        
    plt.show()
    
transformDistribution(['residual sugar', 'sulphates'], 'log')

In [None]:
df['residual_sugar_log'] = np.log(df['residual sugar'])
df['sulphates_log'] = np.log(df['sulphates'])

In [None]:
skew(['residual_sugar_log', 'sulphates_log'], df)

Ok, after log transformation distributions are approximately symmetric. Let's normalize data, normalization or scaling refers to bringing all the columns into the same range.

In [None]:
dfScaled = minmax_scale(df[['citric acid','density','residual_sugar_log','sulphates_log']])
 
df['citric_acid_norm'] = dfScaled[:,0]
df['density_norm'] = dfScaled[:,1]
df['residual_sugar_norm'] = dfScaled[:,2]
df['sulphates_norm'] = dfScaled[:,3]

df.head()

<div class="alert alert-block alert-success" id='ols'><strong>Ordinary Least Squares(OLS)</strong></div>

In [None]:
df = df[['citric_acid_norm','density_norm','residual_sugar_norm','sulphates_norm', 'alcohol']]
df.head()

In [None]:
X = df.drop(['alcohol'], axis=1)
y = df['alcohol']

X = sm.add_constant(X)
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=50)

In [None]:
model = sm.OLS(y_train, x_train).fit()
print(model.summary())

**R-squared:** Signifies the “percentage variation in dependent that is explained by independent variables”. In our case 53.7% variation of Y(alcohol) explained by citric acid, density, residual sugar and sulphates.

**Adj. R-squared:** This is the modified version of R-squared which is adjusted for the number of variables in the regression. It increases only when an additional variable adds to the explanatory power to the regression.

According to the results, we can see that all four coefficients are statistically significant. Based on the results, we can interpret the regression equation:


$$ Y = 10.2203 + 0.9384(citric_i) -4.0054(density_i) +  1.7621(residual_i) + 1.4542(sulphates_i) $$ 

Bases on the equation, we can interpret the impact of density as: 1 unit increase of density, the alcohol content decreased by 4.0054 on average, holding all other variables constant.

In [None]:
y_pred = model.predict(x_test)

In [None]:
sns.regplot(y_test, y_pred, line_kws={'color':'red'}, ci=None)

plt.xlabel('Actual')
plt.ylabel('Predictions')
plt.title('Prediction vs Actual')

plt.show()

We were able to train a decent linear regression model using available data. But our model was only able to explain 53.7% variance of Y(alcohol) and I think there will be some additional important factors that are related to red wine alcohol content. What do you think ?

<div class="alert alert-block alert-success" id='mtt'><strong>Model Training and Testing</strong></div>

In [None]:
lr = LinearRegression()
lr.fit(x_train, y_train)

In [None]:
print(lr.intercept_)

In [None]:
print(lr.coef_)

In [None]:
y_pred = lr.predict(x_test)

Model evaluation is very important. It helps you to understand the performance of yhe model and makes it easy to present the model. Tere are 3 main metrics for model evaluation in regression:

1. R Square/Adjusted R Square
2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)
3. Mean Absolute Error(MAE)

In [None]:
r2_score(y_test, y_pred)

Well, our model is able to explain 59.6% variance of Y(alcohol) for the test data. While R Square/Adjusted R Square are relative measure of how well the model fits dependent variables. MSE, RMSE, or MAE are better be used to compare performance between different regression models.

In [None]:
mean_squared_error(y_test, y_pred)

In [None]:
np.sqrt(mean_squared_error(y_test, y_pred))

Hope you've enjoyed my work, if you like my work and need to share something with me leave a comment :)

<div class="alert alert-block alert-success" id='rf'><strong>References</strong></div>

* https://learningwithdata.com/posts/tylerfolkman/the-ultimate-guide-to-linear-regression/
* https://www.analyticsvidhya.com/blog/2021/05/all-you-need-to-know-about-your-first-machine-learning-model-linear-regression/
* https://towardsdatascience.com/the-complete-guide-to-linear-regression-in-python-3d3f8f06bf8
* https://www.keboola.com/blog/linear-regression-machine-learning
* https://rstudio-pubs-static.s3.amazonaws.com/57835_c4ace81da9dc45438ad0c286bcbb4224.html
* https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
* https://www.kaggle.com/nareshbhat/outlier-the-silent-killer
* https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
* https://community.gooddata.com/metrics-and-maql-kb-articles-43/normality-testing-skewness-and-kurtosis-241
* https://towardsdatascience.com/methods-for-normality-test-with-application-in-python-bb91b49ed0f5
* https://medium.com/@TheDataGyan/day-8-data-transformation-skewness-normalization-and-much-more-4c144d370e55
* https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/  
* https://medium.com/@TheDataGyan/day-8-data-transformation-skewness-normalization-and-much-more-4c144d370e55
* https://jyotiyadav99111.medium.com/statistics-how-should-i-interpret-results-of-ols-3bde1ebeec01