# <center> Introduction to Regression - Complete Analysis on Wine data </center>

- Report can be found [here](https://github.com/Abhishekmamidi123/Regression-Analysis).

## Problem Definition
-  A wine dataset is provided. The task is to analyze data and build a regression model to predict the quality of the wine.

## Abstract 
- The main goal of this report is to extract maximum knowledge from the Wine data in different ways. The data is analyzed and the plots are shown. A regression model is built to predict the quality of wine using the features provided. The assumptions of regression are also checked.

## Methodology 
1. Description of data 
2. Preprocess data 
3. Visualize data 
4. Build a Regression model 
5. Check Regression Assumptions 
6. Goodness of fit 
7. Compare different Regression methods

## Description of data 
1. **Name of the data**: Wine data from UCI Machine learning repository 
2. **Number of data points**: 4898 
3. **Number of features**: 11 
4. **Target attribute**: Quality of wine 
5. **Range of target attribute**: 3 to 9

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
from IPython.display import HTML, display
from IPython.core import display as ICD
from plotly.offline import init_notebook_mode, iplot

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import numpy as np
import statsmodels.api as sm
import pylab
import scipy as sp

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn import svm
from sklearn import tree
from sklearn import neighbors
from sklearn import linear_model

init_notebook_mode(connected=True)
import warnings
warnings.filterwarnings('ignore')

### Data

In [None]:
PATH = '../input/'
filename = 'winequality-white.csv'
white_data = pd.read_csv(PATH + filename)

data_head = white_data.head()
colorscale = [[0, '#4d004c'],[.5, '#f2e5ff'],[1, '#ffffff']]
df_table = ff.create_table(round(data_head.iloc[:,[0,1,2,3,4,5]], 3), colorscale=colorscale)
py.iplot(df_table, filename='wine_quality')
df_table = ff.create_table(round(data_head.iloc[:,[6,7,8,9,10,11]], 3), colorscale=colorscale)
py.iplot(df_table, filename='wine_quality')

### Features 
1. Fixed acidity 
2. Volatile acidity 
3. Citric acid 
4. Residual sugar 
5. Chlorides 
6. Free sulfur dioxide 
7. Total sulfur dioxide 
8. Density 
9. pH 
10. Sulphates 
11. Alcohol

### Target Attribute 
- Quality of wine

### Distribution of Target Attribute

In [None]:
value_counts = white_data.quality.value_counts()
target_counts = pd.DataFrame({'quality': list(value_counts.index), 'value_count': value_counts})

In [None]:
plt.figure(figsize=(10,4))
g = sns.barplot(x='quality', y='value_count', data=target_counts, capsize=0.3, palette='spring')
g.set_title("Frequency of target class", fontsize=15)
g.set_xlabel("Quality", fontsize=13)
g.set_ylabel("Frequency", fontsize=13)
g.set_yticks([0, 500, 1000, 1500, 2000, 2500])
for p in g.patches:
    g.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', va='center', xytext=(0, 10), 
                textcoords='offset points', fontsize=14, color='black')

#### Analysis:
- The quality of wine ranges from 3 to 9 
- The data is not balanced. The number of data points having quality 6 is very high and quality 3 and 9 are very low. 
- This may affect the model.

### Distribution of target attribute - Box plot

In [None]:
plt.figure(figsize=(10,3))
sns.boxplot(data=white_data['quality'], orient='horizontal', palette='husl')
plt.title("Distribution of target variable")

### Describe data

In [None]:
white_data.describe().drop(columns=['quality'])

# data_head = white_data.describe().drop(columns=['quality'])
# data_head.columns = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
#        'chlorides', 'free_SO2', 'total_SO2', 'density',
#        'pH', 'sulphates', 'alcohol']
# colorscale = [[0, '#4d004c'],[.5, '#f2e5ff'],[1, '#ffffff']]
# df_table = ff.create_table(round(data_head.iloc[:,[0,1,2,3,4,5,6,7,8,9,10]], 3), colorscale=colorscale)
# py.iplot(df_table, filename='wine_quality')
# df_table = ff.create_table(round(data_head.iloc[:,[6,7,8,9,10]], 3), colorscale=colorscale)
# py.iplot(df_table, filename='wine_quality')

## Preprocess data

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=white_data.drop(columns=['quality']), orient='horizontal', palette='husl')

#### Analysis:
- If we observe the above boxplot, the range of features is different from each other. 
- We can normalize the data. All the variables range from 0 to 1 after normalization and don’t lose any information. 

### Distribution of features - After normalization 
- This looks good than before and very easy to understand the distribution of data.

In [None]:
y = white_data['quality']
white_data = white_data.loc[:, ~white_data.columns.isin(['quality'])]

scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(white_data)
white_data.loc[:,:] = scaled_values

white_data['quality'] = y

In [None]:
data_head = white_data.head()
colorscale = [[0, '#4d004c'],[.5, '#f2e5ff'],[1, '#ffffff']]
df_table = ff.create_table(round(data_head.iloc[:,[0,1,2,3,4,5]], 3), colorscale=colorscale, )
py.iplot(df_table, filename='wine_quality')
df_table = ff.create_table(round(data_head.iloc[:,[6,7,8,9,10,11]], 3), colorscale=colorscale, )
py.iplot(df_table, filename='wine_quality')

In [None]:
columns = list(white_data.columns)
new_column_names = []
for col in columns:
    new_column_names.append(col.replace(' ', '_'))
white_data.columns = new_column_names

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=white_data.drop(columns=['quality']), orient='horizontal', palette='husl')

## Visualize data
### Correlation between features 

In [None]:
corr_matrix = white_data.corr().abs()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

#### Analysis:
- The correlation between “density” and “residual sugar” is 0.84. 
- The correlation between “alcohol” and “density” is 0.78. 
- The correlation between “total sulfur dioxide” and “free sulfur dioxide” 
is 0.62. 
- These are the three pairs of features having a high correlation(>0.5).

### Distribution of each feature

In [None]:
features = white_data.copy(deep=True)
features['quality'] = y.astype('str').map({'3': 'Three', '4': 'Four', '5': 'Five', '6': 'Six', '7': 'Seven', '8': 'Eight', '9': 'Nine'})
f, axes = plt.subplots(4, 3, figsize=(15, 10), sharex=True)
sns.distplot(features["fixed_acidity"], rug=False, color="skyblue", ax=axes[0, 0])
sns.distplot(features["volatile_acidity"], rug=False, color="olive", ax=axes[0, 1])
sns.distplot(features["citric_acid"], rug=False, color="gold", ax=axes[0, 2])
sns.distplot(features["residual_sugar"], rug=False, color="teal", ax=axes[1, 0])
sns.distplot(features["chlorides"], rug=False, ax=axes[1, 1])
sns.distplot(features["free_sulfur_dioxide"], rug=False, color="red", ax=axes[1, 2])
sns.distplot(features["total_sulfur_dioxide"], rug=False, color="skyblue", ax=axes[2, 0])
sns.distplot(features["density"], rug=False, color="olive", ax=axes[2, 1])
sns.distplot(features["pH"], rug=False, color="gold", ax=axes[2, 2])
sns.distplot(features["sulphates"], rug=False, color="teal", ax=axes[3, 0])
sns.distplot(features["alcohol"], rug=False, ax=axes[3, 1])

#### Analysis:
- If we observe the distribution of all features, they follow a Normal distribution. 
- There is some fluctuation in the features “sulphates” and “alcohol”. 

In [None]:
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.5)]

### Pair plot between features
- This is to understand the relation between features. 

In [None]:
features = white_data.copy(deep=True)
features['quality'] = y.astype('str').map({'3': 'Three', '4': 'Four', '5': 'Five', '6': 'Six', '7': 'Seven', '8': 'Eight', '9': 'Nine'})
sns.pairplot(features, diag_kind='kde', palette='husl', hue='quality')

#### Analysis:
- From this plot, we can see how different features are correlated with each other. 
- In the above plot, the features that are plotted on the x-axis and y-axis are in the given order itself. 

### Pair plot between correlated features

In [None]:
features = white_data.copy(deep=True)
features['quality'] = y.astype('str').map({'3': 'Three', '4': 'Four', '5': 'Five', '6': 'Six', '7': 'Seven', '8': 'Eight', '9': 'Nine'})
sns.pairplot(features, vars=to_drop, diag_kind='kde', palette='husl', hue='quality')

#### Analysis:
- As we have seen above in the correlation plot, there is a high correlation(>0.5) in between some of the features. 
- Here, we can visualize how these features are correlated. 
- If we observe carefully, we cannot separate the data points of different quality easily, because all the data points of various quality are overlapped.

## Build a Regression model
### Linear Regression using Gradient Descent
- The method of Linear Regression that finds the coefficients of different features using Gradient Descent optimization, is fit to the data to see how independent variables are contributing to the dependent variable. 
- The below plot shows the coefficients of features(contribution). 

In [None]:
model_reg = LinearRegression().fit(white_data.drop(columns=['quality']), y)
y_true = white_data.quality
y_pred = model_reg.predict(white_data.drop(columns=['quality']))

In [None]:
column_names = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol']
regression_coefficient = pd.DataFrame({'Feature': column_names, 'Coefficient': model_reg.coef_}, columns=['Feature', 'Coefficient'])

In [None]:
column_names = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol']

plt.figure(figsize=(15,5))
g = sns.barplot(x='Feature', y='Coefficient', data=regression_coefficient, capsize=0.3, palette='spring')
g.set_title("Contribution of features towards target variable", fontsize=15)
g.set_xlabel("Feature", fontsize=13)
g.set_ylabel("Degree of Coefficient", fontsize=13)
g.set_yticks([-8, -6, -4, -2, 0, 2, 4, 6, 8])
g.set_xticklabels(column_names)
for p in g.patches:
    g.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', va='center', xytext=(0, 10), 
               textcoords='offset points', fontsize=14, color='black')

### Ordinary Least Squares(OLS)
- In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. 
- The OLS method corresponds to minimizing the sum of squared differences between the observed and predicted values. This minimization leads to the estimators of the parameters of the model. 
- The results of OLS Regression are shown below: 

In [None]:
model_ols = ols("""quality ~ fixed_acidity 
                        + volatile_acidity 
                        + citric_acid
                        + residual_sugar 
                        + chlorides 
                        + free_sulfur_dioxide
                        + total_sulfur_dioxide 
                        + density 
                        + pH 
                        + sulphates 
                        + alcohol""", data=white_data).fit()

In [None]:
model_summary = model_ols.summary()
HTML(
(model_ols.summary()
    .as_html()
    .replace('<th>Dep. Variable:</th>', '<th style="background-color:#c7e9c0;"> Dep. Variable: </th>')
    .replace('<th>Model:</th>', '<th style="background-color:#c7e9c0;"> Model: </th>')
    .replace('<th>Method:</th>', '<th style="background-color:#c7e9c0;"> Method: </th>')
    .replace('<th>No. Observations:</th>', '<th style="background-color:#c7e9c0;"> No. Observations: </th>')
    .replace('<th>  R-squared:         </th>', '<th style="background-color:#aec7e8;"> R-squared: </th>')
    .replace('<th>  Adj. R-squared:    </th>', '<th style="background-color:#aec7e8;"> Adj. R-squared: </th>')
    .replace('<th>coef</th>', '<th style="background-color:#ffbb78;">coef</th>')
    .replace('<th>std err</th>', '<th style="background-color:#c7e9c0;">std err</th>')
    .replace('<th>P>|t|</th>', '<th style="background-color:#bcbddc;">P>|t|</th>')
    .replace('<th>[0.025</th>    <th>0.975]</th>', '<th style="background-color:#ff9896;">[0.025</th>    <th style="background-color:#ff9896;">0.975]</th>'))
)

#### Analysis:
- The R-squared is 0.282 and Adjusted R-squared is 0.280. 
- If p-value > 0.05, we fail to reject the null hypothesis, otherwise we reject the null hypothesis.
- The p-values of the features “citric acid” and “chlorides”, is greater than 0.05. Also, the contribution of these features is very less. 
- So, we can remove the remove the features from the data. 
- Let’s fit the model again and see if there would be any change. The results of OLS Regression are shown after the removal of these two features from the data. 

In [None]:
model_ols = ols("""quality ~ fixed_acidity 
                        + volatile_acidity 
                        + residual_sugar 
                        + free_sulfur_dioxide
                        + total_sulfur_dioxide 
                        + density 
                        + pH 
                        + sulphates 
                        + alcohol""", data=white_data).fit()

In [None]:
model_summary = model_ols.summary()
HTML(
(model_ols.summary()
    .as_html()
    .replace('<th>Dep. Variable:</th>', '<th style="background-color:#c7e9c0;"> Dep. Variable: </th>')
    .replace('<th>Model:</th>', '<th style="background-color:#c7e9c0;"> Model: </th>')
    .replace('<th>Method:</th>', '<th style="background-color:#c7e9c0;"> Method: </th>')
    .replace('<th>No. Observations:</th>', '<th style="background-color:#c7e9c0;"> No. Observations: </th>')
    .replace('<th>  R-squared:         </th>', '<th style="background-color:#aec7e8;"> R-squared: </th>')
    .replace('<th>  Adj. R-squared:    </th>', '<th style="background-color:#aec7e8;"> Adj. R-squared: </th>')
    .replace('<th>coef</th>', '<th style="background-color:#ffbb78;">coef</th>')
    .replace('<th>std err</th>', '<th style="background-color:#c7e9c0;">std err</th>')
    .replace('<th>P>|t|</th>', '<th style="background-color:#bcbddc;">P>|t|</th>')
    .replace('<th>[0.025</th>    <th>0.975]</th>', '<th style="background-color:#ff9896;">[0.025</th>    <th style="background-color:#ff9896;">0.975]</th>'))
)

- There is no change in the values of R-squared and there is an increase of 0.001 of Adjusted R-squared. So, there is no harm in removing those features from the data. Also, the contribution of these features to predict the quality of the wine is very less as shown before. Now, we are left with 9 features.

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def goodness(y_true, y_pred):
    mape = mean_absolute_percentage_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r_squared = r2_score(y_true, y_pred)
    return mape, mse, r_squared

### Contribution of features after removing two features 

In [None]:
model = LinearRegression().fit(white_data.drop(columns=['quality', 'citric_acid', 'chlorides']), y)
y_true = white_data.quality
y_pred = model.predict(white_data.drop(columns=['quality', 'citric_acid', 'chlorides']))

In [None]:
column_names = ['fixed_acidity', 'volatile_acidity', 'residual_sugar',
       'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol']
regression_coefficient = pd.DataFrame({'Feature': column_names, 'Coefficient': model.coef_}, columns=['Feature', 'Coefficient'])

In [None]:
column_names = ['fixed_acidity', 'volatile_acidity', 'residual_sugar',
       'free_SO2', 'total_SO2', 'density',
       'pH', 'sulphates', 'alcohol']

plt.figure(figsize=(15,5))
g = sns.barplot(x='Feature', y='Coefficient', data=regression_coefficient, capsize=0.3, palette='spring')
g.set_title("Contribution of features towards target variable", fontsize=15)
g.set_xlabel("Feature", fontsize=13)
g.set_ylabel("Degree of Coefficient", fontsize=13)
g.set_yticks([-8, -6, -4, -2, 0, 2, 4, 6, 8])
g.set_xticklabels(column_names)
for p in g.patches:
    g.annotate(np.round(p.get_height(),decimals=2), 
                (p.get_x()+p.get_width()/2., p.get_height()), 
                ha='center', va='center', xytext=(0, 10), 
               textcoords='offset points', fontsize=14, color='black')

- The coefficients of features have changed a little bit. 

## Check Regression Assumptions 
1. Linearity 
2. Homoscedasticity 
3. Correlation of errors 
4. Normality of errors. 

- Let’s check each condition using the predicted values and the errors/residuals. 
- Residuals are the difference between “true value” and the “predicted value”. 

**Note**: The two features “chlorides” and “citric acid” are removed from the data.

### Linearity 
- Plot partial regression plots to check linearity.

In [None]:
error = y_true - y_pred
error_info = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred, 'error': error}, columns=['y_true', 'y_pred', 'error'])

In [None]:
fig = plt.figure(figsize=(10,12))
fig = sm.graphics.plot_partregress_grid(model_ols, fig=fig)

### Analysis:
- If we observe carefully, all the partial residual plots between the independent variable and dependent variable are linear.  
- Linearity condition is satisfied. 

### Homoskedasticity 
- To check homoskedasticity, we plot the residuals vs predicted values/fitted values. 
- If we see any kind of funnel shape, we can say that there is heteroskedasticity. 

In [None]:
plt.figure(figsize=(8,5))
g = sns.regplot(x="y_pred", y="error", data=error_info, color='blue')
g.set_title('Check Homoskedasticity', fontsize=15)
g.set_xlabel("predicted values", fontsize=13)
g.set_ylabel("Residual", fontsize=13)

#### Analysis:
- The points are not random. Also, we can see the shape of a funnel to the right, which confirms that there is heteroskedasticity. 
- It means that the variance of Y across all X is not the same. 
- We can conclude that, Homoskedasticity condition doesn’t hold in this case. 

### Correlation of errors
- If there is no correlation between errors, then the model is good.

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
ax = error_info.error.plot()
ax.set_title('Uncorrelated errors', fontsize=15)
ax.set_xlabel("Data", fontsize=13)
ax.set_ylabel("Residual", fontsize=13)

#### Analysis:
- If we observe, there is no correlation/pattern between errors. It is purely random. 
- We can also check this condition using the Durbin-Watson test: 
    - If DW = 2, then there is no correlation. 
    - If DW < 2, then the errors are positively correlated. 
    - If DW > 2, then the errors are negatively correlated. 
- If we perform Durbin-Watson test, the value of DW is 1.621. 
- According to the test, we can say that the errors are positively correlated.  
- However, this is a point estimate for perfect uncorrelation of errors(DW=2). So, we won’t get DW as 2 on real data. If it around 2, then we can conclude that the errors are uncorrelated. 

### Normality of error terms
- This can be checked by plotting probability probability plot(p-p plot) or Quantile-Quantile plot(Q-Q plot). 

#### Probability-Probability plot

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
_ = sp.stats.probplot(error_info.error, plot=ax, fit=True)
ax.set_title('Probability plot', fontsize=15)
ax.set_xlabel("Theoritical Qunatiles", fontsize=13)
ax.set_ylabel("Ordered Values", fontsize=13)

### Quantile-Quantile plot

In [None]:
ax = sm.qqplot(error_info.error, line='45')
plt.show()

#### Analysis:
- If we observe the above plots, we can conclude that the errors are following a Normal distribution, because the plot shows the fluctuation around the line and there is not much deviation. 
- The graph is linear.

### Linear Regression Assumption: Multicollinearity
- If the independent variables are independent of each other, then we say there is no multicollinearity. 
- This can be tested in different ways: 
1. **Correlation plot**: If we observe the plot, there is multicollinearity between variables. 
2. **Variation Inflation Factor**: With VIF > 10 there is an indication that multicollinearity may be present. With VIF > 100 there is certainly multicollinearity among the variables. 
- We can conclude that multicollinearity among variables exists. 
- If multicollinearity is found in the data, centring the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a **factor analysis**/**Principal Component Analysis(PCA)** and rotating the factors to ensure the independence of the factors in the linear regression analysis. 
- We can do the same analysis after applying PCA on the data. We can see some improvements in the model as there won’t be any multicollinearity. 
- The results of OLS Regression are shown below after transforming the feature variables using **PCA**. 

In [None]:
pca = PCA()
transform_X = pca.fit_transform(white_data.drop(columns=['quality']), white_data.quality)

columns = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7',
            'feature_8', 'feature_9', 'feature_10', 'feature_11']
transform_df = pd.DataFrame.from_records(transform_X)
transform_df.columns = columns
transform_df['quality'] = white_data.quality

In [None]:
model_ols_new = ols("""quality ~ feature_1 
                        + feature_2 
                        + feature_3
                        + feature_4 
                        + feature_5 
                        + feature_6 
                        + feature_7 
                        + feature_8 
                        + feature_9 
                        + feature_10 
                        + feature_11""", data=transform_df).fit()

In [None]:
model_summary = model_ols_new.summary()
HTML(
(model_ols_new.summary()
    .as_html()
    .replace('<th>Dep. Variable:</th>', '<th style="background-color:#c7e9c0;"> Dep. Variable: </th>')
    .replace('<th>Model:</th>', '<th style="background-color:#c7e9c0;"> Model: </th>')
    .replace('<th>Method:</th>', '<th style="background-color:#c7e9c0;"> Method: </th>')
    .replace('<th>No. Observations:</th>', '<th style="background-color:#c7e9c0;"> No. Observations: </th>')
    .replace('<th>  R-squared:         </th>', '<th style="background-color:#aec7e8;"> R-squared: </th>')
    .replace('<th>  Adj. R-squared:    </th>', '<th style="background-color:#aec7e8;"> Adj. R-squared: </th>')
    .replace('<th>coef</th>', '<th style="background-color:#ffbb78;">coef</th>')
    .replace('<th>std err</th>', '<th style="background-color:#c7e9c0;">std err</th>')
    .replace('<th>P>|t|</th>', '<th style="background-color:#bcbddc;">P>|t|</th>')
    .replace('<th>[0.025</th>    <th>0.975]</th>', '<th style="background-color:#ff9896;">[0.025</th>    <th style="background-color:#ff9896;">0.975]</th>'))
)

#### Analysis:
- If we observe the p-values of transformed features, all the p-values are less than 0.05, which shows that multicollinearity problem is solved. 

In [None]:
r2_linear_regression = model_ols_new.rsquared

model_ridge=linear_model.Ridge()
model_ridge.fit(white_data.drop(columns=['quality']),white_data.quality)
y_predict_ridge = model_ridge.predict(white_data.drop(columns=['quality']))
r2_ridge = r2_score(y_true, y_predict_ridge)

model_lasso=linear_model.Lasso()
model_lasso.fit(white_data.drop(columns=['quality']),white_data.quality)
y_predict_lasso = model_lasso.predict(white_data.drop(columns=['quality']))
r2_score(y_true, y_predict_lasso)

n_neighbors=5
knn=neighbors.KNeighborsRegressor(n_neighbors,weights='uniform')
knn.fit(white_data.drop(columns=['quality']),white_data.quality)
y_predict_knn=knn.predict(white_data.drop(columns=['quality']))
r2_knn = r2_score(y_true, y_predict_knn)

reg = linear_model.BayesianRidge()
reg.fit(white_data.drop(columns=['quality']),white_data.quality)
y_pred_reg=reg.predict(white_data.drop(columns=['quality']))
r2_bayesian = r2_score(y_true, y_pred_reg)

dec = tree.DecisionTreeRegressor(max_depth=6)
dec.fit(white_data.drop(columns=['quality']),white_data.quality)
y1_dec=dec.predict(white_data.drop(columns=['quality']))
r2_dt = r2_score(y_true, y1_dec)

svm_reg=svm.SVR()
svm_reg.fit(white_data.drop(columns=['quality']),white_data.quality)
y1_svm=svm_reg.predict(white_data.drop(columns=['quality']))
r2_svm = r2_score(y_true, y1_svm)

## Discussion
- We can do alot of changes to improve the accuracy of the model. Some of the conditions above are violated. If we can transform the variables accordingly, we can achieve good results. If we observe the R-squared score, it is 0.282. It is able to explain only 28% of the variance, which is poor. So, there is a scope to apply different methods to get better models. 
- I have applied different popular Regression methods on the data to compare the results we got. The below table shows the comparison of the R-squared of different methods.

In [None]:
r2_list = [r2_linear_regression, r2_ridge, r2_knn, r2_dt, r2_bayesian, r2_svm]
r2_names = ['Linear Regression', 'Ridge Regression', 'KNN', 'Decision Tree', 'Bayesian Regression', 'SVM']

col = {'R-squared':r2_list, 'Method':r2_names}
df = pd.DataFrame(data=col, columns=['Method', 'R-squared'])

data_head = df
colorscale = [[0, '#4d004c'],[.5, '#f2e5ff'],[1, '#ffffff']]
df_table = ff.create_table(round(data_head.iloc[:,[0,1]], 3), colorscale=colorscale)
py.iplot(df_table, filename='wine_quality')

- If we compare R-square, KNN outperformed on all the Regression methods. Also, all the methods performed better than LinearRegression. So, we can conclude there is alot of scope to improve the Linear Regression model. 

## Conclusion  
- We have visualized wine dataset in all possible ways and they are shown in the form of plots.  
- A Linear Regression model is built to predict the target variable. Some improvements have been done on the model by removing some features that are not contributing and the data is transformed using Principal Component Analysis(PCA). 
- The test of assumptions for Linear Regression is also checked and they are analyzed properly. 
- In the end, Linear Regression is compared with different other popular Regression methods, in which KNN performed well owhen compared to others. 
- There is alot of scope to increase the performance of Linear Regression model. 
- We can increase the samples to build a robust model. Also, we can add some more features that contribute to the wine quality. 

### Thank you for reading till the end.
### Do Upvote if you find it useful. Feedback is always welcome. Please let me know in the comment section below.