## Red Wine Quality Analysis

### Thank you for opening the notebook.
Kindly share your views in the comment section. Open for discussion and feedbacks:)
This notebook would be explaining the Regression on Red Wine Quality Analysis.

## Importing the Relevant Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Reading and Understanding the Data

In [None]:
wine= pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
wine.head()

In [None]:
wine.shape

In [None]:
wine.info()

We have no missing values in the dataset

In [None]:
wine.describe()

## Visualize the Variables

In [None]:
sns.pairplot(wine)

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(wine.corr(),annot=True)

## Splitting the Dataset to Train and Test

### The next step is to divide the data into “attributes” and “labels”.Attributes are the independent variables while labels are dependent variables whose values are to be predicted.

In [None]:
X = wine.drop('quality',axis=1)
y = wine['quality']

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,random_state=22)

### The ideal split would be 80-20 or 70-30 as training and testing set respectively. Also the random_state is included if you wish to work on the same values everytime.

## Building the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Fit the model
lm = LinearRegression()
lm.fit(X_train,y_train)

In [None]:
lm.coef_

### The result should look something like this:

In [None]:
coeff = pd.DataFrame(lm.coef_, X_train.columns, columns=['Coefficient'])
coeff

### The coef_ contain the coefficients for the prediction of each of the targets.


In [None]:
lm.intercept_

### The intercept (often labeled the constant) is the expected mean value of Y when all X=0.

## Making Predictions

In [None]:
pred = lm.predict(X_test)

In [None]:
sns.distplot((y_test-pred),bins=30)
plt.title('Actual vs Predictions')

### The assumption of error terms being normally distributed holds good

### The output looks something like the below:

In [None]:
df= pd.DataFrame({'Actual':y_test,'Predictions':pred})
df['Predictions']= round(df['Predictions'],2)
df.head()

### Though the model is not very precise, the predicted percentages are close to the actual ones.

In [None]:
fig, ax = plt.subplots()
ax.scatter(y_test,pred)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
plt.show()

In [None]:
sns.regplot('Actual','Predictions',data=df)

## Evaluating Model Performance:
    
* Mean Absolute Error (MAE) is the mean of the absolute value of the errors.

* Mean Squared Error (MSE) is the mean of the squared errors.

* Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors.

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, pred))
print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

### We want the value of RMSE to be as low as possible, as lower the RMSE value is, the better the model is with its predictions

### R-squared

In [None]:
print('R squared: ',lm.score(X_train,y_train))

## Thank you for investing your precious time on this notebook!