### Simple Linear Regression using Wine Quality Data

In the wine quality data, the dependent variable (Y) is wine quality and the independent (X)
variable we have chosen is alcohol content. 

We are testing here whether there is any
significant relation between both, to check whether a change in alcohol percentage is the
deciding factor in the quality of the wine:

### importing Necessary Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import  r2_score

In [2]:
wine_quality = pd.read_csv("winequality-red.csv", sep = ';')

In [3]:
wine_quality.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
wine_quality.rename(columns = lambda x: x.replace(" ", "_"), inplace = True)

In [5]:
wine_quality.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [6]:
X = wine_quality['alcohol']
y = wine_quality['quality']

In the following step, we will split data into train and test using the 70 percent - 30 percent rule:

In [7]:
x_train, x_test, y_train, y_test = train_test_split(X, y ,train_size = 0.7, random_state = 42)

After splitting a single variable out of the DataFrame, it becomes a pandas series, 

hence we need to convert it back into a pandas DataFrame again:

In [8]:
x_train = pd.DataFrame(x_train)
x_test = pd.DataFrame(x_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

The following function is for calculating the mean from the columns of the DataFrame. The
mean was calculated for both alcohol (independent) and the quality (dependent)
variables:

In [9]:
def mean(values):
    return round(sum(values) / float(len(values)),2)

In [10]:
alcohol_mean = mean(x_train['alcohol'])
quality_mean = mean(y_train['quality'])

Variance and covariance is indeed needed for calculating the coefficients of the regression
model:

In [11]:
alcohol_variance = round(sum((x_train['alcohol'] - alcohol_mean) ** 2), 2)
quality_variance = round(sum((y_train['quality'] - quality_mean) ** 2), 2)

In [12]:
covariance = round(sum((x_train['alcohol'] - alcohol_mean) * (y_train['quality'] - quality_mean)), 2)

In [13]:
b1 = covariance/alcohol_variance
b0 = quality_mean - b1*alcohol_mean

In [14]:
print("\n\Interept (B0):", round(b0,4), "Co-efficient (B1):",round(b1,4))


\Interept (B0): 1.6918 Co-efficient (B1): 0.377


After computing coefficients, it is necessary to predict the quality variable, which will test
the quality of fit using R-squared value:

In [15]:
y_test['y_pred'] = pd.DataFrame(b0 + b1*x_test['alcohol'])
R_sqrd = 1- ( sum((y_test['quality']-y_test['y_pred'])**2) / sum((y_test['quality'] - mean(y_test['quality']))**2 ))
print("Test R-Squared value ", round(R_sqrd, 4))

Test R-Squared value  0.185


From the test R-squared value, we can conclude that there is no strong relationship between
quality and alcohol variables in the wine data, as R-squared is less than 0.7.