<a href="https://colab.research.google.com/github/robdnh/ml_course/blob/main/wine_quality_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import required libraries and download data

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import numpy as np
from git import Repo

Repo.clone_from('https://github.com/robdnh/data.git', '/content/data')

<git.repo.base.Repo '/content/data/.git'>

### 1. Identify a problem we'd like to solve

<em>Given the physiochemical properties of wine, can we predict its quality?</em>

### 2. Load the data in to a pandas dataframe for easy inspection/manipulation/feature engineering

In [4]:
# Load dataset from the specified CSV file
# Data source: https://archive.ics.uci.edu/dataset/222/bank+marketing

df = pd.read_csv('data/linear-regression/winequality-red.csv')



**Exercise**: <em> Looking at this dataset, what are our features and what is our label?

**Exercise**: <em> Variables in a dataset used for linear regression should be relatively independent and must have a linear relationship with the dependent variable (label). How do we check if variables are independent?</em>

In [5]:
df.corr().round(3)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
fixed acidity,1.0,-0.256,0.672,0.115,0.094,-0.154,-0.113,0.668,-0.683,0.183,-0.062,0.124
volatile acidity,-0.256,1.0,-0.552,0.002,0.061,-0.011,0.076,0.022,0.235,-0.261,-0.202,-0.391
citric acid,0.672,-0.552,1.0,0.144,0.204,-0.061,0.036,0.365,-0.542,0.313,0.11,0.226
residual sugar,0.115,0.002,0.144,1.0,0.056,0.187,0.203,0.355,-0.086,0.006,0.042,0.014
chlorides,0.094,0.061,0.204,0.056,1.0,0.006,0.047,0.201,-0.265,0.371,-0.221,-0.129
free sulfur dioxide,-0.154,-0.011,-0.061,0.187,0.006,1.0,0.668,-0.022,0.07,0.052,-0.069,-0.051
total sulfur dioxide,-0.113,0.076,0.036,0.203,0.047,0.668,1.0,0.071,-0.066,0.043,-0.206,-0.185
density,0.668,0.022,0.365,0.355,0.201,-0.022,0.071,1.0,-0.342,0.149,-0.496,-0.175
pH,-0.683,0.235,-0.542,-0.086,-0.265,0.07,-0.066,-0.342,1.0,-0.197,0.206,-0.058
sulphates,0.183,-0.261,0.313,0.006,0.371,0.052,0.043,0.149,-0.197,1.0,0.094,0.251


A positive correlation between two values implies how they will move in a similar direction, while a negative correlation implies they will move in an opposit direction.

**Exercise**: <em> Are there fields with a positive and negative correlation? Do fields seem sufficiently independent? Is this an optimal dataset for linear regression?
**Bonus**: Does anyone know why given the dataset?
</em>


### 3. Normalize values of the dataset, if necessary

In [7]:
# Standardize numerical features and label for normalization
cols_to_norm = ['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol',]
df[cols_to_norm] = StandardScaler().fit_transform(df[cols_to_norm])

### 4. Split the dataset in to features (X) and a label (y)

In [8]:
X = df.drop('quality', axis=1)
y = df['quality']

print(X)

#~~~~~~~~~~~~~~~~~

print(y)

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0         -0.528360          0.961877    -1.391472       -0.453218  -0.243707   
1         -0.298547          1.967442    -1.391472        0.043416   0.223875   
2         -0.298547          1.297065    -1.186070       -0.169427   0.096353   
3          1.654856         -1.384443     1.484154       -0.453218  -0.264960   
4         -0.528360          0.961877    -1.391472       -0.453218  -0.243707   
...             ...               ...          ...             ...        ...   
1594      -1.217796          0.403229    -0.980669       -0.382271   0.053845   
1595      -1.390155          0.123905    -0.877968       -0.240375  -0.541259   
1596      -1.160343         -0.099554    -0.723916       -0.169427  -0.243707   
1597      -1.390155          0.654620    -0.775267       -0.382271  -0.264960   
1598      -1.332702         -1.216849     1.021999        0.752894  -0.434990   

      free sulfur dioxide  

### 5. Split the data set in to a training (x_train, y_train) and test (y_test, y_train) data set.

In [9]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=40)

### 6. Establish and train a model

In [10]:

model = LinearRegression()

model.fit(x_train, y_train)

### 7. Predict labels (y_pred) associated with the test features (x_test)

In [11]:
# Make predictions on the test set
y_pred = model.predict(x_test)
print(y_pred)

[5.97751296 5.21281463 5.69691905 6.37317677 5.58008105 5.48337242
 5.7432569  6.40020606 5.29331222 5.48048552 5.83435203 5.66581677
 6.12753086 6.62703942 5.25984497 6.209118   5.83465586 4.84794634
 5.36646949 5.13044843 5.17652803 5.51946744 5.03371999 5.43955924
 6.55085054 5.35970358 5.82816034 5.25936555 5.84725097 5.371452
 5.29023222 5.55137281 5.29417263 5.11677105 5.12190603 5.56259519
 6.69769539 5.01863612 5.60806605 5.74607154 5.27094904 5.71848435
 5.10925922 5.83913012 6.13965603 5.17787859 5.77693335 6.19608418
 6.61550566 5.09932673 6.40020606 5.25292609 4.94054864 6.02555288
 6.07756061 5.07144153 5.69284514 5.66524378 5.96724117 5.32509552
 5.89724736 5.18343205 5.61877085 6.42350365 5.62736033 5.36435368
 4.99077655 5.6251994  6.23627693 6.32539853 5.31159291 5.20596295
 5.18800136 6.31179876 6.04145832 5.06516142 5.15367544 5.54630124
 5.74632028 6.45531727 5.81519865 6.13196341 5.92048251 5.34343051
 5.90555293 4.98795848 5.92881486 5.06970982 5.71175563 5.131321

### 8. Measure the accuracy of our predictions to evaluate performance

In [17]:
# sklearn accuracy is used for classification tasks
# Calculate and store accuracy score of the model

mse = mean_squared_error(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

#Mean Absolute Percentage Error
def mape(actual, pred):
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100

r2 = r2_score(y_test, y_pred)
map = str(round(mape(y_test, y_pred))) + ' %'
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-squared Score: {r2}")
print(f"Mean Absolute Percentage Error: {map}")

Mean Squared Error: 0.409384015104224
Mean Absolute Error: 0.494186028580269
R-squared Score: 0.4183385391256881
Mean Absolute Percentage Error: 9 %
