# Wine Quality Regression Analysis

This is my first regression analysis. At the end of chapter 2 of "Hands-On Machine Learning with Scikit-Learn and Tensorflow", the author encourages us to show off our skills by working on a kaggle dataset and this is what I'm doing now

These are the steps that I will go through:


1. Look at the big picture
2. Get the data
3. Discover and visualize the data to gain insights
4. Prepare the data for Machine Learning algorithms
5. Select a model and train it
6. Fine tune your model.
7. Present your solution

In [None]:
import numpy as np
import pandas as pd

# **1. The Big Picture**

This seems like a good candidate for a supervised regression algorithm as we have a known output variable (quality) thats between 0 and 10

# 2. Get the Data

The data is already provided to us so all we need to do is load it into pandas

In [None]:
df = pd.read_csv('../input/winequality-red.csv')

# 3. Discover and Visualize the Data

In [None]:
df.head()

In [None]:
df.info()

No missing values

In [None]:
df.describe()

The wine quality ranges from 3 to 8. 

Let's plot histograms of the data

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
df.hist(bins = 50, figsize = (20,15))
plt.show()

The majority of the wine is at a quality of 5 or 6.


Lets now look at which values affect the wine quality the most

In [None]:
corr_matrix = df.corr()
corr_matrix['quality'].sort_values(ascending = False)

The alcohol, sulphates, and volatile acidity effect the wine quality the most

# 4. Prepare the Data for Machine Learning

We should first perform feature scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns = df.columns)

We should now split the data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X = df_scaled.drop(['quality'], axis = 1)
y = df_scaled['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Since all the variables are numerical therefore the data does not require anymore cleaning

# 5. Select a Model and Train it

Since this is a simple example we will try linear regression and if the results are not satisfactory we will try another model.

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, y_train)

In [None]:
predictions = lm.predict(X_test)

**Model Evaluation**

In [None]:
from sklearn.metrics import mean_squared_error
lm_mse = mean_squared_error(y_test,predictions)
lm_rmse = np.sqrt(lm_mse)
lm_rmse

0.79? hmm. Thats almost 8% in error which is okay but I can do better. Lets try decision trees

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
tree_mse = mean_squared_error(y_test, predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

1.0?? Thats even worse. Let's stick to linear regression

# 6. Fine-tune Your Model

I dont know how to do this so Ill just skip

# 7. Present Your Solution

In the end I was able to fit the data to a linear model with 92% accuracy. Maybe I'll revisit this once I am more experienced