<a href="https://colab.research.google.com/github/sidle34/allofthestuff/blob/Notebooks/LinearRegressionandLasso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic 7: Linear Models and Regularization

1. Inspect the red wine and white wine data files (they have been posted on Carmen) visually and then load each dataset into a pandas dataframe from their respective CSV files.  Separate the features and the targets into two separate arrays - you should have 4 arrays when you are finished (the 'quality' feature in the final column is the target for both files).

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
red=pd.read_csv('/content/drive/MyDrive/winequality-red.csv')
white=pd.read_csv('/content/drive/MyDrive/winequality-white.csv')

In [None]:
#View dataframe columns
print(red.columns)
print(white.columns)

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


In [None]:
#View first 5 rows of data
print(red.head())
print(white.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [None]:
#View summary statistics for each dataframe
print(red.describe())
print(white.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

In [None]:
#Check for missing values
print(red.isnull().sum())
print(white.isnull().sum())

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


In [None]:
#Separate feature variables into an array
featuresred=red.drop(columns=['quality']).to_numpy()
featureswhite=white.drop(columns=['quality']).to_numpy()

#Separate quality variable into an array
qualityred = red['quality'].to_numpy()
qualitywhite=white['quality'].to_numpy()

2. Using your result from 1 above, train LinearRegression and LASSO models on each dataset.  Compute the R2 value of these models.

In [None]:
# Train red Linear Regression
linear_regred = LinearRegression()
linear_regred.fit(featuresred, qualityred)
r2_linearred = r2_score(qualityred, linear_regred.predict(featuresred))

# Train white Linear Regression
linear_regwhite = LinearRegression()
linear_regwhite.fit(featureswhite, qualitywhite)
r2_linearwhite = r2_score(qualitywhite, linear_regwhite.predict(featureswhite))

In [None]:

# Train red Lasso Regression
lasso_regred = Lasso()
lasso_regred.fit(featuresred, qualityred)
r2_lassored = r2_score(qualityred, lasso_regred.predict(featuresred))

# Train white Lasso Regression
lasso_regwhite = Lasso()
lasso_regwhite.fit(featureswhite, qualitywhite)
r2_lassowhite = r2_score(qualitywhite, lasso_regwhite.predict(featureswhite))

3. Now separate the data into training and testing sets using 20% of the data as testing.  Build new LinearRegression and LASSO models on the training sets and compare your R2 values to the ones you found with the full training set above - is there a change in performance?  

In [None]:
#Split red into train and test sets
redx_train, redx_test, redy_train, redy_test = train_test_split(
    featuresred, qualityred, test_size=0.2, random_state=76
)

#Split white into train and test sets
whitex_train, whitex_test, whitey_train, whitey_test = train_test_split(
    featureswhite, qualitywhite, test_size=0.2, random_state=76
)

In [None]:
# Test red Linear Regression model
linear_regred2 = LinearRegression()
linear_regred2.fit(redx_train, redy_train)
redy_pred_linear = linear_regred2.predict(redx_test)
r2_linearred2 = r2_score(redy_test, redy_pred_linear)

# Test white Linear Regression model
linear_regwhite2 = LinearRegression()
linear_regwhite2.fit(whitex_train, whitey_train)
whitey_pred_linear = linear_regwhite2.predict(whitex_test)
r2_linearwhite2 = r2_score(whitey_test, whitey_pred_linear)

In [None]:
#Test red Lasso Regression model
lasso_regred2 = Lasso()
lasso_regred2.fit(redx_train, redy_train)
redy_pred_lasso = lasso_regred2.predict(redx_test)
r2_lassored2 = r2_score(redy_test, redy_pred_lasso)

#Test white Lasso Regression model
lasso_regwhite2 = Lasso()
lasso_regwhite2.fit(whitex_train, whitey_train)
whitey_pred_lasso = lasso_regwhite2.predict(whitex_test)
r2_lassowhite2 = r2_score(whitey_test, whitey_pred_lasso)

In [None]:
# Compare red Lienar Regression R² scores
print(f"R² score for red wine training Linear Regression: {r2_linearred}")
print(f"R² score for red wine testing Linear Regression: {r2_linearred2}\n")

# Compare red Lasso R² scores
print(f"R² score for red wine training Lasso Regression: {r2_lassored}")
print(f"R² score for red wine testing Lasso Regression: {r2_lassored2}\n")


# Compare white Linear Regression R² scores
print(f"R² score for white wine training Linear Regression: {r2_linearwhite}")
print(f"R² score for white wine testing Linear Regression: {r2_linearwhite2}\n")


# Compare white Lasso R² scores
print(f"R² score for white wine training Lasso Regression: {r2_lassowhite}")
print(f"R² score for white wine testing Lasso Regression: {r2_lassowhite2}\n")


R² score for red wine training Linear Regression: 0.3605517030386881
R² score for red wine testing Linear Regression: 0.32217200912615784

R² score for red wine training Lasso Regression: 0.032843336246042854
R² score for red wine testing Lasso Regression: 0.029900596214573638

R² score for white wine training Linear Regression: 0.2818703641332858
R² score for white wine testing Linear Regression: 0.23278254439918633

R² score for white wine training Lasso Regression: 0.0403536156127029
R² score for white wine testing Lasso Regression: 0.026397672719520204



**The models performed slightly better on the training dataset than on the testing data set.**

4. Compare your R2 values for each model's performance on the test set to the training set performance.  Does your model seem like it generalizes well?

**The linear regression model generalizes well. However, the Lasso model seems to be underfitting.**