# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [31]:
import pandas as pd
import numpy as np

In [32]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [42]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [43]:
pd.crosstab(data.type, data.quality, margins=True)

quality,3,4,5,6,7,8,9,All
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
red,10,53,681,638,199,18,0,1599
white,20,163,1457,2198,880,175,5,4898
All,30,216,2138,2836,1079,193,5,6497


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [44]:
data_N2 = data[['type', 'quality']]

In [45]:
import warnings
warnings.filterwarnings('ignore')

var=list(data)
var.remove("quality")
var.remove("type")

for i in range(0,len(var)): 
    data_N2[var[i]+"_1"] = (data[var[i]]-data[var[i]].mean())/(data[var[i]].std(ddof=0)) 
data_N2.head()

Unnamed: 0,type,quality,fixed acidity_1,volatile acidity_1,citric acid_1,residual sugar_1,chlorides_1,free sulfur dioxide_1,total sulfur dioxide_1,density_1,pH_1,sulphates_1,alcohol_1
0,white,6,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558
1,white,6,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615
2,white,6,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521
3,white,6,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219
4,white,6,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219


In [46]:
data_N2.loc[data_N2['quality'] >= 7, 'clasificacion'] = '1'#Bueno
data_N2.loc[data_N2['quality'] < 7, 'clasificacion'] = '0' #Malo
pd.crosstab(data_N2.type, data_N2.clasificacion, margins=True)

clasificacion,0,1,All
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
red,1382,217,1599
white,3838,1060,4898
All,5220,1277,6497


In [47]:
from sklearn.svm import SVC # "Support Vector Classifier"
from sklearn.model_selection import train_test_split

In [48]:
data_NR=data_N2[(data_N2["type"]=='red')]
data_NW=data_N2[(data_N2["type"]=='white')]

data_NR_2=data_NR.drop("type",axis=1)
data_NR_2=data_NR_2.drop("quality",axis=1)

data_NW_2=data_NW.drop("type",axis=1)
data_NW_2=data_NW_2.drop("quality",axis=1)

y_r = data_NR_2["clasificacion"].values
X_r = data_NR_2[['fixed acidity_1', 'volatile acidity_1','citric acid_1','residual sugar_1','chlorides_1','free sulfur dioxide_1','total sulfur dioxide_1','density_1','pH_1','sulphates_1','alcohol_1']].values

In [49]:
validation_size = 0.30
seed = 7
X_train_rw, X_test_rw= train_test_split(X_r,test_size=validation_size, random_state=seed)
y_train_rw, y_test_rw = train_test_split(y_r,test_size=validation_size, random_state=seed)

In [50]:
clf_rw = SVC(kernel='linear')
clf_rw.fit(X_train_rw, y_train_rw)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [51]:
clf_rw.score(X_test_rw,y_test_rw)

0.86875

In [52]:
y_w = data_NW_2["clasificacion"].values
X_w = data_NW_2[['fixed acidity_1', 'volatile acidity_1','citric acid_1','residual sugar_1','chlorides_1','free sulfur dioxide_1','total sulfur dioxide_1','density_1','pH_1','sulphates_1','alcohol_1']].values

In [53]:
validation_size = 0.30
seed = 8
X_train_ww, X_test_ww= train_test_split(X_w,test_size=validation_size, random_state=seed)
y_train_ww, y_test_ww = train_test_split(y_w,test_size=validation_size, random_state=seed)

In [54]:
clf_ww = SVC(kernel='linear')
clf_ww.fit(X_train_ww, y_train_ww)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [55]:
clf_ww.decision_function(X_test_ww)

array([-1.00045218, -0.99981722, -1.00032966, ..., -1.00014484,
       -0.99967801, -0.99993767])

In [56]:
clf_ww.predict(X_test_ww)

array(['0', '0', '0', ..., '0', '0', '0'], dtype=object)

In [57]:
clf_ww.score(X_test_ww,y_test_ww)

0.7891156462585034

In [58]:
d = {'Model': ['red wine', 'white wine'],'Accuracy': [clf_rw.score(X_test_rw,y_test_rw),clf_ww.score(X_test_ww,y_test_ww)],}
res= pd.DataFrame(data=d)
res

Unnamed: 0,Model,Accuracy
0,red wine,0.86875
1,white wine,0.789116


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [59]:
clf_pol_r = SVC(kernel='poly')
clf_rbf_r= SVC(kernel='rbf')
clf_sig_r= SVC(kernel='sigmoid')

clf_pol_w= SVC(kernel='poly')
clf_rbf_w= SVC(kernel='rbf')
clf_sig_w= SVC(kernel='sigmoid')

In [60]:
clf_pol_r.fit(X_train_rw, y_train_rw)
clf_rbf_r.fit(X_train_rw, y_train_rw)
clf_sig_r.fit(X_train_rw, y_train_rw)

clf_pol_w.fit(X_train_ww, y_train_ww)
clf_rbf_w.fit(X_train_ww, y_train_ww)
clf_sig_w.fit(X_train_ww, y_train_ww)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [61]:
sup = {'Model': ['red wine', 'red wine','red wine','red wine','white wine', 'white wine','white wine','white wine'],'Kernel': ['linear','poly', 'rbf','sigmoid','linear','poly', 'rbf','sigmoid'],'Accuracy': [clf_rw.score(X_test_rw,y_test_rw),clf_pol_r.score(X_test_rw, y_test_rw),clf_rbf_r.score(X_test_rw, y_test_rw),clf_sig_r.score(X_test_rw, y_test_rw),clf_ww.score(X_test_ww,y_test_ww),clf_pol_w.score(X_test_ww, y_test_ww),clf_rbf_w.score(X_test_ww, y_test_ww),clf_sig_w.score(X_test_ww, y_test_ww)],}
res3= pd.DataFrame(data=sup)
pd.pivot_table(res3,index=["Model"],values=["Accuracy"],columns=["Kernel"],aggfunc=[np.sum])

Unnamed: 0_level_0,sum,sum,sum,sum
Unnamed: 0_level_1,Accuracy,Accuracy,Accuracy,Accuracy
Kernel,linear,poly,rbf,sigmoid
Model,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
red wine,0.86875,0.883333,0.8875,0.783333
white wine,0.789116,0.803401,0.819048,0.713605


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [62]:
#Para vinos rojos rbf, para vinos blancos rbf
Cv=[0.1,1,10,100,1000]
gammav=[0.01,0.001,0.0001]
res4r = pd.DataFrame(0, index=Cv, columns=gammav)
res4w = pd.DataFrame(0, index=Cv, columns=gammav)

In [63]:
for i in range(0,len(Cv)): 
    for j in range(0,len(gammav)):
        clf_rbf_rw1 = SVC(kernel='rbf',C=Cv[i],gamma=gammav[j]).fit(X_train_rw, y_train_rw)
        res4r.iloc[i,j]=clf_rbf_rw1.score(X_test_rw, y_test_rw)

In [64]:
print("Accuracy Vinos Rojos SVM RBF")
res4r

Accuracy Vinos Rojos SVM RBF


Unnamed: 0,0.01,0.001,0.0001
0.1,0.86875,0.86875,0.86875
1.0,0.86875,0.86875,0.86875
10.0,0.883333,0.86875,0.86875
100.0,0.875,0.86875,0.86875
1000.0,0.864583,0.88125,0.86875


In [65]:
for i in range(0,len(Cv)): 
    for j in range(0,len(gammav)):
        clf_rbf_ww1 = SVC(kernel='rbf',C=Cv[i],gamma=gammav[j]).fit(X_train_ww, y_train_ww)
        res4w.iloc[i,j]=clf_rbf_ww1.score(X_test_ww, y_test_ww)

In [66]:
print("Accuracy Vinos Blancos SVM RBF")
res4w

Accuracy Vinos Blancos SVM RBF


Unnamed: 0,0.01,0.001,0.0001
0.1,0.789116,0.789116,0.789116
1.0,0.791156,0.789116,0.789116
10.0,0.816327,0.789116,0.789116
100.0,0.821769,0.796599,0.789116
1000.0,0.82449,0.814966,0.789116


# Exercise 6.5

Compare the results with other methods

In [67]:
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [68]:
lgm=linear_model.LogisticRegression()
clflrr=lgm.fit(X_train_rw, y_train_rw)
clflrw=lgm.fit(X_train_ww, y_train_ww)

In [69]:
x3=data_NR_2.drop("clasificacion",axis=1)
a=clflrr.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit")
a

Coeficientes Vinos Rojos Logit


Unnamed: 0,Coeficientes estimados,Variables
0,0.617697,fixed acidity_1
1,-0.622389,volatile acidity_1
2,-0.156196,citric acid_1
3,1.146875,residual sugar_1
4,-0.373778,chlorides_1
5,0.17337,free sulfur dioxide_1
6,0.004923,total sulfur dioxide_1
7,-1.623533,density_1
8,0.534017,pH_1
9,0.283308,sulphates_1


In [70]:
a=clflrw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit")
a

Coeficientes Vinos Blancos Logit


Unnamed: 0,Coeficientes estimados,Variables
0,0.617697,fixed acidity_1
1,-0.622389,volatile acidity_1
2,-0.156196,citric acid_1
3,1.146875,residual sugar_1
4,-0.373778,chlorides_1
5,0.17337,free sulfur dioxide_1
6,0.004923,total sulfur dioxide_1
7,-1.623533,density_1
8,0.534017,pH_1
9,0.283308,sulphates_1


In [71]:
print("Accuracy Vinos Rojos Logit")
clflrr.score(X_test_rw, y_test_rw)

Accuracy Vinos Rojos Logit


0.8666666666666667

In [72]:
print("Accuracy Vinos Blancos Logit")
clflrw.score(X_test_ww, y_test_ww)

Accuracy Vinos Blancos Logit


0.8142857142857143

In [73]:
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report


# Exercise 6.9
Evaluate the f1score

In [74]:
y_pred_rw=clflrr.predict(X_test_rw)
y_pred_ww=clflrw.predict(X_test_ww)

In [75]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.46428571428571425
f1 Avg Micro RW: 0.8666666666666667
f1 Avg Weighted RW: 0.8066964285714284
f1 Avg None RW: [0.92857143 0.        ]


In [76]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None)) 

f1 Avg Macro WW: 0.6450704225352113
f1 Avg Micro WW: 0.8142857142857143
f1 Avg Weighted WW: 0.7867778097154354
f1 Avg None WW: [0.89014085 0.4       ]


# Exercise 6.10
- Estimate a regularized logistic regression using:
- C = 0.01, 0.1 & 1.0
- penalty = ['l1, 'l2']
- Compare the coefficients and the f1score

# C = 0.01 penalty = ['l1']

In [77]:
lg_rw = LogisticRegression(C=0.01, penalty='l1')
lg_rw.fit(X_train_rw, y_train_rw)

lg_ww = LogisticRegression(C=0.01, penalty='l1')
lg_ww.fit(X_train_ww, y_train_ww)

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [78]:
x3=data_NR_2.drop("clasificacion",axis=1)
a=lg_rw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit C=0,01 P=l1")
a

Coeficientes Vinos Rojos Logit C=0,01 P=l1


Unnamed: 0,Coeficientes estimados,Variables
0,0.0,fixed acidity_1
1,-0.618091,volatile acidity_1
2,0.0,citric acid_1
3,0.0,residual sugar_1
4,-0.103221,chlorides_1
5,0.0,free sulfur dioxide_1
6,0.010022,total sulfur dioxide_1
7,0.0,density_1
8,0.0,pH_1
9,0.0,sulphates_1


In [79]:
b=lg_ww.coef_
b=pd.DataFrame(b.reshape(-1,1))
b.rename(columns={0:"Coeficientes estimados"}, inplace=True)
b["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit C=0,01 P=l1")
a

Coeficientes Vinos Blancos Logit C=0,01 P=l1


Unnamed: 0,Coeficientes estimados,Variables
0,0.0,fixed acidity_1
1,-0.618091,volatile acidity_1
2,0.0,citric acid_1
3,0.0,residual sugar_1
4,-0.103221,chlorides_1
5,0.0,free sulfur dioxide_1
6,0.010022,total sulfur dioxide_1
7,0.0,density_1
8,0.0,pH_1
9,0.0,sulphates_1


In [80]:
y_pred_rw=lg_rw.predict(X_test_rw)
y_pred_ww=lg_ww.predict(X_test_ww)

In [81]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.48948948948948956
f1 Avg Micro RW: 0.8583333333333333
f1 Avg Weighted RW: 0.8095157657657659
f1 Avg None RW: [0.92342342 0.05555556]


In [82]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None))

f1 Avg Macro WW: 0.5667426695430413
f1 Avg Micro WW: 0.8040816326530612
f1 Avg Weighted WW: 0.7521637344727445
f1 Avg None WW: [0.88741204 0.2460733 ]


# C = 0.1 penalty = ['l1']


In [83]:
lg_rw = LogisticRegression(C=0.1, penalty='l1')
lg_rw.fit(X_train_rw, y_train_rw)

lg_ww = LogisticRegression(C=0.1, penalty='l1')
lg_ww.fit(X_train_ww, y_train_ww)

x3=data_NR_2.drop("clasificacion",axis=1)
a=lg_rw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit C=0,1 P=l1")
a

Coeficientes Vinos Rojos Logit C=0,1 P=l1


Unnamed: 0,Coeficientes estimados,Variables
0,0.028533,fixed acidity_1
1,-0.583468,volatile acidity_1
2,0.0,citric acid_1
3,0.0,residual sugar_1
4,-0.173477,chlorides_1
5,0.0,free sulfur dioxide_1
6,0.0,total sulfur dioxide_1
7,0.0,density_1
8,-0.211027,pH_1
9,0.14962,sulphates_1


In [84]:
b=lg_ww.coef_
b=pd.DataFrame(b.reshape(-1,1))
b.rename(columns={0:"Coeficientes estimados"}, inplace=True)
b["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit C=0,1 P=l1")
a

Coeficientes Vinos Blancos Logit C=0,1 P=l1


Unnamed: 0,Coeficientes estimados,Variables
0,0.028533,fixed acidity_1
1,-0.583468,volatile acidity_1
2,0.0,citric acid_1
3,0.0,residual sugar_1
4,-0.173477,chlorides_1
5,0.0,free sulfur dioxide_1
6,0.0,total sulfur dioxide_1
7,0.0,density_1
8,-0.211027,pH_1
9,0.14962,sulphates_1


In [85]:
y_pred_rw=lg_rw.predict(X_test_rw)
y_pred_ww=lg_ww.predict(X_test_ww)

In [86]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.6344159992137979
f1 Avg Micro RW: 0.8708333333333333
f1 Avg Weighted RW: 0.8512339688467397
f1 Avg None RW: [0.92840647 0.34042553]


In [87]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None))

f1 Avg Macro WW: 0.6344878904930056
f1 Avg Micro WW: 0.8142857142857143
f1 Avg Weighted WW: 0.7827208199050352
f1 Avg None WW: [0.89084366 0.37813212]


# C = 1 penalty = ['l1']


In [89]:
lg_rw = LogisticRegression(C=1, penalty='l1')
lg_rw.fit(X_train_rw, y_train_rw)

lg_ww = LogisticRegression(C=1, penalty='l1')
lg_ww.fit(X_train_ww, y_train_ww)

x3=data_NR_2.drop("clasificacion",axis=1)
a=lg_rw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit C=0,1 P=l1")
a

Coeficientes Vinos Rojos Logit C=0,1 P=l1


Unnamed: 0,Coeficientes estimados,Variables
0,0.412831,fixed acidity_1
1,-0.437873,volatile acidity_1
2,0.00541,citric acid_1
3,1.011673,residual sugar_1
4,-0.254534,chlorides_1
5,0.0,free sulfur dioxide_1
6,-0.536883,total sulfur dioxide_1
7,-0.804365,density_1
8,0.0,pH_1
9,0.417887,sulphates_1


In [90]:
b=lg_ww.coef_
b=pd.DataFrame(b.reshape(-1,1))
b.rename(columns={0:"Coeficientes estimados"}, inplace=True)
b["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit C=1 P=l1")
a

Coeficientes Vinos Blancos Logit C=1 P=l1


Unnamed: 0,Coeficientes estimados,Variables
0,0.412831,fixed acidity_1
1,-0.437873,volatile acidity_1
2,0.00541,citric acid_1
3,1.011673,residual sugar_1
4,-0.254534,chlorides_1
5,0.0,free sulfur dioxide_1
6,-0.536883,total sulfur dioxide_1
7,-0.804365,density_1
8,0.0,pH_1
9,0.417887,sulphates_1


In [91]:
y_pred_rw=lg_rw.predict(X_test_rw)
y_pred_ww=lg_ww.predict(X_test_ww)

In [92]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.703062171357872
f1 Avg Micro RW: 0.8833333333333333
f1 Avg Weighted RW: 0.8736931642437366
f1 Avg None RW: [0.93442623 0.47169811]


In [93]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None))

f1 Avg Macro WW: 0.6463996889787158
f1 Avg Micro WW: 0.8156462585034013
f1 Avg Weighted WW: 0.7878545405971844
f1 Avg None WW: [0.89103337 0.401766  ]


# C = 0.01 penalty = ['l2']


In [94]:
lg_rw = LogisticRegression(C=0.01, penalty='l2')
lg_rw.fit(X_train_rw, y_train_rw)

lg_ww = LogisticRegression(C=0.01, penalty='l2')
lg_ww.fit(X_train_ww, y_train_ww)

x3=data_NR_2.drop("clasificacion",axis=1)
a=lg_rw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit C=0,01 P=l2")
a

Coeficientes Vinos Rojos Logit C=0,01 P=l2


Unnamed: 0,Coeficientes estimados,Variables
0,-0.023772,fixed acidity_1
1,-0.422351,volatile acidity_1
2,0.181595,citric acid_1
3,0.23249,residual sugar_1
4,-0.277389,chlorides_1
5,0.081533,free sulfur dioxide_1
6,0.142287,total sulfur dioxide_1
7,-0.308454,density_1
8,-0.17965,pH_1
9,0.049787,sulphates_1


In [95]:
b=lg_ww.coef_
b=pd.DataFrame(b.reshape(-1,1))
b.rename(columns={0:"Coeficientes estimados"}, inplace=True)
b["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit C=0,01 P=l2")
a

Coeficientes Vinos Blancos Logit C=0,01 P=l2


Unnamed: 0,Coeficientes estimados,Variables
0,-0.023772,fixed acidity_1
1,-0.422351,volatile acidity_1
2,0.181595,citric acid_1
3,0.23249,residual sugar_1
4,-0.277389,chlorides_1
5,0.081533,free sulfur dioxide_1
6,0.142287,total sulfur dioxide_1
7,-0.308454,density_1
8,-0.17965,pH_1
9,0.049787,sulphates_1


In [96]:
y_pred_rw=lg_rw.predict(X_test_rw)
y_pred_ww=lg_ww.predict(X_test_ww)

In [97]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.5934117647058823
f1 Avg Micro RW: 0.86875
f1 Avg Weighted RW: 0.8401705882352941
f1 Avg None RW: [0.928      0.25882353]


In [98]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None))

f1 Avg Macro WW: 0.6336722488038278
f1 Avg Micro WW: 0.8163265306122449
f1 Avg Weighted WW: 0.7832446375679458
f1 Avg None WW: [0.8923445 0.375    ]


# C = 0.1 penalty = ['l2']


In [99]:
lg_rw = LogisticRegression(C=0.1, penalty='l2')
lg_rw.fit(X_train_rw, y_train_rw)

lg_ww = LogisticRegression(C=0.1, penalty='l2')
lg_ww.fit(X_train_ww, y_train_ww)

x3=data_NR_2.drop("clasificacion",axis=1)
a=lg_rw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit C=0,1 P=l2")
a

Coeficientes Vinos Rojos Logit C=0,1 P=l2


Unnamed: 0,Coeficientes estimados,Variables
0,0.197251,fixed acidity_1
1,-0.510441,volatile acidity_1
2,0.059071,citric acid_1
3,0.644162,residual sugar_1
4,-0.31677,chlorides_1
5,0.004898,free sulfur dioxide_1
6,-0.115201,total sulfur dioxide_1
7,-0.542953,density_1
8,-0.176215,pH_1
9,0.246406,sulphates_1


In [100]:
b=lg_ww.coef_
b=pd.DataFrame(b.reshape(-1,1))
b.rename(columns={0:"Coeficientes estimados"}, inplace=True)
b["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit C=0,1 P=l2")
a

Coeficientes Vinos Blancos Logit C=0,1 P=l2


Unnamed: 0,Coeficientes estimados,Variables
0,0.197251,fixed acidity_1
1,-0.510441,volatile acidity_1
2,0.059071,citric acid_1
3,0.644162,residual sugar_1
4,-0.31677,chlorides_1
5,0.004898,free sulfur dioxide_1
6,-0.115201,total sulfur dioxide_1
7,-0.542953,density_1
8,-0.176215,pH_1
9,0.246406,sulphates_1


In [101]:
y_pred_rw=lg_rw.predict(X_test_rw)
y_pred_ww=lg_ww.predict(X_test_ww)

In [102]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.6735785336391613
f1 Avg Micro RW: 0.8770833333333333
f1 Avg Weighted RW: 0.8636595338812111
f1 Avg None RW: [0.93131548 0.41584158]


In [103]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None))

f1 Avg Macro WW: 0.6445123007623008
f1 Avg Micro WW: 0.817687074829932
f1 Avg Weighted WW: 0.7879806808378238
f1 Avg None WW: [0.89262821 0.3963964 ]


# C = 1 penalty = ['l2']


In [104]:
lg_rw = LogisticRegression(C=0.1, penalty='l2')
lg_rw.fit(X_train_rw, y_train_rw)

lg_ww = LogisticRegression(C=0.1, penalty='l2')
lg_ww.fit(X_train_ww, y_train_ww)

x3=data_NR_2.drop("clasificacion",axis=1)
a=lg_rw.coef_
a=pd.DataFrame(a.reshape(-1,1))
a.rename(columns={0:"Coeficientes estimados"}, inplace=True)
a["Variables"]=x3.columns
print("Coeficientes Vinos Rojos Logit C=1 P=l2")
a

Coeficientes Vinos Rojos Logit C=1 P=l2


Unnamed: 0,Coeficientes estimados,Variables
0,0.197251,fixed acidity_1
1,-0.510441,volatile acidity_1
2,0.059071,citric acid_1
3,0.644162,residual sugar_1
4,-0.31677,chlorides_1
5,0.004898,free sulfur dioxide_1
6,-0.115201,total sulfur dioxide_1
7,-0.542953,density_1
8,-0.176215,pH_1
9,0.246406,sulphates_1


In [105]:
b=lg_ww.coef_
b=pd.DataFrame(b.reshape(-1,1))
b.rename(columns={0:"Coeficientes estimados"}, inplace=True)
b["Variables"]=x3.columns
print("Coeficientes Vinos Blancos Logit C=1 P=l2")
a

Coeficientes Vinos Blancos Logit C=1 P=l2


Unnamed: 0,Coeficientes estimados,Variables
0,0.197251,fixed acidity_1
1,-0.510441,volatile acidity_1
2,0.059071,citric acid_1
3,0.644162,residual sugar_1
4,-0.31677,chlorides_1
5,0.004898,free sulfur dioxide_1
6,-0.115201,total sulfur dioxide_1
7,-0.542953,density_1
8,-0.176215,pH_1
9,0.246406,sulphates_1


In [106]:
y_pred_rw=lg_rw.predict(X_test_rw)
y_pred_ww=lg_ww.predict(X_test_ww)

In [107]:
print("f1 Avg Macro RW:",f1_score(y_test_rw, y_pred_rw, average='macro'))
print("f1 Avg Micro RW:",f1_score(y_test_rw, y_pred_rw, average='micro'))
print("f1 Avg Weighted RW:",f1_score(y_test_rw, y_pred_rw, average='weighted'))
print("f1 Avg None RW:",f1_score(y_test_rw, y_pred_rw, average=None))

f1 Avg Macro RW: 0.6735785336391613
f1 Avg Micro RW: 0.8770833333333333
f1 Avg Weighted RW: 0.8636595338812111
f1 Avg None RW: [0.93131548 0.41584158]


In [108]:
print("f1 Avg Macro WW:",f1_score(y_test_ww, y_pred_ww, average='macro'))
print("f1 Avg Micro WW:",f1_score(y_test_ww, y_pred_ww, average='micro'))
print("f1 Avg Weighted WW:",f1_score(y_test_ww, y_pred_ww, average='weighted'))
print("f1 Avg None WW:",f1_score(y_test_ww, y_pred_ww, average=None))

f1 Avg Macro WW: 0.6445123007623008
f1 Avg Micro WW: 0.817687074829932
f1 Avg Weighted WW: 0.7879806808378238
f1 Avg None WW: [0.89262821 0.3963964 ]


CONCLUSIÓN: Para los dos tipos de vino la incidencia de las penalidades es mucho mas evidente que la incidencia de C, con el mismo C y con penalidad l1 quedaban 4 variables (para ambos moelos) mientras que con penalidad l2 dejo todas las variables. Es interesante ver que el C afecta de cierta medida la cantidad de variables a dejar, es primordial revisar que tipo de penalidad es la adecuada para el modelo.

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [109]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import SGDRegressor
from sklearn import metrics
from sklearn.linear_model import LinearRegression

In [110]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [111]:
data = data_w.assign(type = 'white')
data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


In [112]:
data_N6 = data[['type']]

In [113]:
var=list(data)
var.remove("type")
for i in range(0,len(var)): 
    data_N6[var[i]+"_1"] = (data[var[i]]-data[var[i]].mean())/(data[var[i]].std(ddof=0)) 
data_N6.head()

Unnamed: 0,type,fixed acidity_1,volatile acidity_1,citric acid_1,residual sugar_1,chlorides_1,free sulfur dioxide_1,total sulfur dioxide_1,density_1,pH_1,sulphates_1,alcohol_1,quality_1
0,white,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,0.207999
1,white,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,0.207999
2,white,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,0.207999
3,white,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,0.207999
4,white,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,0.207999


In [114]:
data_NR6=data_N6[(data_N2["type"]=='red')]
data_NW6=data_N6[(data_N2["type"]=='white')]

data_NR6_2=data_NR6.drop("type",axis=1)
data_NW6_2=data_NW6.drop("type",axis=1)

y_r = data_NR6_2["quality_1"].values
X_r = data_NR6_2[['fixed acidity_1', 'volatile acidity_1','citric acid_1','residual sugar_1','chlorides_1','free sulfur dioxide_1','total sulfur dioxide_1','density_1','pH_1','sulphates_1','alcohol_1']].values

y_w = data_NW6_2["quality_1"].values
X_w = data_NW6_2[['fixed acidity_1', 'volatile acidity_1','citric acid_1','residual sugar_1','chlorides_1','free sulfur dioxide_1','total sulfur dioxide_1','density_1','pH_1','sulphates_1','alcohol_1']].values

In [115]:
validation_size = 0.30
seed = 7
X_train_rw, X_test_rw= train_test_split(X_r,test_size=validation_size, random_state=seed)
y_train_rw, y_test_rw = train_test_split(y_r,test_size=validation_size, random_state=seed)

In [116]:
validation_size = 0.30
seed = 8
X_train_ww, X_test_ww= train_test_split(X_w,test_size=validation_size, random_state=seed)
y_train_ww, y_test_ww = train_test_split(y_w,test_size=validation_size, random_state=seed)

In [117]:
lm_rw=LinearRegression()
lm_ww=LinearRegression()

In [118]:
lm_rw.fit(X_train_rw, y_train_rw)
lm_ww.fit(X_train_ww, y_train_ww)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [119]:
x3=data_NR6_2.drop("quality_1",axis=1)

In [120]:
result_rw=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lm_rw.coef_})
result_rw.loc[len(result_rw)]=["Intercepto",lm_rw.intercept_]
print("Coeficientes Estimados Vinos Rojos")
result_rw

Coeficientes Estimados Vinos Rojos


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.071316
1,volatile acidity_1,-0.225628
2,citric acid_1,-0.064091
3,residual sugar_1,0.104086
4,chlorides_1,-0.056724
5,free sulfur dioxide_1,0.077883
6,total sulfur dioxide_1,-0.199427
7,density_1,-0.099834
8,pH_1,-0.090825
9,sulphates_1,0.129358


Para los vinos rojos las variables que mas impactan en la calidad del vino son el alcohol y la acidez

In [121]:
result_ww=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lm_ww.coef_})
result_ww.loc[len(result_ww)]=["Intercepto",lm_ww.intercept_]
print("Coeficientes Estimados Vinos Blancos")
result_ww

Coeficientes Estimados Vinos Blancos


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.162301
1,volatile acidity_1,-0.353241
2,citric acid_1,0.002612
3,residual sugar_1,0.528768
4,chlorides_1,-0.005499
5,free sulfur dioxide_1,0.079631
6,total sulfur dioxide_1,0.011639
7,density_1,-0.714809
8,pH_1,0.169781
9,sulphates_1,0.119537


Para los vinos blancos las variables que mas impactan en la calidad del vino son: el azucar residual y la densidad. Esto significa que segun el tipo de vino las propiedades mas importantes son difierentes.

In [122]:
y_pred_rw=lm_rw.predict(X_test_rw)
y_pred_ww=lm_ww.predict(X_test_ww)

In [123]:
print("RMSE RW:",np.sqrt(metrics.mean_squared_error(y_test_rw, y_pred_rw)))
print("RMSE WW:",np.sqrt(metrics.mean_squared_error(y_test_ww, y_pred_ww)))

RMSE RW: 0.7493721842860801
RMSE WW: 0.8602813759403796


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [124]:
from sklearn.linear_model import Ridge
ridgereg_rw = Ridge(alpha=0.1, normalize=True)
ridgereg_ww = Ridge(alpha=0.1, normalize=True)

ridgereg_rw.fit(X_train_rw, y_train_rw)
ridgereg_ww.fit(X_train_ww, y_train_ww)

y_pred_rw= ridgereg_rw.predict(X_test_rw)
y_pred_ww= ridgereg_ww.predict(X_test_ww)

print("RMSE RW:",np.sqrt(metrics.mean_squared_error(y_test_rw, y_pred_rw)))
print("RMSE WW:",np.sqrt(metrics.mean_squared_error(y_test_ww, y_pred_ww)))

RMSE RW: 0.7452638987255246
RMSE WW: 0.8599941688694003


In [125]:
result_rw=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":ridgereg_rw.coef_})
result_rw.loc[len(result_rw)]=["Intercepto",ridgereg_rw.intercept_]
print("Coeficientes Estimados Vinos Rojos alpha=0.1")
result_rw

Coeficientes Estimados Vinos Rojos alpha=0.1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.062419
1,volatile acidity_1,-0.194754
2,citric acid_1,-0.021862
3,residual sugar_1,0.107864
4,chlorides_1,-0.058983
5,free sulfur dioxide_1,0.060086
6,total sulfur dioxide_1,-0.187254
7,density_1,-0.130921
8,pH_1,-0.065295
9,sulphates_1,0.123224


In [126]:
result_rw=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":ridgereg_ww.coef_})
result_rw.loc[len(result_rw)]=["Intercepto",ridgereg_ww.intercept_]
print("Coeficientes Estimados Vinos Blancos alpha = 0.1")
result_rw

Coeficientes Estimados Vinos Blancos alpha = 0.1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,-0.026207
1,volatile acidity_1,-0.317153
2,citric acid_1,0.004017
3,residual sugar_1,0.181052
4,chlorides_1,-0.06057
5,free sulfur dioxide_1,0.093984
6,total sulfur dioxide_1,-0.037827
7,density_1,-0.182394
8,pH_1,0.066188
9,sulphates_1,0.075046


Para los vinos rojos tanto la regresion lineal como la ridge seleccionan las mismas variables dandole mayores coeficientes, pero la magnitud es diferente, se hace evidente que la regresion ridge disminuye el impacto de la acidez volatil y el alcohol en la calidad del vino.

In [127]:
ridgereg_rw = Ridge(alpha=1, normalize=True)
ridgereg_ww = Ridge(alpha=1, normalize=True)

ridgereg_rw.fit(X_train_rw, y_train_rw)
ridgereg_ww.fit(X_train_ww, y_train_ww)

y_pred_rw= ridgereg_rw.predict(X_test_rw)
y_pred_ww= ridgereg_ww.predict(X_test_ww)

print("RMSE RW:",np.sqrt(metrics.mean_squared_error(y_test_rw, y_pred_rw)))
print("RMSE WW:",np.sqrt(metrics.mean_squared_error(y_test_ww, y_pred_ww)))

RMSE RW: 0.7671369844044262
RMSE WW: 0.902661279376848


In [128]:
result_ww=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":ridgereg_rw.coef_})
result_ww.loc[len(result_ww)]=["Intercepto",ridgereg_rw.intercept_]
print("Coeficientes Estimados Vinos Rojos alpha=1")
result_ww

Coeficientes Estimados Vinos Rojos alpha=1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.029422
1,volatile acidity_1,-0.119696
2,citric acid_1,0.031051
3,residual sugar_1,0.045933
4,chlorides_1,-0.039814
5,free sulfur dioxide_1,0.004615
6,total sulfur dioxide_1,-0.107776
7,density_1,-0.108619
8,pH_1,-0.022074
9,sulphates_1,0.074835


In [129]:
result_ww=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":ridgereg_ww.coef_})
result_ww.loc[len(result_ww)]=["Intercepto",ridgereg_ww.intercept_]
print("Coeficientes Estimados Vinos Blancos alpha=1")
result_ww

Coeficientes Estimados Vinos Blancos alpha=1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,-0.0476
1,volatile acidity_1,-0.161647
2,citric acid_1,0.006394
3,residual sugar_1,0.030357
4,chlorides_1,-0.091403
5,free sulfur dioxide_1,0.048169
6,total sulfur dioxide_1,-0.044526
7,density_1,-0.09066
8,pH_1,0.034547
9,sulphates_1,0.038497


A diferencia de los vinos rojos, los vinos blancos tienen un comportamiento diferente pues en la regresion lineal los coeficientes con mayor magnitud estan en la acidez volatil y el alcohol, en cambio en la regresion lineal las variables con mas peso son el alcohol, la acidez volatil, la densidad y el dioxido de sulfuro. 

El variar el alpha hizo que la magnitud de los coeficientes variara, pero aun asi las variables mas importantes siguen siendo las mismas.

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [130]:
from sklearn.linear_model import Lasso
lassoreg_rw = Lasso(alpha=0.01, normalize=True)
lassoreg_ww = Lasso(alpha=0.01, normalize=True)

lassoreg_rw.fit(X_train_rw, y_train_rw)
lassoreg_ww.fit(X_train_ww, y_train_ww)

y_pred_rw= lassoreg_rw.predict(X_test_rw)
y_pred_ww= lassoreg_ww.predict(X_test_ww)

print("RMSE RW:",np.sqrt(metrics.mean_squared_error(y_test_rw, y_pred_rw)))
print("RMSE WW:",np.sqrt(metrics.mean_squared_error(y_test_ww, y_pred_ww)))


RMSE RW: 0.8666842182556884
RMSE WW: 1.009511790345014


In [131]:
result_rw=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lassoreg_rw.coef_})
result_rw.loc[len(result_ww)]=["Intercepto",lassoreg_rw.intercept_]
print("Coeficientes Estimados Vinos Rojos alpha=0.01")
result_rw

Coeficientes Estimados Vinos Rojos alpha=0.01


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.0
1,volatile acidity_1,-0.002895
2,citric acid_1,0.0
3,residual sugar_1,0.0
4,chlorides_1,-0.0
5,free sulfur dioxide_1,-0.0
6,total sulfur dioxide_1,-0.0
7,density_1,-0.0
8,pH_1,-0.0
9,sulphates_1,0.0


In [132]:
result_ww=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lassoreg_ww.coef_})
result_ww.loc[len(result_ww)]=["Intercepto",lassoreg_ww.intercept_]
print("Coeficientes Estimados Vinos Blancos alpha=0.01")
result_ww

Coeficientes Estimados Vinos Blancos alpha=0.01


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,-0.0
1,volatile acidity_1,-0.0
2,citric acid_1,-0.0
3,residual sugar_1,-0.0
4,chlorides_1,-0.0
5,free sulfur dioxide_1,0.0
6,total sulfur dioxide_1,-0.0
7,density_1,-0.0
8,pH_1,0.0
9,sulphates_1,0.0


In [133]:
lassoreg_rw = Lasso(alpha=0.1, normalize=True)
lassoreg_ww = Lasso(alpha=0.1, normalize=True)

lassoreg_rw.fit(X_train_rw, y_train_rw)
lassoreg_ww.fit(X_train_ww, y_train_ww)

y_pred_rw= lassoreg_rw.predict(X_test_rw)
y_pred_ww= lassoreg_ww.predict(X_test_ww)

print("RMSE RW:",np.sqrt(metrics.mean_squared_error(y_test_rw, y_pred_rw)))
print("RMSE WW:",np.sqrt(metrics.mean_squared_error(y_test_ww, y_pred_ww)))

RMSE RW: 0.9134821237216612
RMSE WW: 1.009511790345014


In [134]:
result_rw=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lassoreg_rw.coef_})
result_rw.loc[len(result_ww)]=["Intercepto",lassoreg_rw.intercept_]
print("Coeficientes Estimados Vinos Rojos alpha=0.1")
result_rw

Coeficientes Estimados Vinos Rojos alpha=0.1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.0
1,volatile acidity_1,-0.0
2,citric acid_1,0.0
3,residual sugar_1,0.0
4,chlorides_1,-0.0
5,free sulfur dioxide_1,-0.0
6,total sulfur dioxide_1,-0.0
7,density_1,-0.0
8,pH_1,-0.0
9,sulphates_1,0.0


In [135]:
result_ww=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lassoreg_ww.coef_})
result_ww.loc[len(result_ww)]=["Intercepto",lassoreg_ww.intercept_]
print("Coeficientes Estimados Vinos Blancos alpha=0.1")
result_ww

Coeficientes Estimados Vinos Blancos alpha=0.1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,-0.0
1,volatile acidity_1,-0.0
2,citric acid_1,-0.0
3,residual sugar_1,-0.0
4,chlorides_1,-0.0
5,free sulfur dioxide_1,0.0
6,total sulfur dioxide_1,-0.0
7,density_1,-0.0
8,pH_1,0.0
9,sulphates_1,0.0


In [136]:
from sklearn.linear_model import Lasso
lassoreg_rw = Lasso(alpha=1, normalize=True)
lassoreg_ww = Lasso(alpha=1, normalize=True)

lassoreg_rw.fit(X_train_rw, y_train_rw)
lassoreg_ww.fit(X_train_ww, y_train_ww)

y_pred_rw= lassoreg_rw.predict(X_test_rw)
y_pred_ww= lassoreg_ww.predict(X_test_ww)

print("RMSE RW:",np.sqrt(metrics.mean_squared_error(y_test_rw, y_pred_rw)))
print("RMSE WW:",np.sqrt(metrics.mean_squared_error(y_test_ww, y_pred_ww)))

RMSE RW: 0.9134821237216612
RMSE WW: 1.009511790345014


In [137]:
result_rw=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lassoreg_rw.coef_})
result_rw.loc[len(result_ww)]=["Intercepto",lassoreg_rw.intercept_]
print("Coeficientes Estimados Vinos Rojos alpha=1")
result_rw

Coeficientes Estimados Vinos Rojos alpha=1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,0.0
1,volatile acidity_1,-0.0
2,citric acid_1,0.0
3,residual sugar_1,0.0
4,chlorides_1,-0.0
5,free sulfur dioxide_1,-0.0
6,total sulfur dioxide_1,-0.0
7,density_1,-0.0
8,pH_1,-0.0
9,sulphates_1,0.0


In [138]:
result_ww=pd.DataFrame({"Variables":x3.columns,"Coeficientes_Estimados":lassoreg_ww.coef_})
result_ww.loc[len(result_ww)]=["Intercepto",lassoreg_ww.intercept_]
print("Coeficientes Estimados Vinos Blancos alpha=1")
result_ww

Coeficientes Estimados Vinos Blancos alpha=1


Unnamed: 0,Variables,Coeficientes_Estimados
0,fixed acidity_1,-0.0
1,volatile acidity_1,-0.0
2,citric acid_1,-0.0
3,residual sugar_1,-0.0
4,chlorides_1,-0.0
5,free sulfur dioxide_1,0.0
6,total sulfur dioxide_1,-0.0
7,density_1,-0.0
8,pH_1,0.0
9,sulphates_1,0.0


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

Se desarrollo en el punto 6.5 donde se hizo un modelo logit

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

Se desarrollo en el punto 6.5 donde se hizo un modelo logit