## Mock Exam

### Part 1: Data Pre-Processing

In [1]:
import pandas  as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('winequality.csv')
df=pd.DataFrame(data)
#Remove all rows that contain NaN values in the original Dataframe
df.dropna(how='all',inplace=True)
df.reset_index(drop=True,inplace=True)
##df.info()

In [3]:
df.drop_duplicates(keep='first', inplace=True)
##df.info()

### Part 2: Data Exploration

In [4]:
df.rename(columns={'fixed acidity' : 'fixed_acidity'},inplace=True)
df.rename(columns={'volatile acidity' : 'volatile_acidity'},inplace=True)
df.rename(columns={'citric acid' : 'citric_acid'},inplace=True)
df.rename(columns={'residual sugar' : 'residual_sugar'},inplace=True)
df.rename(columns={'free sulfur dioxide' : 'free_sulfur_dioxide'},inplace=True)
df.rename(columns={'total sulfur dioxide' : 'total_sulfur_dioxide'},inplace=True)
##df.describe()

In [5]:
##df.corr()

In [6]:
##plt.scatter(df.sulphates, df.quality)
##plt.title("Quality vs Sulphates")

In [7]:
##plt.scatter(df.pH, df.quality)
##plt.title("Quality vs pH")

In [8]:
##plt.scatter(df.citric_acid , df.quality)
##plt.title("Quality vs Citric Acid")

In [9]:
##plt.scatter(df.alcohol, df.quality)
##plt.title("Quality vs Alcohol")

In [10]:
#plt.scatter(df.volatile_acidity, df.quality)
#plt.title("Quality vs Volatile Acidity")

In [11]:
##plt.scatter(df.chlorides, df.quality)
##plt.title("Quality vs Chlorides")

In [12]:
##sns.pairplot(df, height=2.5)
##plt.tight_layout()

In [13]:
##cm = np.corrcoef(df.values.T)
##sns.set(font_scale=0.8)
##hm= sns.heatmap(cm,
##                  cbar=True,
##                  annot=True,
##                  square=True,
##                  fmt ='.2f',
##                  annot_kws={'size':10},
##                  yticklabels=df.columns,
##                  xticklabels=df.columns)

In [14]:
#df.describe()

In [15]:
df.drop(['fixed_acidity','residual_sugar','free_sulfur_dioxide', 
         'pH','chlorides','total_sulfur_dioxide','density'],axis=1,inplace=True)
#df.drop(df.loc[df["quality"]>7.5].index, inplace=True)
#df.drop(df.loc[df["alcohol"]>12].index, inplace=True)
#df.drop(df.loc[df["total_sulfur_dioxide"]>80].index, inplace=True)
#df.drop(df.loc[df["volatile_acidity"]>1.58].index, inplace=True)
##df.info()

### Analysis ....
In this dataset has been eliminated the columns with lower correlation with quality and areas close to 0, so had a low association, all with value in the range 0.2 and -0.2.

### Part 3: Linear Regression

In [16]:
method = []
r2_list_train = []
r2_list_test = []
rmse_list = []

In [17]:
from sklearn.model_selection import train_test_split
response = df['quality']
features = df.drop('quality', axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(features, 
                                                   response,
                                                   test_size =0.1, 
                                                   random_state= 42)

In [18]:
model= LinearRegression()
#Training model
model.fit(X_train, Y_train )
#Making predictions 
predictions = model.predict(X_test)

In [19]:
from sklearn.metrics import mean_squared_error

r2_train = model.score(X_train, Y_train)
print("R2 in the train set is ", r2_train)
r2= model.score(X_test, Y_test)
print("R2 in the test set is ", r2)

rmse= mean_squared_error(Y_test, predictions) ** 0.5
print("RMSE in the test set is mse",rmse )

method.append('Linear Regression')
r2_list_train.append(r2_train)
r2_list_test.append(r2)
rmse_list.append(rmse)

R2 in the train set is  0.3426857567444026
R2 in the test set is  0.2948268456209844
RMSE in the test set is mse 0.6004304498602445


In [20]:
##actual_value = Y_test
##plt.scatter(predictions, actual_value, alpha=0.6)
##plt.xlabel("Predicted value")
##plt.ylabel("Actual value")
##plt.title("Linear Regression Model")
##plt.show()

### Using Lasso

In [21]:
#1.Split Data set 
response = df['quality']
features = df.drop('quality', axis=1)
xtrain, xtest, ytrain, ytest = train_test_split(features, \
                                                response, \
                                                test_size=0.1, 
                                                random_state=42)

In [22]:
# 2. Create a list of aplhas
#alphas = 10**np.linspace(6, -2, 20)
alphas = [100, 10, 1, 0.1, 1e-2, 1e-4, 1e-6, 1e-8]
##alphas

In [23]:
lasso = Lasso()
coefs = []

for a in alphas:
    lasso.set_params(alpha = a)
    lasso.fit(features,response)
    coefs.append(lasso.coef_)

##np.shape(coefs)

In [24]:
##plt.plot(alphas, coefs)
##ax = plt.gca()
##ax.set_xscale('log')
##plt.axis('tight')
##plt.xlabel('alpha')
##plt.ylabel('weights')

In [25]:
#Findinfg the best alpha
rmse_test_list = []
r2_test_list = []
r2_train_list = []
best_r2_test = 0
best_alpha = 0

for a in alphas:
    lasso = Lasso(alpha = a, max_iter=1000)
    lasso.fit(xtrain, ytrain)
    pred = lasso.predict(xtest)
    
    r2_train = lasso.score(xtrain,ytrain)
    r2_train_list.append(r2_train)
    
    r2_test = lasso.score(xtest,ytest)
    r2_test_list.append(r2_test)
    
    rmse_test = mean_squared_error(ytest, pred) ** 0.5
    rmse_test_list.append(rmse_test)
    
    if r2_test > best_r2_test:
        best_r2_test = r2_test
        best_alpha = a
    
lasso_result = np.vstack((alphas, \
                         r2_train_list, \
                         r2_test_list, \
                         rmse_test_list)).T
lasso_df = pd.DataFrame(lasso_result, \
                            columns=['Alpha', 'R2 Train', 'R2 Test', 'RMSE'])

##print(lasso_df)
##print(best_alpha)

In [26]:
lasso = Lasso(alpha=best_alpha)
lasso.fit(xtrain, ytrain)
pred = lasso.predict(xtest)

r2_test =lasso.score(xtest,ytest)
r2_train = lasso.score(xtrain,ytrain)
rmse= mean_squared_error(pred,ytest) ** 0.5
##print("R^2 train:", r2_train)
##print("R^2 test:", r2_test)
##print("RMSE test:", rmse)
##print()
##print(pd.Series(lasso.coef_, index = features.columns))

method.append('Lasso')
r2_list_test.append(r2_test)
r2_list_train.append(r2_train)
rmse_list.append(rmse)

### 4 Linear Regression with Scaler

In [27]:
# scale the features
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
zscore = ss.fit_transform(features)
feature_ss = pd.DataFrame(zscore, \
                          index=features.index, columns=features.columns)
feature_ss = feature_ss.reset_index(drop=True)
##print(feature_ss.head())

In [28]:
# Build the model
X_train_ss, X_test_ss, Y_train_ss, Y_test_ss = train_test_split( \
                                                               feature_ss, \
                                                                response, \
                                                                test_size=0.1, \
                                                                random_state=42 )
model_ss = LinearRegression()
model_ss.fit(X_train_ss, Y_train_ss)
pred_ss = model_ss.predict(X_test_ss)

In [29]:
#R2
r2_train_ss = model_ss.score(X_train_ss, Y_train_ss)
##print("R^2 in Training Set:", r2_train_ss)
r2_test_ss = model_ss.score(X_test_ss, Y_test_ss)
##print("R^2 in Test Set:", r2_test_ss)
# rmse
rmse_ss = mean_squared_error(Y_test_ss, pred_ss) ** 0.5
##print('RMSE:', rmse_ss)

In [30]:
# append the result
method.append('Standard Scaler')
r2_list_test.append(r2_test_ss)
r2_list_train.append(r2_train_ss)
rmse_list.append(rmse_ss)

In [31]:
# minmax
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
minmax = mm.fit_transform(features)

feature_mm = pd.DataFrame(minmax, \
                          index=features.index, 
                          columns=features.columns)
feature_mm = feature_mm.reset_index(drop=True)
print(feature_mm.head())

X_train_mm, X_test_mm, Y_train_mm, Y_test_mm = train_test_split(feature_mm, \
                                                               response, \
                                                               test_size=0.1,\
                                                               random_state=42)
model_mm = LinearRegression()
model_mm.fit(X_train_mm, Y_train_mm)

pred_mm = model_mm.predict(X_test_mm)

r2_train_mm = model_mm.score(X_train_mm, Y_train_mm)
##print("R^2 in Training Set:", r2_train_mm)

r2_test_mm = model_mm.score(X_test_mm, Y_test_mm)
##print("R^2 in Test Set:", r2_test_mm)

rmse_mm = mean_squared_error(Y_test_mm, pred_mm) ** 0.5
##print("RMSE:", rmse_mm)

method.append('MinMax Scaler')
r2_list_train.append(r2_train_mm)
r2_list_test.append(r2_test_mm)
rmse_list.append(rmse_mm)

   volatile_acidity  citric_acid  sulphates   alcohol
0          0.397260         0.00   0.137725  0.153846
1          0.520548         0.00   0.209581  0.215385
2          0.438356         0.04   0.191617  0.215385
3          0.109589         0.56   0.149701  0.215385
4          0.369863         0.00   0.137725  0.153846


In [32]:
##actual_value = Y_test_mm
##plt.scatter(pred_mm, actual_value, alpha=0.6)
##plt.xlabel("Predicted value")
##plt.ylabel("Actual value")
##plt.title("Linear with Scaler MinMax")
##plt.show()

## Analysis ...
Create a cell (a Markdown cell) in the Jupyter notebook and write down your observation and conclude if scaling is necessary.

In this case, it's observed that values for Linear Regression with Scaling and Simple Linear Regression are pretty close, almost similar. So, it's unnecessary to apply scaling over this dataset. So, Scaling does not cause benefits over this dataset. Furthermore, the dimensions of the numbers are low and are not in accordance with this type of model.

### Part 5: Linear Regression with Ridge using Scaled Dataset 

In [33]:
# scale the features
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
zscore = ss.fit_transform(features)
feature_ss = pd.DataFrame(zscore, \
                          index=features.index, columns=features.columns)
feature_ss = feature_ss.reset_index(drop=True)
##print(feature_ss.head())

In [34]:
# Build the model
X_train_ss, X_test_ss, Y_train_ss, Y_test_ss = train_test_split( \
                                                               feature_ss, \
                                                                response, \
                                                                test_size=0.1, \
                                                                random_state=42 )
model_ss = LinearRegression()
model_ss.fit(X_train_ss, Y_train_ss)
pred_ss = model_ss.predict(X_test_ss)

In [35]:
# 2. Create a list of aplhas
##alphas = 10 ** np.linspace(6, -2, 20)
alphas = [100, 10, 1, 0.1, 1e-2, 1e-4, 1e-6, 1e-8]
##alphas

In [36]:
ridge = Ridge()
coefs = []
for a in alphas:
    ridge.set_params(alpha = a)
    ridge.fit(features,response)
    coefs.append(ridge.coef_)
##np.shape(coefs)

In [37]:
##plt.plot(alphas, coefs)
##ax = plt.gca()
##ax.set_xscale('log')
##plt.axis('tight')
##plt.xlabel('alpha')
##plt.ylabel('weights')

In [38]:
#Findinfg the best alpha

rmse_test_list = []
r2_test_list = []
r2_train_list = []

best_r2_test = 0
best_alpha = 0

for a in alphas:
    ridge = Ridge(alpha = a, max_iter=1000)
    ridge.fit(X_train_ss, Y_train_ss)
    pred = ridge.predict(X_test_ss)
    
    r2_train = ridge.score(X_train_ss, Y_train_ss)
    r2_train_list.append(r2_train)
    
    r2_test = ridge.score(X_test_ss, Y_test_ss)
    r2_test_list.append(r2_test)
    
    rmse_test = mean_squared_error(Y_test_ss, pred) ** 0.5
    rmse_test_list.append(rmse_test)
    
    if r2_test > best_r2_test:
        best_r2_test = r2_test
        best_alpha = a
    
ridge_result = np.vstack((alphas, \
                         r2_train_list, \
                         r2_test_list, \
                         rmse_test_list)).T
ridge_df = pd.DataFrame(ridge_result, \
                        columns=['Alpha', 'R2 Train', 'R2 Test', 'RMSE'])

##print(ridge_df)
##print(best_alpha)

In [39]:
# apply ridge correction to the model considering the  previous best alpha founded
ridge = Ridge(alpha=best_alpha)
ridge.fit(X_train_ss, Y_train_ss)
pred = ridge.predict(X_test_ss)

r2_test =ridge.score(X_test_ss, Y_test_ss)
r2_train = ridge.score(X_train_ss, Y_train_ss)
rmse= mean_squared_error(pred,Y_test_ss) ** 0.5
##print("R^2 train:", r2_train)
##print("R^2 test:", r2_test)
##print("RMSE test:", rmse)
##print()
##print(pd.Series(ridge.coef_, index = features.columns))

method.append('Ridge using Scaler')
r2_list_train.append(r2_train)
r2_list_test.append(r2_test)
rmse_list.append(rmse)

### Polynomial

In [40]:
from sklearn.model_selection import train_test_split
#Split datasets
response = df['quality'] 
features = df.drop('quality', axis=1) 
X_train, X_test, Y_train, Y_test = train_test_split(features, 
                                                   response,
                                                   test_size =0.1,
                                                   random_state= 1024)
model = LinearRegression()

In [41]:
from sklearn.preprocessing import PolynomialFeatures
#Finding best degree
best_degree = 0
best_mse = 0
best_model = None
best_r2=0

for degree in range(1, 5):
    
    poly_features = PolynomialFeatures(degree=degree) 
    
    X_train_poly = poly_features.fit_transform(X_train) 
    X_test_poly = poly_features.transform(X_test) 
    model.fit(X_train_poly, Y_train)

    y_pred = model.predict(X_test_poly)
    mse = mean_squared_error(Y_test, y_pred)
    
    r2= model.score(X_test_poly, Y_test)
    rmse_test = mean_squared_error(Y_test, y_pred) ** 0.5

    if r2 > best_r2:
        best_r2= r2
        best_degree = degree
        best_mse = rmse_test
        
##print(f"Best degree: {best_degree}")

In [42]:
poli_reg = PolynomialFeatures(degree = best_degree)

X_train_poly = poli_reg.fit_transform(X_train)
X_test_poly = poli_reg.fit_transform(X_test)

model.fit(X_train_poly, Y_train )
prediction = model.predict(X_test_poly)

r2_train= model.score(X_train_poly, Y_train)
r2_test= model.score(X_test_poly, Y_test)
rmse_test = mean_squared_error(Y_test, prediction) ** 0.5
##print("R^2 train:", r2_train)
##print("R^2 test:", r2_test)
##print("RMSE test:", rmse_test)

method.append('Polynomial')
r2_list_train.append(r2_train)
r2_list_test.append(r2_test)
rmse_list.append(rmse)

In [43]:
##actual_value = Y_test
##plt.scatter(predictions, actual_value, alpha=0.6)
##plt.xlabel("Predicted value")
##plt.ylabel("Actual value")
##plt.title("Polynomial Regression Model")
##plt.show()

In [44]:
results = np.vstack((method, r2_list_train,r2_list_test, rmse_list)).T
results_df = pd.DataFrame(results, columns=['Method', 'R2 Training','R2 Test', 'RMSE'])
results_df

Unnamed: 0,Method,R2 Training,R2 Test,RMSE
0,Linear Regression,0.3426857567444026,0.2948268456209844,0.6004304498602445
1,Lasso,0.3426857567443765,0.2948268225072082,0.6004304597005333
2,Standard Scaler,0.3426857567444026,0.2948268456209845,0.6004304498602445
3,MinMax Scaler,0.3426857567444026,0.2948268456209845,0.6004304498602445
4,Ridge using Scaler,0.3411910401469027,0.2979710480382105,0.5990903648307659
5,Polynomial,0.3433307856234657,0.3372789791018367,0.5990903648307659


## Anaysis
In this table, it can be observed that the performance of this dataset is pretty close (lower values R^2) in the different models, which generally means that this dataset has low correlation and underfitting.

### Part 6: Discussion 

1. Create a cell (a Markdown cell) in the Jupyter notebook and answer: From managerial perspective, how can one improve the quality of the wine? Explain your answer.


- Try with polynomial model
- Add fetures
- Add data


2. Create a cell (a Markdown cell) in the Jupyter notebook and answer: What is overfitting? How linear regression with Ridge solve the problem of overfitting?



In linear regression, overfitting can occur when the model is too complex and captures noise in the data as if it were a meaningful pattern. 

Ridge regression is a technique used to address overfitting in linear regression by adding a regularization term to the cost function.

Key points about how Ridge regression addresses overfitting:

1.Shrinkage of Coefficients: The regularization term penalizes large coefficients, preventing them from becoming too extreme. This helps to avoid fitting the noise in the training data.

2.Reduced Sensitivity to Outliers: Ridge regression is more robust to outliers and noisy data because it reduces the impact of individual data points on the model.

3.Better Generalization: By preventing the model from becoming too complex, Ridge regression improves its ability to generalize to new, unseen data.

4.Multicollinearity Handling: Ridge regression is particularly useful when there is multicollinearity (high correlation) among the features. It can stabilize and improve the numerical stability of the coefficient estimates.