# Exercise: Linear Models

In this exercise, we'll be exploring two types of linear models, one regression, one classification. While regression is what you typically think of for a linear model, they can also be used effectively in classification problems.

You're tasked with compeleting the following steps:
1. Load in the wine dataset from scikit learn.
2. For the wine dataset, create a train and test split, 80% train / 20% test.
3. Create a LogisticRegression model with these hyper parameters: random_state=0, max_iter=10000
4. Evaluate the model with the test dataset
5. Load the diabetes dataset from scikit learn
6. For the Diabetes dataset, create a train and test split, 80% train / 20% test.
7. Create a SGDRegressor model model with these hyper parameters: random_state=0, max_iter=10000
8. Evaluate the model with the test dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDRegressor

## Linear Classifier

In [2]:
# Load in the wine dataset
wine = datasets.load_wine()
wine

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [10]:
# Create the wine `data` dataset as a dataframe and name the columns with `feature_names`
feature_names = wine.feature_names

df = pd.DataFrame(wine.data, columns=feature_names)

# Include the target as well
df['target'] = wine.target

In [11]:
# Check your dataframe by `.head()`
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [19]:
# Split your data with these ratios: train: 0.8 | test: 0.2
df_train, df_test = train_test_split(df, test_size=0.2)
                                                    

In [22]:
# Output the shapes of the train and test sets
print("Train set shape:", df_train.shape)
print("Test set shape:", df_test.shape)

Train set shape: (142, 14)
Test set shape: (36, 14)


In [26]:
# How does the model perform on the training dataset and default model parameters?
# Using the hyperparameters in the requirements, is there improvement?
# Remember we use the test dataset to score the model
clf = LogisticRegression().fit(df_train.drop('target', axis=1), df_train['target'])
clf.score(df_test.drop('target', axis=1), df_test['target'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9166666666666666

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(df_train.drop('target', axis=1))
X_test_scaled = scaler.transform(df_test.drop('target', axis=1))


In [28]:
# Define hyperparameters
hyperparameters = {
    'C': 1.0,  # Regularization strength
    'solver': 'saga',  # Alternative solver
    'max_iter': 1000  # Increased number of iterations
}


In [29]:


# Create and fit the model with hyperparameters
clf_hyperparameters = LogisticRegression(**hyperparameters).fit(X_train_scaled, df_train['target'])

# Evaluate model performance on the training set with hyperparameters
train_score_hyperparameters = clf_hyperparameters.score(X_train_scaled, df_train['target'])

# Print the accuracy score on the training set with hyperparameters
print("Accuracy on training set with hyperparameters:", train_score_hyperparameters)




Accuracy on training set with hyperparameters: 1.0


In [30]:
# Score the model on the test set
test_score = clf_hyperparameters.score(X_test_scaled, df_test['target'])

# Print the accuracy score on the test set
print("Accuracy on test set with hyperparameters:", test_score)

Accuracy on test set with hyperparameters: 0.9722222222222222


## Linear Regression

In [31]:
# Load in the diabetes dataset
diabetes = datasets.load_diabetes()

In [33]:
# Create the diabetes `data` dataset as a dataframe and name the columns with `feature_names`
dfd = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Include the target as well
dfd['target'] = diabetes.target

In [34]:
# Check your dataframe by `.head()`
dfd.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [35]:
# Split your data with these ratios: train: 0.8 | test: 0.2
dfd_train, dfd_test = train_test_split(dfd, test_size=0.2)

In [36]:
# How does the model perform on the training dataset and default model parameters?
# Using the hyperparameters in the requirements, is there improvement?
# Remember we use the test dataset to score the model
reg = SGDRegressor().fit(dfd_train.drop('target', axis=1), dfd_train['target'])
reg.score(dfd_test.drop('target', axis=1), dfd_test['target'])



0.3447577180567909

In [37]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(dfd_train.drop('target', axis=1))
X_test_scaled = scaler.transform(dfd_test.drop('target', axis=1))

In [38]:
# Create and fit the model with default parameters
reg_default = SGDRegressor().fit(X_train_scaled, dfd_train['target'])

# Score the model on the test set with default parameters
test_score_default = reg_default.score(X_test_scaled, dfd_test['target'])

# Print the R^2 score on the test set with default parameters
print("R^2 score on test set with default parameters:", test_score_default)


R^2 score on test set with default parameters: 0.44037498206705084


In [39]:

# Define hyperparameters
hyperparameters = {
    'alpha': 0.0001,  # Regularization strength
    'max_iter': 1000,  # Maximum number of iterations
    'tol': 1e-3  # Tolerance for stopping criterion
}

# Create and fit the model with hyperparameters
reg_hyperparameters = SGDRegressor(**hyperparameters).fit(X_train_scaled, dfd_train['target'])

# Score the model on the test set with hyperparameters
test_score_hyperparameters = reg_hyperparameters.score(X_test_scaled, dfd_test['target'])

# Print the R^2 score on the test set with hyperparameters
print("R^2 score on test set with hyperparameters:", test_score_hyperparameters)

R^2 score on test set with hyperparameters: 0.43897350329595164
