# Final Project
## Authors:
- Taylor Tucker
- Virginia Weston
- Tina Jin
- Jeffrey Bradley

## Code for linear regression modeling, however using Ridge, Lasso, and ElasticNet

Import statements

In [32]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.linear_model import RidgeCV, ElasticNetCV, LassoCV

Importing the continuous dataset

In [33]:
df = pd.read_csv("./cleaned_data.csv", index_col=0)

Making x, y, and best feature splits and creating the training and testing data

In [34]:
x = df.iloc[:, :-1]
y = df.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=True, random_state=1, test_size=0.3)

Creating a list of L1 ratios for ElasticNet grid search. The other two regressors do not really have hyperparameters,
according to the documentation.

In [35]:
ratios = [0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1]
grid = {"elasticnetcv__l1_ratio": ratios}

Making the pipelines for the learners

In [36]:
pl_lasso = make_pipeline(StandardScaler(), MinMaxScaler(), LassoCV(normalize=True, cv=10, copy_X=True), verbose=True)
pl_ridge = make_pipeline(StandardScaler(), MinMaxScaler(), RidgeCV(normalize=True, cv=10), verbose=True)
pl_en = make_pipeline(StandardScaler(), MinMaxScaler(), ElasticNetCV(normalize=True, cv=10, copy_X=True), verbose=True)

Creating a GridSearchCV for the ElasticNet Learner

In [37]:
gs_en = GridSearchCV(estimator=pl_en, param_grid=grid, n_jobs=-1, refit=True, cv=10, verbose=True, scoring='r2')

Fitting the pipelines for Lasso and Ridge, and fitting the GridSearch for ElasticNet.

In [38]:
pl_lasso.fit(x_train, y_train)
pl_ridge.fit(x_train, y_train)
gs_en.fit(x_train, y_train)

[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline] ........... (step 3 of 3) Processing lassocv, total=   0.1s
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline] ........... (step 3 of 3) Processing ridgecv, total=   0.1s
Fitting 10 folds for each of 11 candidates, totalling 110 fits
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline] ...... (step 3 of 3) Processing elasticnetcv, total=   0.1s


  tol, rng, random, positive)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 110 out of 110 | elapsed:    3.1s finished
  tol, rng, random, positive)


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('minmaxscaler', MinMaxScaler()),
                                       ('elasticnetcv',
                                        ElasticNetCV(cv=10, normalize=True))],
                                verbose=True),
             n_jobs=-1,
             param_grid={'elasticnetcv__l1_ratio': [0, 0.1, 0.2, 0.3, 0.4, 0.5,
                                                    0.6, 0.7, 0.8, 0.9, 1]},
             scoring='r2', verbose=True)

Printing the R^2 Accuracies of the models, including the GridSearchCV

In [39]:
print("Lasso Testing R^2:", pl_lasso.score(x_test, y_test))
print("Ridge Testing R^2:", pl_ridge.score(x_test, y_test))
print("ElasticNet Testing R^2:", gs_en.score(x_test, y_test))
print("ElasticNet Best Training Score:", gs_en.best_score_)
print("ElasticNet Best Params:", gs_en.best_params_)

Lasso Testing R^2: 0.8404422713790248
Ridge Testing R^2: 0.8244072494805195
ElasticNet Testing R^2: 0.8404422713790248
ElasticNet Best Training Score: 0.7950507307111627
ElasticNet Best Params: {'elasticnetcv__l1_ratio': 1}


As we can see, there are not much better accuracies than the normal logistic regression. I still want to try to exact
same process, however, with the MaxAbsScaler instead to conserve spatial relationships.

In [40]:
df = pd.read_csv("./cleaned_data.csv", index_col=0)

Making x, y, and best feature splits and creating the training and testing data

In [41]:
x = df.iloc[:, :-1]
y = df.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=True, random_state=1, test_size=0.3)

Creating a list of L1 ratios for ElasticNet grid search. The other two regressors do not really have hyperparameters,
according to the documentation.

In [42]:
ratios = [0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1]
grid = {"elasticnetcv__l1_ratio": ratios}

Making the pipelines for the learners

In [43]:
pl_lasso = make_pipeline(MaxAbsScaler(), LassoCV(normalize=True, cv=10, copy_X=True), verbose=True)
pl_ridge = make_pipeline(MaxAbsScaler(), RidgeCV(normalize=True, cv=10), verbose=True)
pl_en = make_pipeline(MaxAbsScaler(), ElasticNetCV(normalize=True, cv=10, copy_X=True), verbose=True)

Creating a GridSearchCV for the ElasticNet Learner

In [44]:
gs_en = GridSearchCV(estimator=pl_en, param_grid=grid, n_jobs=-1, refit=True, cv=10, verbose=True, scoring='r2')

Fitting the pipelines for Lasso and Ridge, and fitting the GridSearch for ElasticNet.

In [45]:
pl_lasso.fit(x_train, y_train)
pl_ridge.fit(x_train, y_train)
gs_en.fit(x_train, y_train)

[Pipeline] ...... (step 1 of 2) Processing maxabsscaler, total=   0.0s
[Pipeline] ........... (step 2 of 2) Processing lassocv, total=   0.1s
[Pipeline] ...... (step 1 of 2) Processing maxabsscaler, total=   0.0s
[Pipeline] ........... (step 2 of 2) Processing ridgecv, total=   0.0s
Fitting 10 folds for each of 11 candidates, totalling 110 fits
[Pipeline] ...... (step 1 of 2) Processing maxabsscaler, total=   0.0s
[Pipeline] ...... (step 2 of 2) Processing elasticnetcv, total=   0.1s


  tol, rng, random, positive)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done  95 out of 110 | elapsed:    2.1s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done 110 out of 110 | elapsed:    2.5s finished
  tol, rng, random, positive)


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('maxabsscaler', MaxAbsScaler()),
                                       ('elasticnetcv',
                                        ElasticNetCV(cv=10, normalize=True))],
                                verbose=True),
             n_jobs=-1,
             param_grid={'elasticnetcv__l1_ratio': [0, 0.1, 0.2, 0.3, 0.4, 0.5,
                                                    0.6, 0.7, 0.8, 0.9, 1]},
             scoring='r2', verbose=True)

Printing the R^2 Accuracies of the models, including the GridSearchCV

In [46]:
print("Lasso Testing R^2:", pl_lasso.score(x_test, y_test))
print("Ridge Testing R^2:", pl_ridge.score(x_test, y_test))
print("ElasticNet Testing R^2:", gs_en.score(x_test, y_test))
print("ElasticNet Best Training Score:", gs_en.best_score_)
print("ElasticNet Best Params:", gs_en.best_params_)

Lasso Testing R^2: 0.8404422713790247
Ridge Testing R^2: 0.8244072494805194
ElasticNet Testing R^2: 0.8404422713790247
ElasticNet Best Training Score: 0.7950507307111628
ElasticNet Best Params: {'elasticnetcv__l1_ratio': 1}


Absolutely no difference in R^2 values. Since the Lasso seems to be the best model, I will use that on the best features

In [47]:
df = pd.read_csv("./cleaned_data.csv", index_col=0)

Making x, y, and best feature splits and creating the training and testing data

In [48]:
x = df[["Average Amount of Aid", "Graduation Rate"]]
y = df.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=True, random_state=1, test_size=0.3)

Making the new pipeline

In [49]:
pl_lasso = make_pipeline(MaxAbsScaler(), LassoCV(normalize=True, verbose=True, cv=10, copy_X=True))

Fitting the pipeline

In [50]:
pl_lasso.fit(x_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Pipeline(steps=[('maxabsscaler', MaxAbsScaler()),
                ('lassocv', LassoCV(cv=10, normalize=True, verbose=True))])

Printing the R^2 testing score

In [51]:
print("Lasso with the best two features and MaxAbsScaler Accuracy:", pl_lasso.score(x_test, y_test))

Lasso with the best two features and MaxAbsScaler Accuracy: 0.8464922187631583


Using the best two features (i.e. the two that are most correlated with the target, we get a testing accuracy of 84.6%,
even with L2 Normalization and feature scaling. I would argue this is one of the best models we have.