Q1: For each model, what were the optimal hyperparameters?

A1: 
- Linear Regression - {'fit_intercept': True, 'positive': True}
- Support Vector Regression - {'C': 1, 'gamma': 1, 'kernel': 'linear'}
- Decision Tree Regression - {'criterion': 'absolute_error', 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 10, 'splitter': 'best'}
- Neural Network - {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'max_iter': 1000, 'solver': 'adam'}

Q2: What model performed the best? Did multiple models perform similarly?

A2:The linear regression performed best. It had an average R-squared score of 0.81 and a root mean squred error of 0.06. The decison tree regression and support vector regression performed similarly with a R-squared score of 0.80 and 0.77 and a root mean squred error of 0.06 and 0.07, respectively. The neural network performed the worst with a R-squared score of 0.65 and a root mean squared error of 0.08. Notably, all models have a low root mean squred error, implying the models predict closely to the observed value.

Q3: What models did you use and what was their R squared and RMSE on the testing set?

A3: I used the linear regression model and the support vector regression. The R-squared score of the linear regression is 0.82 and the root mean squared error is 0.06. The R-squared score of the support vector regression is 0.80 and the root mean squared error of the support vector regression is 0.06.

In [24]:
import pandas as pd

1. Load in the data

In [25]:
admission = pd.read_csv('admissions.csv')

2. Split data into feature frame and target frame

In [26]:
y = admission['Chance of Admit']
X = admission[['GRE Score','TOEFL Score','University Rating','SOP','LOR ','CGPA','Research']]

3. Train test split

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=42)

4. Grid search

In [44]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

Define grid search function

In [39]:
def grid_search(params, estimator, x_train, y_train):
    gridSearch = GridSearchCV(estimator = estimator,  
                           param_grid = params,
                           scoring = 'neg_mean_squared_error',
                           cv = 5,
                           verbose=0)


    gridSearch.fit(x_train, y_train) 

    estimator_params = gridSearch.best_params_

    print(estimator_params)

Find linear regression parameters

In [40]:
parameters = {
              "fit_intercept": [True, False],
              "positive" : [True, False]
             }
grid_search(parameters,LinearRegression(),X_train,y_train)

{'fit_intercept': True, 'positive': True}


Find support vector regression parameters

In [41]:
parameters = {
              'C':[1,10,100,1000],
              'gamma':[1,0.1,0.001,0.0001],
              'kernel':['linear','rbf']}
grid_search(parameters,SVR(),X_train,y_train)

{'C': 1, 'gamma': 1, 'kernel': 'linear'}


Find decision tree parameters

In [43]:
parameters = {
              'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
              'splitter': ['best', 'random'],
              'max_depth': [None, 10, 20, 30, 40, 50],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4],
              'max_features': [None, 'sqrt', 'log2']}
grid_search(parameters,DecisionTreeRegressor(),X_train,y_train)


{'criterion': 'absolute_error', 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 10, 'splitter': 'best'}


Find neural network parameters

In [45]:
parameters = {
    'max_iter': [1000],
    'hidden_layer_sizes': [(50,50), (50,50,50), (100)],
    'activation': ['relu'],
    'solver': ['adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}
grid_search(parameters,MLPRegressor(),X_train,y_train)


{'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'max_iter': 1000, 'solver': 'adam'}


Retrain with optimized parameters

In [48]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import Normalizer
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [49]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

r2Scores = []
rmseScores = []

def cross_validate(x,y,estimator):
    for i, (train_index, test_index) in enumerate(cv.split(x, y)):

        ### making training and validation sets
        # Convert indices to list
        train_index = train_index.tolist()
        test_index = test_index.tolist()
        
        # Split the data into training and testing sets for this fold
        xTrain, xTest = x.iloc[train_index], x.iloc[test_index]
        yTrain, yTest = y.iloc[train_index], y.iloc[test_index]

        ### feature scaling
        xScaler = Normalizer()
        xColNames = xTrain.columns.values.tolist()
        # train the scaler and apply it to the training set
        xTrainScaled = xScaler.fit_transform(xTrain[xColNames])
        # apply the scaling to the testing set
        xTestScaled = xScaler.transform(xTest[xColNames])

        ### model training
        # instantiate the model
        clf = estimator
        # Train the classifier on the training data
        clf.fit(xTrain, yTrain)
        
        ### model prediction and evaluation
        # Make predictions on the test data
        y_pred = clf.predict(xTest)
        
        # Calculate metrics and store them
        r2Score = r2_score(yTest, y_pred)
        r2Scores.append(r2Score)

        rmseScore = mean_squared_error(yTest, y_pred, squared=False)
        rmseScores.append(rmseScore)

        print(f"Completed Fold {i}")

    ### Calculate the mean scores across all folds
    avgR2Score = sum(r2Scores) / len(r2Scores)
    print("Mean r squared score:", avgR2Score)

    avgRMSE = sum(rmseScores) / len(rmseScores)
    print("Mean rmse:", avgRMSE)

Linear regression

In [50]:
cross_validate(X_train,y_train,LinearRegression(fit_intercept=True,positive=True))

Completed Fold 0
Completed Fold 1
Completed Fold 2
Completed Fold 3
Completed Fold 4
Mean r squared score: 0.813033614518314
Mean rmse: 0.06001386578600224


Support vector machine

In [51]:
cross_validate(X_train,y_train,SVR(C=1,gamma=1,kernel='linear'))

Completed Fold 0
Completed Fold 1
Completed Fold 2
Completed Fold 3
Completed Fold 4
Mean r squared score: 0.798294312936136
Mean rmse: 0.06222150293258977


Decison tree

In [52]:
cross_validate(X_train,y_train,DecisionTreeRegressor(criterion='absolute_error',max_depth=None,max_features='log2',min_samples_leaf=4,min_samples_split=10,splitter='best'))

Completed Fold 0
Completed Fold 1
Completed Fold 2
Completed Fold 3
Completed Fold 4
Mean r squared score: 0.7699251364847342
Mean rmse: 0.06627304104093519


Neural network

In [53]:
cross_validate(X_train,y_train,MLPRegressor(activation='relu',alpha=0.05,hidden_layer_sizes=(50,50,50),learning_rate='adaptive',max_iter=1000,solver='adam'))

Completed Fold 0
Completed Fold 1
Completed Fold 2
Completed Fold 3
Completed Fold 4
Mean r squared score: 0.6510141275481585
Mean rmse: 0.07830902903552128


Retraining linear regression model

In [56]:
lm = LinearRegression(fit_intercept=True,positive=True)

lm.fit(X_train,y_train)

y_pred = lm.predict(X_test)

print(r2_score(y_test,y_pred))
print(mean_squared_error(y_test,y_pred,squared=False))

0.8188432567829629
0.060865880415783113


In [57]:
svr = SVR(C=1,gamma=1,kernel='linear')
svr.fit(X_train,y_train)

y_pred = svr.predict(X_test)

print(r2_score(y_test,y_pred))
print(mean_squared_error(y_test,y_pred,squared=False))

0.7955393040861873
0.06466236333012791
