In [None]:
Name: Esplanada, Borris A.
Section: CPE32S1
Instructor: Engr. Roman Richard
Date: 7/01/2024

1. Choose any dataset applicable to the classification problem, and also, choose any dataset applicable to the regression problem.
2. Explain your datasets and the problem being addressed.
3. For classification, do the following:
 - Create a base model
 - Evaluate the model with k-fold cross validation
 - Improve the accuracy of your model by applying additional hidden layers
4. For regression, do the following:
 - Create a base model
 - Improve the model by standardizing the dataset
 - Show tuning of layers and neurons (see evaluating small and larger networks)
5. Submit the link to your Google Colab (make sure that it is accessible to me)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Classification Problem
Title: Personal Loan Modeling

Link: https://www.kaggle.com/datasets/teertha/personal-loan-modeling

Explain your datasets and the problem being addressed.

The Personal Loan Modeling dataset contains information about customers of a bank, including their personal and financial details, as well as whether or not they accepted a personal loan offer in the past. The dataset consists of 5,000 rows and 14 columns.

The problem being addressed is whether or not a customer will accept a personal loan offer. This is a binary classification problem, where the goal is to predict whether a customer will accept the loan offer or not based on their personal and financial information.

This is important to banks and other financial institutions because they want to make targeted marketing efforts towards customers who are most likely to accept personal loan offers, so they can increase their chances of making a profit.

By using this dataset, banks can gain insights into what factors may influence a customer's decision to accept a personal loan offer and adjust their marketing strategy accordingly.

For classification, do the following:
- Create a base model
- Evaluate the model with k-fold cross validation
- Improve the accuracy of your model by applying additional hidden layers

In [None]:
path = "/content/drive/MyDrive/3rdYear/CPE019/hoa7.1/Bank_Personal_Loan_Modelling.csv"
df = pd.read_csv(path)

In [None]:
pip install scikeras[tensorflow]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# importing modules

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
# load dataset


df = pd.read_csv("/content/drive/MyDrive/3rdYear/CPE019/hoa7.1/Bank_Personal_Loan_Modelling.csv")
dataset = df.values

In [None]:
df

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0


In [None]:
pip install scikeras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Separate the features and target variable
X = df.drop("Personal Loan", axis=1)
y = df["Personal Loan"]

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Scale the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Create a base model with one hidden layer
def create_base_model():
    model = Sequential()
    model.add(Dense(20, input_dim=13, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

In [None]:
# Evaluate the model with k-fold cross-validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
estimator = KerasClassifier(build_fn=create_base_model, epochs=10, batch_size=32, verbose=0)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Base model accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))



Base model accuracy: 96.00% (0.42%)


In [None]:
# Improve the accuracy of the model by adding additional hidden layers
def create_improved_model():
    model = Sequential()
    model.add(Dense(20, input_dim=13, activation="relu"))
    model.add(Dense(10, activation="relu"))
    model.add(Dense(5, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

In [None]:
# Evaluate the improved model with k-fold cross-validation
estimator = KerasClassifier(build_fn=create_improved_model, epochs=10, batch_size=32, verbose=0)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Improved model accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))



Improved model accuracy: 96.70% (0.23%)


In [None]:
# Fit the best model on the entire training set and evaluate on the test set
best_model = create_improved_model()
best_model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)
score = best_model.evaluate(X_test, y_test, verbose=0)
print("Test set accuracy: %.2f%%" % (score[1]*100))

Test set accuracy: 98.30%


#  Regression Problem

Title: California Housing Dataset

Link: https://www.kaggle.com/datasets/camnugent/california-housing-prices?resource=download

Explain your datasets and the problem being addressed.

The California Housing dataset includes data on the cost of homes in various Californian cities. The 20,640 instances and 10 attributes in this dataset include the median household income, median house value, latitude, and longitude. The objective of this dataset is to create a model that, depending on the other parameters, can forecast the median house value for a specific place. The outcome variable in this regression problem is a continuous numerical number. The dataset can be used to construct models that can forecast home prices in other regions as well as to study the relationship between different variables and Californian housing costs.

In [None]:
# importing modules

import numpy as np
import pandas as pd

In [None]:
path = "/content/drive/MyDrive/3rdYear/CPE019/hoa7.1/housing.csv"
df = pd.read_csv(path)

In [None]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/3rdYear/CPE019/hoa7.1/housing.csv')

# One-hot encoding of categorical variables
df_encoded = pd.get_dummies(df, columns=['ocean_proximity'])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_encoded.drop('median_house_value', axis=1), df['median_house_value'], test_size=0.2, random_state=42)

# Impute missing values in the test data
imputer = SimpleImputer(strategy='median')
X_test = imputer.fit_transform(X_test)

# Create a base linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Base Model - Mean Squared Error: {mse:.2f}')
print(f'Base Model - R-Squared: {r2:.2f}')


Base Model - Mean Squared Error: 4909161624.07
Base Model - R-Squared: 0.63




In [None]:
# Standardize the training and testing sets
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Improve the model by standardizing the dataset
model_scaled = LinearRegression()
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
mse_scaled = mean_squared_error(y_test, y_pred_scaled)
r2_scaled = r2_score(y_test, y_pred_scaled)
print(f'Standardized Model - Mean Squared Error: {mse_scaled:.2f}')
print(f'Standardized Model - R-Squared: {r2_scaled:.2f}')

Standardized Model - Mean Squared Error: 4909161624.07
Standardized Model - R-Squared: 0.63




In [None]:
# Define the neural network model
def create_model(input_dim, output_dim, hidden_layers, neurons):
    model = Sequential()
    model.add(Dense(neurons, input_dim=input_dim, activation='relu'))
    for i in range(hidden_layers):
        model.add(Dense(neurons, activation='relu'))
    model.add(Dense(output_dim))
    return model

# Evaluate small and large networks with varying layers and neurons
results = []
for hidden_layers in [1, 2, 3]:
    for neurons in [10, 50, 100]:
        model = create_model(input_dim=X_train_scaled.shape[1], output_dim=1, hidden_layers=hidden_layers, neurons=neurons)
        model.compile(loss='mse', optimizer=Adam(learning_rate=0.01))
        history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=128, verbose=0)
        mse_small, r2_small = model_small.evaluate(X_test_scaled, y_test, verbose=0), r2_score(y_test, model_small.predict(X_test_scaled))
        results.append({'hidden_layers': hidden_layers, 'neurons': neurons, 'mse': mse, 'r2': r2})
        print(f'hidden_layers: {hidden_layers}, neurons: {neurons}, mse: {mse:.2f}, r2: {r2:.2f}')

# Print results
df_results = pd.DataFrame(results)
print(df_results)




hidden_layers: 1, neurons: 10, mse: 4909161624.07, r2: 0.63
hidden_layers: 1, neurons: 50, mse: 4909161624.07, r2: 0.63
hidden_layers: 1, neurons: 100, mse: 4909161624.07, r2: 0.63
hidden_layers: 2, neurons: 10, mse: 4909161624.07, r2: 0.63
hidden_layers: 2, neurons: 50, mse: 4909161624.07, r2: 0.63
hidden_layers: 2, neurons: 100, mse: 4909161624.07, r2: 0.63
hidden_layers: 3, neurons: 10, mse: 4909161624.07, r2: 0.63
hidden_layers: 3, neurons: 50, mse: 4909161624.07, r2: 0.63
hidden_layers: 3, neurons: 100, mse: 4909161624.07, r2: 0.63
   hidden_layers  neurons           mse        r2
0              1       10  4.909162e+09  0.625372
1              1       50  4.909162e+09  0.625372
2              1      100  4.909162e+09  0.625372
3              2       10  4.909162e+09  0.625372
4              2       50  4.909162e+09  0.625372
5              2      100  4.909162e+09  0.625372
6              3       10  4.909162e+09  0.625372
7              3       50  4.909162e+09  0.625372
8      

In [None]:
results = []
for hidden_layers in [1, 2, 3]:
    for neurons in [10, 50, 100]:
        model_large = create_model(input_dim=X_train_scaled.shape[1], output_dim=1, hidden_layers=hidden_layers, neurons=neurons)
        model_large.compile(loss='mse', optimizer=Adam(learning_rate=0.01))
        history_large = model_large.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=128, verbose=0)
        mse_large, r2_large = model_large.evaluate(X_test_scaled, y_test, verbose=0), r2_score(y_test, model_large.predict(X_test_scaled))
        results.append({'hidden_layers': hidden_layers, 'neurons': neurons, 'mse': mse, 'r2': r2})
        print(f'hidden_layers: {hidden_layers}, neurons: {neurons}, mse: {mse:.2f}, r2: {r2:.2f}')

# Print results
df_results = pd.DataFrame(results)
print(df_results)



hidden_layers: 1, neurons: 10, mse: 4909161624.07, r2: 0.63
hidden_layers: 1, neurons: 50, mse: 4909161624.07, r2: 0.63
hidden_layers: 1, neurons: 100, mse: 4909161624.07, r2: 0.63
hidden_layers: 2, neurons: 10, mse: 4909161624.07, r2: 0.63
hidden_layers: 2, neurons: 50, mse: 4909161624.07, r2: 0.63
hidden_layers: 2, neurons: 100, mse: 4909161624.07, r2: 0.63
hidden_layers: 3, neurons: 10, mse: 4909161624.07, r2: 0.63
hidden_layers: 3, neurons: 50, mse: 4909161624.07, r2: 0.63
hidden_layers: 3, neurons: 100, mse: 4909161624.07, r2: 0.63
   hidden_layers  neurons           mse        r2
0              1       10  4.909162e+09  0.625372
1              1       50  4.909162e+09  0.625372
2              1      100  4.909162e+09  0.625372
3              2       10  4.909162e+09  0.625372
4              2       50  4.909162e+09  0.625372
5              2      100  4.909162e+09  0.625372
6              3       10  4.909162e+09  0.625372
7              3       50  4.909162e+09  0.625372
8      

# Conclusion

In conclusion, the student successfully addressed errors such as "ValueError: Input X contains NaN" and "ValueError: could not convert string to float: 'NEAR OCEAN'." They created a Python program for both classification and regression problems. Additionally, the student demonstrated the ability to determine whether a dataset is suitable for a classification or regression problem.

Submit the link to your Google Colab (make sure that it is accessible to me)

Link: https://colab.research.google.com/drive/14lXmBBpnBIA3FR-j_tjT6Or6tizKNhjb?usp=sharing