<h2 style="text-align:center">2242-ASDS-6302-002-MACHINE LEARNING</h2>

<h2 style="text-align:center">Name: Vineesha Mallu</h2>

<h2 style="text-align:center">UTA ID: 1002419747</h2>

<h4 style="text-align:center">Statement of Problem</h4>
<p>The US Census Bureau has provided a dataset containing various metrics, such as population, median income, and median housing prices, for each block group in California. The objective is to build a housing price prediction model using the dataset,which can predict the median house values in any district based on other metrics.</p>
<h4 style="text-align:center">Objective</h4>
<p>Develop a model to predict median house values in California using the given dataset. The model should be capable of predicting
housing prices in any district based on various metrics.</p>

<h4 style="text-align:center">QUESTIONS</h4>

#Importing Required Packages

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

#1. Load and Explore Data:
- Read the "housing.xlsx" file.
- Display the first few rows of the dataset.
- Extract input (X) and output (Y) data from the dataset.

In [65]:
# Reading housing.xlsx file using pandas read_excel method
housing_data = pd.read_excel('housing.xlsx', sheet_name='housing')
print("Displaying first 5 rows:\n")
housing_data.head()

Displaying first 5 rows:



Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


In [66]:
# Extracting input features from the dataset
X = housing_data.drop(columns=['median_house_value'])
# Extracting output feature from the dataset
Y = housing_data['median_house_value']

# Displaying the shapes of X and Y
print("\nShape of input data (X):", X.shape)
print("Shape of output data (Y):", Y.shape)


Shape of input data (X): (20640, 9)
Shape of output data (Y): (20640,)


#2. Handle Missing Values:
- Fill missing values. Imputation method should make sense.

In [67]:
# Trying to check is there any missing values in the dataset using isnull method
print("Checking Missing values:")
print(housing_data.isnull().sum())

Checking Missing values:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64


In [68]:
# Checking the percentage of missing values in total_bedrooms feature
missing_percentage = (housing_data['total_bedrooms'].isnull().sum() / len(housing_data)) * 100
print("Percentage of missing values in 'total_bedrooms': {:.2f}%".format(missing_percentage))

Percentage of missing values in 'total_bedrooms': 1.00%


To handle missing values using KNN Imputation technique which considers the similarity between rows, using other features to predict the missing 'total_bedrooms' values. This method is more sophisticated and can be more accurate.

In [69]:
# Trying to find correlation for numeric features using corr method
numeric_features = housing_data.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_features.corr()
correlation_matrix

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
longitude,1.0,-0.924664,-0.108197,0.044568,0.069608,0.099773,0.05531,-0.015176,-0.045967
latitude,-0.924664,1.0,0.011173,-0.0361,-0.066983,-0.108785,-0.071035,-0.079809,-0.14416
housing_median_age,-0.108197,0.011173,1.0,-0.361262,-0.320451,-0.296244,-0.302916,-0.119034,0.105623
total_rooms,0.044568,-0.0361,-0.361262,1.0,0.93038,0.857126,0.918484,0.19805,0.134153
total_bedrooms,0.069608,-0.066983,-0.320451,0.93038,1.0,0.877747,0.979728,-0.007723,0.049686
population,0.099773,-0.108785,-0.296244,0.857126,0.877747,1.0,0.907222,0.004834,-0.02465
households,0.05531,-0.071035,-0.302916,0.918484,0.979728,0.907222,1.0,0.013033,0.065843
median_income,-0.015176,-0.079809,-0.119034,0.19805,-0.007723,0.004834,0.013033,1.0,0.688075
median_house_value,-0.045967,-0.14416,0.105623,0.134153,0.049686,-0.02465,0.065843,0.688075,1.0


From above correlation table, features like total_rooms, population and households have strong correlation with total_bedrooms. So we can use these features in KNN imputaion to find nearest values.

In [70]:
# Features considering for imputation
columns_for_imputation = ['total_bedrooms','total_rooms','population','households']
# Initializing KNN imputation and trying to fit the imputation using 3 nearest neighbors
imputer = KNNImputer(n_neighbors=3)
imputed_data = imputer.fit_transform(housing_data[columns_for_imputation])

# Update only the 'total_bedrooms' column in the original DataFrame
housing_data['total_bedrooms'] = imputed_data[:, columns_for_imputation.index('total_bedrooms')]
housing_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


In [71]:
# Trying to check is there any missing values after KNN Imputation
housing_data.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
median_house_value    0
dtype: int64

#3. Encode Categorical Data:
- Convert categorical columns in the dataset to numerical data.

In [72]:
#Converting categorical columns in the dataset to numerical data
housing_data = pd.get_dummies(housing_data, columns=['ocean_proximity'], drop_first=True)

# Display the first few rows of the encoded dataset
housing_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41,880,129.0,322,126,8.3252,452600,False,False,True,False
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,358500,False,False,True,False
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,352100,False,False,True,False
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,341300,False,False,True,False
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,342200,False,False,True,False


#4. Split the Dataset:
- Split the data into 80% training dataset and 20% test dataset.

In [73]:
# Extracting input features from the dataset
X = housing_data.drop(columns=['median_house_value'])
# Extracting output feature from the dataset
y = housing_data['median_house_value']

#Spliting the data into 80% training dataset and 20% test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

print("The dimension of X_train is {}".format(X_train.shape))
print("The dimension of X_test is {}".format(X_test.shape))

The dimension of X_train is (16512, 12)
The dimension of X_test is (4128, 12)


#5. Standardize Data:
- Standardize the training and test datasets.

In [74]:
#Standardizing the training and testing datasets using StandardScalar
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
pd.DataFrame(X_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.523462,-0.850719,-0.602847,3.523431,4.248185,1.933739,3.255219,-0.709625,1.466063,-0.017404,-0.35283,-0.382555
1,-1.676442,1.337217,1.377381,-0.162457,-0.109365,-0.544768,-0.351275,-0.651252,-0.682099,-0.017404,-0.35283,-0.382555
2,-1.45215,0.930486,0.189244,0.222041,0.198986,0.040533,0.313079,0.682972,-0.682099,-0.017404,-0.35283,2.614004
3,0.730962,-0.738517,-0.998893,-0.099379,0.065128,0.285215,0.038901,-0.550104,-0.682099,-0.017404,-0.35283,-0.382555
4,-0.540028,-0.238284,0.981336,0.016574,0.509727,-0.427708,0.020447,-1.003551,-0.682099,-0.017404,-0.35283,-0.382555


#6. Linear Regression:
- Perform Linear Regression on the training data.
- Predict the output for the test dataset using the fitted model.
- Print the root mean squared error (RMSE) from Linear Regression.

In [75]:
# Initializing and performing Linear Regression on the training data
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predicting the output for the train and test dataset
y_train_pred = lr_model.predict(X_train)
y_pred = lr_model.predict(X_test)

# Calculating train and test R2 score
train_r2 = r2_score(y_train, y_train_pred)
print("R2 score for train data:", train_r2)

test_r2 = r2_score(y_test, y_pred)
print("R2 score for test data:", test_r2)

# Calculating train and test RMSE
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"RMSE(train) for Linear Regression: {rmse:.4f}")

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE(test) for Linear Regression: {rmse:.4f}")

R2 score for train data: 0.6464347579023244
R2 score for test data: 0.6462557578816923
RMSE(train) for Linear Regression: 68399.1767
RMSE(test) for Linear Regression: 69476.7456


The R2 score for the train data and test data is approximately 0.6464 and 0.6462, which means that approximately 64.64% and 64.62% of the variance in the target variable is explained by the features in the training and testing dataset.
The RMSE for the train data and test data is approximately 68399.1767 and 69476.7456 units. This means that, on average, the predicted values are off by approximately 68399.1767 and 69476.7456 units from the actual values in the training and testing dataset.

#7. Lasso Regression:
- Implement Lasso Regression on the training data.
- Predict the output for the test dataset using the fitted Lasso model.
- Evaluate and print the RMSE for Lasso Regression.

In [77]:
# Initializing and fitting the Lasso Regression model
lasso_model = Lasso(alpha=2.0)  
lasso_model.fit(X_train, y_train)

# Predicting the output for the train and test dataset
y_train_pred = lasso_model.predict(X_train)
y_pred = lasso_model.predict(X_test)

# Calculating train and test R2 score
train_r2 = r2_score(y_train, y_train_pred)
print("R2 score for train data:", train_r2)

test_r2 = r2_score(y_test, y_pred)
print("R2 score for test data:", test_r2)

# Calculating train and test RMSE
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"RMSE(train) for Lasso Regression: {rmse:.4f}")

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE(test) for Lasso Regression: {rmse:.4f}")

R2 score for train data: 0.6464347237605994
R2 score for test data: 0.6462559684180953
RMSE(train) for Lasso Regression: 68399.1800
RMSE(test) for Lasso Regression: 69476.7249


The R2 score for the train data and test data is approximately 0.6464 and  0.6462, which means that approximately 64.64% and 64.62% of the variance in the target variable is explained by the features in the training and testing dataset.
The RMSE for the train data and test data is approximately 68399.1800 and 69476.7249 units. This means that, on average, the predicted values are off by approximately 68399.1800 and 69476.7249units from the actual values in the training and testing dataset.

#8. Ridge Regression:
- Implement Ridge Regression on the training data.
- Predict the output for the test dataset using the fitted Ridge model.
- Evaluate and print the RMSE for Ridge Regression.

In [78]:
# Initializing and fitting the Ridge Regression model
ridge_model = Ridge(alpha=2.0)  
ridge_model.fit(X_train, y_train)

# Predicting the output for the train and test dataset
y_train_pred = ridge_model.predict(X_train)
y_pred = ridge_model.predict(X_test)

# Calculating train and test R2 score
train_r2 = r2_score(y_train, y_train_pred)
print("R2 score for train data:", train_r2)

test_r2 = r2_score(y_test, y_pred)
print("R2 score for test data:", test_r2)

# Calculating train and test RMSE
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"RMSE(train) for Ridge Regression: {rmse:.4f}")

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE(test) for Ridge Regression: {rmse:.4f}")

R2 score for train data: 0.6464344278291291
R2 score for test data: 0.6462692658271749
RMSE(train) for Ridge Regression: 68399.2086
RMSE(test) for Ridge Regression: 69475.4190


The R2 score for the train data and test data is approximately 0.6464 and  0.6462, which means that approximately 64.64% and 64.62% of the variance in the target variable is explained by the features in the training and testing dataset.
The RMSE for the train data and test data is approximately 68399.2086 and 69476.4190 units. This means that, on average, the predicted values are off by approximately 68399.2086 and 69476.4190 units from the actual values in the training and testing dataset.

#9. Elastic Net Regression:
- Implement Elastic Net Regression on the training data.
- Predict the output for the test dataset using the fitted Elastic Net model.
- Evaluate and print the RMSE for Elastic Net Regression.

In [79]:
# Initializing and fitting the ElasticNet Regression model
elastic_net_model = ElasticNet(alpha=1.0, l1_ratio=0.5) 
elastic_net_model.fit(X_train, y_train)

# Predicting the output for the train and test dataset
y_train_pred = elastic_net_model.predict(X_train)
y_pred = elastic_net_model.predict(X_test)

# Calculating train and test R2 score
train_r2 = r2_score(y_train, y_train_pred)
print("R2 score for train data:", train_r2)

test_r2 = r2_score(y_test, y_pred)
print("R2 score for test data:", test_r2)

# Calculating train and test RMSE
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"RMSE(train) for ElasticNet Regression: {rmse:.4f}")

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE(test) for ElasticNet Regression: {rmse:.4f}")

R2 score for train data: 0.565179977635
R2 score for test data: 0.5651703933466987
RMSE(train) for ElasticNet Regression: 75852.6676
RMSE(test) for ElasticNet Regression: 77029.0125


The R2 score for the train data and test data is similar and approximately 0.5651, which means that approximately 56.51% of the variance in the target variable is explained by the features in the training and testing dataset.
The RMSE for the train data and test data is approximately 75852.6676 and 77029.0125 units. This means that, on average, the predicted values are off by approximately 75852.6676 and 77029.0125 units from the actual values in the training and testing dataset.

#10. Cross-Validation and Grid Search:
- Apply cross-validation on the dataset to assess the models' generalization performance.
- Perform grid search to fine-tune hyperparameters for Ridge and Lasso Regression models.
- Discuss the results of cross-validation and grid search, providing insights into the optimal hyperparameters for the models

In [84]:
model = LinearRegression()

# Initialize KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
mse_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Calculate RMSE scores
rmse_scores = np.sqrt(-mse_scores)

print("Cross-Validation RMSE scores for Linear:", rmse_scores)
print(f'Mean RMSE across all folds:',rmse_scores.mean())

Cross-Validation RMSE scores for Linear: [69299.07315794 69026.88769617 67708.24589533 65813.19806569
 71608.29843775]
Mean RMSE across all folds: 68691.14065057761


The RMSE scores across different folds range from approximately 65813.20 to 71608.30 units with a mean RMSE of approximately 68691.14 units.
The variability in RMSE scores across folds indicates how consistent the model's performance is across different subsets of the data. Having a smaller variability suggests a more stable model.

In [80]:
# Initializing KFold for cross-validation
cv_folds = KFold(n_splits=5, shuffle=True, random_state=10)

# Range of alpha values for Ridge regression
params = {'alpha':[0.001,0.01,0.1,0.2,0.5,0.9,1.0,5.0,10.0,50.0,100.0]}

In [81]:
# Performing grid search with cross-validation for Ridge
ridge_grid_search = GridSearchCV(Ridge(), params, cv=cv_folds, scoring='neg_mean_squared_error')
ridge_grid_search.fit(X_train, y_train)

# Getting best parameters for Ridge
best_ridge_params = ridge_grid_search.best_params_
print("Best parameters for Ridge:", best_ridge_params)

# Getting best Ridge model
best_ridge_model = ridge_grid_search.best_estimator_

# Assess generalization performance using cross-validation
ridge_cv_scores = cross_val_score(best_ridge_model, X_train, y_train, cv=cv_folds, 
                                  scoring='neg_mean_squared_error')

# Converting negative MSE scores to RMSE
ridge_cv_rmse_scores = np.sqrt(-ridge_cv_scores)
print("Cross-Validation RMSE scores for Ridge:", ridge_cv_rmse_scores)
print("Mean RMSE for Ridge:", ridge_cv_rmse_scores.mean())

# Fitting the best Ridge model on the training data
best_ridge_model.fit(X_train, y_train)

# Predicting the output for the train and test dataset
y_train_pred = best_ridge_model.predict(X_train)
y_pred = best_ridge_model.predict(X_test)

# Calculating R2 score for train and test data
train_r2 = r2_score(y_train, y_train_pred)
print("R2 score for train data:", train_r2)

test_r2 = r2_score(y_test, y_pred)
print("R2 score for test data:", test_r2)

# Calculating train and test RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"RMSE(train) for Ridge Regression: {rmse_train:.4f}")

rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE(test) for Ridge Regression: {rmse_test:.4f}")

Best parameters for Ridge: {'alpha': 10.0}
Cross-Validation RMSE scores for Ridge: [68698.39511704 68908.71216957 69040.31369858 67480.0438679
 68993.0414666 ]
Mean RMSE for Ridge: 68624.10126393789
R2 score for train data: 0.6464268182104302
R2 score for test data: 0.6463139189160093
RMSE(train) for Ridge Regression: 68399.9447
RMSE(test) for Ridge Regression: 69471.0338


The best value for the regularization parameter (alpha) found by grid search is 10.0.
The R2 scores for both train and test datasets remain relatively consistent before and after grid search, indicating that the model's ability to explain the variance in the target variable remains similar.
The RMSE values also show a slight change after grid search, but the differences are relatively small. This suggests that tuning the regularization parameter did not significantly impact the model's predictive performance.
The RMSE scores range from approximately 67480.04 to 69040.31 units across the 5 folds, with a mean RMSE of approximately 68624.10 units.
The variability in RMSE scores across folds indicates how consistent the model's performance is across different subsets of the data. Having a smaller variability suggests a more stable model.

In [82]:
# Performing grid search with cross-validation for Lasso
lasso_grid_search = GridSearchCV(Lasso(), params, cv=cv_folds, scoring='neg_mean_squared_error')
lasso_grid_search.fit(X_train, y_train)

# Getting best parameters for Lasso
best_lasso_params = lasso_grid_search.best_params_
print("Best parameters for Lasso:", best_lasso_params)

# Getting best Lasso model
best_lasso_model = lasso_grid_search.best_estimator_

# Assess generalization performance using cross-validation
lasso_cv_scores = cross_val_score(best_lasso_model, X_train, y_train, cv=cv_folds, scoring='neg_mean_squared_error')

# Converting negative MSE scores to RMSE
lasso_cv_rmse_scores = np.sqrt(-lasso_cv_scores)
print("Cross-Validation RMSE scores for Lasso:", lasso_cv_rmse_scores)
print("Mean RMSE for Lasso:", lasso_cv_rmse_scores.mean())

# Fitting the best Lasso model on the training data
best_lasso_model.fit(X_train, y_train)

# Predicting the output for the train and test dataset
y_train_pred = best_lasso_model.predict(X_train)
y_pred = best_lasso_model.predict(X_test)

# Calculating R2 score for train and test data
train_r2 = r2_score(y_train, y_train_pred)
print("R2 score for train data:", train_r2)

test_r2 = r2_score(y_test, y_pred)
print("R2 score for test data:", test_r2)

# Calculating train and test RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print(f"RMSE(train) for Lasso Regression: {rmse_train:.4f}")

rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE(test) for Lasso Regression: {rmse_test:.4f}")

Best parameters for Lasso: {'alpha': 50.0}
Cross-Validation RMSE scores for Lasso: [68684.433      68920.63242333 69043.79007543 67480.83211478
 68997.76363047]
Mean RMSE for Lasso: 68625.49024880368
R2 score for train data: 0.6464134246424034
R2 score for test data: 0.6462406172852683
RMSE(train) for Lasso Regression: 68401.2402
RMSE(test) for Lasso Regression: 69478.2324


The best value for the regularization parameter (alpha) found by grid search is 50.0.
The R2 scores for both train and test datasets remain relatively consistent before and after grid search, indicating that the model's ability to explain the variance in the target variable remains similar.
The RMSE values also show a slight change after grid search, but the differences are relatively small. This suggests that tuning the regularization parameter did not significantly impact the model's predictive performance.
The RMSE scores range from approximately 67480.83 to 69043.79 units across the 5 folds, with a mean RMSE of approximately 68625.49 units.
The variability in RMSE scores across folds indicates how consistent the model's performance is across different subsets of the data. Having a smaller variability suggests a more stable model.