<a href="https://colab.research.google.com/github/sahithipunya/spunyasa_Hands-on-assignment2/blob/main/spunyasa_regression_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**HANDS-ON SESSION-II: DATA MINING TOOLS**

**Regression MODELS**

importing the libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
import numpy as np

Upload the file

In [2]:
from google.colab import files
uploaded = files.upload()

Saving NY-House-Dataset.csv to NY-House-Dataset.csv


Import the data using pandas

In [3]:
# Load the dataset
ny_house_data = pd.read_csv('/content/NY-House-Dataset.csv')
ny_house_data.dropna(inplace=True)

Select the Features and the Target Variables

In [5]:
features = ['TYPE', 'BEDS', 'BATH', 'PROPERTYSQFT', 'LATITUDE', 'LONGITUDE']
target = 'PRICE'

Separate features and target

In [7]:
X = ny_house_data[features]
y = ny_house_data[target]

Splitting into training and testing sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Define preprocessor for numerical and categorical data

In [18]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['TYPE']),  # encode 'TYPE' column
        ('num', StandardScaler(), ['BEDS', 'BATH', 'PROPERTYSQFT', 'LATITUDE', 'LONGITUDE'])
    ]
)

# Create a pipeline that includes preprocessing and the model
linear_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])


**Linear Regression Model**

In [19]:
# Initialize and train the Linear Regression model
linear_pipeline.fit(X_train, y_train)

# Predict on the testing set
y_pred_linear = linear_pipeline.predict(X_test)

# Evaluate the Linear Regression model
print("Linear Regression Mean Squared Error:", mean_squared_error(y_test, y_pred_linear))
print("Linear Regression R^2 Score:", r2_score(y_test, y_pred_linear))


Linear Regression Mean Squared Error: 23571052435484.266
Linear Regression R^2 Score: 0.06444900142880705


**Random Forest Regressor Model**

In [20]:
# Define pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])

# Train and evaluate
rf_pipeline.fit(X_train, y_train)
y_pred = rf_pipeline.predict(X_test)

# Performance metrics
print("Random Forest Regressor Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Random Forest Regressor R^2 Score:", r2_score(y_test, y_pred))

Random Forest Regressor Mean Squared Error: 10805753576844.055
Random Forest Regressor R^2 Score: 0.5711123388825893


**Random Forest Regressor Model with regularization parameters**

In [21]:
# Define pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100,          # Number of trees
    max_depth=10,              # Limit the depth of each tree
    min_samples_split=5,       # Minimum samples required to split an internal node
    min_samples_leaf=4,        # Minimum samples required to be at a leaf node
    max_features='sqrt',
    random_state=42))
])

# Train and evaluate
rf_pipeline.fit(X_train, y_train)
y_pred = rf_pipeline.predict(X_test)

# Performance metrics
print("Random Forest Regressor Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Random Forest Regressor R^2 Score:", r2_score(y_test, y_pred))

Random Forest Regressor Mean Squared Error: 45760469148704.016
Random Forest Regressor R^2 Score: -0.8162639417282667


In [22]:
# Define pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=300,          # Number of trees
    max_depth=5,              # Limit the depth of each tree
    min_samples_split=5,       # Minimum samples required to split an internal node
    min_samples_leaf=4,        # Minimum samples required to be at a leaf node
    max_features='sqrt',
    random_state=42))
])

# Train and evaluate
rf_pipeline.fit(X_train, y_train)
y_pred = rf_pipeline.predict(X_test)

# Performance metrics
print("Random Forest Regressor Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Random Forest Regressor R^2 Score:", r2_score(y_test, y_pred))

Random Forest Regressor Mean Squared Error: 50799259942686.164
Random Forest Regressor R^2 Score: -1.0162569531478467


**Gradient Boosting Regressor Model**

In [23]:
# Define pipeline
gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(n_estimators=300, learning_rate=0.1, max_depth=3, random_state=42))
])

# Train and evaluate
gb_pipeline.fit(X_train, y_train)
y_pred = gb_pipeline.predict(X_test)

# Performance metrics
print("Gradient Boosting Regressor Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Gradient Boosting Regressor R^2 Score:", r2_score(y_test, y_pred))

Gradient Boosting Regressor Mean Squared Error: 10407087799768.004
Gradient Boosting Regressor R^2 Score: 0.5869356529607583


**Ridge Regression Model**

In [24]:
# Define pipeline
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', Ridge(alpha=1.0))
])

# Train and evaluate
ridge_pipeline.fit(X_train, y_train)
y_pred_ridge = ridge_pipeline.predict(X_test)

# Performance metrics
print("Ridge MSE:", mean_squared_error(y_test, y_pred_ridge))
print("Ridge R^2 Score:", r2_score(y_test, y_pred_ridge))


Ridge MSE: 23518431020377.49
Ridge R^2 Score: 0.06653758095168238


**Lasso Regression Model**

In [25]:
# Define pipeline
lasso_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', Lasso(alpha=0.1))
])

# Train and evaluate
lasso_pipeline.fit(X_train, y_train)
y_pred_lasso = lasso_pipeline.predict(X_test)

# Performance metrics
print("Lasso Regression MSE:", mean_squared_error(y_test, y_pred_lasso))
print("Lasso RegressionR^2 Score:", r2_score(y_test, y_pred_lasso))


Lasso Regression MSE: 23536820404962.473
Lasso RegressionR^2 Score: 0.0658076938514488


  model = cd_fast.enet_coordinate_descent(


**Elastic Net Model**

In [26]:
# L1 + L2 Regularization: Elastic Net
# Define pipeline
elastic_net_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', ElasticNet(alpha=0.1, l1_ratio=0.5))
])  # l1_ratio balances between L1 and L2 (0 = pure L2, 1 = pure L1)
elastic_net_pipeline.fit(X_train, y_train)
elastic_net_predictions = elastic_net_pipeline.predict(X_test)
print("Elastic Net MSE:", mean_squared_error(y_test, elastic_net_predictions))
print("Elastic Net R^2 Score:", r2_score(y_test, elastic_net_predictions))

Elastic Net MSE: 22386460966445.41
Elastic Net R^2 Score: 0.1114662372858638


**CONCLUSION:**

Gradient Boosting Regressor has the lowest MSE with 1.04e13, which is better than 1.08e13.

Random Forest Regressor has a higher MSE of 1.08e13, meaning it performed slightly worse than Gradient Boosting in terms of MSE.

Both models perform well, but Gradient Boosting Regressor has a slightly better score (0.5869) compared to Random Forest (0.5711), confirming that Gradient Boosting Regressor is the best model overall, both in terms of MSE and
𝑅2
