<a href="https://colab.research.google.com/github/vkatari90/Practical-Data-Analytics-Solutions/blob/week1/Assignment_5_VenkateshBabu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

file_path = './real_estate_value.csv'
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,HouseAge,DistanceToMRT,NoOfStores,Latitude,Longitude,UnitPrice
0,32.0,84.87882,10,24.98298,121.54024,37.9
1,19.5,306.5947,9,24.98034,121.53951,42.2
2,13.3,561.9845,5,24.98746,121.54391,47.3
3,13.3,561.9845,5,24.98746,121.54391,54.8
4,5.0,390.5684,5,24.97937,121.54245,43.1


In [2]:
from sklearn.model_selection import train_test_split

data.info()

data.describe()

data.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   HouseAge       414 non-null    float64
 1   DistanceToMRT  414 non-null    float64
 2   NoOfStores     414 non-null    int64  
 3   Latitude       414 non-null    float64
 4   Longitude      414 non-null    float64
 5   UnitPrice      414 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 19.5 KB


HouseAge         0
DistanceToMRT    0
NoOfStores       0
Latitude         0
Longitude        0
UnitPrice        0
dtype: int64

# Preprocessing and Splitting the Data


In [3]:
from sklearn.preprocessing import StandardScaler

X = data.drop('UnitPrice', axis=1)
y = data['UnitPrice']

Justification: We separate the features (independent variables) from the target variable (UnitPrice) to prepare for model training and evaluation.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Justification: Splitting the data into training and testing sets ensures that we can evaluate the model's performance on unseen data. An 80-20 split is a common practice.

In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Justification: Standardize the features


In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import PolynomialFeatures

numerical_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, X.columns)
    ])


Justification: Standardizing features to have zero mean and unit variance

In [7]:
numerical_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False))
])


Justification: Adding polynomial features allows the model to capture non-linear relationships between the features and the target variable.

In [8]:
numerical_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('select', SelectKBest(score_func=f_regression, k=5))
])


Justification: Feature selection helps in retaining only the most relevant features, reducing noise and improving model performance.

# Fine-tuning and Evaluating Models

### Decision Tree Regressor


In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [10]:
dt_regressor = DecisionTreeRegressor(random_state=42)

In [11]:
param_grid_dt = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [12]:
grid_search_dt = GridSearchCV(estimator=dt_regressor, param_grid=param_grid_dt, cv=5, n_jobs=-1, scoring='r2')
grid_search_dt.fit(X_train_scaled, y_train)

In [13]:
best_dt = grid_search_dt.best_estimator_
print("Best parameters for Decision Tree:", grid_search_dt.best_params_)
print("Best R2 score for Decision Tree:", grid_search_dt.best_score_)

Best parameters for Decision Tree: {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best R2 score for Decision Tree: 0.6007565459112857


In [14]:
y_pred_dt = best_dt.predict(X_test_scaled)
print("Decision Tree Performance on Test Set:")
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_dt))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_dt))
print("R2 Score:", r2_score(y_test, y_pred_dt))

Decision Tree Performance on Test Set:
Mean Absolute Error: 4.725469923825084
Mean Squared Error: 41.24276282303145
R2 Score: 0.7541557842677021


In [15]:
from sklearn.ensemble import RandomForestRegressor

In [16]:
rf_regressor = RandomForestRegressor(random_state=42)

In [17]:
param_grid_rf = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}



In [18]:
grid_search_rf = GridSearchCV(estimator=rf_regressor, param_grid=param_grid_rf, cv=5, n_jobs=-1, scoring='r2')
grid_search_rf.fit(X_train_scaled, y_train)

In [19]:
best_rf = grid_search_rf.best_estimator_
print("Best parameters for Random Forest:", grid_search_rf.best_params_)
print("Best R2 score for Random Forest:", grid_search_rf.best_score_)

Best parameters for Random Forest: {'max_depth': 7, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Best R2 score for Random Forest: 0.6483065233907805


In [20]:
y_pred_rf = best_rf.predict(X_test_scaled)
print("Random Forest Performance on Test Set:")
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred_rf))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred_rf))
print("R2 Score:", r2_score(y_test, y_pred_rf))

Random Forest Performance on Test Set:
Mean Absolute Error: 3.854974118743021
Mean Squared Error: 32.12279017036377
R2 Score: 0.8085190802941027
