## LEVEL 2 — HOUSE PRICE PREDICTION (REGRESSION)

## ⚙️ CELL 1 — Header

# Level 2 - Predictive Modeling (Regression)
# Project: House Price Prediction
# Author: Rakesh Mahakur

"""
This notebook builds a regression model to predict house prices
based on property features. The model is evaluated using MSE and R².
"""


## ⚙️ CELL 2 — Load Data

“I load the house price dataset to begin regression modeling.”

In [1]:
import pandas as pd

df = pd.read_csv(r"D:\codveda\Data Set For Task\4) house Prediction Data Set.csv")
df.head()


Unnamed: 0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
0,0.02731 0.00 7.070 0 0.4690 6.4210 78...
1,0.02729 0.00 7.070 0 0.4690 7.1850 61...
2,0.03237 0.00 2.180 0 0.4580 6.9980 45...
3,0.06905 0.00 2.180 0 0.4580 7.1470 54...
4,0.02985 0.00 2.180 0 0.4580 6.4300 58...


## ⚙️ CELL 3 — Split Data

“I split the dataset into training and testing sets.”

In [15]:
import pandas as pd

df = pd.read_csv(r"D:\codveda\Data Set For Task\Churn Prdiction Data\churn-bigml-80.csv")
print(df.head())
print(df.columns)




  State  Account length  Area code International plan Voice mail plan  \
0    KS             128        415                 No             Yes   
1    OH             107        415                 No             Yes   
2    NJ             137        415                 No              No   
3    OH              84        408                Yes              No   
4    OK              75        415                Yes              No   

   Number vmail messages  Total day minutes  Total day calls  \
0                     25              265.1              110   
1                     26              161.6              123   
2                      0              243.4              114   
3                      0              299.4               71   
4                      0              166.7              113   

   Total day charge  Total eve minutes  Total eve calls  Total eve charge  \
0             45.07              197.4               99             16.78   
1             27.47   

In [16]:

df_test = pd.read_csv(r"D:\codveda\Data Set For Task\Churn Prdiction Data\churn-bigml-20.csv")
print(df_test.head())


  State  Account length  Area code International plan Voice mail plan  \
0    LA             117        408                 No              No   
1    IN              65        415                 No              No   
2    NY             161        415                 No              No   
3    SC             111        415                 No              No   
4    HI              49        510                 No              No   

   Number vmail messages  Total day minutes  Total day calls  \
0                      0              184.5               97   
1                      0              129.1              137   
2                      0              332.9               67   
3                      0              110.4              103   
4                      0              119.3              117   

   Total day charge  Total eve minutes  Total eve calls  Total eve charge  \
0             31.37              351.6               80             29.89   
1             21.95   

## CELL 4 — Train Model

“In this step, I build a complete machine learning pipeline by applying one-hot encoding to categorical features and training a Logistic Regression classifier to predict customer churn. This approach ensures the model can effectively learn from both numerical and categorical data while maintaining clean, scalable, and production-ready preprocessing.”

In [18]:
from sklearn.model_selection import train_test_split

# Features & Target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (2132, 19)
Test shape: (534, 19)


In [21]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Identify categorical and numeric columns
categorical_cols = X.select_dtypes(include=['object']).columns
numeric_cols = X.select_dtypes(exclude=['object']).columns

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)



In [22]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

model.fit(X_train, y_train)

print("Model training completed ✅")


Model training completed ✅


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [23]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.8520599250936329
[[437  18]
 [ 61  18]]
              precision    recall  f1-score   support

       False       0.88      0.96      0.92       455
        True       0.50      0.23      0.31        79

    accuracy                           0.85       534
   macro avg       0.69      0.59      0.62       534
weighted avg       0.82      0.85      0.83       534



## CELL 5 — Evaluate Model

“I evaluate model performance using error and accuracy metrics.”

In [24]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))


MSE: 0.14794007490636704
R2 Score: -0.1736263736263739
