
# Logistic Regression: Classifying Apartments as Expensive or Inexpensive

In this notebook, we will use **Logistic Regression** to classify apartments based on their price per square meter.
The goal is to predict whether an apartment is "expensive" or "inexpensive" using available features such as the number of rooms, area, luxurious status, and more.

Logistic Regression is a supervised machine learning algorithm used for binary classification problems.


## Libraries and settings

In [None]:
# Libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.preprocessing import StandardScaler

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Show current working directory
print(os.getcwd())


## Import Data


In [None]:
# Load dataset
df = pd.read_csv('./Data/apartments_data_enriched_cleaned.csv', 
                 delimiter=';', 
                 index_col=0)

# Display dataset
df.head()


## Data Preprocessing

Before building the model, we need to preprocess the data. This includes:
- Handling missing values (if any).
- Creating a binary target variable for classification.


In [None]:
# Check for missing values
df.isna().sum()

# For simplicity, let's define an 'expensive' apartment as one with a price_per_m2 greater than the median
median_price_per_m2 = df['price_per_m2'].median()

# Create a new column 'expensive' that is 1 if the price_per_m2 is greater than the median, and 0 otherwise
df['expensive'] = (df['price_per_m2'] > median_price_per_m2).astype(int)

# Display the updated dataframe
df[['price_per_m2', 'expensive']].head()



## Splitting the Data

We'll split the dataset into a training set and a test set (80% training, 20% testing).


In [4]:
# Select features for the model
features = ['rooms', 
            'area', 
            'luxurious', 
            'pop_dens', 
            'mean_taxable_income', 
            'dist_supermarket']
X = df[features]
y = df['expensive']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Building the Logistic Regression Model

Now we'll train a logistic regression model on the training set.


In [None]:
# Train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Display the model coefficients
coefficients = pd.DataFrame({
    'Feature': features,
    'Coefficient': model.coef_[0]
})
coefficients


## Model Evaluation

Let's evaluate the model on the test set using various metrics.


### Confusion Matrix and Classification Report

In [None]:
# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix as a heatmap
plt.figure(figsize=(4, 4))
sns.heatmap(cm, 
            annot=True, 
            fmt='d', 
            cmap='Blues',
            cbar=False,
            xticklabels=['inexpensiv', 'expensive'], 
            yticklabels=['inexpensive', 'expensive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Print classification report
print('\nClassification report:\n', classification_report(y_test, y_pred))


### ROC Curve and AUC

In [None]:

# ROC Curve
y_prob = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, label=f'Area under the ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()



## Conclusion

We have successfully trained a logistic regression model to classify apartments as "expensive" or "inexpensive".
The model was evaluated using metrics like accuracy, precision, recall, and the ROC curve.


### Jupyter notebook --footer info-- (please always provide this at the end of each notebook)

In [None]:
import os
import platform
import socket
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('-----------------------------------')