## Final Activity 1

#### Objectives:
- Develop regression models for a given dataset and interpret the regression models.
- Evaluate the regression models based on the regression metrics.

#### Simple Linear Regression (SLR)

Predict how much discount should be given considering any damage in the house. The dataset has a column for number of damages and another column for the discount that can be given (around USD 1 for every damage discovered)

#### Multiple Linear Regression (MLR)

Predict the possible price of a house by correlating the price column with ONLY the following attributes/columns in the dataset:
- size (in square meters)
- number of bedrooms
- number of bathrooms
- number of extra rooms
- presence of a garage (0 - no, 1- yes)
- presence of a garden (0 - no, 1- yes)
- if the house is in a subdivision (0 - no, 1- yes)
- if the house is located in a city (0 - no, 1- yes)
- if the house is solar powered (0 - no, 1- yes)

In [None]:
# Import Dataset
import pandas as pd
import numpy as np

raw = pd.read_csv('data/housePriceData.csv')

raw

### Separate columns and split train-test data

In [None]:
from sklearn.model_selection import train_test_split

# Separate columns by model
slr_features = raw[['damages']].copy()
slr_target = raw['discount'].copy()
mlr_features = raw[['size', 'bedrooms', 'bathrooms', 'extraRooms', 'garage', 'garden', 'inSubdivision', 'inCity',
                    'solarPowered']].copy()
mlr_target = raw['price'].copy()

slr_features_train, slr_features_test, slr_target_train, slr_target_test = train_test_split(
    slr_features, slr_target, test_size=0.1
)

mlr_features_train, mlr_features_test, mlr_target_train, mlr_target_test = train_test_split(
    mlr_features, mlr_target, test_size=0.1
)

## Simple Linear Regression

a. Display the descriptive statistics of the discount amounts

In [None]:
discount_stats = slr_features_train.describe()
print("Descriptive Statistics for Discount Amounts:")
discount_stats

b. Create a scatter plot using the “damages” column as the independent variable and the “discount” column as the dependent variable.

In [None]:
import matplotlib.pyplot as plt

slr_train_data = pd.concat([slr_features_train, slr_target_train], axis=1)

plt.figure(figsize=(10, 6))
plt.scatter(slr_train_data['damages'], slr_train_data['discount'], color='yellow', edgecolor='black', alpha=0.7)
plt.title('Discount vs. Number of Damages', fontsize=14)
plt.xlabel('Number of Damages', fontsize=12)
plt.ylabel('Discount (%)', fontsize=12)
plt.grid(True, linestyle='-', alpha=0.3)

z = np.polyfit(slr_train_data['damages'], slr_train_data['discount'], 1)
p = np.poly1d(z)
plt.plot(slr_train_data['damages'], p(slr_train_data['damages']), color='blue', linestyle='-', label='Trendline')

plt.legend()
plt.show()

c. Determine the correlation between the “damages” and “discount” columns.

In [None]:
correlation = slr_train_data[['damages', 'discount']].corr().iloc[0, 1]
print(f"Correlation between Damages and Discount: {correlation:.2f}")

d. Create a simple linear regression model using the “damages” column as the independent variable and the “discount” column as the dependent variable.

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize and train the model
slr_model = LinearRegression()
slr_model.fit(slr_features_train, slr_target_train)

# Get the slope (coefficient) and intercept
slope = slr_model.coef_[0]
intercept = slr_model.intercept_

print(f"SLR Equation: discount = {intercept:.2f} + {slope:.2f} * damages")
print(f"Slope (damages coefficient): {slope:.2f}")
print(f"Intercept: {intercept:.2f}")

e. Evaluate the developed regression model based on different performance metrics and discuss the OLS regression results.

In [None]:
# Performance Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error

# Predict on test data
slr_predictions = slr_model.predict(slr_features_test)

# Calculate metrics
mae = mean_absolute_error(slr_target_test, slr_predictions)
mse = mean_squared_error(slr_target_test, slr_predictions)
rmse = root_mean_squared_error(slr_target_test, slr_predictions)

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")

In [None]:
# OLS Regression
import statsmodels.api as sm

# Add a constant (intercept) to the features
X_train = sm.add_constant(slr_features_train)
ols_model = sm.OLS(slr_target_train, X_train).fit()

# Print OLS summary
ols_model.summary()

f. Predict the discount amounts on the remaining 10% unseen data.

In [None]:
# Predict discounts for the test set (10% unseen data)
test_predictions = slr_model.predict(slr_features_test)

# Create a DataFrame to compare actual vs. predicted discounts
results = pd.DataFrame({
    'Actual Discount': slr_target_test,
    'Predicted Discount': test_predictions,
    'Error (Difference Actual vs. Predicted)': np.abs(slr_target_test - test_predictions)
})

print("Predictions on Unseen Test Data:")
results.round(2)