# Multiple Linear Regression

In this notebook, we will continue practicing:

- How to perform regression with multiple variables
- How to handle categorical variables in regression
- How to evaluate and compare models to determine the best one


## Imports

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # data manipulation and processing
import matplotlib.pyplot as plt  # data visualization
import seaborn as sns  # more attractive visualizations
import statsmodels.api as sm  # will help us apply regression model
import statsmodels.formula.api as smf
from scipy.stats import spearmanr, pearsonr, chi2_contingency

# Getting the Data

In [None]:
#https://www.kaggle.com/datasets/gauravduttakiit/real-estate-price/data
property_dataset = pd.read_csv("real_estate_price_size_year_view.csv")
property_dataset.head()

Unnamed: 0,price,size,year,view
0,234314.144,643.09,2015,No sea view
1,228581.528,656.22,2009,No sea view
2,281626.336,487.29,2018,Sea view
3,401255.608,1504.75,2015,No sea view
4,458674.256,1275.46,2009,Sea view


In [None]:
#Data Dictionary
#Price: The selling price of the property.
#Size: The size of the property in square feet.
#Year: The year the property was built.
#View: if the property has a Sea view or not.

## What is the size of the dataset?

## Get insights about the data using descriptive statistics.

## Which numeric features are correlated with each other?

## Based on the correlation matrix, which features seem to be most correlated with the target variable (price)?

## What is the type of the "view" column? and how many unique values does it have?

## For work with view column we need to convert it to binary values. Convert the "view" using .map() function so that "Sea view" is represented as 1 and "No sea view" as 0.

# Fist regression model

## For starting, let's create a regression model using only the "size" feature to predict the "price". First create scatter plot to visualize the relationship between size and price.

In [None]:
#imagine we want to predict price based on size
X = property_dataset['size']
y = property_dataset['price']

## Based on the scatter plot, does there appear to be a relationship between size and price?

## Calculate the Spearman and Pearson correlation coefficients between size and price. What do these coefficients indicate about the relationship between these two variables?

## Now, let's create a linear regression model using "size" as the independent variable to predict "price". Use the statsmodels library to fit the model and display the summary of the regression results.

## From previous class we already know about r-squared. What is the R-squared value of this model and what does it indicate about the model's performance?

## Lets try to evaluate the model using RMSE (Root Mean Squared Error). Calculate the RMSE for this regression model.

In [None]:
#calculate RMSE
#rmse = np.sqrt(np.mean((predictions - y) ** 2))

## Compare RMSE to the mean size of the properties in the dataset. What does this comparison tell you about the model's predictive accuracy?

In [None]:
#what RMSE means in this context
#(rmse / mean_size) * 100


# Second regression model with categorical variable

## Does having a sea view impact the price of the property? Let's investigate this by incorporating the "view" feature into our regression model. Use the chi-squared test to determine if there is a significant association between "view" and "price".

## Chi-squared is enough for determine if there is association between categorical and numeric variable?

## Create a new regression model that includes both "size" and "view" as independent variables to predict "price". Fit the model using statsmodels and display the summary of the regression results.

## Compare the R-squared value of this new model with the previous model that only included "size". How does the inclusion of "view" affect the model's performance?

## Use also RMSE to evaluate this new model. Interpret the results.

## Is this model is better than previeous one

## Interpret the coefficient of the "view" variable in the context of the model. What does this coefficient tell you about the impact of having a sea view on the price of the property?

# Third regression model with all features

## Create a final regression model that includes all three features: "size", "view", and "year". Fit the model using statsmodels and display the summary of the regression results.

## Is this model better than the previous two models? Justify your answer using R-squared and RMSE.

## Interpret the coefficients of all variables in the context of the model. What do these coefficients tell you about the impact of each feature on the price of the property? Create the final equation for predicting price based on size, view, and year.