# <font color="brown">Understanding Matrix Form and sklearn comparison </font>

## <font color = "brown">Problem Statement </font>
Build a model to predict the price of house based on the data shared.

## <font color = "brown">Data Dictionary </font>
* price: Price of house
* area: Area of house
* bedrooms: Number of bedrooms
* bathrooms: Number of bathrooms
* stories: Number of floors
* mainroad: Is the property situated on the main road?
* guestroom: Does the house have guest room?
* basement: Does the house have basement?
* hotwaterheating: Does the house have hot water heating?
* airconditioning: Does the house have air conditioning?
* parking: Number of parking available
* prefarea: Is the property in the preffered area?
* furnishingstatus: Type of furnishing of the house

### <font color="blue"> Importing the libraries required

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, root_mean_squared_error
import statsmodels.api as sm

### <font color="blue"> Reading the input file

In [16]:
data = pd.read_csv("Housing.csv")

In [17]:
#Creating a copy of the data
df = data.copy()

**Let us check the head and tail of the data**

In [18]:
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


# <font color="brown">Preparing the data </font>

Creating independent and dependent dataset

In [19]:
X = df.drop('price', axis=1)
y = df['price']

Creating dummies for the independent variables

In [20]:
X = pd.get_dummies(data=X,
                       columns=X.select_dtypes(include="object").columns.to_list(),
                       drop_first=True,
                       dtype=int)

X.head()

Unnamed: 0,area,bedrooms,bathrooms,stories,parking,mainroad_yes,guestroom_yes,basement_yes,hotwaterheating_yes,airconditioning_yes,prefarea_yes,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,7420,4,2,3,2,1,0,0,0,1,1,0,0
1,8960,4,4,4,3,1,0,0,0,1,0,0,0
2,9960,3,2,2,2,1,0,1,0,0,1,1,0
3,7500,4,2,2,3,1,0,1,0,1,1,0,0
4,7420,4,1,2,2,1,1,1,0,1,0,0,0


Splitting the data into train and test

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## <font color = "brown"> Using the matrix method to calculate the coefficients

Adding constant to train and test.

In [22]:
X_mat_train = sm.add_constant(X_train)
X_mat_test = sm.add_constant(X_test)

In [23]:
XtX = np.dot(X_mat_train.T, X_mat_train)

In [24]:
XtX_inv = np.linalg.inv(XtX)

In [25]:
Xty = np.dot(X_mat_train.T, y_train)

In [26]:
beta = np.dot(XtX_inv, Xty)

In [27]:
beta

array([ 7.31952766e+04,  2.47035990e+02,  4.08463022e+04,  1.02113955e+06,
        5.25832065e+05,  2.73070484e+05,  4.70516854e+05,  2.74402848e+05,
        5.47653834e+05,  8.17377075e+05,  6.14433328e+05,  5.10839754e+05,
       -6.70279607e+04, -3.96087615e+05])

## <font color="brown"> Model with sklearn

In [28]:
lr_org = LinearRegression()
lr_org.fit(X_train, y_train)

In [29]:
lr_org.intercept_

np.float64(73195.27662571426)

In [30]:
lr_org.coef_

array([ 2.47035990e+02,  4.08463022e+04,  1.02113955e+06,  5.25832065e+05,
        2.73070484e+05,  4.70516854e+05,  2.74402848e+05,  5.47653834e+05,
        8.17377075e+05,  6.14433328e+05,  5.10839754e+05, -6.70279607e+04,
       -3.96087615e+05])

## <font color="brown"> Comparing the results of matrix calculation and sklearn

In [35]:
feature_names = X_train.columns
matrix_intercept = beta[0]
matrix_coefficients = beta[1:]
sklearn_intercept = lr_org.intercept_
sklearn_coefficients = lr_org.coef_

# Create DataFrame
comparison_df = pd.DataFrame({
    "Variable": ["Intercept"] + list(feature_names),
    "Matrix Calculation": [matrix_intercept] + list(matrix_coefficients),
    "Sklearn Model": [sklearn_intercept] + list(sklearn_coefficients),
    "Difference": [
        matrix_intercept - sklearn_intercept
    ] + list(matrix_coefficients - sklearn_coefficients)
})

pd.options.display.float_format = "{:,.4f}".format
pd.set_option("display.max_columns", None)  # Ensure all columns are displayed
pd.set_option("display.width", 1000)       # Setting a wide display width to prevent wrapping
pd.set_option("display.colheader_justify", "center")  # Center-align headers for better readability

# Display the DataFrame
print(comparison_df)

               Variable              Matrix Calculation  Sklearn Model  Difference
0                         Intercept       73,195.2766      73,195.2766    0.0000  
1                              area          247.0360         247.0360   -0.0000  
2                          bedrooms       40,846.3022      40,846.3022   -0.0000  
3                         bathrooms    1,021,139.5494   1,021,139.5494    0.0000  
4                           stories      525,832.0646     525,832.0646   -0.0000  
5                           parking      273,070.4837     273,070.4837    0.0000  
6                      mainroad_yes      470,516.8540     470,516.8540   -0.0000  
7                     guestroom_yes      274,402.8482     274,402.8482   -0.0000  
8                      basement_yes      547,653.8341     547,653.8341   -0.0000  
9               hotwaterheating_yes      817,377.0748     817,377.0748   -0.0000  
10              airconditioning_yes      614,433.3280     614,433.3280   -0.0000  
11  