**Foreword:** This notebook was maintained simultaneously when I was learning Linear Regression. This notebook is third part of the three-notebook series, which tend to explain process of applying Linear Regression on a dataset. You can find all parts by clicking given links:

[**1. First Linear Regression Model**](https://www.kaggle.com/salmankhi/my-first-regression-mode)

[**2. Multiple Variable Linear Regression Model**](https://www.kaggle.com/salmankhi/multiple-linear-regression)

[**3. Linear Regression on Categorical Data**](https://www.kaggle.com/salmankhi/regression-on-categorical-data)

## Regression Analysis on Categorical Data
Till now, we have only seen application of regression analysis on numerical data. But regression can also be applied on categorical data. For this purpose we need to substitute dummy values (numerical) in place of categorical data.

**For the example,** we will use same real estate data with an additional 'view' column in it. This data will contain prices, sizes, years of construction and view options of the houses.

Of course, we can use and should use all of the variables as regressors in our analysis. But, for the sake that I want to make as much as visualization as possible, I will not be considering 'year' as a regressor in this particular example. However, I will point out how all three independent variables can be used to predict price of the given house.

## 1. Importing Libraries

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # data manipulation and processing
import matplotlib.pyplot as plt  # data visualization
import seaborn as sns  # more attractive visualizations
import statsmodels.api as sm  # will help us apply regression model

In [None]:
sns.set()  # (optional), will make all matplotlib visualizations appear in seaborn skins

## 2. Loading Data

In [None]:
raw_data = pd.read_csv("../input/real-estate-price/real_estate_price_size_year_view.csv")
raw_data.head()

In [None]:
raw_data.view.value_counts()

Here we can see that there are only two unique values in 'view' column and we need to replace them with numerical data.

## 3. Replacing Categorical Data with Dummy Values
We can edit the same 'view' column or add another column for our dummy values. Here, we will be editing same column and use map() function to replace 'No sea view' with a 0 and 'Sea view' with a 1.

In [None]:
data = raw_data.copy()  # making a copy so that our original data remains intact

In [None]:
data.view = data.view.map({"No sea view": 0, "Sea view":1})
data.head()

## 4. Saving Dependent Variable and Regressors

In [None]:
y = data["price"]  # inferred
x1 = data[["size", "view"]]  # data["size", "year", "view"] - for all variables available in dataset

## 5. Price - Size Scatter Plot

In [None]:
plt.scatter(y, data["size"])
plt.xlabel("size", fontsize = 20)
plt.ylabel("price", fontsize = 20)
plt.show()

## 6. Applying Regression

In [None]:
x = sm.add_constant(x1)
result = sm.OLS(y, x).fit()
result.summary()

**Summary tells us,**
- b0 = 7.748e+04
- b1 = 218.7521
- b2 = 5.756e+04
- P>|t| for all of the coefficients is 0.000, meaning each regressor is driving the variability of 'y'
- P-Vale of f-statistics is very small, we can reject the H0 that b1 = b2 = 0

we would have a b3, if we had considered 'year' as well with these two regressors (i.e. 'size', 'view').

## 7. Substituting values in Multiple Linear Regression Equation

We can find a price of a house for given 'size' and 'view' by using following equation.
<center>y_hat = 7.748e+04 + (218.7521 * "size") + (5.756e+04 * "view")</center>

Like for house of size = 500 and no sea view, predicted price is 186856,

for house of size = 700 and sea view, predicted price is 288166

and for house of size = 300 and sea view, predicted price is 200665.

## 8. Visualization of Results

With equation in hand, we can also say that we have two equations for the prediction of house prices, that are given below,

- With Sea View: y_hat = 135040 + (218.7521 * "size") (As view = 1 here, and 7.748e+04 + 5.756e+04 = 135040)
- Without Sea View: y_hat = 7.748e+04 + (218.7521 * "size") (As view = 0 here)

In [None]:
plt.scatter(data["size"], y)
y_hat_sv = 135040 + (218.7521 * data["size"])
y_hat_nsv = 7.748e+04 + (218.7521 * data["size"])
fig = plt.plot(data["size"], y_hat_sv, lw = 2, c = "orange")
fig = plt.plot(data["size"], y_hat_nsv, lw = 2, c = "green")
plt.xlabel("size", fontsize = 20)
plt.ylabel("price", fontsize = 20)
plt.show()

- orange line shows regression results for houses with the sea view
- green line shows regression results for houses without the sea view

both have the same slopes, but different intercepts.

In [None]:
plt.scatter(data["size"], y, c=data['view'],cmap='RdYlGn_r')
y_hat_sv = 135040 + (218.7521 * data["size"])
y_hat_nsv = 7.748e+04 + (218.7521 * data["size"])
fig = plt.plot(data["size"], y_hat_sv, lw = 2, c = "orange")
fig = plt.plot(data["size"], y_hat_nsv, lw = 2, c = "green")
plt.xlabel("size", fontsize = 20)
plt.ylabel("price", fontsize = 20)
plt.show()

Houses with and without sea views are seperated by color here and it can be clearly observed that houses having same size are expensive if they have a sea view compared to the houses that do not.

## 9. Making Predictions
Till now, we were predicting values by manually applying the equation on a given house and solving for it. But we can also use python's predict() function to make predictions based on the regression model we have made.

In [None]:
x.head(2)  # our regressor and constant that we added before applying sm.OLS()

const is nothing but a coefficient of our b0, does not change anything as 1*b0 = b0.

We will create a DataFrame of new houses whose prices we want to predict using our model, this new houses' DataFrame should have columns for all regressors and constant as columns. (i.e. 'const', 'size' and 'view')

In [None]:
new_houses = pd.DataFrame({'const': 1,'size': [400, 600, 800, 1000], 'view': [0, 1, 1, 0]}, index = range(4))
new_houses

In [None]:
new_house_prices = result.predict(new_houses)  # syntax to apply model on some new houses (saved in new_houses)
new_house_prices.reindex()
new_house_prices

- result here is the variable we saved sm.OLS().fit() in

- predict() is the function to apply to model to predict dependent variable and takes new df df (having regressors' values) as argument

- and return of the statement is a Series having prices of given houses in corresonding indices

**Next,** we can create a new houses DataFrame containing everything(i.e. all the independent variables and prices of the houses). But for that, we will need to change new_house_prices from a Series to a DataFrame as join() can only be used to join DataFrames.

In [None]:
new_houses_prices = pd.DataFrame({"Predicted Prices": new_house_prices})
new_houses.join(new_houses_prices)

That is how regression is applied on Categorical Data.