# REAL ESTATE SALES ANALYSIS 

# Project Goal

Using multiple linear regression modeling to analyze house sales in a northwestern county.


## OVERVIEW

- A brief overview of the business focusing on conducting regression analysis for house sales.
- It aims to explain the importance of regression analysis in understanding the factors that influence house prices and predicting future trends like waterfront and views.
- Highlighting the value that the business brings to clients by providing data-driven insights and accurate predictions.
- Conducts a market analysis to identify the target market for the regression analysis services.
- Analyze the demand for real estate market insights and the need for accurate predictions

# 1. Business Understanding

## a.) Introduction 

- The business specializes in providing regression analysis services for house sales. It understands the importance of utilizing data-driven insights to understand the factors that influence house prices and predict future trends accurately. By conducting thorough market analysis, the business identifies the target market and the demand for real estate market insights and predictions.

- Overall, the business understands the significance of regression analysis in the context of house sales. It leverages data, statistical expertise, and predictive modeling to provide valuable insights to clients, enabling them to make informed decisions in the dynamic real estate market in Northwestern county.

## b.) Problem Statement 

- The real estate industry faces the challenge of accurately understanding the factors that influence house prices and predicting future trends. Many stakeholders, including buyers, sellers, investors, and lenders, seek reliable insights to make informed decisions. However, the complexity of the market and the multitude of variables involved make it difficult to obtain accurate predictions and data-driven insights.
- There is a need for a specialized business that comprehends the intricacies of house sale regression analysis. Such a business should possess a deep understanding of the real estate market, employ robust methodologies for data collection and analysis, develop accurate regression models, and deliver clear and understandable reports to clients.

- By addressing these challenges, the business can provide reliable predictions of house prices, identify significant factors influencing the market, and offer actionable insights to clients. This will empower stakeholders to make informed decisions, mitigate risks, optimize investments, and maximize returns in the dynamic real estate industry.

## c.) Main Objective 

The primary focus is on delivering value to clients by:

- Predicting House Prices: Developing robust regression models that consider various factors such as location, size, amenities, and market trends to accurately predict house prices. The objective is to provide clients with reliable estimates of property values, empowering them to make informed buying, selling, or investment decisions.

- Identifying Market Influencers: Analyzing the data to identify significant factors that impact the real estate market, such as economic indicators, neighborhood characteristics, interest rates, and supply and demand dynamics. The objective is to help clients understand the key drivers of property values and recognize emerging trends.

- Ensuring Data Accuracy and Reliability: Implementing robust data collection, cleaning, and preprocessing methodologies to ensure the accuracy and reliability of the data used in the regression analysis. The objective is to provide clients with reliable and trustworthy insights to support their decision-making processes.

- Building Strong Client Relationships: Prioritizing client needs and fostering strong relationships to understand their specific requirements and tailor analysis and insights accordingly. The objective is to provide personalized services and cultivate long-term collaborations and repeat business.

## d.) Specific Objective

- Providing Actionable Insights: Presenting findings and insights in clear and understandable reports, customized based on client requirements. The objective is to provide clients with actionable recommendations that enable them to optimize their real estate strategies, mitigate risks, and maximize returns.

## e.) Experimental Design

1. Data Collection
2. Data cleaning
3. Training the data
4. Modelling and Analysis
5. Conclusions and Recommendations

# 2. Data Understanding 

This project uses the King County House Sales dataset, which can be found in kc_house_data.csv in the data folder in this git respiratory . The description of the column names can be found in column_names.md in the same folder. As with most real world data sets, the column names are not perfectly described so will figure out how to access this as we continue

Lets ignore some or all of the following features:

- date
- view
- sqft_above
- sqft_basement
- yr_renovated
- zipcode
- lat
- long
- sqft_living15
- sqft_lot15

## a.) Importing Data

Below is the import for the data. 

In [32]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

## b.) Loading Data 

Okay lets load the data for the north western county preset

In [33]:
data = pd.read_csv(r"C:\Users\Administrator\Desktop\Moringa\Phase 2\Project 2\kc_house_data.csv")
data.head(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503
5,7237550310,5/12/2014,1230000.0,4,4.5,5420,101930,1.0,NO,NONE,...,11 Excellent,3890,1530.0,2001,0.0,98053,47.6561,-122.005,4760,101930
6,1321400060,6/27/2014,257500.0,3,2.25,1715,6819,2.0,NO,NONE,...,7 Average,1715,?,1995,0.0,98003,47.3097,-122.327,2238,6819
7,2008000270,1/15/2015,291850.0,3,1.5,1060,9711,1.0,NO,,...,7 Average,1060,0.0,1963,0.0,98198,47.4095,-122.315,1650,9711
8,2414600126,4/15/2015,229500.0,3,1.0,1780,7470,1.0,NO,NONE,...,7 Average,1050,730.0,1960,0.0,98146,47.5123,-122.337,1780,8113
9,3793500160,3/12/2015,323000.0,3,2.5,1890,6560,2.0,NO,NONE,...,7 Average,1890,0.0,2003,0.0,98038,47.3684,-122.031,2390,7570


In [34]:
data.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,17755.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474000.0,540296.6,3.3732,2.115826,2080.32185,15099.41,1.494096,1788.596842,1970.999676,83.636778,98077.951845,47.560093,-122.213982,1986.620318,12758.283512
std,2876736000.0,367368.1,0.926299,0.768984,918.106125,41412.64,0.539683,827.759761,29.375234,399.946414,53.513072,0.138552,0.140724,685.230472,27274.44195
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,370.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,322000.0,3.0,1.75,1430.0,5040.0,1.0,1190.0,1951.0,0.0,98033.0,47.4711,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,1560.0,1975.0,0.0,98065.0,47.5718,-122.231,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10685.0,2.0,2210.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,9410.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


## c.) Data Cleaning

### i.) Handling Missing Values

Some of the data has question marks it has to be replaced 

In [35]:
data.replace('?', 0.0, inplace=True)

Now let deal with missing values in the data by dropping them in both column and rows

In [36]:
data = data.dropna()
data.head(30)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503
5,7237550310,5/12/2014,1230000.0,4,4.5,5420,101930,1.0,NO,NONE,...,11 Excellent,3890,1530.0,2001,0.0,98053,47.6561,-122.005,4760,101930
6,1321400060,6/27/2014,257500.0,3,2.25,1715,6819,2.0,NO,NONE,...,7 Average,1715,0.0,1995,0.0,98003,47.3097,-122.327,2238,6819
8,2414600126,4/15/2015,229500.0,3,1.0,1780,7470,1.0,NO,NONE,...,7 Average,1050,730.0,1960,0.0,98146,47.5123,-122.337,1780,8113
9,3793500160,3/12/2015,323000.0,3,2.5,1890,6560,2.0,NO,NONE,...,7 Average,1890,0.0,2003,0.0,98038,47.3684,-122.031,2390,7570
11,9212900260,5/27/2014,468000.0,2,1.0,1160,6000,1.0,NO,NONE,...,7 Average,860,300.0,1942,0.0,98115,47.69,-122.292,1330,6000
13,6054650070,10/7/2014,400000.0,3,1.75,1370,9680,1.0,NO,NONE,...,7 Average,1370,0.0,1977,0.0,98074,47.6127,-122.045,1370,10208
14,1175000570,3/12/2015,530000.0,5,2.0,1810,4850,1.5,NO,NONE,...,7 Average,1810,0.0,1900,0.0,98107,47.67,-122.394,1360,4850


In [37]:
data = data.dropna(axis=1)
data

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.7210,-122.319,1690,7639
3,2487200875,12/9/2014,604000.0,4,3.00,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.00,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503
5,7237550310,5/12/2014,1230000.0,4,4.50,5420,101930,1.0,NO,NONE,...,11 Excellent,3890,1530.0,2001,0.0,98053,47.6561,-122.005,4760,101930
6,1321400060,6/27/2014,257500.0,3,2.25,1715,6819,2.0,NO,NONE,...,7 Average,1715,0.0,1995,0.0,98003,47.3097,-122.327,2238,6819
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21591,2997800021,2/19/2015,475000.0,3,2.50,1310,1294,2.0,NO,NONE,...,8 Good,1180,130.0,2008,0.0,98116,47.5773,-122.409,1330,1265
21592,263000018,5/21/2014,360000.0,3,2.50,1530,1131,3.0,NO,NONE,...,8 Good,1530,0.0,2009,0.0,98103,47.6993,-122.346,1530,1509
21593,6600060120,2/23/2015,400000.0,4,2.50,2310,5813,2.0,NO,NONE,...,8 Good,2310,0.0,2014,0.0,98146,47.5107,-122.362,1830,7200
21594,1523300141,6/23/2014,402101.0,2,0.75,1020,1350,2.0,NO,NONE,...,7 Average,1020,0.0,2009,0.0,98144,47.5944,-122.299,1020,2007


### ii.) Remove Unnecessary Columns

Now its has the necessary values in place. Next lets remove the columns and through combination we choose to ignore to make use for our modelling


In [52]:
print(data.columns)

Index(['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'condition', 'grade', 'yr_built', 'lat'],
      dtype='object')


In [54]:
# Specify the columns to be removed
remove_columns = ['date', 'view', 'sqft_above', 'sqft_basement', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

# Remove existing columns
existing_columns = [col for col in remove_columns if col in data.columns]
data.drop(columns=existing_columns, inplace=True)

In [57]:
data.head(20)

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,yr_built
1,6414100192,538000.0,3,2.25,2570,7242,2.0,NO,Average,7 Average,1951
3,2487200875,604000.0,4,3.0,1960,5000,1.0,NO,Very Good,7 Average,1965
4,1954400510,510000.0,3,2.0,1680,8080,1.0,NO,Average,8 Good,1987
5,7237550310,1230000.0,4,4.5,5420,101930,1.0,NO,Average,11 Excellent,2001
6,1321400060,257500.0,3,2.25,1715,6819,2.0,NO,Average,7 Average,1995
8,2414600126,229500.0,3,1.0,1780,7470,1.0,NO,Average,7 Average,1960
9,3793500160,323000.0,3,2.5,1890,6560,2.0,NO,Average,7 Average,2003
11,9212900260,468000.0,2,1.0,1160,6000,1.0,NO,Good,7 Average,1942
13,6054650070,400000.0,3,1.75,1370,9680,1.0,NO,Good,7 Average,1977
14,1175000570,530000.0,5,2.0,1810,4850,1.5,NO,Average,7 Average,1900


Now it has the relevant columns to use for modelling the data.

### iii.) Handling Categorical Values

In [58]:
encoded_data = pd.get_dummies(data, columns=['categorical_variable'])

KeyError: "None of [Index(['categorical_variable'], dtype='object')] are in the [columns]"

Lets find if there is a correlation between price and other variables 