## Final Project Submission

* Student names: Amos Kipkirui, Brian Muli, Emilly Njue, Swaleh Athuman, Samwel Kagwi
* Students pace: Full time
* Scheduled project review date/time: 21/04/2023
* Instructor name: 
* Blog post URL: https://github.com/swalehmwadime/dsc-phase-2-project-v2-3.git


# BUSINESS UNDERSTANDING

### INTRODUCTION

The real estate market is a dynamic and ever-changing industry, and accurate prediction of housing prices is crucial for both buyers and sellers. In order to make informed decisions, stakeholders in the real estate industry need access to reliable and comprehensive data.

The King County House Sales dataset is a valuable resource for understanding the dynamics of the real estate market in King County. This dataset contains detailed information on house sale prices including a wide range of features such as the number of bedrooms, bathrooms, square footage, location, and more. 

This dataset allows for in-depth analysis and modeling to understand the factors that influence housing prices in the region, and serves as a valuable resource for developing and testing predictive models for accurate price predictions.


We will provide an overview of the King County House Sales dataset, including its key features, data quality, and potential use cases. We will also highlight the significance of this dataset for evaluating regression models to predict housing prices in King County, and the potential benefits it can offer to stakeholders in the real estate industry. 

### Business Problem
A real estate agency located in King County is looking to advice homeowners about how home renovations might increase the value of their homes, and by what amount. The agency is looking to use the King County house dataset provided to make recommendations on the best renovations that home owners can undertake.

### PROBLEM STATEMENT
To aid in making these recommendations, we will attempt to answer these questions:

1. Predict House sale prices using Multi-Linear Regression

2. What are the key factors that significantly impact housing prices in King County?

3. How does the number of bedrooms, bathrooms, and square footage of a house correlate with its sale price in King County?

4. How does the overall grade of a house, year built, and year renovated affect its sale price in King County?

# Data Understanding
This project uses the King County House Sales dataset.

In [1]:
#Importing required libraries

import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
# Set the Seaborn style
sns.set_style("darkgrid")

import statsmodels.api as sm

In [3]:
data=pd.read_csv('kc_house_data.csv')
data

FileNotFoundError: [Errno 2] No such file or directory: 'kc_house_data.csv'

In [4]:
import pandas as pd

def analyze(data):
    
    # Check the shape (rows and columns) of the DataFrame
    print("Shape of the DataFrame:")
    print(data.shape)

    # Display the first few rows of the DataFrame
    print("\nFirst few rows of the DataFrame:")
    print(pd.DataFrame(data.head()))

    # Check the data types of each column in the DataFrame
    print("\nData types of columns in the DataFrame:")
    print(data.dtypes)

    # Check for missing values in the DataFrame
    print("\nMissing values in the DataFrame:")
    print(data.isnull().sum())

    # Drop columns with missing values
    data.dropna(axis=1, inplace=True)

    # Check for duplicate rows in the DataFrame
    print("\nDuplicate rows in the DataFrame:")
    print(data.duplicated().sum())

    # Check for unique values in each column of the DataFrame
    print("\nUnique values in each column of the DataFrame:")
    for col in data.columns:
        unique_values = data[col].nunique()
        print(f"{col}: {unique_values}")

    # Check value counts of categorical columns in the DataFrame
    print("\nValue counts of categorical columns:")
    for col in data.columns:
        if data[col].dtype == 'object':
            print(f"{col}:")
            print(data[col].value_counts())
            print()
            
analyze(data)


NameError: name 'data' is not defined

In [None]:
data.describe()

From description we can see that:

- Average price of house sold in King County is $5,402,966.

- The maximum price of house sold in King County is $7,700,000.

- The average no. of bedrooms in houses sold in KC is 3/House with 2 bathrooms/bedroom with one house having 33 bedrooms.

- The average area of house is 2080 Sqft with one house having 13540 Sqft. living area.

In [None]:
pd.DataFrame(data.corr()['price']).sort_values(by='price',ascending=False)

In [None]:
sns.heatmap(data.corr(),annot=True,cmap='Blues')

In [None]:
data = data[['price', 'sqft_living', 'sqft_living15', 'bathrooms', 'bedrooms']]

y = data['price']

X = data[['sqft_living']]

data

In [None]:
data.plot.scatter(x="sqft_living", y="price");

In [None]:
model = sm.OLS(y, sm.add_constant(X))
fit = model.fit()

print(fit.summary())

### Interpretation of the model

* The model is statistically significant overall, with an F-statistic p-value well below 0.05

* The model explains about 49% of the variance in price

* The model coefficients (`const` and `sqft_living`) are both statistically significant, with t-statistic p-values well below 0.05

* As the value of "sqft_living" increases, the estimated "price" decreases by -43,990 , assuming all other variables are held constant.

* For each increase of 1 "sqft_living", we see an associated increase in price of about 280

In [None]:
X_update = data[['sqft_living', 'sqft_living15', 'bathrooms', 'bedrooms']]

data

In [None]:
second_model = sm.OLS(y, sm.add_constant(X_update))
model_fit = second_model.fit()

print(model_fit.summary())

### **Initial model interpretation**
* The model explains a `51%` of the variance in `price` which shows an increase compared to the first model which had a variance of `49%` with only one predictor (sqft_living).

* The model is `statistically significant` overall, with an F-statistic p-value well below 0.05.

* The model coefficients (`const`, `sqft_living`, `sqft_living15`, `bathrooms`, `bedrooms`) are both statistically significant, with t-statistic p-values well below 0.05

* If a house had `0 sqft_living, 0 sqft_living15, 0 bathrooms and 0 bedrooms` we would expect the price to be about $21,240

* For each increase of `1 sqft in sqft_living`, we see an associated increase in price of about $275

* For each increase of `1 sqft in sqft_living15`, we see an associated increase in price of about $60

* For each increase of `1 bathroom`, we see an associated increase in price of about $7,230

* For each increase of `1 bedroom` , we see an associated decrease in price of about $55,580




In [None]:
X_metric = X_update.copy()
#change square feet to square meters 1sqft = 0.092903

for col in X_metric.columns:
    X_metric[col] = X_metric[['sqft_living', 'sqft_living15']] * 0.092903

X_metric

In [None]:
metric_model = sm.OLS(y, sm.add_constant(X_metric)).fit()
print(metric_model.summary())