# Housing Sales Price Study Notebook

## Objectives
*   Answer business requirement 1: 
    * The client is interested to understand the patterns from the house attributes, so the client can learn the most relevant variables that are correlated to  SalePrice.
    * Visualize the relevant variables against the SalePrice.

## Inputs

* outputs/datasets/collection/HousePricing.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App

## Additional

* The study will be performed on the raw dataset, the dataset will be studied before and after any changes  

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')


---

# Load Data

Function to scroll and see all rows 

In [None]:
import pandas as pd
#df = pd.read_csv("outputs/datasets/collection/HousePricing.csv")
df = pd.read_csv("outputs/datasets/HousePricing.csv")
df.info()

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation Study

Correlation analysis function

In [None]:
def Correlation(df, method, key=False, ascending=True):
    correlation = df.corr(method=method)['SalePrice'].sort_values(key=key, ascending=ascending)[1:]
    
    return correlation

First we will have to replace `NaN` Values to be able to use the One hot Encoder

In [None]:
ohe = df.fillna('Missing', inplace=False)

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=ohe.columns[ohe.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(ohe)
df_ohe.head(3)


In [None]:
spearman = Correlation(df_ohe, 'spearman', key=abs, ascending=False)
spearman[:20]

In [None]:
pearson = Correlation(df_ohe, 'pearson', key=abs,ascending=False)
pearson[:20]

* We can se higher negative correlations that we will further explore 

We will consider the top positive and negative correlation levels at `df_ohe` and will study the associated variables at `df`

Therefore we are studying at df the following variables. We will investigate:

Positive Correlation:
* The Sale price with 1stFlrSF 
* The Sales Price with OverallQual
* The Sales Price with GrLivArea
* The Sales Price with GarageArea
* The Sales Price with TotalBsmtSF
* The Sales Price with YearBuilt 

Negative Correlation:
* The Sale price with KitchenQual_TA
* The Sales Price with GarageFinish_Unf
* The Sales Price with MasVnrArea_0.0
* The Sales Price with GarageYrBlt_Missing
* The Sales Price with GarageFinish_None

We will not furthur study 'YearRemodAdd' since we won't be able to determine if it's same as construction date if no remodeling or additions

In [None]:
vars_to_study = [     
    '1stFlrSF', 'GarageArea', 
    'GrLivArea', 'OverallQual', 
    'TotalBsmtSF', 'YearBuilt', 
]

# EDA on selected variables

In [None]:
df_eda = df_ohe.filter(vars_to_study + ['SalePrice'])

## Variables Distribution by SalePrice

- We plot the distribution by SalePrice  

In [None]:
plt.figure(figsize=(15,10))

for i, attribute in enumerate(vars_to_study, 1):
    plt.subplot(2,3, i)
    sns.scatterplot(data=df_eda, x=df_eda[attribute], y=df['SalePrice'], hue='SalePrice')
    plt.title(f'Sale Price vs. {attribute}')
    plt.xlabel(attribute)
    plt.ylabel('Sale Price')

plt.tight_layout()
plt.show()

In [None]:
feat_study = ['KitchenQual_TA', 'GarageFinish_Unf',
'MasVnrArea_0.0','GarageYrBlt_Missing',
'GarageFinish_None']

plt.figure(figsize=(20,15))
for i, feature in enumerate(feat_study, 1):
    plt.subplot(3,3, i)
    sns.barplot(data=df_eda, x=df_eda[feature], y='SalePrice')
    #plt.xticks(rotation=90)
    plt.title(f'Sale Price by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Sale Price')


plt.show()

---

# Conclusions and Next steps

The correlations and plots interpretation converge. 

**Top Positive correlated features**

- Houses with larger garages (GarageArea) are likely to have a higher Sale Price, indicating that buyers value spacious garages.

- An increase in total basement square footage (TotalBsmtSF) often leads to an increase in the Sale Price, which indicates that basement area is an important factor in house valuation.

- The Sale Price tends to rise with the size of the first floor (1stFlrSF), which shows the significance of main-level living space in the housing market.

- The Sale Price tends to be higher for houses with better Overall Quality (OverallQual), affirming that quality is a crucial determinant of property value.

- An increase in above-grade living area (GrLivArea) leads to a rise in the Sale Price, which reflects the market's valuation of living space.

- The sale price tends to increase the more up to date the year that they were built (YearBuilt). 

**Top Negative correlated features**

- 'KitchenQual_TA'indicates that the Sale Price of houses with average kitchen quality tends to decrease.

- The Sale Price is typically lower when the garage finish is not finished, as shown by 'GarageFinish_Unf'.

- Houses without any masonry veneer area tend to have a lower Sale Price, indicated by 'MasVnrArea_0.0'.

- The without the records of the year the garage was built tend to decrease referred to as 'GarageYrBlt_Missing', often see a decrease in their Sale Price.

- The Sale Price usually decreases on houses without garages,  based on 'GarageFinish_None'.

**Buyers are willing to pay premiums for more space and higher quality in homes, as these patterns demonstrate the significance of size and quality in valuation.**

---