# **DATA PROFILING and CLEANING**

## Objectives

* Take a closer look at data. Understand data types, distribution, gaps (i.e. missing values), duplicates and address it.
## Inputs

* Raw data (house_prices_records.csv)

## Outputs

* One cleaned dataset of house_prices_records that is ready for Exploratory Data Analysis.

## Additional Comments

* In the Data Collection phase, we inspected inherited_houses.csv file manually. It was easy to do so because of only 4 raws of data. We concluded that the only difference from "house_prices_records.csv" is the absence of SalePrice. For the data cleaning purpose, we are only focusing primary dataset i.e. (house_price_records.csv), because inherited house dataset is (1) irrelavant for EDA and (2) complete and ready to use as it is.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Data Profiling

## Schema validation

Check that data confirms to schema outlined in metadata

In [None]:
# Make sure to work on a copy of data

import pandas as pd
df_source_data=pd.read_csv("inputs/datasets/raw/house_prices_records.csv")
df=df_source_data.copy()
df.head(5)

There are 24 columns. Column name description is provided in metadata for additional context. This can help determine data type.

## Initial Data profiling

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df=df, minimal=True)
profile.to_notebook_iframe()

In [None]:
# Import lib to dispaly images
from PIL import Image
from IPython.display import display

In [None]:
img = Image.open('docs/images/profilereport_flags.jpg')
display(img)

```EnclosedPorch``` and ```WoodDeckSF``` contains less than 15% data. Hence, we cannot use them for further correlation analysis or for predictive purpose.

We can ask data collector why these values are missing and if missing value mean anything to uncover some more insights.

```[2ndFlrSF,MasVnrArea,OpenPorchSF]```contains between 40% to 60% zeros. This is concerning. Let's have a look at their distribution of other values.

In [None]:
# Find rows where all three columns are zero
zero_rows = df[(df['2ndFlrSF'] == 0) & (df['MasVnrArea'] == 0) & (df['OpenPorchSF'] == 0)]

# Count the number of such rows
num_zero_rows = zero_rows.shape[0]

num_zero_rows

Consider real life scenario, it is very unlikely to have all three values zero at the same time. 

It could be possible that the data in this rows is corrupted. However we are not sure.

**However, it is important to note that 271 out of 1460 rows is around 18% of your original dataset. That is a lot of data.**

Let's see if we can have rows where four or more columns have missing values.

In [None]:
# Check for rows where four or more columns have missing values
rows_with_four_or_more_missing = df[df.isnull().sum(axis=1) >= 4]

# Display the number of such rows
num_rows_with_four_or_more_missing = rows_with_four_or_more_missing.shape[0]

# Print the number of rows with four or more missing values
print(f'Number of rows with four or more missing values: {num_rows_with_four_or_more_missing}')



110 out of 1460 rows is less than 10% of the data. We can "assume" that these rows are of low data quality because there is a pattern in missing data and drop them from analysis. This is also to give us best chance as "imputating only randomly missing data"

#### section wrap up : Now we know which rows to drop and column to drop and why

In [None]:
# Create a dataframe with all rows except those with four or more missing values
df_dropped_rows = df.drop(rows_with_four_or_more_missing.index)

# Drop the EnclosedPorch and WoodDeckSF columns
df_dropped_rows_cols = df_dropped_rows.drop(columns=['EnclosedPorch', 'WoodDeckSF'])


---

# Data distribution and correlation

Let's look at each variable's distribution and its correlation with other variables. This will help us understand data and affact imputation strategy down the lie

---

In [None]:
df.dtypes

In [None]:
cat_data_col=('BsmtExposure','BsmtFinType1','GarageFinish','KitchenQual','OverallCond','OverallQual')


In [None]:
df_copy=df.copy()

In [None]:
for cat in cat_data_col:
    print(cat)
    df_copy[cat]=df_copy[cat].astype('category')

In [None]:
df_copy.dtypes

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [None]:
#data=sns.load_dataset(df_copy)
sns.pairplot(df_copy)

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df=df_copy)
profile.to_notebook_iframe()