# **DATA CLEANING NOTEBOOK**

## Objectives

* Verify if values are missing in the dataset
* Handle missing values in dataset
* data cleaning

## Inputs

* dataset from data_collection that is stored under outputs/datasets/collection/house_prices.csv ("raw and uncleaned dataset from Kaggle")

## Outputs

* Cleaned data to be stored in output/datasets/cleaned 

## Conclusions drawn from this data cleaning step

* xxx 


---

# Change working directory

* Change of working directory from the current folder to its parent folder
* getcwd() = 'get current working directory'. This is access to the current directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\My_Folders\\CodeInstitute\\Project_5_files\\Project-5\\Project-5\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\My_Folders\\CodeInstitute\\Project_5_files\\Project-5\\Project-5'

# Load Data

In [4]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices.csv")
df.head(10)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,...,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,...,50.0,0.0,4,6,5,991,,1939,1950,118000


---

# Exploration of data

Head of the dataset suggests missing data (zero values) or undefined/missing values (NaN).
Variables/features that have missing values have to be determined.

* A total of nine variables have missing data as determined by len(variables_with_missing_data)
* Features 'EnclosedPorch' and 'WoodDeckSF' have substanial number of missing data of around 90% in dataset

In [5]:
variables_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
variables_with_missing_data

['2ndFlrSF',
 'BedroomAbvGr',
 'BsmtExposure',
 'BsmtFinType1',
 'EnclosedPorch',
 'GarageFinish',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'WoodDeckSF']

In [6]:
len(variables_with_missing_data)

10

In [7]:
df[variables_with_missing_data].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   2ndFlrSF       1374 non-null   float64
 1   BedroomAbvGr   1361 non-null   float64
 2   BsmtExposure   1422 non-null   object 
 3   BsmtFinType1   1315 non-null   object 
 4   EnclosedPorch  136 non-null    float64
 5   GarageFinish   1225 non-null   object 
 6   GarageYrBlt    1379 non-null   float64
 7   LotFrontage    1201 non-null   float64
 8   MasVnrArea     1452 non-null   float64
 9   WoodDeckSF     155 non-null    float64
dtypes: float64(7), object(3)
memory usage: 114.2+ KB


## Run a Profile Report for each variable with missing data

Use of ydata_profiling library class of ProfileReport to generate a profile report on missing values
Export of profile report into a Jupyter Notebook as an iframe


In [None]:
# This code was taken from walkthrough project 2 adjusted for my variable name for missing data
# This code is in the data cleaning notebook of walkthrough project 2
# ydata profiling was determined in requirements.txt in this forked repo

from ydata_profiling import ProfileReport

if variables_with_missing_data:
    profile = ProfileReport(df=df[variables_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

# Data Cleaning

## Assessing Missing Data Levels

Missing data levels to be shown in a dataframe including absolute levels, relative levels and data type

---

In [10]:
## Evaluate MissingData funciton taken from walktrough project 2 from under the heading "Data Cleaning" and "Assessing Missing Data Levels"

def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data


In [11]:
EvaluateMissingData(df)

Unnamed: 0,RowsWithMissingData,PercentageOfDataset,DataType
EnclosedPorch,1324,90.68,float64
WoodDeckSF,1305,89.38,float64
LotFrontage,259,17.74,float64
GarageFinish,235,16.1,object
BsmtFinType1,145,9.93,object
BedroomAbvGr,99,6.78,float64
2ndFlrSF,86,5.89,float64
GarageYrBlt,81,5.55,float64
BsmtExposure,38,2.6,object
MasVnrArea,8,0.55,float64


NOTE

* Two features/variables ['EnclosedPorch', 'WoodDeckSF'] have each missing values of close to 90%
* Both features where missing values are close to 90% are numerical (sqf area). When value is missing this does likely suggest that EnclosedPorch or WoodDeck exists in the row of data for that specific house
* Imputation of values for these two features (such as using median or mean of those houses where data for EnclosedPorch or WoodDeckSF does exist likely not appropriate)
* Use of only 10PP of the dataset to "impute" 90PP of the dataset may lead to a subjective bias. Imputation of missing values would imply that either Porch or WoodDeck exist in certain sqf even if both features may not even exist for the specific houses in the row of features
* Review of features suggests that both variables ['EnclosedPorch', 'WoodDeckSF'] unlikely to contribute significantly to predictive power of house price

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [9]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (2852421808.py, line 5)