# **Notebook 1: Load data and data cleaning**

## Objectives

* Fetch data from Kaggle and save as raw data 
* Inspect data
* Save raw datasets
* Save cleaned data under outputs/datasets/cleaned

## Inputs

* Kaggle dataset downloaded from https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

## Outputs

* Raw datasets are found in folder inputs/dataset/raw/house_prices
* Generate Dataset: outputs/datasets/collection/house_prices_records.csv
* Generate cleaned datasets: outputs/datasets/cleaned/house_prices_records_cleaned.csv and outputs/datasets/cleaned/inherited_houses_cleaned.csv

## Conclusions

* Data cleaning pipeline
* Drop variables: `['EnclosedPorch', 'WoodDeckSF' ]`
* Use median and mode imputation to replace missing values in numerical and categorical variables respectively.
* Handle mismatching data types by converting floats into integers



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/housing'

# Section 1: Load and inspect the data

The data has been downloaded from: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

First we create the folders in the file path

In [4]:
import os
try:
  os.makedirs(name='inputs/datasets/raw/house_prices')
except Exception as e:
  print(e)

[Errno 17] File exists: 'inputs/datasets/raw/house_prices'


Then drag and drop both csv files to the folder. 
- house-metadata.txt can be dragged and dropped to inputs/datasets/raw for reference.

Loading the data and showing first rows to get an idea of the data I'm working with.

In [5]:
import pandas as pd

# Load the historical house prices records data
records_df = pd.read_csv('inputs/datasets/raw/house_prices/house_prices_records.csv')

# Load the inherited houses data
inherited_df = pd.read_csv('inputs/datasets/raw/house_prices/inherited_houses.csv')

# Check the first few rows of each DataFrame
print(records_df.head())

1stFlrSF  2ndFlrSF  BedroomAbvGr BsmtExposure  BsmtFinSF1 BsmtFinType1  \
0       856     854.0           3.0           No         706          GLQ   
1      1262       0.0           3.0           Gd         978          ALQ   
2       920     866.0           3.0           Mn         486          GLQ   
3       961       NaN           NaN           No         216          ALQ   
4      1145       NaN           4.0           Av         655          GLQ   

   BsmtUnfSF  EnclosedPorch  GarageArea GarageFinish  ...  LotFrontage  \
0        150            0.0         548          RFn  ...         65.0   
1        284            NaN         460          RFn  ...         80.0   
2        434            0.0         608          RFn  ...         68.0   
3        540            NaN         642          Unf  ...         60.0   
4        490            0.0         836          RFn  ...         84.0   

   MasVnrArea OpenPorchSF  OverallCond  OverallQual  TotalBsmtSF  WoodDeckSF  \
0       196.0  

In [6]:
print(inherited_df.head())

1stFlrSF  2ndFlrSF  BedroomAbvGr BsmtExposure  BsmtFinSF1 BsmtFinType1  \
0       896         0             2           No       468.0          Rec   
1      1329         0             3           No       923.0          ALQ   
2       928       701             3           No       791.0          GLQ   
3       926       678             3           No       602.0          GLQ   

   BsmtUnfSF  EnclosedPorch  GarageArea GarageFinish  ...  LotArea  \
0      270.0              0       730.0          Unf  ...    11622   
1      406.0              0       312.0          Unf  ...    14267   
2      137.0              0       482.0          Fin  ...    13830   
3      324.0              0       470.0          Fin  ...     9978   

   LotFrontage MasVnrArea  OpenPorchSF  OverallCond  OverallQual  TotalBsmtSF  \
0         80.0        0.0            0            6            5        882.0   
1         81.0      108.0           36            6            6       1329.0   
2         74.0        0

# Section 2: Data overview

Explore the data to spot any anomalies. First let's take a look at records_df:

In [7]:
print(records_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

And show missing values:

In [None]:
print(records_df.isnull().sum())

Then we do the same for inherited_df:

In [None]:
print(inherited_df.info())

And explore missing values for inherited houses. There are none here!

In [None]:
print(inherited_df.isnull().sum())

Great, now we have taken a first look at the date and can start doing some initial cleaning.

# Section 3: Cleaning - Data types mismatch

At this point, we spot that the data types vary between the records_df dataset and inherited_df dataset on numerous variables. Is this a problem?

Well, discrepancies between the data types in records_df and inherited_df for the same variables can cause issues down the line, especially when building and using predictive models.

Most machine learning algorithms require the input data to be in a consistent format. If a variable is represented as an integer in one dataset and as a float in another, the algorithm may get confused and produce unreliable results.

To address this, we should make sure that the same variables have the same data types in both datasets. Given that these variables represent counts (i.e., the number of square feet, the number of bedrooms), it would make more sense for them to be integers.

First we take a look at a sample of the data (top 20 rows), and see that the floats seem to be whole numbers. 

Therefore, in the cases where a variable is a float in one dataset and an integer in the other, it will be converted into an integer in both.

In [None]:
print(records_df.head(20))

In [None]:
print(inherited_df.head())

So we convert the variables listed below from the records_df dataframe to integers. Please note that the variable "GarageYrBlt" was also converted from a float to an integer in both datasets. This was not because of a data type mismatch, but because it made more sense to have the variable as an integer, as year values typically represent whole numbers and don't usually involve decimal points.

Given that there are missing values in some of the columns that we want to convert, we'll use the nullable integer type "Int64". Note that this type is case sensitive.

In [None]:
float_cols_records = ['2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'LotFrontage', 'MasVnrArea', 'WoodDeckSF', 'GarageYrBlt']

for col in float_cols_records:
    records_df[col] = records_df[col].astype('Int64')


Now let's do the same for the inherited_df dataframe. In this case there are no missing values, so we can use the "int64" type.

In [None]:
float_cols_inherited = ['BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'TotalBsmtSF', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']

for col in float_cols_inherited:
    inherited_df[col] = inherited_df[col].astype('int64')


Now we check the data types again to ensure the conversion was successful. First for the records_df:

In [None]:
print(records_df.info())


And here for inherited_df. Comparing them, we see that the data types for all variables match between the datasets, which will make analyses more reliable.

In [None]:
print(inherited_df.info())

# Section 4: Cleaning - Handle missing data values

Moving on, we proceed with handling missing data values. To refresh, here are the numbers of missing data in records_df. There were no missing data values in the inherited_df.

In [None]:
print(records_df.isnull().sum())

Looking at the missing values in records_df, columns 'EnclosedPorch' and 'WoodDeckSF' have a lot of missing values (over 80% of the data). Filling these missing values may not give us reliable data, so we choose the approach of dropping these columns. 

For other columns, we fill missing values with a reasonable strategy - using the median value for numerical columns and the most frequent value for categorical columns.

In [None]:
# Drop columns with too many missing values
records_df = records_df.drop(['EnclosedPorch', 'WoodDeckSF'], axis=1)
inherited_df = inherited_df.drop(['EnclosedPorch', 'WoodDeckSF'], axis=1)

In [None]:
# Fill missing values in numerical columns with the median
for col in ['2ndFlrSF', 'BedroomAbvGr', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']:
    records_df[col] = records_df[col].fillna(records_df[col].median())
    inherited_df[col] = inherited_df[col].fillna(inherited_df[col].median())

In [None]:
# Fill missing values in categorical columns with the most frequent value
for col in ['BsmtFinType1', 'GarageFinish']:
    records_df[col] = records_df[col].fillna(records_df[col].mode()[0])
    inherited_df[col] = inherited_df[col].fillna(inherited_df[col].mode()[0])

Then we print out the count of missing values in each column to confirm that there are no missing values left

In [None]:
print(records_df.isnull().sum())

And for inherited_df also here below, just for consistency. We can see here that the dropped variables are no longer visible.

In [None]:
print(inherited_df.isnull().sum())

Great! Now the dataset has no missing values and the data types are the same in both datasets.

---

## Save data

 Create a collection folder for records_df and manually drag and drop a copy of the raw dataset there for future use.

In [8]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

[Errno 17] File exists: 'outputs/datasets/collection'


Now we create folders for the cleaned data and save it to new CSV files.

In [None]:
import os

# Define file paths
records_file_path = 'outputs/datasets/cleaned/house_prices_records_cleaned.csv'
inherited_file_path = 'outputs/datasets/cleaned/inherited_houses_cleaned.csv'

# Create the directories in the file path
os.makedirs(os.path.dirname(records_file_path), exist_ok=True)
os.makedirs(os.path.dirname(inherited_file_path), exist_ok=True)

# Save the CSV files
records_df.to_csv(records_file_path, index=False)
inherited_df.to_csv(inherited_file_path, index=False)

---

# Push files to Repo

Great! Now you can  push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)