# Data Cleaning Tutorial

We will guide you through a Python script using pandas and numpy to clean a dataset step by step. The steps will include handling missing values, removing duplicates, and addressing inconsistencies in data. Let's assume you have a DataFrame loaded with your data. Here’s how you can clean it:

## 1. Import Required Libraries

First, you'll need to import the necessary libraries. If you haven't already installed pandas and numpy, you can do so using pip install pandas numpy.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## 2. Load Your Data

Load your data into a DataFrame. We’ll assume the data is in a CSV file for this example:

We use Ames Housing Dataset from Kaggle https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset

About Dataset
The Ames Housing Dataset is a well-known dataset in the field of machine learning and data analysis. It contains various features and attributes of residential homes in Ames, Iowa, USA. The dataset is often used for regression tasks, particularly for predicting housing prices.

Here are some key details about the Ames Housing Dataset:

- Number of Instances: The dataset consists of 2,930 instances or observations.
- Number of Features: There are 79 different features or variables that describe various aspects of the residential properties.
- Target Variable: The target variable in the dataset is the "SalePrice," representing the sale price of the houses.
- Data Types: The features include both numerical and categorical variables, covering a wide range of aspects such as lot size, number of rooms, location, construction, and more.

The Ames Housing Dataset is widely used in the machine learning community for tasks such as regression modeling, feature engineering, and predictive analytics related to housing prices. It serves as a valuable resource for developing and testing machine learning algorithms and techniques in the real estate domain.

In [2]:
data_url = 'https://raw.githubusercontent.com/chriskhanhtran/kaggle-house-price/refs/heads/master/Data/train.csv'
data_url = 'dataset/AmesHousing.csv'
df = pd.read_csv(data_url)

display(df.head())
print()

print(df.dtypes)
print()

print(df.shape)
print()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000



Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
BsmtExposure      object
BsmtFinType1      object
BsmtFinSF1         int64
BsmtFinType2      object
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
Heating           object

## 3. Handling Missing Values

Missing values can be handled in several ways depending on the context: removing rows, filling with a specific value, or imputing based on other data.

### 3.1 Check the missing values

In [3]:
# Checking for missing values in each column
missing_values = df.isnull().sum()

# Displaying only columns that have missing values
missing_values = missing_values[missing_values > 0]
print("Columns with missing values:")
print(missing_values)
print()

print(df.shape)
print()


Columns with missing values:
LotFrontage      259
Alley           1369
MasVnrType       872
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

(1460, 81)



### 3.2 Remove Not Necessery Columns

When we look at the first five entries using the head() method, we see that a handful of columns provide the information, but in a few columns, almost all rows are empty: Alley, MasVnrType, FireplaceQu, PoolQC, Fence, and MiscFeature.

We can drop these columns in the following way:

In [4]:
dropdf = df.copy()           # Copy the data frame first into new df

to_drop = ['Alley', 
           'MasVnrType', 
           'FireplaceQu',
           'PoolQC', 
           'Fence', 
           'MiscFeature']
dropdf.drop(to_drop, inplace=True, axis=1)

In [5]:
display(dropdf.head())
print(dropdf.shape)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,0,12,2008,WD,Normal,250000


(1460, 75)


### 3.3 Removing rows with missing values

In [6]:
newdf = dropdf.copy()           # Copy the data frame first into new df
newdf.dropna(inplace=True)  # This will remove all rows with any missing values.


In [7]:
# Checking for missing values in each column
missing_values = newdf.isnull().sum()

# Displaying only columns that have missing values
missing_values = missing_values[missing_values > 0]
print("Columns with missing values:")
print(missing_values)
print()

print(newdf.shape)
print()


Columns with missing values:
Series([], dtype: int64)

(1094, 75)



### 3.4 Filling missing values with a specific value

In [8]:
filldf = dropdf.copy()           # Copy the data frame into new df
filldf.fillna(value=0, inplace=True)  # Replace all NaNs with 0.


In [9]:
# Checking for missing values in each column
missing_values = filldf.isnull().sum()

# Displaying only columns that have missing values
missing_values = missing_values[missing_values > 0]
print("Columns with missing values:")
print(missing_values)
print()

print(filldf.shape)
print()


Columns with missing values:
Series([], dtype: int64)

(1460, 75)



### 3.5 Imputing missing values

For numerical data, you might want to fill missing values with the mean or median:


In [10]:
imputingdf = dropdf.copy()

print(imputingdf['LotFrontage'].mean())
print()

display(imputingdf[['LotFrontage']].head(10))
print()


70.04995836802665



Unnamed: 0,LotFrontage
0,65.0
1,80.0
2,68.0
3,60.0
4,84.0
5,85.0
6,75.0
7,
8,51.0
9,50.0





In [11]:
imputingdf['LotFrontage'].fillna(value=imputingdf['LotFrontage'].mean(), inplace=True)

display(imputingdf[['LotFrontage']].head(10))
print()


Unnamed: 0,LotFrontage
0,65.0
1,80.0
2,68.0
3,60.0
4,84.0
5,85.0
6,75.0
7,70.049958
8,51.0
9,50.0





For categorical data, you can use the mode:

In [12]:
categoricaldf = dropdf.copy()

print(categoricaldf['BsmtQual'].mode()[0])
print()

print(categoricaldf[['BsmtQual']].head(20))
print()


TA

   BsmtQual
0        Gd
1        Gd
2        Gd
3        TA
4        Gd
5        Gd
6        Ex
7        Gd
8        TA
9        TA
10       TA
11       Ex
12       TA
13       Gd
14       TA
15       TA
16       TA
17      NaN
18       TA
19       TA



In [13]:
categoricaldf['BsmtQual'].fillna(value=categoricaldf['BsmtQual'].mode()[0], inplace=True)

print(categoricaldf[['BsmtQual']].head(20))
print()

   BsmtQual
0        Gd
1        Gd
2        Gd
3        TA
4        Gd
5        Gd
6        Ex
7        Gd
8        TA
9        TA
10       TA
11       Ex
12       TA
13       Gd
14       TA
15       TA
16       TA
17       TA
18       TA
19       TA



## 4. Advanced Imputation Techniques

When dealing with missing data, more advanced imputation techniques can be essential to maintain the integrity of your dataset, especially when simple methods like filling with the mean, median, or mode are not suitable. Below, I will introduce some more sophisticated strategies using Python and libraries such as pandas and sklearn for imputing missing values.

### 4.1 Imputation Using Interpolation (useful for time series data)

Interpolation is a method of estimating and constructing new data points within the range of a discrete set of known data points.

In [15]:
interpolationDF = dropdf.copy()

# Interpolating missing values
interpolationDF['LotFrontage'] = dropdf['LotFrontage'].interpolate(method='linear', limit_direction='both')
interpolationDF['BsmtQual'] = dropdf['BsmtQual'].interpolate(method='pad', limit_direction='forward')

display(interpolationDF[['LotFrontage', 'BsmtQual']].head(20))
print()


Unnamed: 0,LotFrontage,BsmtQual
0,65.0,Gd
1,80.0,Gd
2,68.0,Gd
3,60.0,TA
4,84.0,Gd
5,85.0,Gd
6,75.0,Ex
7,63.0,Gd
8,51.0,TA
9,50.0,TA





### 4.2 K-Nearest Neighbors Imputation

The KNN imputation method imputes missing values based on the k-nearest neighbors found in the complete cases of the dataset. The missing value is imputed using the mean or median (depending on the variable type) of the nearest neighbors.

In [17]:
# for numerical
from sklearn.impute import KNNImputer

# Create an imputer object with KNN
imputer = KNNImputer(n_neighbors=5, weights="uniform")

# Selecting numeric columns only
numeric_cols = df.select_dtypes(include=[np.number]).columns
imputedKNNDF = df[numeric_cols]

# Fit on the dataset and transform
df_imputedKNN = pd.DataFrame(imputer.fit_transform(imputedKNNDF), columns=imputedKNNDF.columns)
display(df_imputedKNN.head(10))
print()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1.0,60.0,65.0,8450.0,7.0,5.0,2003.0,2003.0,196.0,706.0,0.0,150.0,856.0,856.0,854.0,0.0,1710.0,1.0,0.0,2.0,1.0,3.0,1.0,8.0,0.0,2003.0,2.0,548.0,0.0,61.0,0.0,0.0,0.0,0.0,0.0,2.0,2008.0,208500.0
1,2.0,20.0,80.0,9600.0,6.0,8.0,1976.0,1976.0,0.0,978.0,0.0,284.0,1262.0,1262.0,0.0,0.0,1262.0,0.0,1.0,2.0,0.0,3.0,1.0,6.0,1.0,1976.0,2.0,460.0,298.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,181500.0
2,3.0,60.0,68.0,11250.0,7.0,5.0,2001.0,2002.0,162.0,486.0,0.0,434.0,920.0,920.0,866.0,0.0,1786.0,1.0,0.0,2.0,1.0,3.0,1.0,6.0,1.0,2001.0,2.0,608.0,0.0,42.0,0.0,0.0,0.0,0.0,0.0,9.0,2008.0,223500.0
3,4.0,70.0,60.0,9550.0,7.0,5.0,1915.0,1970.0,0.0,216.0,0.0,540.0,756.0,961.0,756.0,0.0,1717.0,1.0,0.0,1.0,0.0,3.0,1.0,7.0,1.0,1998.0,3.0,642.0,0.0,35.0,272.0,0.0,0.0,0.0,0.0,2.0,2006.0,140000.0
4,5.0,60.0,84.0,14260.0,8.0,5.0,2000.0,2000.0,350.0,655.0,0.0,490.0,1145.0,1145.0,1053.0,0.0,2198.0,1.0,0.0,2.0,1.0,4.0,1.0,9.0,1.0,2000.0,3.0,836.0,192.0,84.0,0.0,0.0,0.0,0.0,0.0,12.0,2008.0,250000.0
5,6.0,50.0,85.0,14115.0,5.0,5.0,1993.0,1995.0,0.0,732.0,0.0,64.0,796.0,796.0,566.0,0.0,1362.0,1.0,0.0,1.0,1.0,1.0,1.0,5.0,0.0,1993.0,2.0,480.0,40.0,30.0,0.0,320.0,0.0,0.0,700.0,10.0,2009.0,143000.0
6,7.0,20.0,75.0,10084.0,8.0,5.0,2004.0,2005.0,186.0,1369.0,0.0,317.0,1686.0,1694.0,0.0,0.0,1694.0,1.0,0.0,2.0,0.0,3.0,1.0,7.0,1.0,2004.0,2.0,636.0,255.0,57.0,0.0,0.0,0.0,0.0,0.0,8.0,2007.0,307000.0
7,8.0,60.0,75.6,10382.0,7.0,6.0,1973.0,1973.0,240.0,859.0,32.0,216.0,1107.0,1107.0,983.0,0.0,2090.0,1.0,0.0,2.0,1.0,3.0,1.0,7.0,2.0,1973.0,2.0,484.0,235.0,204.0,228.0,0.0,0.0,0.0,350.0,11.0,2009.0,200000.0
8,9.0,50.0,51.0,6120.0,7.0,5.0,1931.0,1950.0,0.0,0.0,0.0,952.0,952.0,1022.0,752.0,0.0,1774.0,0.0,0.0,2.0,0.0,2.0,2.0,8.0,2.0,1931.0,2.0,468.0,90.0,0.0,205.0,0.0,0.0,0.0,0.0,4.0,2008.0,129900.0
9,10.0,190.0,50.0,7420.0,5.0,6.0,1939.0,1950.0,0.0,851.0,0.0,140.0,991.0,1077.0,0.0,0.0,1077.0,1.0,0.0,1.0,0.0,2.0,2.0,5.0,2.0,1939.0,1.0,205.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,2008.0,118000.0





When dealing with categorical, we can create our own function

In [18]:
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler()

def find_category_mappings(df, variable):
    return {k: i for i, k in enumerate(df[variable].dropna().unique(), 0)}

def integer_encode(df , variable, ordinal_mapping):
    df[variable] = df[variable].map(ordinal_mapping)

mappin = dict()
def imputation(df1 , cols):
    df = df1.copy()
    #Encoding dict &amp; Removing nan    
    #mappin = dict()
    for variable in cols:
        mappings = find_category_mappings(df, variable)
        mappin[variable] = mappings

    #Apply mapping
    for variable in cols:
        integer_encode(df, variable, mappin[variable])  

    #Minmaxscaler and KNN imputation 
    sca = mm.fit_transform(df)
    knn_imputer = KNNImputer()
    knn = knn_imputer.fit_transform(sca)
    df.iloc[:,:] = mm.inverse_transform(knn)
    for i in df.columns : 
        df[i] = round(df[i]).astype('int')

    #Inverse transform
    for i in cols:
        inv_map = {v: k for k, v in mappin[i].items()}
        df[i] = df[i].map(inv_map)
    return df

# get some categorical columns
knnDF = dropdf[['BsmtQual']]

knn_DF = imputation(knnDF,['BsmtQual'])
display(knn_DF.head(20))
print()


Unnamed: 0,BsmtQual
0,Gd
1,Gd
2,Gd
3,TA
4,Gd
5,Gd
6,Ex
7,Gd
8,TA
9,TA





### 4.3 MICE (Multiple Imputation by Chained Equations)

MICE is a technique that performs multiple imputations using chained equations. It is particularly useful for more complex datasets with patterns of missing data.

Some references can be found here https://medium.com/@brijesh_soni/topic-9-mice-or-multivariate-imputation-with-chain-equation-f8fd435ca91

In [19]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Initialize the MICE imputer
mice_imputer = IterativeImputer()

# get some numerical columns
imputedDF = dropdf[['LotFrontage']]

# Fit on the dataset and transform it
df_imputed = pd.DataFrame(mice_imputer.fit_transform(imputedDF), columns=imputedDF.columns)
display(df_imputed.head(20))
print()


Unnamed: 0,LotFrontage
0,65.0
1,80.0
2,68.0
3,60.0
4,84.0
5,85.0
6,75.0
7,70.049958
8,51.0
9,50.0





### General Tips:

- Choosing Methods: The choice of method depends heavily on the nature of your data and the reasons for the missing data. For instance, time series data often benefits from interpolation, whereas datasets with a complex mixture of numerical and categorical variables might benefit more from MICE or KNN.

- Normalizing Data: Especially for methods like KNN, scaling the data (normalizing or standardizing) can improve the imputation results because KNN is distance-based.

- Parameter Tuning: For KNN and MICE, the parameters such as the number of neighbors in KNN and the number of imputations in MICE can affect the performance significantly. Experimenting with these can yield better results.


### Advantage using MICE compare other methods

Multiple Imputation by Chained Equations (MICE) offers several advantages over other imputation methods, particularly when handling complex datasets with multiple types of missing data. Here’s a detailed look at the benefits of using MICE:

### 1. **Handles Different Types of Variables**
   - **Versatility with Data Types:** MICE can impute missing values in mixed-type data, including continuous, binary, ordinal, and nominal data. This is because MICE uses different imputation models for different types of variables, making it highly adaptable.

### 2. **Reflects Uncertainty in Imputations**
   - **Multiple Imputations:** Unlike single imputation methods, MICE generates multiple complete datasets by repeating the imputation process several times, each time creating plausible values based on a predictive model. This approach acknowledges the uncertainty inherent in any imputation process, as the true value of the missing data is not known.

### 3. **Reduces Bias**
   - **Model-Based Imputation:** MICE works by fitting a sequence of regression models and uses the results to estimate the missing values. This method can reduce the bias that often accompanies simpler methods like mean or median imputation, especially when the data is not missing completely at random (MCAR).

### 4. **Improves Accuracy**
   - **Chained Equations:** By using a set of different predictive models, each tailored to the specific variable's distribution and relationship with other variables in the dataset, MICE can yield more accurate imputations compared to methods that use a single general model for all variables.

### 5. **Robust to Missingness Patterns**
   - **Handling Non-Random Missingness:** MICE is particularly effective when the missing data is not randomly distributed (Missing at Random or Missing Not at Random), as it models each variable with missing data conditional on the others, thus capturing the dependencies among variables.

### 6. **Better Estimates of Variability**
   - **Statistical Inference:** The multiple datasets generated allow for statistical analysis that reflects the uncertainty due to the missing data. This is a major advantage when conducting inferential statistics, as it leads to more reliable standard errors and confidence intervals than single imputation methods.

### 7. **Flexibility in Modeling**
   - **Customizable Models:** MICE allows the researcher to specify different models for imputing different variables, using the information most relevant to each type of missing data. This flexibility can improve the quality of imputation where different variables have different underlying distributions.

### Conclusion
While MICE is computationally more intensive and complex compared to simpler methods like mean imputation or even K-Nearest Neighbors (KNN), its ability to provide a nuanced, less biased, and statistically sound approach to handling missing data makes it particularly useful in advanced statistical analyses where the quality of imputation can significantly impact the results.

## 5. Removing Duplicates

Duplicates can skew your data analysis, so it's important to remove them:

In [20]:
dropdf.drop_duplicates(inplace=True)
display(dropdf.head(20))
print()

print(dropdf.shape)
print()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,0,1,2008,WD,Normal,118000



(1460, 75)



## 6. Addressing Inconsistencies

Data inconsistencies, such as variations in text format, can affect categorical data analysis.

### 6.1 Standardizing text data

Convert all text to the same case (e.g., lower case):

In [21]:
dropdf['MSZoning'] = dropdf['MSZoning'].str.lower()

display(dropdf[['MSZoning']].head(20))
print()

print(dropdf.shape)
print()


Unnamed: 0,MSZoning
0,rl
1,rl
2,rl
3,rl
4,rl
5,rl
6,rl
7,rl
8,rm
9,rl



(1460, 75)



### 6.2 Fixing formats in data

For dates or other specific types of data, ensure consistent formats:

In [30]:
dropdf['dummy_date'] = pd.date_range(start='2020-11-03', end='2024-11-01')
dropdf['dummy_date'] = pd.to_datetime(dropdf['dummy_date'], format='%Y-%m-%d')

display(dropdf[['dummy_date']].head(20))
print()

print(dropdf.shape)
print()

Unnamed: 0,dummy_date
0,2020-11-03
1,2020-11-04
2,2020-11-05
3,2020-11-06
4,2020-11-07
5,2020-11-08
6,2020-11-09
7,2020-11-10
8,2020-11-11
9,2020-11-12



(1460, 76)



## 7. Checking the clean data

After cleaning, it’s good practice to check the first few rows of your DataFrame to ensure everything looks correct:

In [31]:
display(dropdf.head())
print()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,dummy_date
0,1,60,rl,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,0,2,2008,WD,Normal,208500,2020-11-03
1,2,20,rl,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,0,5,2007,WD,Normal,181500,2020-11-04
2,3,60,rl,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,0,9,2008,WD,Normal,223500,2020-11-05
3,4,70,rl,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,0,2,2006,WD,Abnorml,140000,2020-11-06
4,5,60,rl,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,0,12,2008,WD,Normal,250000,2020-11-07





## 8. Saving the cleaned data

Once your data is cleaned, you may want to save it back to a CSV:

In [26]:
url_clean_data = 'dataset/Clean_data.csv'
df_imputedKNN.to_csv(url_clean_data, index=False)
print('Data was saved!')


Data was saved!


This tutorial is a basic guide to start cleaning your data using Python. Depending on the specific needs of your data and the nuances of what “clean” means in your context, you may need to apply more specialized cleaning steps.

#### Terima kasih