# <font face = 'Impact' color = '#FFAEBC' > Data cleaning through imputation techniques <font/>
#### <font face = 'Times New Roman' color = '#B5E5CF'> License: GPL v3.0<font/>
#### <font face = 'Times New Roman' color = '#B5E5CF'> Author and Trainer: Paolo Hilado MSc. (Data Science)<font/>
This notebook provides a practical introduction to handling missing data through various imputation techniques. You'll explore methods such as mean, median, and mode imputation, as well as forward/backward fill, group-wise strategies, and a brief look at more advanced approaches like interpolation and model-based imputation. The goal is to equip you with the tools to clean and prepare datasets effectively for analysis or modeling.


In [None]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import researchpy as rp

In [None]:
# Load the dataset
df = pd.read_excel("PracticeTest.xlsx")
df.head(5)

In [None]:
#Checking for NAs or missing cases
df.isnull().values.any()

In [None]:
# For curiosity, you may subset the dataframe to include all rows that is made up of all missing cases and view them.
missing_df1= df[df.isnull().all(axis=1)] # take a subset to include all rows with missing cases
missing_df2= df[df.isnull().any(axis=1)] # take a subset to include any row with missing cases

In [None]:
missing_df1

In [None]:
missing_df2

In [None]:
df.info()

In [None]:
# Check out the descriptive analysis results using .describe() which excludes missing cases.
df.describe().T

In [None]:
# Determine the columns in the data frame that have missing cases.
missing_columns = df.isnull().sum()
# Show only columns with at least one missing value
# missing_columns = missing_columns[missing_columns > 0]
missing_columns

In [None]:
df.head()

In [None]:
df1 = df.iloc[:,1:31]
df1.head()

In [None]:
# Doing mean imputation
# Check out the mean for each feature
df1.mean()

In [None]:
df.head()

In [None]:
# Perform mean imputation for each column in your dataframe 
df1.fillna(np.round(df1.mean(),2), inplace=True)
df1.head()

In [None]:
df1.mean()

In [None]:
df

In [None]:
# Load the dataset
df = pd.read_excel("PracticeTest.xlsx")
df.head(5)

In [None]:
df1 = df.iloc[:,1:31]
df1.head()

In [None]:
# Doing median imputation
# Check out the median for each feature
df1.median()

In [None]:
# Perform median imputation for each column in your dataframe
df1.fillna(df1.median(), inplace=True)
df1.head()

In [None]:
# Load the dataset
df = pd.read_excel("PracticeTest.xlsx")
df.head(5)

In [None]:
# Check out the frequently occuring observation for variable Sex
df['Sex'].mode()

In [None]:
# Perform mode imputation for column Sex
df['Sex'] = df['Sex'].fillna(df['Sex'].mode()[0])
df.head(5)

# <font color = '#FFAEBC' > Remember that these basic data imputation techniques should be used with caution and avoided as much as possible. <font/>


# <font color = '#FFAEBC' > It is always of best practice to verify with the primary source and input the accurate data. When the missing value can be computed using existing features from the data frame, this is also a good alternative. <font/>