### Treating missing or null values Techniques:
Treating missing values is a critical step in preparing features for model building. Missing values can cause issues with model convergence and bias the results. Several techniques can be employed to handle missing values effectively. The choice of technique depends on the nature of the data and the problem at hand. Here are some commonly used techniques:

1. **Deletion**: In some cases, if the proportion of missing values is very small, the simplest approach is to remove the rows or columns with missing values. However, this should be done carefully, as it may lead to a loss of valuable information.

2. **Forward Fill/Backward Fill**: In time-series data or ordered data, missing values can be filled using the most recent known value (forward fill) or the next known value (backward fill).

3. **Mean/Median/Mode Imputation**: For numerical features with missing values, one common technique is to impute the missing values with the mean, median, or mode of the non-missing values in that feature. This approach is straightforward but may introduce bias.

4. **K-Nearest Neighbors (KNN) Imputation**: KNN imputation involves finding the K nearest neighbors of a data point with missing values and using their values to impute the missing ones. This method takes into account the similarity between data points.

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv(r"D:\Data\Test.csv")

In [3]:
data.shape

(4001, 14)

In [4]:
data.head()

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,Today,Age,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure
0,1.0,Laraine,Medendorp,Female,93.0,19644,45137,69.0,Executive Secretary,Health,Mass Customer,N,Yes,11.0
1,2.0,Eli,Bockman,Male,81.0,29571,45137,,Administrative Officer,Financial Services,Mass Customer,N,Yes,16.0
2,3.0,Arlin,Dearle,Male,61.0,19744,45137,,Recruiting Manager,Property,Mass Customer,N,Yes,15.0
3,4.0,Talbot,,Male,33.0,22557,45137,,,IT,Mass Customer,N,No,7.0
4,5.0,Sheila-kathryn,Calton,Female,56.0,28258,45137,,Senior Editor,,Affluent Customer,N,Yes,8.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4001 entries, 0 to 4000
Data columns (total 14 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   customer_id                          4000 non-null   float64
 1   first_name                           4000 non-null   object 
 2   last_name                            3875 non-null   object 
 3   gender                               4000 non-null   object 
 4   past_3_years_bike_related_purchases  4000 non-null   float64
 5   DOB                                  3913 non-null   object 
 6   Today                                4001 non-null   int64  
 7   Age                                  1 non-null      float64
 8   job_title                            3494 non-null   object 
 9   job_industry_category                3344 non-null   object 
 10  wealth_segment                       4000 non-null   object 
 11  deceased_indicator            

## *`Checking the null values using isnull() and isna()`*

In [6]:
# Checking the null values using isnull().
# We can use any options from both either isnull() nor isna().
data.isnull().sum()

customer_id                               1
first_name                                1
last_name                               126
gender                                    1
past_3_years_bike_related_purchases       1
DOB                                      88
Today                                     0
Age                                    4000
job_title                               507
job_industry_category                   657
wealth_segment                            1
deceased_indicator                        1
owns_car                                  1
tenure                                   88
dtype: int64

In [7]:
# Checking the null values using isna().
# Coverting the values into percentage to know null values percentage. 
round((data.isna().sum()/len(data))*100,2)

customer_id                             0.02
first_name                              0.02
last_name                               3.15
gender                                  0.02
past_3_years_bike_related_purchases     0.02
DOB                                     2.20
Today                                   0.00
Age                                    99.98
job_title                              12.67
job_industry_category                  16.42
wealth_segment                          0.02
deceased_indicator                      0.02
owns_car                                0.02
tenure                                  2.20
dtype: float64

### 1. Dropping or Deleting missing values
* Remove rows having missing values if overall less than 5% of rows are missing in the DataFrame.
* Remove the whole column containing missing values - go for this approach if more than 50% of the column data is missing.

- **Observe the data there are missing values in every column. we can drop the rows which has less than 5% are null values in the row.**
- **In some cases preserving the data is important in that case it is important to fill the null values those cases are when we have less data etc.**
- **In the *`Age`* column we have more than 50% of null values. So we can drop the age column we has no information.**

In [8]:
data.drop(['Age'], axis=1, inplace=True)

### 2. Forward Fill/Backward Fill:
* Forward Fill and Backward Fill are techniques used in data imputation inplace of missing values in the data.
* Forward Fill means that the missing values are filled with the previous observation's value.
* Backward Fill means that the missing values are filled with the next observation's value.
#### Advantages:
* FFill and BFill is straightforward to implement and understand. 
#### Disadvantages:
* Inaccurate for some data distributions, data get biased in some cases and It is incorrect in some cases.

In [9]:
# Doing only for the job title column using ffill.
data.job_title.fillna(method='ffill',inplace=True)

In [10]:
# Doing only for the tenure column using bfill.
data.DOB.fillna(method='bfill',inplace=True)

### 3. Mean/Median/Mode Imputation:

* Missing values in a dataset are typically filled using mean, median, and mode imputation. Let's explore it
### *`Mean Imputation`*

#### Advantages:
- It is easy to implement mean imputation, making it computationally efficient.
- The mean and variance of the data can be preserved by imputed missing values with the mean.
- Numerical data: Mean imputation works well for numerical data.

#### Disadvantages:
1. Underestimation of true variability
2. Impact on correlation structure
3. Sensitivity to outliers

### *`Median Imputation`*

#### Advantages:
1. Robustness against outliers
2. Maintains data order
3. Applicable to skewed data

#### Disadvantages:
1. Ignores data distribution
2. Lack of variability preservation

### *`Mode Imputation`*

#### Advantages:
1. Suitable for categorical data
2. Preserves data categories

#### Disadvantages:

1. Limited applicability to categorical data

2. Biased representation of the most frequent category

- For gender column beacuse it is categorical column we can use mode to fill the missing values.

In [11]:
# for past_3_years_bike_related_purchases column we can go with mean and median.
data.past_3_years_bike_related_purchases.mean()

48.89

In [12]:
data.past_3_years_bike_related_purchases.fillna(data.past_3_years_bike_related_purchases.mean(), inplace=True)

In [13]:
data.past_3_years_bike_related_purchases.median()

48.0

In [14]:
data.past_3_years_bike_related_purchases.fillna(data.past_3_years_bike_related_purchases.median(), inplace=True)

In [15]:
# from Gender taking out the most repeated category.
data.gender.mode()[0]

'Female'

In [16]:
data.gender.fillna(data.gender.mode()[0], inplace=True)

## 🌟Observations:
- While filling with mean and median we should check the **`distribution of data, skewness of the data.`**
- Mainly if there are any outliers in the data so choose wisely because mean is affected by outliers in that median is prefered.
- we can also visualize how the mean and median is distributed by distribution plots like histplot, kdeplot etc.
### There is an automatic Imputation method for all mean, median, mode that is simpleImputer(). It handles the both numerical and categorical data.

### 4. K-Nearest Neighbors (KNN) Imputation:

In [25]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
data.tenure = imputer.fit_transform(data[['tenure']])

* KNN Imputer works only for numerical data.
* KNN Imputer uses internally K-Nearest Neighbors algorithm operates on distances and calculates the similarity between data points based on their numerical feature values. 
* It is not suitable for imputing missing values in categorical data or any other non-numeric data types.