## Mean Median Mode Imputation : 

Mean, median, and mode imputation are techniques used in data preprocessing to handle missing data in a dataset. They involve replacing missing values with the mean, median, or mode of the available data in the respective column. Each method has its own advantages and is applicable in different scenarios:

1. Mean Imputation:
   - In mean imputation, missing values are replaced with the mean (average) value of the non-missing data in the column.
   - This method is suitable for continuous numerical data, such as age, income, or temperature, where the distribution of values is approximately normal.
   - Mean imputation can distort the original data distribution if the missing values are not missing completely at random (MCAR). It tends to pull the data towards the center of the distribution.

2. Median Imputation:
   - Median imputation replaces missing values with the median value, which is the middle value when the data is sorted in ascending order.
   - This method is more robust to outliers compared to mean imputation, making it suitable for data with skewed or non-normal distributions.
   - It is a better choice when the data contains outliers because outliers can significantly affect the mean.

3. Mode Imputation:
   - Mode imputation is used for categorical or nominal data. It replaces missing values with the most frequent category (mode) in the column.
   - This method is appropriate for data like color, nationality, or product type, where the concept of mean or median is not meaningful.
   - Mode imputation is often used when dealing with categorical data, but it may not be suitable for continuous data.


In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
dataframe = pd.read_csv('Titanic.csv')

In [3]:
dataframe.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
dataframe.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [5]:
dataframe.shape

(891, 12)

In [6]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
dataframe.select_dtypes(include = 'object').head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


In [8]:
dataframe.isna().sum().any()

True

In [9]:
dataframe.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [10]:
dataframe['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

If you have a dataset with an "Age" column, and some of the entries in that column are missing or marked as "NaN" (which stands for "Not-a-Number"), replace those missing values with the average age of all the people.

Here's how you can do it:

1. Calculate the mean (average) age of all the people in your dataset.
2. Then, go through the "Age" column, and wherever you find a missing or "NaN" value, replace it with the calculated mean age.



In [11]:
dataframe_2 = dataframe.copy()

In [12]:
dataframe_2.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
mean = dataframe_2['Age'].mean()

In [14]:
mean

29.69911764705882

In [15]:
dataframe_2['Age'] = dataframe_2['Age'].fillna(mean)

In [16]:
dataframe_2['Age'].isna().sum()

0

If you have a feature, like "Age," in your dataset, and some values are missing (NaN), replacing those missing values with the median is a good idea when the data doesn't follow a normal or Gaussian distribution. 

Here's how to do it in simple terms:

1. Calculate the median age from the available data in the "Age" feature. The median is the middle value when all the ages are sorted from smallest to largest.

2. Replace all the missing (NaN) ages in the "Age" feature with this calculated median age.


In [17]:
dataframe_3 = dataframe.copy()

In [18]:
dataframe_3.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
dataframe_3.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [20]:
median = dataframe_3['Age'].median()

In [21]:
median

28.0

In [24]:
dataframe_3['Age'] = dataframe_3['Age'].fillna(median)

In [25]:
dataframe_3['Age'].isna().sum()

0

You can replace missing values in the "Age" feature with the mode, which is the most frequent value in the dataset. In your example, if 17 is the most common age, you can assume that missing ages are also 17:

1. Find the mode (most frequent age) in the "Age" feature.
2. Replace all the missing (NaN) ages in the "Age" feature with this mode age.


In [26]:
dataframe_4 = dataframe.copy()

In [27]:
dataframe_4.isna().sum().any()

True

In [28]:
dataframe_4.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [30]:
mode = dataframe_4['Age'].mode()[0]

In [31]:
mode

24.0

In [32]:
dataframe_4['Age'] = dataframe_4['Age'].fillna(mode)

In [33]:
dataframe_4.isna().sum().any()

True

In [34]:
dataframe_4.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64