Link: https://www.w3resource.com/pandas/dataframe/dataframe-fillna.php

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("titanic.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.shape

(891, 12)

In [5]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Method 1: Dropping the row.

In [6]:
df.dropna(subset=['Embarked'], inplace=True )

In [7]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [8]:
df.shape

(889, 12)

Try threshold attribute: Keep only the rows with at least 2 non-NA values:

        df.dropna(thresh=2)


### Method 2: Dropping the column.

In [9]:
df.drop(['Cabin', 'Age'], axis = 1, inplace=True)

In [10]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [11]:
df.shape

(889, 10)

### Method 3: Filling the Missing Values – Imputation

Filling the missing data with the mean or median value if it’s a numerical variable.

Filling the missing data with mode if it’s a categorical value.

Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different.

Filling the categorical value with a new type for the missing values.

In [12]:
df = pd.read_csv("titanic.csv")

#### Filling with specific word/Letter. Filling the missing embarked as "U" stands for unknown.

In [13]:
df.Embarked.fillna("U", inplace = True)

In [14]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

#### Imputing with mean

In [15]:
mean_value = df.Age.mean()

In [16]:
df.Age.isnull().sum()

177

In [17]:
df['Age'].fillna(mean_value, inplace = True)

In [18]:
df.Age.isnull().sum()

0

Similarly for Median and Mode for categorical.

#### Using  SimpleImputer() from sklearn.impute

Link: https://www.youtube.com/watch?v=thnUTFqANfE&t=1205s

In [19]:
df = pd.read_csv("titanic.csv")

In [20]:
#Use the below line of code if the imputation is thrwoing error 
# " Input contains NaN, infinity or a value too large for dtype('float64')"

#df.replace([np.inf, -np.inf], np.nan, inplace=True)

In [21]:
from sklearn.impute import SimpleImputer

In [22]:
#Can also mention "missing_values = none" inside SimpleImputer. Default is np.nan.
#default stratergy is mean.

mean_imputer = SimpleImputer(strategy="mean") # default is mean
mode_imputer = SimpleImputer(strategy="most_frequent") # mode imputation
median_imputer = SimpleImputer(strategy="median") #median imputation
constant_imputer = SimpleImputer(strategy="constant", fill_value='S') #can give constant values to fill the missing values. It can be string or numerical.
#imputer = SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value=80)

##### Mean imputation

In [23]:
df['Age'] = mean_imputer.fit_transform(df['Age'].values.reshape(-1,1))

# Also can give iloc method to select the column.
#df.iloc[:,5] = mean_imputer.fit_transform(df.iloc[:,5].values.reshape(-1, 1))

In [24]:
df.Age.isnull().sum()

0

##### Mode Imputation

In [25]:
df['Embarked'] = mode_imputer.fit_transform(df['Embarked'].values.reshape(-1,1))

In [26]:
df.Embarked.isnull().sum()

0

##### Median Imputation

In [27]:
df1 = pd.read_csv("titanic.csv")

In [28]:
df['Age'] = median_imputer.fit_transform(df['Age'].values.reshape(-1,1))

In [29]:
df.Age.isnull().sum()

0

##### Constant imputation

In [30]:
df1.Embarked.isnull().sum()

2

In [31]:
df1['Embarked'] = constant_imputer.fit_transform(df['Embarked'].values.reshape(-1,1))

In [32]:
df1.Embarked.isnull().sum()

0

##### SimpleImputer can be applied to the multiple column also.
First segregate the columns to which imputation needs to be done.

In [33]:
data = [[12, np.nan, 34], [10, 32, np.nan],
        [np.nan, 11, 20]]

In [34]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [35]:
data = imputer.fit_transform(data)

In [36]:
data

array([[12. , 21.5, 34. ],
       [10. , 32. , 27. ],
       [11. , 11. , 20. ]])