# Missing values


In same cases, the dataset is not complete and can have some missing values. In fact, it is mostly the situation. It is important to be able to handle missing values to be able to evaluate the dataset.

In this section, we  will have a look at how to handle missing data.

In [1]:
import numpy as np
import pandas as pd

from pandas import Series,DataFrame

### Series

In [2]:
ser1 = Series([np.nan,1,2,3,np.nan],
             index=['row1','row2','row3','row4','row5'])
ser1

sat1    NaN
sat2    1.0
sat3    2.0
sat4    3.0
sat5    NaN
dtype: float64

In [3]:
ser1.isnull()          #returns True when there is a missing value

sat1     True
sat2    False
sat3    False
sat4    False
sat5     True
dtype: bool

In [4]:
ser1.dropna()      #drops the rows with missing values

sat2    1.0
sat3    2.0
sat4    3.0
dtype: float64

### DataFrames

In [5]:
npn = np.nan

df = DataFrame([[1,npn,3],[4,5,npn],[8,7,9],[npn,npn,npn]])
df

Unnamed: 0,0,1,2
0,1.0,,3.0
1,4.0,5.0,
2,8.0,7.0,9.0
3,,,


In [6]:
df.dropna()     # Eliminates the whole row if there is a missing value in that row (by default axis=0, along rows)

Unnamed: 0,0,1,2
2,8.0,7.0,9.0


In [7]:
df.dropna(axis=1)   

0
1
2
3


In [8]:
df.dropna(how='all')    #the row is eliminated only if all its values are missing

Unnamed: 0,0,1,2
0,1.0,,3.0
1,4.0,5.0,
2,8.0,7.0,9.0


In [9]:
df.dropna(thresh=3)      # keep the row if it has mininum 3 non-NaN values
                         

Unnamed: 0,0,1,2
2,8.0,7.0,9.0


In [10]:
df.dropna(thresh=2)    # keep the row if it has mininum 2 non-NaN values

Unnamed: 0,0,1,2
0,1.0,,3.0
1,4.0,5.0,
2,8.0,7.0,9.0


In [11]:
df

Unnamed: 0,0,1,2
0,1.0,,3.0
1,4.0,5.0,
2,8.0,7.0,9.0
3,,,


In [12]:
df.dropna(thresh=2,inplace=True)
df

Unnamed: 0,0,1,2
0,1.0,,3.0
1,4.0,5.0,
2,8.0,7.0,9.0


If the dataset has many missing values scattered throughout rows and columns, using drop() method could cause a large amount of important data loss.

It such cases, fillna() function runs to our help.

In [13]:
df.fillna(value='FILLED')

Unnamed: 0,0,1,2
0,1.0,DOLGU,3
1,4.0,5,DOLGU
2,8.0,7,9


Using the mean() function helps to minimize the change (statistically) in the numerical datasets while handling missing values.

In [14]:
df.fillna(value=df.mean())

Unnamed: 0,0,1,2
0,1.0,6.0,3.0
1,4.0,5.0,6.0
2,8.0,7.0,9.0


In [15]:
df

Unnamed: 0,0,1,2
0,1.0,,3.0
1,4.0,5.0,
2,8.0,7.0,9.0


In [16]:
df[1].fillna(value=df[1].mean(),inplace=True)
df

Unnamed: 0,0,1,2
0,1.0,6.0,3.0
1,4.0,5.0,
2,8.0,7.0,9.0
