# How do I handle missing values in pandas?

In [None]:
import pandas as pd

In [4]:
ufo=pd.read_csv('http://bit.ly/uforeports')

In [5]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59


In the above output for row 18236 under Colors Reported column we see value as NaN rather than any color which indicates missing value and stands for 'Not a Number'. Below are few methods to deal with null values.

###### 16.1 - isnull( )

In [6]:
ufo.isnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,False,True,False,False,False
18237,False,True,False,False,False
18238,False,True,True,False,False
18239,False,False,False,False,False
18240,False,True,False,False,False


the way isnull works is if the value is missing or isnull then it will return True else False.

###### 16.2 - notnull( )

In [8]:
ufo.notnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,True,False,True,True,True
18237,True,False,True,True,True
18238,True,False,False,True,True
18239,True,True,True,True,True
18240,True,False,True,True,True


notnull works opposite of isnull() here it returns true if the value is not null

In [9]:
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

above view shows number of missing values in each columns.

Below is the example on how it works

In [10]:
pd.Series([True,False,True]).sum()

2

Dummy series contains Boolean values i.e 2 True and 1 False . While perfroming mathmetical operations on them pandas converts True to 1 and False to 0

For ufo.isnull().sum() output is at column level . Refrencing to previous notes by default sum is doing axis 0 i.e at combing rows for each column . This can be checked in reverse manner below is the example.

In [14]:
ufo.isnull().sum(axis=1).head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

In [15]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


above we can see the for each row one value is null .

In [17]:
ufo[ufo.City.isnull()]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


Above code is used to filter DataFrame where city column is null. This returns 25 rows which is basically raw data for ufo.isnull().sum()

#### Options to deal with Null

###### Drop Nulls

In [18]:
ufo.shape

(18241, 5)

In [19]:
ufo.dropna(how='any').shape

(2486, 5)

dropna drops rows if value for any column is missing.By default dropna follows how=any . 

In [21]:
ufo.dropna().shape

(2486, 5)

In [22]:
ufo.dropna(how='all').shape

(18241, 5)

(how='all') drops if all the values are missing.

In [25]:
ufo.dropna(subset=['City','Shape Reported'],how = 'any').shape

(15576, 5)

Implies that if any of those two('City','Shape Reported') is missing for a given row then drop it

###### Filling Missing Values

In [26]:
ufo['Shape Reported'].value_counts(dropna=False)

LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
PYRAMID         1
DOME            1
FLARE           1
HEXAGON         1
Name: Shape Reported, dtype: int64

we can see that NaN has a count of 2644. missing value can be updated with 'NotReceived' Below is te process.

In [27]:
ufo['Shape Reported'].fillna(value='NotReceived',inplace=True)

In [28]:
ufo['Shape Reported'].value_counts(dropna=False)

LIGHT          2803
NotReceived    2644
DISK           2122
TRIANGLE       1889
OTHER          1402
CIRCLE         1365
SPHERE         1054
FIREBALL       1039
OVAL            845
CIGAR           617
FORMATION       434
VARIOUS         333
RECTANGLE       303
CYLINDER        294
CHEVRON         248
DIAMOND         234
EGG             197
FLASH           188
TEARDROP        119
CONE             60
CROSS            36
DELTA             7
CRESCENT          2
ROUND             2
PYRAMID           1
HEXAGON           1
DOME              1
FLARE             1
Name: Shape Reported, dtype: int64

NaN has been replaced by NotReceived    2644