_This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code on top of what was provided._

## Handling missing values in pandas

In [1]:
import pandas as pd

In [2]:
url = 'http://bit.ly/uforeports'
ufo = pd.read_csv(url)

In [4]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59


**NaN shows missing values**

In [5]:
# checking which are NaN (True) using isnull
# creates DataFrame of True's and False's
ufo.isnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,False,True,False,False,False
18237,False,True,False,False,False
18238,False,True,True,False,False
18239,False,False,False,False,False
18240,False,True,False,False,False


In [6]:
ufo.notnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,True,False,True,True,True
18237,True,False,True,True,True
18238,True,False,False,True,True
18239,True,True,True,True,True
18240,True,False,True,True,True


In [14]:
# count number of missing values in each column
# sum True's
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

In [11]:
# we create a pandas series of booleans
booleans = pd.Series([True, False, True])

In [13]:
# use sum() on series
# this would sum all True
booleans.sum()

# sum() uses axis=0 by default
# the following code does the same thing
# booleans.sum(axis=0)

2

In [17]:
# this allows us to see the 25 rows of missing values in the column City
ufo[ufo.City.isnull()]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


**What do we do about the missing values?**

**Method 1: drop missing values**

In [18]:
ufo.shape

(18241, 5)

In [19]:
# drop rows if any of the 5 columns have a missing value
# how='any' is the default, you need not include this
ufo.dropna(how='any').shape

(2486, 5)

In [20]:
# no changes are made
# you can change using inplace='true'
# ufo.dropna(how='any', inplace=True).shape

In [21]:
# drop row if all of the columns are missing
ufo.dropna(how='all').shape

(18241, 5)

In [23]:
# drop row if either City or Shape Reported are missing
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

(15576, 5)

In [25]:
# drop row if both City and Shape Reported are missing
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape

(18237, 5)

**Method 2: Filling missing values**

In [28]:
# this shows missing values (NaN)
ufo['Shape Reported'].value_counts(dropna=False)

LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
PYRAMID         1
FLARE           1
DOME            1
Name: Shape Reported, dtype: int64

In [29]:
# inplace=True makes the change to the data
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

In [32]:
ufo['Shape Reported'].value_counts(dropna=False)

VARIOUS      2977
LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
FLARE           1
DOME            1
PYRAMID         1
Name: Shape Reported, dtype: int64