**HANDLING MISSING DATA**

**IMPORT NUMPY AND PANDAS**

In [None]:
import numpy as np 
import pandas as pd 


**NONE : PYTHONIC MISSING DATA**

**None: None is a Python singleton object that is often used for missing data in Python code.**

In [None]:
vals = np.array([3, None, 8, 7]) 
vals


array([3, None, 8, 7], dtype=object)

**PERFORMING AN ERROR**

In [None]:
vals.sum() 


TypeError: ignored

**MISSING NUMERICAL DATA**

**By setting dtype to float64 you are just telling the computer to read that memory as float64 instead of actually converting the integer numbers to floating point numbers**

**NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation**

In [None]:
vals1 = np.array([5, np.nan, 8, 7]) 
vals1.dtype

dtype('float64')

**NumPy does provide some special aggregations that will ignore these missing values:**

In [None]:
np.nansum(vals1), np.nanmin(vals1), np.nanmax(vals1)

(20.0, 5.0, 8.0)

**NaN AND NONE IN PANDAS**

In [None]:
pd.Series([3, np.nan, None,4])

0    3.0
1    NaN
2    NaN
3    4.0
dtype: float64

In [None]:
Z = pd.Series(range(3), dtype=int)
Z

0    0
1    1
2    2
dtype: int64

In [None]:
Z[2] = None
Z

0    0.0
1    1.0
2    NaN
dtype: float64

**OPERATING ON NULL VALUES**

**isnull(): Generate a boolean mask indicating missing values**

In [None]:
data = pd.Series([3, np.nan, None,'hello'])

In [None]:
data.isnull()

0    False
1     True
2     True
3    False
dtype: bool

**notnull(): Opposite of isnull()**

In [None]:
data[data.notnull()]

0        3
3    hello
dtype: object

**dropna(): Return a filtered version of the data**

In [None]:
data.dropna()

0        3
3    hello
dtype: object

In [None]:
df = pd.DataFrame([[1,np.nan,3],
                   [3,4,np.nan],
                   [4,5,2]])
df

Unnamed: 0,0,1,2
0,1,,3.0
1,3,4.0,
2,4,5.0,2.0


In [None]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1,,3.0,
1,3,4.0,,
2,4,5.0,2.0,


In [None]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1,,3.0
1,3,4.0,
2,4,5.0,2.0


**thresh: thresh takes integer value which tells minimum amount of na values to drop.**

In [None]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
2,4,5.0,2.0,


**fillna(): Return a copy of the data with missing values filled or imputed**

In [None]:
data = pd.Series([1, np.nan, 2, 8, None], index=list('smvec'))
data

s    1.0
m    NaN
v    2.0
e    8.0
c    NaN
dtype: float64

**We can fill NA entries with a single value, such as zero**

In [None]:
data.fillna(89)

s     1.0
m    89.0
v     2.0
e     8.0
c    89.0
dtype: float64

**We can specify a forward-fill to propagate the previous value forward:**

In [None]:
data.fillna(method='ffill')

s    1.0
m    1.0
v    2.0
e    8.0
c    8.0
dtype: float64

**Or we can specify a back-fill to propagate the next values backward:**

In [None]:
data.fillna(method='bfill')

s    1.0
m    2.0
v    2.0
e    8.0
c    NaN
dtype: float64

In [None]:
df

Unnamed: 0,0,1,2,3
0,1,,3.0,
1,3,4.0,,
2,4,5.0,2.0,


In [None]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,3.0,3.0
1,3.0,4.0,4.0,4.0
2,4.0,5.0,2.0,2.0


**axis=1 (or axis='columns') is vertical axis. To take it further, if you use pandas method drop, to remove columns or rows, if you specify axis=1 you will be removing columns. If you specify axis=0 you will be removing rows from dataset.**

In [None]:
df.fillna(method='bfill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,3.0,3.0,
1,3.0,4.0,,
2,4.0,5.0,2.0,


In [None]:
df.fillna(method='bfill', axis=0)

Unnamed: 0,0,1,2,3
0,1,4.0,3.0,
1,3,4.0,2.0,
2,4,5.0,2.0,
