# Operations & Missing Values

In [1]:
import pandas as pd
import numpy as np

In [13]:
rng = np.random.RandomState(47)

In [6]:
x = pd.Series([1.,2.,3.], index=[0,1,2])
y = pd.Series([4.,5.,6.], index=[1,2,3])

Pandas missing values handling uses NaN to handle operations where the valuee of a specific location is not defined or is null/na

In [7]:
x + y

0    NaN
1    6.0
2    8.0
3    NaN
dtype: float64

we can use the operation `add` to pass the `fill_value` argument to fill the missing values

In [11]:
x.add(y, fill_value=0.0)

0    1.0
1    6.0
2    8.0
3    6.0
dtype: float64

In [15]:
mat_a = pd.DataFrame(rng.randint(0,10,(2,2)))
mat_b = pd.DataFrame(rng.randint(0,10,(3,3)))

In [16]:
mat_a

Unnamed: 0,0,1
0,8,3
1,0,7


In [17]:
mat_b

Unnamed: 0,0,1,2
0,0,7,7
1,1,7,2
2,2,1,7


In [18]:
mat_a + mat_b

Unnamed: 0,0,1,2
0,8.0,10.0,
1,1.0,14.0,
2,,,


Each indices that are not defined in the other matrix is considered NaN and we can handle those missing values using the `fill_value` argument of the `add` operation  
In this case we will use the `mat_a` mean to fill the missing values

In [23]:
fill_value = mat_a.stack().mean()
fill_value

4.5

This operation will fill in `mat_a` the missing values with the mean of that matrix, that will result with no NaN values in the added matrix  
so basically the `mat_a` will be 3x3 where column & row 2 will have the mean value

In [26]:
new_mat = mat_a.add(mat_b, fill_value=fill_value)
new_mat

Unnamed: 0,0,1,2
0,8.0,10.0,11.5
1,1.0,14.0,6.5
2,6.5,5.5,11.5


## Operations on DataFrames

In [28]:
new_mat - mat_b

Unnamed: 0,0,1,2
0,8.0,3.0,4.5
1,0.0,7.0,4.5
2,4.5,4.5,4.5


In [29]:
new_mat * mat_b

Unnamed: 0,0,1,2
0,0.0,70.0,80.5
1,1.0,98.0,13.0
2,13.0,5.5,80.5


The substraction of row from a matrix will be row-wise so `mat_b.iloc[0]` will be `[0, 7, 7]`

In [34]:
new_mat - mat_b.iloc[0]

Unnamed: 0,0,1,2
0,8.0,3.0,4.5
1,1.0,7.0,-0.5
2,6.5,-1.5,4.5


Lets try to multiply the matrix `mat_b` with a vector  
this multiplication will result in `(3x3) X (3x1) => 3x1`

In [35]:
a = pd.Series(rng.randint(0,5,3))
a

0    3
1    4
2    0
dtype: int64

In [36]:
mat_b @ a

0    28
1    31
2    10
dtype: int64

If we want to only multiply the matrix with a vector row-wise we can use the `*` operation instead of `@`

In [37]:
mat_b * a

Unnamed: 0,0,1,2
0,0,28,0
1,3,28,0
2,6,4,0


## Missing values handling

In [44]:
exa_arr = [0, None, 1]
values = np.array(exa_arr)
values.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

The addition between `NoneType` and an integer is undefined and will result an error

In [49]:
exa_arr = [0, np.nan, 1]
values = np.array(exa_arr)
values.sum()

nan

Numpy handle with those missing values and does not throw error but returns with `nan` value

Numpy also can ignore those `nan` values

In [51]:
np.nansum(values)

1.0

Pandas on the other hand by default ignore `nan` or `None` values

In [45]:
ser = pd.Series([0, None, 1])
ser.sum()

1.0

In [46]:
ser = pd.Series([0, np.nan, 1])
ser.sum()

1.0

### Detecting null values

We have `isnull()` and `notnull()` functions to check for null values. the result will be a boolean array

In [53]:
ser.isnull()

0    False
1     True
2    False
dtype: bool

we can select the rows where there are no missing values

In [54]:
ser[ser.notnull()]

0    0.0
2    1.0
dtype: float64

### Drop/Fill missing values
Another way to handle missing values will be to **drop/fill** those null values using the following methods:  
1. `dropna()`
2. `fillna()`

The `dropna()` function on a matrix will result in removing the whole row (by default)

In [64]:
df = pd.DataFrame(rng.randint(0,10,(2,2)))
df.iloc[0,0] = np.nan
df

Unnamed: 0,0,1
0,,6
1,6.0,1


In [65]:
df.dropna()

Unnamed: 0,0,1
1,6.0,1


We can also remove the whole column by specifing the `axis` argumnent

In [66]:
df.dropna(axis=1)

Unnamed: 0,1
0,6
1,1


if we do not want to delete important data because of few NaN values we can specify the `how` argument to manage which rows/columns will be removed

In [67]:
df.dropna(axis=1, how='all')

Unnamed: 0,0,1
0,,6
1,6.0,1


In [68]:
df_1 = pd.DataFrame(rng.randint(0,10,(2,2)))
df_1[0] = np.nan

df_1

Unnamed: 0,0,1
0,,3
1,,8


In [70]:
df_1.dropna(axis=1, how='all')

Unnamed: 0,1
0,3
1,8


### Filling missing values

In [74]:
df

Unnamed: 0,0,1
0,,6
1,6.0,1


In [75]:
df.fillna(0)

Unnamed: 0,0,1
0,0.0,6
1,6.0,1


Another intersting way to fill missing values is by fill the previous value to the current missing value

In [82]:
df_2 = pd.DataFrame(rng.randint(0,10,(4,4)))
df_2.iloc[1,2] = np.nan
df_2.iloc[2,0] = np.nan
df_2

Unnamed: 0,0,1,2,3
0,8.0,8,8.0,6
1,5.0,8,,2
2,,9,7.0,4
3,8.0,2,1.0,2


In [83]:
df_2.fillna(method='ffill')

Unnamed: 0,0,1,2,3
0,8.0,8,8.0,6
1,5.0,8,8.0,2
2,5.0,9,7.0,4
3,8.0,2,1.0,2


Same as `ffill` we also have `bdill` to fill the missing value with the next value

In [84]:
df_2.fillna(method='bfill')

Unnamed: 0,0,1,2,3
0,8.0,8,8.0,6
1,5.0,8,7.0,2
2,8.0,9,7.0,4
3,8.0,2,1.0,2
