Tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

### Handling Missing Data

All of the descriptive statistics on pandas objects exclude missing data by default. NaN is a *sentinel value* that can be easily detected.

In [2]:
import pandas as pd
import numpy as np

a_series = pd.Series(['artichoke', 'letucce', np.nan, 'cilantro'])

a_series

0    artichoke
1      letucce
2          NaN
3     cilantro
dtype: object

In [3]:
a_series.isnull()

0    False
1    False
2     True
3    False
dtype: bool

When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python *None* value is also treated as NA in object arrays.

dropna() removes the NA values.

In [4]:
a_series.dropna()

0    artichoke
1      letucce
3     cilantro
dtype: object

In [13]:
df = pd.DataFrame([[1, 4, 2, 8], 
                 ['asdf', np.nan, 'sdf', 'df'],
                 [9, 0, np.nan, 3]])

df

Unnamed: 0,0,1,2,3
0,1,4.0,2,8
1,asdf,,sdf,df
2,9,0.0,,3


In [16]:
cleaned = df.dropna() # in a DataFrame, dropna() drops the instances/rows with NaN

In [15]:
cleaned

Unnamed: 0,0,1,2,3
0,1,4.0,2,8


dropna(how='all') will only drop rows that are all NA.

In [17]:
df[2] = np.nan


In [18]:
df

Unnamed: 0,0,1,2,3
0,1,4.0,,8
1,asdf,,,df
2,9,0.0,,3


In [19]:
df.dropna(how='all', axis=1) # will drop columns with all values NaN

Unnamed: 0,0,1,3
0,1,4.0,8
1,asdf,,df
2,9,0.0,3


In [27]:
df2 = pd.DataFrame(np.random.randn(6, 7))

df2

Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,0.991239,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,-0.759632,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,-0.243128,0.433749,-0.619012,0.317007
3,0.98694,-1.001859,0.899056,1.379279,0.24188,-1.571206,-0.501158
4,0.320348,-0.660196,-1.892148,1.822202,1.313109,-0.278788,0.655076
5,0.16226,-0.176792,0.094787,-2.543953,0.589935,-1.037856,-0.078592


In [34]:
df2.iloc[:4, 3] = np.nan

df2

Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,,0.433749,-0.619012,0.317007
3,0.98694,-1.001859,0.899056,,0.24188,-1.571206,-0.501158
4,0.320348,-0.660196,-1.892148,1.822202,,,
5,0.16226,-0.176792,0.094787,-2.543953,,,


In [35]:
df2.iloc[3:, 4:] = np.nan

df2

Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,,0.433749,-0.619012,0.317007
3,0.98694,-1.001859,0.899056,,,,
4,0.320348,-0.660196,-1.892148,1.822202,,,
5,0.16226,-0.176792,0.094787,-2.543953,,,


In [36]:
df2.dropna()

Unnamed: 0,0,1,2,3,4,5,6


In [41]:
df2.dropna(thresh=5)

Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,,0.433749,-0.619012,0.317007


Filling the missing values.

In [42]:
df2.fillna(2)

Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,2.0,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,2.0,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,2.0,0.433749,-0.619012,0.317007
3,0.98694,-1.001859,0.899056,2.0,2.0,2.0,2.0
4,0.320348,-0.660196,-1.892148,1.822202,2.0,2.0,2.0
5,0.16226,-0.176792,0.094787,-2.543953,2.0,2.0,2.0


Filling each column's NAs with different values. 

In [43]:
df2.fillna({3: 6.8, 4: 0})

Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,6.8,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,6.8,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,6.8,0.433749,-0.619012,0.317007
3,0.98694,-1.001859,0.899056,6.8,0.0,,
4,0.320348,-0.660196,-1.892148,1.822202,0.0,,
5,0.16226,-0.176792,0.094787,-2.543953,0.0,,


fillna returns a new object, but we can modify the existing object in-place.

In [48]:
_ = df.fillna(1, inplace=True) # replacing NA in place

In [46]:
df

Unnamed: 0,0,1,2,3
0,1,4.0,1.0,8
1,asdf,1.0,1.0,df
2,9,0.0,1.0,3


In [55]:
df2.fillna(method='ffill')


Unnamed: 0,0,1,2,3,4,5,6
0,-0.052232,0.622271,0.913073,,-0.663992,-1.609501,-1.790384
1,-0.922147,0.077835,-1.170292,,-0.321846,0.817847,-0.642239
2,-0.919701,-2.264835,0.076041,,0.433749,-0.619012,0.317007
3,0.98694,-1.001859,0.899056,,0.433749,-0.619012,0.317007
4,0.320348,-0.660196,-1.892148,1.822202,0.433749,-0.619012,0.317007
5,0.16226,-0.176792,0.094787,-2.543953,0.433749,-0.619012,0.317007
