# The pandas NA Trap

## Why are some of my values missing, pandas?

More like this over at [datasciencehorrorstories.com](http://www.datasciencehorrorstories.com)

In [1]:
import pandas as pd

Create a fake chemistry-themed DataFrame

In [2]:
df = pd.DataFrame({ "symbol": ["K", "NA", "S", "O", "B"],
                   "name": ["Potassium", "Sodium", "Sulphur", "Oxygen", "Boron"]
                  })
df

Unnamed: 0,name,symbol
0,Potassium,K
1,Sodium,
2,Sulphur,S
3,Oxygen,O
4,Boron,B


Write it out to csv

In [3]:
df.to_csv("elements.csv", index=False)

Read it back in (it's the exact same DataFrame, remember!)

In [4]:
df2 = pd.read_csv("elements.csv")
df2

Unnamed: 0,name,symbol
0,Potassium,K
1,Sodium,
2,Sulphur,S
3,Oxygen,O
4,Boron,B


Wait, is that a NaN?

In [5]:
df2.isnull().sum()

name      0
symbol    1
dtype: int64

It **is** a NaN!

In [6]:
%%html
<p style="font-size: 40px; padding: 10px 0px;" >&#x2639;</p>

Q: How do we fix this?

A: Tell pandas not to try and be clever about NA values

In [7]:
df_tryagain = pd.read_csv("elements.csv", na_filter=False)
df_tryagain

Unnamed: 0,name,symbol
0,Potassium,K
1,Sodium,
2,Sulphur,S
3,Oxygen,O
4,Boron,B


Looks good!

In [8]:
df_tryagain.isnull().sum()

name      0
symbol    0
dtype: int64

In [9]:
%%html
<p style="font-size: 50px; padding: 10px 0px;" >&#x263A;</p>

## Final Thoughts

### So what values does pandas filter out?

From the pandas documentation:

By default the following values are interpreted as NaN: `''`, `'#N/A'`, `'#N/A N/A'`, `'#NA'`, `'-1.#IND'`, `'-1.#QNAN'`, `'-NaN'`, `'-nan'`, `'1.#IND'`, `'1.#QNAN'`, `'N/A'`, `'NA'`, `'NULL'`, `'NaN'`, `'nan'`.

### What if I have my own codes for missing values?

You can also do the opposite, i.e. tell pandas to treat **more** values as NaN by using `na_values` and include/exclude the default list by using `keep_default_na`

In [10]:
df_nooxygen = pd.read_csv("elements.csv", na_values=["O"])
df_nooxygen

Unnamed: 0,name,symbol
0,Potassium,K
1,Sodium,
2,Sulphur,S
3,Oxygen,
4,Boron,B


And if we wanted to read in the other NA value properly:

In [11]:
df_sodium_is_ok = pd.read_csv("elements.csv", na_values=["O"], keep_default_na=False)
df_sodium_is_ok

Unnamed: 0,name,symbol
0,Potassium,K
1,Sodium,
2,Sulphur,S
3,Oxygen,
4,Boron,B
