# The Tale of the Missing Groups
## Where are you hiding my NaNs, `pandas`?

More like this over at [datasciencehorrorstories.com](http://datasciencehorrorstories.com)

In [2]:
import pandas as pd
import numpy as np

Let's assume our data is a web server log of people viewing pages on our Pet Salon website.

Our simplified dataset will contain an IP address and a page url, meaning a person from that IP address has viewed that page.

In [9]:
df = pd.DataFrame({
        "ip": ["0.0.0.1", "0.0.0.1", "0.0.0.1", "0.0.0.2", "0.0.0.2", np.NaN, "0.0.0.2", "0.0.0.3"],
        "page": ["/home", "/home", "/cat-haircuts", "/home", "/login", "/dog-shampoos", "/login", "/profile"]
    })
df

Unnamed: 0,ip,page
0,0.0.0.1,/home
1,0.0.0.1,/home
2,0.0.0.1,/cat-haircuts
3,0.0.0.2,/home
4,0.0.0.2,/login
5,,/dog-shampoos
6,0.0.0.2,/login
7,0.0.0.3,/profile


We want to count how many pages a typical user will view before leaving our site.

However, for some reason sometimes we can't record a user's IP address, so ***we also want to know how many pages we couldn't assign to a user***.

You would think we could just group by the IP address column and do a count, right?

In [11]:
df.groupby("ip")["page"].count()

ip
0.0.0.1    3
0.0.0.2    3
0.0.0.3    1
Name: page, dtype: int64

Hey, our missing page is... well, missing!

By default, `pandas` will exclude NaN values when grouping.

The workaround is to fill the missing value with a "pseudo" value, and then do the grouping.

In [13]:
df["ip"] = df["ip"].fillna("unknown user")
df.groupby("ip")["page"].count()

ip
0.0.0.1         3
0.0.0.2         3
0.0.0.3         1
unknown user    1
Name: page, dtype: int64

Now we have what we need!