# Cleaning data with `pandas`


**Questions**
- How can we deal with imperfect data?
- What is tidy data?

**Objectives**
- Clean a dataset using `pandas`

In the previous section we learned how to read data in with `pandas` and how to access the data within a dataframe. Here we will be looking at how to "clean" our data while still staying within the bounds of reproducible science.

# Exercise: Read in the data
If you haven't already, import `pandas` and read the exoplanet data into a dataframe. I'm calling the dataframe `exo_df`, the same as in the last section. If you're reading in the data for the first time now rename the first column to `name` rather than `# name`. You can check the [previous section](01_intro_pandas.ipynb) if you need a refresher on how to do this. Check that the data has been read correctly by looking at your dataframe.

In [1]:
import pandas as pd

In [2]:
exo_df = pd.read_csv("../data/Exoplanet_catalog_2019.csv")

In [3]:
exo_df.rename(columns={'# name':'name'}, inplace=True)

In [4]:
exo_df

Unnamed: 0,name,orbital_period,semi_major_axis,eccentricity,discovered,detection_type,star_name,star_distance,star_metallicity,star_mass,star_sp_type
0,11 Com b,326.030000,1.290000,0.23100,2008.0,Radial Velocity,11 Com,110.6000,-0.350,2.700,G8 III
1,11 UMi b,516.220000,1.540000,0.08000,2009.0,Radial Velocity,11 UMi,119.5000,0.040,1.800,K4III
2,14 And b,185.840000,0.830000,0.00000,2008.0,Radial Velocity,14 And,76.4000,-0.240,2.200,K0III
3,14 Her b,1773.400000,2.770000,0.36900,2002.0,Radial Velocity,14 Her,18.1000,0.430,0.900,K0 V
4,16 Cyg B b,799.500000,1.680000,0.68900,1996.0,Radial Velocity,16 Cyg B,21.4100,0.080,1.010,G2.5 V
5,18 Del b,993.300000,2.600000,0.08000,2008.0,Radial Velocity,18 Del,73.1000,-0.052,2.300,G6III
6,1SWASP J1407 b,3725.000000,3.900000,,2012.0,Primary Transit,1SWASP J1407,133.0000,,0.900,
7,24 Boo b,30.350600,0.190000,0.04200,2018.0,Radial Velocity,24 Boo,100.0000,-0.770,0.990,G3IV
8,24 Sex b,452.800000,1.333000,0.09000,2010.0,Radial Velocity,24 Sex,72.2084,-0.030,1.540,G5
9,24 Sex c,883.000000,2.080000,0.29000,2010.0,Radial Velocity,24 Sex,72.2084,-0.030,1.540,G5


As you may remember from your Semester 1 coursework, there are some gaps in the original csv file. Not every object has data available for every variable. This is a hassle when we're using something like `np.loadtxt` to read in the data, but `pandas` can handle missing variables using the **`NaN`** value. 

`NaN` stands for 'Not a Number' and is the value that `pandas` fills your empty cells with. 

# Information: NaN, null, NA...

`NaN` is probably the most common way to show missing data, but you might also come across `NA` or `null`. If you're using some data where the missing values are represented by something else (for example, `99.999` or `----` you can use the `na_values` argument in `pd.read_csv` to tell it what corresponds to missing data. 

We can select all the rows in a dataframe that contain a `NaN` value using the `isna()` function. This is another logical operation; we're selecting the rows that have a `NaN` value. Because we want to select all the rows with a `NaN` and don't care which column it's in, we use `.any(axis=1)`, where `axis=1` says to do the operation on each row. 

In [5]:
exo_df[exo_df.isna().any(axis=1)]

Unnamed: 0,name,orbital_period,semi_major_axis,eccentricity,discovered,detection_type,star_name,star_distance,star_metallicity,star_mass,star_sp_type
6,1SWASP J1407 b,3725.000000,3.90000,,2012.0,Primary Transit,1SWASP J1407,133.000,,0.900,
11,38 Vir b,825.900000,1.82000,0.030,2016.0,Primary Transit,38 Vir,,0.070,1.180,F6 V
30,75 Cet b,691.900000,2.10000,,2012.0,Radial Velocity,75 Cet,81.500,0.000,2.490,G3III
34,AD 3116 b,1.982796,,0.146,2017.0,Primary Transit,AD 3116,186.540,,0.276,M3.9
35,AD Leo b,2.225990,0.02500,0.030,2019.0,Radial Velocity,AD Leo,4.966,,,M4V
37,BD+03 2562 b,481.900000,1.30000,0.200,2017.0,Radial Velocity,BD+03 2562,2618.000,-0.710,1.140,
39,BD+15 2375 b,153.220000,0.57600,0.001,2016.0,Radial Velocity,BD+15 2375,774.000,-0.220,1.080,
43,BD+20 274 c,578.200000,1.30000,,2012.0,Radial Velocity,BD+20 274,,-0.460,0.800,K5
47,BD+48 738 b,392.600000,1.00000,0.200,2011.0,Radial Velocity,BD +48 738,,-0.200,0.740,K0III
48,BD+49 828 b,2590.000000,4.20000,0.350,2015.0,Radial Velocity,BD+49 828,,-0.190,1.520,KO


At the bottom of the output we can see the size of the dataframe that's returned - 2866 rows by 11 columns. Compare that to the original size of our dataframe... that's a lot of `NaN`s! 

It's handy to check which rows have missing data, but what if we're only concerned with those that aren't missing data? Then we can do the inverse to return only the rows **without** `NaN` values.

We can get this in almost exactly the same command, just one character different. We change
`exo_df.isna()` to `~exo_df.isna()`. The `~` is another way of saying `not.

In [6]:
exo_df[~exo_df.isna().any(axis=1)]

Unnamed: 0,name,orbital_period,semi_major_axis,eccentricity,discovered,detection_type,star_name,star_distance,star_metallicity,star_mass,star_sp_type
0,11 Com b,326.030000,1.290000,0.23100,2008.0,Radial Velocity,11 Com,110.6000,-0.350,2.700,G8 III
1,11 UMi b,516.220000,1.540000,0.08000,2009.0,Radial Velocity,11 UMi,119.5000,0.040,1.800,K4III
2,14 And b,185.840000,0.830000,0.00000,2008.0,Radial Velocity,14 And,76.4000,-0.240,2.200,K0III
3,14 Her b,1773.400000,2.770000,0.36900,2002.0,Radial Velocity,14 Her,18.1000,0.430,0.900,K0 V
4,16 Cyg B b,799.500000,1.680000,0.68900,1996.0,Radial Velocity,16 Cyg B,21.4100,0.080,1.010,G2.5 V
5,18 Del b,993.300000,2.600000,0.08000,2008.0,Radial Velocity,18 Del,73.1000,-0.052,2.300,G6III
7,24 Boo b,30.350600,0.190000,0.04200,2018.0,Radial Velocity,24 Boo,100.0000,-0.770,0.990,G3IV
8,24 Sex b,452.800000,1.333000,0.09000,2010.0,Radial Velocity,24 Sex,72.2084,-0.030,1.540,G5
9,24 Sex c,883.000000,2.080000,0.29000,2010.0,Radial Velocity,24 Sex,72.2084,-0.030,1.540,G5
10,30 Ari B b,335.100000,0.995000,0.28900,2009.0,Radial Velocity,30 Ari B,39.4000,0.120,1.220,F6V


That's a lot less rows. Oh dear. But what if we don't care about some of the columns with missing data? 

In your coursework you were asked to only look at the planet name, period, semi-major axis, detection technique and star mass columns. So let's create a new dataframe that only contains these columns. I'll call this one `planet_df`

First we need to check the names of the columns we want, then we can make our new dataframe using these names:



In [7]:
exo_df.columns

Index(['name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')

In [8]:
planet_df = exo_df[['name', 'orbital_period', 'semi_major_axis', 'detection_type', 'star_mass']]

In [9]:
planet_df

Unnamed: 0,name,orbital_period,semi_major_axis,detection_type,star_mass
0,11 Com b,326.030000,1.290000,Radial Velocity,2.700
1,11 UMi b,516.220000,1.540000,Radial Velocity,1.800
2,14 And b,185.840000,0.830000,Radial Velocity,2.200
3,14 Her b,1773.400000,2.770000,Radial Velocity,0.900
4,16 Cyg B b,799.500000,1.680000,Radial Velocity,1.010
5,18 Del b,993.300000,2.600000,Radial Velocity,2.300
6,1SWASP J1407 b,3725.000000,3.900000,Primary Transit,0.900
7,24 Boo b,30.350600,0.190000,Radial Velocity,0.990
8,24 Sex b,452.800000,1.333000,Radial Velocity,1.540
9,24 Sex c,883.000000,2.080000,Radial Velocity,1.540


Now we can check how many of these rows and columns contain `NaN` values. Rather than print out the dataframe and read the number underneath, we can use the `shape` attribute.

In [10]:
planet_df[planet_df.isna().any(axis=1)].shape

(1783, 5)

This dataframe has fewer rows with missing data than the first, but still has a lot. We only care about the rows that don't have any missing data. We could use the same operation as we did before to return only the complete rows, but an easier way to do it is to use `dropna`. This removes any rows with missing data. If we'd done this on the original dataframe we'd have lost a lot of rows that are perfectly fine for our analysis but have missing data in another column. By using our new dataframe we don't remove data unnecessarily. Setting `inplace=True` tells Python to edit the dataframe rather than just show us the results. 

In [11]:
planet_df.dropna(inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


## Information: `SettingWithCopyWarning`
You may have got a warning when you ran the `dropna(inplace=True)` command. First - don't panic! This is a warning, not an error. Here the warning is telling us that we're editing a copy of part of another dataframe, not the original dataframe. That's exactly what we intended here so it's not a problem. It becomes more important when you're doing more complex data analysis (more complex that we'll reach in this course), but for now we just need to understand what it's telling us and move on.


Let's take another look at our `planet_df` dataframe:

In [12]:
planet_df

Unnamed: 0,name,orbital_period,semi_major_axis,detection_type,star_mass
0,11 Com b,326.030000,1.290000,Radial Velocity,2.700
1,11 UMi b,516.220000,1.540000,Radial Velocity,1.800
2,14 And b,185.840000,0.830000,Radial Velocity,2.200
3,14 Her b,1773.400000,2.770000,Radial Velocity,0.900
4,16 Cyg B b,799.500000,1.680000,Radial Velocity,1.010
5,18 Del b,993.300000,2.600000,Radial Velocity,2.300
6,1SWASP J1407 b,3725.000000,3.900000,Primary Transit,0.900
7,24 Boo b,30.350600,0.190000,Radial Velocity,0.990
8,24 Sex b,452.800000,1.333000,Radial Velocity,1.540
9,24 Sex c,883.000000,2.080000,Radial Velocity,1.540


It definately has fewer rows than before, but we should still check that they're all complete...

## Exercise: Check your data

Use the `isna` function to check that you have properly cleaned the `planet_df` dataframe. Check the shape of the dataframe returned when you use `isna` and its inverse.

[solution]()

## Solution+: Check your data


In [13]:
planet_df[planet_df.isna().any(axis=1)]

Unnamed: 0,name,orbital_period,semi_major_axis,detection_type,star_mass


In [14]:
planet_df[planet_df.isna().any(axis=1)].shape

(0, 5)

In [15]:
planet_df[~planet_df.isna().any(axis=1)]

Unnamed: 0,name,orbital_period,semi_major_axis,detection_type,star_mass
0,11 Com b,326.030000,1.290000,Radial Velocity,2.700
1,11 UMi b,516.220000,1.540000,Radial Velocity,1.800
2,14 And b,185.840000,0.830000,Radial Velocity,2.200
3,14 Her b,1773.400000,2.770000,Radial Velocity,0.900
4,16 Cyg B b,799.500000,1.680000,Radial Velocity,1.010
5,18 Del b,993.300000,2.600000,Radial Velocity,2.300
6,1SWASP J1407 b,3725.000000,3.900000,Primary Transit,0.900
7,24 Boo b,30.350600,0.190000,Radial Velocity,0.990
8,24 Sex b,452.800000,1.333000,Radial Velocity,1.540
9,24 Sex c,883.000000,2.080000,Radial Velocity,1.540


In [16]:
planet_df[~planet_df.isna().any(axis=1)].shape

(2049, 5)

:solution+

## Key Points:
- Missing data is represented as `NaN` or 'Not a Number'
- You can use the `na_values` argument of `pd.read_csv` to tell `pandas` which values correspond to missing data.
- `isna()` can be used to find missing data
- You can use `~` to return the inverse of a logical expression.