# Processing Dirty Data

## Background 

This is fake data generated to demonstrate the capabilities of `pyjanitor`.  It contains a bunch of common problems that we regularly encounter when working with data.  Let's go fix it!

### Load Packages

Importing `pyjanitor` is all that's needed to give Pandas Dataframes extra methods to work with your data.

In [None]:
import pandas as pd
import janitor

## Load Data

In [None]:
df = pd.read_excel('dirty_data.xlsx', engine='openpyxl')
df

## Cleaning Column Names

There are a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the `clean_names()` method.

In [None]:
df_clean = df.clean_names()
df_clean.head(2)

Notice now how the column names have been made better.

If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the `remove_empty()` method:

In [None]:
df_clean = df.clean_names().remove_empty()
df_clean.head(9).tail(4)

Now this is starting to shape up well!

## Renaming Individual Columns

Next, let's rename some of the columns. `%_allocated` and `full_time?` contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:`rename_column()` method:

In [None]:
df_clean = (
    df
    .clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
)

df_clean.head(5)

Note how now we have really nice column names! You might be wondering why I'm not modifying the two certifiation columns -- that is the next thing we'll tackle.

## Coalescing Columns

If we look more closely at the two `certification` columns, we'll see that they look like this:

In [None]:
df_clean[['certification', 'certification_1']]

Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the `coalesce()` method, which does the following:

In [None]:
df_clean = (
    df
    .clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time_", "full_time")
    .coalesce(
        column_names=['certification', 'certification_1'],
        new_column_name='certification'
    )
)

df_clean

Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.`

## Dealing with Excel Dates

Finally, notice how the `hire_date` column isn't date formatted. It's got this weird Excel serialization.
To clean up this data, we can use the :py:meth:`convert_excel_date` method.

In [None]:
df_clean = (
    df
    .clean_names()
    .remove_empty()
    .rename_column('%_allocated', 'percent_allocated')
    .rename_column('full_time_', 'full_time')
    .coalesce(['certification', 'certification_1'], 'certification')
    .convert_excel_date('hire_date')
)
df_clean

We have a cleaned dataframe!