# Data Wrangling 1: Data Cleaning

## Data Wrangling
Before starting any sort of data analysis, we must first ensure that the data is in a form that works for our process. This might involve finding a subset of data, changing the shape of the data, or even adding in information that is not immediately present when received. All of this is referred to as **Data Wrangling**. This can also be worded as Data Manipulation. There are three main types of actions that can be taken to reformat the data.
- Data Cleaning
- Data Transformation
- Data Enrichment

When combined, Data Wrangling is going to take up the majority of the Data Analysis workflow. However, it is a critical step in making sure that we perform an analysis that is logically sound and can produce useful insights. However, great care must be taken to ensure that deceptive insights are not created by our manipulation.
> "If you torture the data long enough, it will confess to anything."
—Ronald Coase, winner of a Nobel Prize in Economics

The ethics of Data Science is not a conversation that falls within the scope of this course, but it should be obvious to the reader that we do not want to intentionally deceive anyone that would be consuming out findings. One way to reduce this risk is to be transparent in your analysis and preparation. Providing information on how you reached your conclusions will give those that receive your results the chance to make their own judgements about your findings.

## Data Cleaning
If you have read through the previous notebooks, you will find a few examples of how we have performed data cleaning. Below is a list of some common steps that might be taken in a data cleaning process.
- Renaming columns
- Sorting and Reordering rows
- Data type conversions
- Deduplicating data
- Addressing missing or invalid data
- Filtering to the desired subset of data

This is by no means and exhaustive list, nor should you expect to perform these steps in this exact order. These steps are simply a set of blocks that you have at your disposal when building the entire pipeline for your data. This notebook strives only to showcase the functionality of Pandas and to teach the syntax for each of these functions. The steps, their order, and how many repetitions of each step will depend entirely on the data set in question, the format that you receive it in, and intended use case for your data.


In [1]:
import pandas as pd

### Renaming Columns
Renaming columns could be one of your steps, especially if you receive data in a format that is not very human-readable. For example, you may receive a spreadsheet with columns that make heavy use of acronyms related to the retrieval process. You might choose to rename the column to something that is a bit descriptive, e.g. rename "pow" to "place-of-work" (yes, this is a real example from a real government CSV that I've seen recently). Let's take an example from the book *Hands-On Data Analysis with Pandas*.

We first read in some temperature data for New York City.

In [2]:
temp_ny = pd.read_csv("https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_03/data/nyc_temperatures.csv")
temp_ny.head()

Unnamed: 0,attributes,datatype,date,station,value
0,"H,,S,",TAVG,2018-10-01T00:00:00,GHCND:USW00014732,21.2
1,",,W,2400",TMAX,2018-10-01T00:00:00,GHCND:USW00014732,25.6
2,",,W,2400",TMIN,2018-10-01T00:00:00,GHCND:USW00014732,18.3
3,"H,,S,",TAVG,2018-10-02T00:00:00,GHCND:USW00014732,22.7
4,",,W,2400",TMAX,2018-10-02T00:00:00,GHCND:USW00014732,26.1


This table is in something called a long format. Notice the column `datatype`. This column informs what the `value` column means. Luckily in our case, we case we know (from the way we received this data) that everything in the `value` column is a temperature in Celsius. We also know that the `attributes` column is a collection of flags that gives us information about the data collection process. We can rename those two columns to make reading the table more clear.

In [4]:
temp_ny.rename(
    columns={
        'value': 'temp_C',
        'attributes': 'flags'
    },
    inplace=True
)
temp_ny.head()

Unnamed: 0,flags,datatype,date,station,temp_C
0,"H,,S,",TAVG,2018-10-01T00:00:00,GHCND:USW00014732,21.2
1,",,W,2400",TMAX,2018-10-01T00:00:00,GHCND:USW00014732,25.6
2,",,W,2400",TMIN,2018-10-01T00:00:00,GHCND:USW00014732,18.3
3,"H,,S,",TAVG,2018-10-02T00:00:00,GHCND:USW00014732,22.7
4,",,W,2400",TMAX,2018-10-02T00:00:00,GHCND:USW00014732,26.1


In the above cell, we passed a dictionary object into the `columns` argument. Using the syntax `{ 'current_name': 'new_name' }`, we can rename as many columns as we need simultaneously. This instantly makes our dataframe much more human-readable. Additionally, we have used the `inplace=True` argument to signify that we want to rename the existing dataframe. Without this value, we would have created a new dataframe that needs to be saved to a variable.

We can also pass in functions as a way of transforming all the column names at the same time.

In [5]:
print("before: \n", temp_ny.columns)
print("after: \n", temp_ny.rename(str.upper,axis='columns').columns)

before: 
 Index(['flags', 'datatype', 'date', 'station', 'temp_C'], dtype='object')
after: 
 Index(['FLAGS', 'DATATYPE', 'DATE', 'STATION', 'TEMP_C'], dtype='object')


### Changing the Index
In addition to renaming the columns of a dataframe, we can also change the index. The index is a special column of the dataframe that is used to reference specific rows. For example, by using `loc`, we can reference specific rows.

In [6]:
temp_ny.loc[5:10]

Unnamed: 0,flags,datatype,date,station,temp_C
5,",,W,2400",TMIN,2018-10-02T00:00:00,GHCND:USW00014732,19.4
6,"H,,S,",TAVG,2018-10-03T00:00:00,GHCND:USW00014732,21.8
7,",,W,2400",TMAX,2018-10-03T00:00:00,GHCND:USW00014732,25.0
8,",,W,2400",TMIN,2018-10-03T00:00:00,GHCND:USW00014732,18.9
9,"H,,S,",TAVG,2018-10-04T00:00:00,GHCND:USW00014732,21.3
10,",,W,2400",TMAX,2018-10-04T00:00:00,GHCND:USW00014732,26.1


However, we can also use other columns as our index. For example, we can set the date to be the index of this dataframe.

In [7]:
date_index = temp_ny.set_index('date')
date_index.head()

Unnamed: 0_level_0,flags,datatype,station,temp_C
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-10-01T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,21.2
2018-10-01T00:00:00,",,W,2400",TMAX,GHCND:USW00014732,25.6
2018-10-01T00:00:00,",,W,2400",TMIN,GHCND:USW00014732,18.3
2018-10-02T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,22.7
2018-10-02T00:00:00,",,W,2400",TMAX,GHCND:USW00014732,26.1


With the date as the index, we can also reference the ranges of dates instead of needing to find the integer index for those rows.

In [8]:
date_index['2018-10-01':'2018-10-10']

Unnamed: 0_level_0,flags,datatype,station,temp_C
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-10-01T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,21.2
2018-10-01T00:00:00,",,W,2400",TMAX,GHCND:USW00014732,25.6
2018-10-01T00:00:00,",,W,2400",TMIN,GHCND:USW00014732,18.3
2018-10-02T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,22.7
2018-10-02T00:00:00,",,W,2400",TMAX,GHCND:USW00014732,26.1
2018-10-02T00:00:00,",,W,2400",TMIN,GHCND:USW00014732,19.4
2018-10-03T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,21.8
2018-10-03T00:00:00,",,W,2400",TMAX,GHCND:USW00014732,25.0
2018-10-03T00:00:00,",,W,2400",TMIN,GHCND:USW00014732,18.9
2018-10-04T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,21.3


In [10]:
date_index['2018-10-01':'2018-10-02']

Unnamed: 0_level_0,flags,datatype,station,temp_C
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-10-01T00:00:00,"H,,S,",TAVG,GHCND:USW00014732,21.2
2018-10-01T00:00:00,",,W,2400",TMAX,GHCND:USW00014732,25.6
2018-10-01T00:00:00,",,W,2400",TMIN,GHCND:USW00014732,18.3


One downside is that this index is still a string object, which we will talk about how to convert in the next section.

### Type Conversion
When pandas reads in a CSV, it does its best to interpret what kind of data you are reading in. Below is the data types that Pandas interpreted from the CSV we read in.

In [11]:
temp_ny.dtypes

flags        object
datatype     object
date         object
station      object
temp_C      float64
dtype: object

It read all data types as object (or string) except for the temperature This may not have been the case if the original column (`value`) had been mixed with string values and numeric values.

Another observation is that our date column is being interpreted as a string. We conveniently have a datatime object built into pandas which can offer us a lot more functionality when it comes to sorting, ordering, and aggregating. Let's convert this column by using the `to_datetime` function in the Pandas library.

In [12]:
temp_ny_copy = temp_ny.copy()
temp_ny_copy.loc[:, 'date'] = pd.to_datetime(temp_ny_copy.date)
temp_ny_copy.dtypes

flags               object
datatype            object
date        datetime64[ns]
station             object
temp_C             float64
dtype: object

In [14]:
temp_ny_copy.head()

Unnamed: 0,flags,datatype,date,station,temp_C
0,"H,,S,",TAVG,2018-10-01,GHCND:USW00014732,21.2
1,",,W,2400",TMAX,2018-10-01,GHCND:USW00014732,25.6
2,",,W,2400",TMIN,2018-10-01,GHCND:USW00014732,18.3
3,"H,,S,",TAVG,2018-10-02,GHCND:USW00014732,22.7
4,",,W,2400",TMAX,2018-10-02,GHCND:USW00014732,26.1


Now our `date` column is a more accurate datatime, which allows us to find some perform more useful functions such as

In [19]:
temp_ny_copy.date.describe()

  temp_ny_copy.date.describe()


count                      93
unique                     31
top       2018-10-01 00:00:00
freq                        3
first     2018-10-01 00:00:00
last      2018-10-31 00:00:00
Name: date, dtype: object

An issue specific to the `datetime` datatype is the existence of timezones. This could be critical if you are measuring data from multiple regions in the world, or if you have servers operating on different timezone standards. Pandas dataframes can manage and convert between timezones if we tell pandas what that timezone is. The one caveat is that this has to be done as the index of the row. Combining our knowledge from the previous section, we can
1. convert a column's data type to `datetime`
2. set that column as the index of the dataframe
3. define the timezone for that index

Below uses the copy of the dataframe we created earlier, where the `date` column's data type has already been converted.

In [20]:
temp_ny_copy.set_index('date', inplace=True)
temp_ny_copy = temp_ny_copy.tz_localize('EST')
temp_ny_copy.head()

Unnamed: 0_level_0,flags,datatype,station,temp_C
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-10-01 00:00:00-05:00,"H,,S,",TAVG,GHCND:USW00014732,21.2
2018-10-01 00:00:00-05:00,",,W,2400",TMAX,GHCND:USW00014732,25.6
2018-10-01 00:00:00-05:00,",,W,2400",TMIN,GHCND:USW00014732,18.3
2018-10-02 00:00:00-05:00,"H,,S,",TAVG,GHCND:USW00014732,22.7
2018-10-02 00:00:00-05:00,",,W,2400",TMAX,GHCND:USW00014732,26.1


Notice the `-05:00` at the end of each index. Eastern timezone is also defined as -5 hours in some contexts because it is 5 hours behind UTC timezone with is defined as +0 hours.

Taking this one step further, we can also convert from one timezone to another. If we needed all of our data converted to UTC for example, we can call `tz_convert` on our dataframe and pass in the timezone code.

In [21]:
temp_ny_copy = temp_ny_copy.tz_convert('UTC')
temp_ny_copy.head()

Unnamed: 0_level_0,flags,datatype,station,temp_C
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-10-01 05:00:00+00:00,"H,,S,",TAVG,GHCND:USW00014732,21.2
2018-10-01 05:00:00+00:00,",,W,2400",TMAX,GHCND:USW00014732,25.6
2018-10-01 05:00:00+00:00,",,W,2400",TMIN,GHCND:USW00014732,18.3
2018-10-02 05:00:00+00:00,"H,,S,",TAVG,GHCND:USW00014732,22.7
2018-10-02 05:00:00+00:00,",,W,2400",TMAX,GHCND:USW00014732,26.1


Notice now that the tail end of the index has `+00:00` and the timestamp of the index has been changed to `05:00:00`. Also note that much of this can be done from the dataframe initialization as well. In the following example, we interpret the date column as a timestamp, set the index to be that column, set the timezone, and convert it to UTC all in the same line.

In [22]:
initialize_with_date = pd.read_csv(
    "https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_03/data/nyc_temperatures.csv",
    index_col='date',
    parse_dates=True
).tz_localize('EST').tz_convert('UTC')
initialize_with_date.head()

Unnamed: 0_level_0,attributes,datatype,station,value
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-10-01 05:00:00+00:00,"H,,S,",TAVG,GHCND:USW00014732,21.2
2018-10-01 05:00:00+00:00,",,W,2400",TMAX,GHCND:USW00014732,25.6
2018-10-01 05:00:00+00:00,",,W,2400",TMIN,GHCND:USW00014732,18.3
2018-10-02 05:00:00+00:00,"H,,S,",TAVG,GHCND:USW00014732,22.7
2018-10-02 05:00:00+00:00,",,W,2400",TMAX,GHCND:USW00014732,26.1


The 2 main differences here is the use of the two arguments `index_col` and `parse_dates` and that we called both `tz_localize` and `tz_convert` sequentially.

An alternate method of changing the data types is by the use of `assign`. This method can perform column renaming, data type conversion, new column creation, and even complex operations as column values. We additionally can use the `astype` method to convert between data types, i.e. float to integer. All of this can be done in the same function call (thus saving us computation time). Consider the following example:

In [23]:
temp_ny.dtypes

flags        object
datatype     object
date         object
station      object
temp_C      float64
dtype: object

In [24]:
new_df = temp_ny.assign(
    date=pd.to_datetime(temp_ny.date), # convert to datetime datatype
    temp_C_whole=temp_ny.temp_C.astype('int'), # cast floating values to integers
    temp_F=(temp_ny.temp_C * 9/5) + 32, # convert from Celsius to Fahrenheit
    temp_F_whole=lambda x: x.temp_F.astype('int') # cast the newly created Fahrenheit column to integers
    )
new_df.head()

Unnamed: 0,flags,datatype,date,station,temp_C,temp_C_whole,temp_F,temp_F_whole
0,"H,,S,",TAVG,2018-10-01,GHCND:USW00014732,21.2,21,70.16,70
1,",,W,2400",TMAX,2018-10-01,GHCND:USW00014732,25.6,25,78.08,78
2,",,W,2400",TMIN,2018-10-01,GHCND:USW00014732,18.3,18,64.94,64
3,"H,,S,",TAVG,2018-10-02,GHCND:USW00014732,22.7,22,72.86,72
4,",,W,2400",TMAX,2018-10-02,GHCND:USW00014732,26.1,26,78.98,78


In [25]:
new_df.dtypes

flags                   object
datatype                object
date            datetime64[ns]
station                 object
temp_C                 float64
temp_C_whole             int32
temp_F                 float64
temp_F_whole             int32
dtype: object

We could naturally combine the above `assign` call with the methods to assign the `date` column as the index, set the timezone, etc.

One additional data type I want to call out os the `category` datatype. Using `assign` once more:

In [26]:
new_df = new_df.assign(
    station=temp_ny.station.astype('category'),
    datatype=temp_ny.datatype.astype('category')
)
new_df.head()

Unnamed: 0,flags,datatype,date,station,temp_C,temp_C_whole,temp_F,temp_F_whole
0,"H,,S,",TAVG,2018-10-01,GHCND:USW00014732,21.2,21,70.16,70
1,",,W,2400",TMAX,2018-10-01,GHCND:USW00014732,25.6,25,78.08,78
2,",,W,2400",TMIN,2018-10-01,GHCND:USW00014732,18.3,18,64.94,64
3,"H,,S,",TAVG,2018-10-02,GHCND:USW00014732,22.7,22,72.86,72
4,",,W,2400",TMAX,2018-10-02,GHCND:USW00014732,26.1,26,78.98,78


In [27]:
new_df.dtypes

flags                   object
datatype              category
date            datetime64[ns]
station               category
temp_C                 float64
temp_C_whole             int32
temp_F                 float64
temp_F_whole             int32
dtype: object

Using this data type allows us to perform data type specific functionality. For example, using `describe` on categorical data gives us the count, number of unique values, the mode, and the number of times it occurs in the dataset.

In [28]:
new_df.describe(include='category')

Unnamed: 0,datatype,station
count,93,93
unique,3,1
top,TAVG,GHCND:USW00014732
freq,31,93


### Reordering and Sorting
Depending on your context, it may be useful to order the rows in a specific way. The simplest way to do this is by using `sort_values`. This passes in the column name(s), the order of the sort (ascending or descending), and a few other keyword arguments that can be searched if needed.

In [34]:
new_df.sort_values(by='temp_C', ascending=False).head(10)

Unnamed: 0,flags,datatype,date,station,temp_C,temp_C_whole,temp_F,temp_F_whole
19,",,W,2400",TMAX,2018-10-07,GHCND:USW00014732,27.8,27,82.04,82
28,",,W,2400",TMAX,2018-10-10,GHCND:USW00014732,27.8,27,82.04,82
31,",,W,2400",TMAX,2018-10-11,GHCND:USW00014732,26.7,26,80.06,80
4,",,W,2400",TMAX,2018-10-02,GHCND:USW00014732,26.1,26,78.98,78
10,",,W,2400",TMAX,2018-10-04,GHCND:USW00014732,26.1,26,78.98,78
25,",,W,2400",TMAX,2018-10-09,GHCND:USW00014732,25.6,25,78.08,78
1,",,W,2400",TMAX,2018-10-01,GHCND:USW00014732,25.6,25,78.08,78
7,",,W,2400",TMAX,2018-10-03,GHCND:USW00014732,25.0,25,77.0,77
27,"H,,S,",TAVG,2018-10-10,GHCND:USW00014732,23.8,23,74.84,74
30,"H,,S,",TAVG,2018-10-11,GHCND:USW00014732,23.4,23,74.12,74


Passing in multiple columns in a list will perform the sort in the order of the columns given.

In [33]:
new_df.sort_values(by=['temp_C', 'date'], ascending=False).head(10)

Unnamed: 0,flags,datatype,date,station,temp_C,temp_C_whole,temp_F,temp_F_whole
28,",,W,2400",TMAX,2018-10-10,GHCND:USW00014732,27.8,27,82.04,82
19,",,W,2400",TMAX,2018-10-07,GHCND:USW00014732,27.8,27,82.04,82
31,",,W,2400",TMAX,2018-10-11,GHCND:USW00014732,26.7,26,80.06,80
10,",,W,2400",TMAX,2018-10-04,GHCND:USW00014732,26.1,26,78.98,78
4,",,W,2400",TMAX,2018-10-02,GHCND:USW00014732,26.1,26,78.98,78
25,",,W,2400",TMAX,2018-10-09,GHCND:USW00014732,25.6,25,78.08,78
1,",,W,2400",TMAX,2018-10-01,GHCND:USW00014732,25.6,25,78.08,78
7,",,W,2400",TMAX,2018-10-03,GHCND:USW00014732,25.0,25,77.0,77
27,"H,,S,",TAVG,2018-10-10,GHCND:USW00014732,23.8,23,74.84,74
30,"H,,S,",TAVG,2018-10-11,GHCND:USW00014732,23.4,23,74.12,74


We can also sort an index (either the row index or the column index). In the following example, we sort the column names by using the `sort_index` method.

In [35]:
new_df.sort_index(axis='columns').head()

Unnamed: 0,datatype,date,flags,station,temp_C,temp_C_whole,temp_F,temp_F_whole
0,TAVG,2018-10-01,"H,,S,",GHCND:USW00014732,21.2,21,70.16,70
1,TMAX,2018-10-01,",,W,2400",GHCND:USW00014732,25.6,25,78.08,78
2,TMIN,2018-10-01,",,W,2400",GHCND:USW00014732,18.3,18,64.94,64
3,TAVG,2018-10-02,"H,,S,",GHCND:USW00014732,22.7,22,72.86,72
4,TMAX,2018-10-02,",,W,2400",GHCND:USW00014732,26.1,26,78.98,78


Above when sorting the rows by value, notice that the index showed the original index of the row. Whenever reordering, filtering, or manipulating rows of data, pandas keeps the original index, i.e., the row at index 5 will be the same data as the 6th line of the original csv (remember that python uses a 0 index). This could leave you with rows that are "out of order" or have missing values in the index. We can simply create a new index at the current value by using `reset_index()`

In [37]:
new_df.loc[(new_df.temp_C >= 20)
           & (new_df.temp_C <= 25)
           & (new_df.datatype == 'TAVG')]\
    .sort_values('temp_C')\
    .reset_index()

Unnamed: 0,index,flags,datatype,date,station,temp_C,temp_C_whole,temp_F,temp_F_whole
0,12,"H,,S,",TAVG,2018-10-05,GHCND:USW00014732,20.3,20,68.54,68
1,21,"H,,S,",TAVG,2018-10-08,GHCND:USW00014732,20.9,20,69.62,69
2,0,"H,,S,",TAVG,2018-10-01,GHCND:USW00014732,21.2,21,70.16,70
3,9,"H,,S,",TAVG,2018-10-04,GHCND:USW00014732,21.3,21,70.34,70
4,6,"H,,S,",TAVG,2018-10-03,GHCND:USW00014732,21.8,21,71.24,71
5,24,"H,,S,",TAVG,2018-10-09,GHCND:USW00014732,21.8,21,71.24,71
6,3,"H,,S,",TAVG,2018-10-02,GHCND:USW00014732,22.7,22,72.86,72
7,18,"H,,S,",TAVG,2018-10-07,GHCND:USW00014732,22.8,22,73.04,73
8,30,"H,,S,",TAVG,2018-10-11,GHCND:USW00014732,23.4,23,74.12,74
9,27,"H,,S,",TAVG,2018-10-10,GHCND:USW00014732,23.8,23,74.84,74


In [39]:
new_df.reset_index()

This pushes the original index to a new column, and create a new numeric index based on the current order and filter in place.

## Closing Thoughts

There are many other methods that could be useful in a data cleaning process. These are only a small handful of things that could be done. Once again, the specifics of your data cleaning proces are going to depend entirely on your data set, how you have retrieved it, knowledge of the measurement methods and tools, knowledge of error on those measurements, and many other pieces of information. Always keep in mind the end questions like the following:
- which data are going to be useful moving forward while still providing a clear picture?
- are there any clear outliers that might be caused by bad measurements or intentionally deceptive data?
- is the data currently in the format that I need it, such as correct units and data types?

Questions like these will guide which processes you undertake.