# Common Messy Datasets

The previous notebooks focused on one particular type of messy dataset. A dataset where the column names are actually variable values and not variable names. This was illustrated with the dataset on arrival delay. The `melt` method will quickly tidy these basic datasets. But, often is the case that datasets take more manipulation to make them tidy. This notebook covers several more common messy datasets.

## Most common messy data problems

1. Column names are variable values, not variable names.
1. Multiple variables are stored in one column.
1. Variables are stored in both rows and columns.
1. Multiple types of observational units are stored in the same table.
1. A single observational unit is stored in multiple tables

The first type of messy data was covered in the previous notebook. This notebook will cover the next three examples.

## Multiple variables are stored in one column

A tidy data set requires that values of a single variable are stored in one column.

### Column names appear as values in a column

Take a look at the dataset below. Notice how the `Value` column has both numeric and string data types and the `Info` column contains variable names.

In [None]:
import pandas as pd
df = pd.DataFrame(data={'State': ['Texas', 'Arizona', 'Florida'] * 3,
                        'Info': ['Age'] * 3 + ['Salary'] * 3 + ['Hair Color'] * 3, 
                        'Value': [10, 15, 20, 3, 4, 5, 'Brown', 'Pink','Red']},
                 columns=['State', 'Info', 'Value'])
df

### The fix
This dataset has three variables in a single column. You can think of it as 'overly melted'. Pivoting it with the **`pivot`** method will make it tidy.

In [None]:
df.pivot(index='State', columns='Info', values='Value')

In [None]:
df_tidy = df.pivot(index='State', columns='Info', values='Value').reset_index()
df_tidy = df_tidy.rename_axis(None, axis='columns')
df_tidy

### Checking data types
Whenever we have a mix of variables in a single column, you might also have a mix of data types. It's important to check the data types after reshaping the data.

In [None]:
df_tidy.dtypes

### Changing data types
Both `Age` and `Salary` should be integers but instead are objects. We need to change their data types. We have seen this previously with the function `pd.to_numeric`, but you can also see use the `astype` method. They both do nearly the same thing, except `pd.numeric` gives you more options (which were needed in a previous notebook). We will use each here.

In [None]:
df_tidy['Age'] = df_tidy['Age'].astype('int')
df_tidy['Salary'] = pd.to_numeric(df_tidy['Salary'])
df_tidy.dtypes

## Two or more values are stored in the same cell
Two or more values of the same variable or different variable can be stored in the same cell in a DataFrame. You will need to extract the desired quantities which might necessitate regular expressions. Let's take a look at a dataset with multiple variables stored in a single cell.

In [None]:
geo = pd.DataFrame({'City':['Houston', 'Dallas', 'Austin'], 
                   'Geolocation':['(29.7604° N, 95.3698° W)', 
                                  '32.7767° N, 96.7970° W', 
                                  '30.2672° N, 97.7431° W']})
geo

### Identify the Variables
The first step in tidying data is identifying the variables. The `Geolocation` column has quite a lot of information packed into it. We will parse it into 4 separate variables.

* latitude 
* latitude direction
* longitude
* longitude direction

### Extracting information with regular expressions using the `str` accessor

The `extract` string method takes a regular expression with **capture groups** and returns each captured group as a new column.

Our regular expression has 4 capture groups. One for each variable. 

```
([0-9.]+).*?([NS]).*?([0-9.]+).*?([EW])
```

### My explanation

* `([0-9.]+)` - This is a capture group that matches one or more of the digits 0-9 and the literal `.`
* `.*?([NS])` - Matches any number of characters in a non-greedy fashion before capturing N or S.
* The pattern then repeats to capture the longitude and direction

In [None]:
geo_extract = geo['Geolocation'].str.extract(r'([0-9.]+).*?([NS]).*?([0-9.]+).*?([EW])')
geo_extract

### Column names

pandas defaults the column names of the resulting DataFrame to integers. We would like these new columns appended to our original DataFrame.

### Creating multiple new column names

It is possible to create several new columns in our original DataFrame by simply assigning the above resulting DataFrame to a selection of new column names as a list.

In [None]:
geo[['latitude', 'latitude direction', 'longitude', 'longitude direction']] = geo_extract
geo

### Dropping the original column
We can remove the **`Geolocation`** column as we have finished processing it.

In [None]:
geo = geo.drop(columns='Geolocation')
geo

### Check and change Data Types
Both latitude and longitude are clearly supposed to be numeric (floats) but since they were extracted from a string, remain as strings. Let's change them to float.

In [None]:
geo.dtypes

In [None]:
geo['latitude'] = geo['latitude'].astype('float')
geo['longitude'] = geo['longitude'].astype('float')
geo.dtypes

We now have tidied this dataset which had multiple values stored in a single cell.

In [None]:
geo

## Variables are stored in both rows and columns
A more difficult situation occurs when variables are stored down a column and across the column names. Pivoting and melting may have to be used together to make it tidy. Let's take a look at the example below. 

In [None]:
tfp = pd.read_csv('../data/tidy/temp_flow_pressure.csv')
tfp

### Identifying the Variables
Identifying variables in this dataset is not as straightforward as it is in others. There are variable values stored across multiple rows and variable values stored as column names.

We can use the following as variables:

* Group
* Pressure
* Temperature
* Flow
* Years

The years are column names and the pressure, temperature, and flow are values in the property column. The Group column is the only one in the correct place.

### Melt the years
Tidying this particular dataset must happen in multiple stages. We won't be able to tidy each variable at the same time. We will begin by melting the year column names into a single column.

In [None]:
tfp_melt = tfp.melt(id_vars=['Group', 'Property'], 
                    value_vars=['2012', '2013', '2014', '2015', '2016'],
                    var_name='Year')
tfp_melt.head(10)

### Need to pivot Property
We now need to pivot the **Property** column so that the values become column names, and keep Group and Year as columns. The values will come from the `value` column.

### Problem! `pivot` only works with a single column as the index

If we try and pivot by passing a list of values to the `index` parameter, we get an error. Pandas actually thinks we are using the `['Group', 'Year']` not as column names but as values to pivot.

In [None]:
tfp_melt.pivot(index=['Group', 'Year'], columns='Property', values='value')

### Must use `pivot_table`

The `pivot_table` method does allow us to keep multiple columns in the index. Multiple aggregation functions produce the same result as there is only one value to aggregate per group.

In [None]:
tfp_tidy = tfp_melt.pivot_table(index=['Group', 'Year'], columns='Property', 
                                values='value', aggfunc='max')
tfp_tidy

### Verify that there is one value per intersection
Let's verify that there is one value per intersection.

In [None]:
tfp_melt.pivot_table(index=['Group', 'Year'], columns='Property', 
                     values='value', aggfunc='size')

### Clean-up

In [None]:
tfp_tidy = tfp_tidy.reset_index()
tfp_tidy = tfp_tidy.rename_axis(None, axis='columns')
tfp_tidy

In [None]:
tfp_tidy.dtypes

### Convert Year to integer

In [None]:
tfp_tidy['Year'] = tfp_tidy['Year'].astype('int')
tfp_tidy.dtypes

In [None]:
tfp_tidy

## Steps to produce tidy data
There won't be an exact set of procedures that will always result in a tidy dataset. This guideline may help you turn messy data into tidy data.

1. Identify each variable
1. Look for variable values masquerading as column names
1. Look for column names masquerading as variable values
1. Examine the 5 types of common messy data sets to see which one your dataset most closely resembles
1. You will likely need to use `melt`, `pivot`, and `pivot_table`
1. You might need to separate different variables into their own DataFrame to make for easier tidying
1. Parse string data with the `str` accessor with the help of regular expressions.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Make the `tidy/country_hour_price.csv` dataset tidy by putting all the hour columns into a single column.</span>

### Exercise 2
<span  style="color:green; font-size:16px">If the resulting DataFrame from Exercise 1 has the strings 'HOUR1' and 'HOUR2' as values in the hour column, then extract just the numerical part of the strings and reassign the result to the hour column.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Tidy the `tidy/flights_status.csv` dataset.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Tidy the `tidy/metrics.csv` dataset.</span>