# Lesson 5 Class Exercises: Tidy Data

With these class exercises we learn a few new things.  When new knowledge is introduced you'll see the icon shown on the right: 
<span style="float:right; margin-left:10px; clear:both;">![Task](https://github.com/spficklin/Data-Analytics-With-Python/blob/master/media/new_knowledge.png?raw=true)</span>

## Tidy Summary:
### Rules for Tidy data
+ Each variable forms a unique column in the data frame.
+ Each observation forms a row in the data frame.
+ Each **type** of observational unit needs its own table.

### Spotting messy data
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables.

## Get Started
Import the Numpy and Pandas packages

## Exercise 1:  Review of Tidy Practice
### Task 1: Task 3b from the Practice Notebook
Download the [PI_DataSet.txt](https://hivdb.stanford.edu/download/GenoPhenoDatasets/PI_DataSet.txt) file from [HIV Drug Resistance Database](https://hivdb.stanford.edu/pages/genopheno.dataset.html). Store the file in the same directory as the practice notebook for this assignment.

Here is the meaning of data columns:
- SeqID:  a numeric identifier for a unique HIV isolate protease sequence.  Note: disruption of the protease inhibits HIV’s ability to reproduce.
- The Next 8 columns are identifiers for unique protease inhibitor class drugs.  
  - The values in these columns are the fold resistance over wild type (the HIV strain susceptible to all drugs).
  - Fold change is the ratio of the drug concentration needed to inhibit the isolate.
- The latter columns, with P as a prefix, are the positions of the amino acids in the protease. 
  - '-' indicates consensus.
  - '.' indicates no sequence.
  - '#' indicates an insertion. 
  - '~' indicates a deletion;.
  - '*' indicates a stop codon
  - a letter indicates one letter Amino Acid substitution. 
  - two and more amino acid codes indicates a mixture. 

Import this dataset into your notebook, view the top few rows of the data and respond to these questions:

What are the variables?

What are the observations?

What are the values?  

What is the observational unit?

What makes this dataset untidy?

### Task 2: Task 3c from the practice notebook

Use the data retreived from task 3b, generate a data frame containing a Tidy’ed set of values for drug concentration fold change. Be sure to:

- Remove the all columns but the SeqID and the protease inhibitors.
- Set the column names as ‘SeqID’, ‘Drug’ and ‘Fold_change’.
- Order the data frame first by sequence ID and then by Drug name
- Reset the row indexes
- Display the first 10 elements.

### Task 3: Tidy everything
In Task 2 above we only tidied up the drug fold change. But, now let's tidy up the rest of the table.
+ The other observable units are the amino acid sequences and the mutation list. Create a separate tidy table for each unit.
+ For the amion acid position variant table be sure to remove the 'P' from the amino acid position and order the rows by SeqID then by position

## Exercise 2:  More Tidy Practice

Let's revisit the weather data from the Tidy paper which contains the daily weather records for five months in 2010 for the MX17004 weather station in Mexico. Each day of the month has its own column (e.g. d1, d2, d3, etc.).  The example data only provides the first 8 dayRun the following code to get the data into the notebook:
```python
data = [['MX17004',2010,1,'tmax',None,None,None,None,None,None,None,None],
        ['MX17004',2010,1,'tmin',None,None,None,None,None,None,None,None],
        ['MX17004',2010,2,'tmax',None,27.3,24.1,None,None,None,None,None],
        ['MX17004',2010,2,'tmin',None,14.4,14.4,None,None,None,None,None],
        ['MX17004',2010,3,'tmax',None,None,None,None,32.1,None,None,None],
        ['MX17004',2010,3,'tmin',None,None,None,None,14.2,None,None,None],
        ['MX17004',2010,4,'tmax',None,None,None,None,None,None,None,None],
        ['MX17004',2010,4,'tmin',None,None,None,None,None,None,None,None],
        ['MX17004',2010,5,'tmax',None,None,None,None,None,None,None,None],
        ['MX17004',2010,5,'tmin',None,None,None,None,None,None,None,None]]
headers = ['id','year','month','element','d1','d2','d3','d4','d5','d6','d7','d8']
weather = pd.DataFrame(data, columns=headers)
weather
```

In [None]:
data = [['MX17004',2010,1,'tmax',None,None,None,None,None,None,None,None],
        ['MX17004',2010,1,'tmin',None,None,None,None,None,None,None,None],
        ['MX17004',2010,2,'tmax',None,27.3,24.1,None,None,None,None,None],
        ['MX17004',2010,2,'tmin',None,14.4,14.4,None,None,None,None,None],
        ['MX17004',2010,3,'tmax',None,None,None,None,32.1,None,None,None],
        ['MX17004',2010,3,'tmin',None,None,None,None,14.2,None,None,None],
        ['MX17004',2010,4,'tmax',None,None,None,None,None,None,None,None],
        ['MX17004',2010,4,'tmin',None,None,None,None,None,None,None,None],
        ['MX17004',2010,5,'tmax',None,None,None,None,None,None,None,None],
        ['MX17004',2010,5,'tmin',None,None,None,None,None,None,None,None]]
headers = ['id','year','month','element','d1','d2','d3','d4','d5','d6','d7','d8']
weather = pd.DataFrame(data, columns=headers)
weather

What makes this dataset untidy?

The solution for how to tidy this data is in the notebook from Lesson 5. However, we're going to try a slightly different approach. It uses the same steps but in a different order.

First melt the data appropriately to get the day as its own column.  Name the melted dataframe `weather_melted`. Remove the `d` from the beginning of the day and convert it to an integer. Print the first 5 rows:

Now that we have the day melted, next, pivot so that we have two variables tmax and tmin as their own columns. Name the resulting dataframe `weather_pivoted`.  Print the top few rows.

Notice that we mave multi-level indexing. Reduce this to a typical one-level index using the `reset_index` function. 

Notice, however, we still have MultiIndexing on the column.  We can remove this by simply resetting the column names.

<span style="float:right; margin-left:10px; clear:both;">![Task](https://github.com/spficklin/Data-Analytics-With-Python/blob/master/media/new_knowledge.png?raw=true)</span>

Finally, let's convert the year, month and day to a datetime object.  Previously, when we wanted to convert the date in a string to a `datetime` object we used the `pd.to_datetime` function. However, our date is spread across three different columns and is not a string. In the Tidy Data lesson we did this using the `datatime` package but it was not well explained. Let's look at this deeper.  

The [`datetime` module](https://docs.python.org/3/library/datetime.html) provides a variety of functions for working with dates. The function that will most help us is the `datetime.datetime` function.  See [documentation here](https://docs.python.org/3/library/datetime.html#datetime.datetime).  We can use this function to create the `datetime` objects that we need. But this is a Python module and not a Pandas module.  So, it does not accept a Series.  We must therefore use the `apply` function of the Pandas dataframe. Rememer that the `apply` function takes the name of a function or a function itself! Review the following code.

```python
import datetime

def create_date(row):
    return datetime.datetime(year=row["year"], month=int(row["month"]), day=row["day"])

melted_weather["date"] = melted_weather.apply(lambda row: create_date(row), axis=1)
```

When the `apply` function was first introduced in the [L04-Pandas_Part2.ipynb Lesson](./L04-Pandas_Part2-Practice.ipynb#4.2-Apply) we supplied function names like `print` or `np.sum`. That worked because by default, with `apply`, the function is applied across rows (i.e. down each column).  We need to calculate the date which is across columns. We can provide the `axis=1` argument to `apply` but we only need 3 columns to form a date, and our melted/pivoted dataframe has more than just the 3 date-specific columns in.  

To solve this challenge, we have to create our own function to give to the `apply` function.  In the code above, the `create_date` function provides this functionality. Here, the function receives a Series object we call `row` and inside the function we call the `datetime.datetime` function and pass in the corresponding values from the row that can be used to make the `datetime` object.

In [None]:
import datetime

def create_date(row):
    return datetime.datetime(year=row["year"], month=int(row["month"]), day=row["day"])

weather_pivoted["date"] = weather_pivoted.apply(lambda row: create_date(row), axis=1)
weather_pivoted.head()

## Exercise 3: More Tidy Practice
Consider the following billboard dataset described in the Tidy paper.  This dataset contains the weekly rank of songs from the moment they enter the Billboard Top 100 to the subsequent 75 weeks.  First load the data. You'll find it in the data directory here:  `../data/billboard.csv`.  Save the data with the name `billboard`. List the top 10 lines:

Do a quick review of the data
+ List the columns.
+ List the data types.
+ Are there missing values?  Should we worry about missing values?
+ Are there duplicates?  Should we worry about any duplcates?
+ What fields are meant to be categorical?  And for those check the categories to make sure there is nothing unexpected there.

What makes this data untidy?

Let's tidy this data into a variable named `billboard_tidy`

Perform the following:
1. Remove columns with missing values
2. convert the week to an actual number
3. Convert the rank column to an integer

Next, calculate the actual date for the rank.  We have the date entered, we just need to add the number of days (in weeks) to the date entered to get the actual date for the rank. We haven't learned all of the date time functions, but here's some hints:

- `pd.to_timedelta`: calculates absolute differences in times, expressed in difference units (e.g. days, hours, minutes, seconds)
- `pd.DateOffset`: 
