# Data Wrangling

Here are some fun facts about your Python skills after 1.5 weeks:

+ You can create and manipulate a Jupyter notebook.
+ You can download a CSV file to your computer and upload it to JupyterHub.
+ You can read a CSV file to create a `DataFrame`.
+ You can call `head` to get a new shorter `DataFrame` with just the top few rows.
+ You can extract a column from a `DataFrame`, and you're aware that the column type is `Series`.
+ You can, given an example, figure out how use a Python dictionary to create a `DataFrame`.
+ You can create a list of values in Python and name it using an assignment statement.
+ You can calculate the average of a `Series` or a list, and can find the maximum and minimum values using built-in functions.
+ You can name any data you want using variables.
+ You can print any data you have.

# What's next?

__Subsetting__:

+ Create a new `DataFrame` containing a subset of the columns.
+ Create a new `DataFrame` containing a subset of the rows
  - by index, using slicing
  - by looking for rows that match our criteria
+ Renaming columns

# PanTHERIA ecological dataset

Dataset: PanTHERIA

A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature. It also includes spatial databases of mammalian geographic ranges and global climatic and anthropogenic variables.

We have saved a copy of the data locally in file [pantheria.txt](pantheria.txt). This uses tabs instead of commas to separate data. When we call `read_csv`, we need to warn it by passing in `'\t'` as the separator. That's what `sep='\t'` does.

Let's start exploring!

In [None]:
import pandas as pd
pantheria_df = pd.read_csv('pantheria.txt', sep='\t')
pantheria_df.head()

Hmm, those columns are sometimes a bit cryptic, but mostly the seem pretty good.

Let's look at all of the column names:

In [None]:
pantheria_df.columns

Well, that's a lot. At least it's not like the dataset in the Social Sciences version of this course, GGR274, which has 350 columns.

Let's make a smaller `DataFrame` with only the columns we care about for this lecture. Here they are:

- `MSW05_Order` (higher taxonomic grouping), 
- `MSW05_Genus` (name of genus), 
- `MSW05_Species` (name of species), 
- `5-1_AdultBodyMass_g` (average body mass within species), 
- `24-1_TeatNumber` (number of teats), 
- `6-2_TrophicLevel` (herbivore vs carnivore, etc), 
- `25-1_WeaningAge_d` (average age when infants are weaned)
- `15-1_LitterSize` (how many offspring are birthed in each litter)

In [None]:
pantheria_df['MSW05_Species'].head()

Let's put all these column names in a list that we'll name `important_columns`. (Ooh, what's up with that indentation?)

In [None]:
important_columns = ["MSW05_Order", "MSW05_Genus", "MSW05_Species",
                     "5-1_AdultBodyMass_g", "24-1_TeatNumber", "6-2_TrophicLevel",
                     "25-1_WeaningAge_d","15-1_LitterSize"]

Just like we can extract a single column to get a `Series` , we can extract a subset of the columns to get a new `DataFrame`:

In [None]:
sub_pantheria_df = pantheria_df[important_columns]
sub_pantheria_df.head()

### Let's remove all that MSW and number stuff from the names

Here's a Python dictionary mapping icky names to better names:

In [None]:
column_names = {
    'MSW05_Order': 'order',
    'MSW05_Genus': 'genus',
    'MSW05_Species': 'species',
    '5-1_AdultBodyMass_g': 'body_mass_g',
    '24-1_TeatNumber': 'teat_number',
    '6-2_TrophicLevel': 'trophic_level',
    '25-1_WeaningAge_d': 'weaning_age_d',
    '15-1_LitterSize': 'litter_size'
}

In [None]:
column_names['6-2_TrophicLevel']

### Renaming columns!

This code makes a new `DataFrame` with the column names replaced (The `columns=` part is required magic):

In [None]:
clean_pantheria_df = sub_pantheria_df.rename(columns=column_names)
clean_pantheria_df.head()

# Subsetting rows

Two ways to subset by row.
+ By row index
+ By describing the columns to look in and the values to select for

## By row index: slicing

`DataFrame`s contain a variable called `iloc` that takes the row range and column range:

In [None]:
clean_pantheria_df.iloc[0:3, 1:4] # Rows 0, 1, and 2; columns 1, 2, and 3

As with list slicing, the second index is not included in the result.

## By using the `DataFrame` row labels

`DataFrame`s contain a variable called `loc` that uses the labels, not the indexes. Watch that second index!

(The row labels are integers in `clean_pantheria_df`.)

In [None]:
clean_pantheria_df.loc[0:3, 'order':'species']

### `iloc` vs `loc`

+ With `iloc`, the second part of a range is not included.
+ With `loc`, the second part of a range __is__ included.

# In which rows is the Marital Status 1?

This gets a `Series` of Boolean values. Because `=` is used for assignment, Python uses `==` for equality.

In [None]:
clean_pantheria_df['genus'] == 'Canis'

Ooh! That's fun. We can give this to `loc` as the row specification.

# Practice

A reminder of `clean_pantheria_df` info:

In [None]:
clean_pantheria_df.head()

How do you get a `DataFrame` for rows where `teat_number` is `8.0`?

In [None]:
# Get the True/False Series from the column.
teat_series = clean_pantheria_df['teat_number'] == 8.0
teats_8_df = clean_pantheria_df.loc[teat_series, :]
teats_8_df.head()

How do you get a `DataFrame` for rows where `teat_number` is `8.0` or greater, but with only the `body_mass_g`, `teat_number`, and `trophic_level` columns?

In [None]:
teat_series = clean_pantheria_df['?'] >= 8.0
teats_8_df = clean_pantheria_df.loc[teat_series, 'body_mass_g':'trophic_level']
teats_8_df.head()

How do you get a `DataFrame` with only `genus`, `species` and `body_mass_g` columns for everyone _except_ rows where the `genus` column is `Canis`? (Hint: `!=` means "not equal to", the opposite of `==`.)

In [None]:
genus_series = clean_pantheria_df['genus'] != 'Canis'
non_canis_df = clean_pantheria_df.loc[genus_series, 'genus':'body_mass_g']
non_canis_df.head()