# More Pandas

In [None]:
import numpy as np
import pandas as pd
import requests
from matplotlib import pyplot as plt

%matplotlib inline

# These next lines ensure that the notebook
# stays current with respect to active .py files.
# See here:
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html

%load_ext autoreload
%autoreload 2

![panda](http://res.freestockphotos.biz/thumbs/3/3173-illustration-of-a-giant-panda-eating-bamboo-th.png)

## Scenario

You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and to get more information about planning. In this lecture, we'll look at a real data set collected by Austin Animal Center.  The code below will return the last 1000 animal outcomes that have occurred.  We will use our `pandas` skills from the last lecture and learn some new ones in order to explore these data further.



## Agenda

SWBAT:

- Apply and use `.map()`, `apply()`, and `.applymap()` from the `pandas` library
- Apply and use `.where()` and `.select()` from the `numpy` library
- Use lambda functions in coordination with the above functions
- Explain what a groupby object is and split a DataFrame using `.groupby()`

## Getting started: Exploratory Data Analysis (EDA)

Let's take a moment to download and to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). 

We can also read the data right off the web without downloading it, as we do below.

Let's take a look at the data:

In [None]:
url = 'https://data.austintexas.gov/resource/9t4d-g238.json'
response = requests.get(url)
animals = pd.DataFrame(response.json())
animals.head()

In [None]:
animals.info()

One way to become familiar with your data is to start asking questions. In your EDA notebooks, **markdown** will be especially helpful in tracking these questions and your methods of answering the questions.  

For example, a simple first question we might ask, after being presented with the above dataset, would be:

### What is the most commonly adopted animal type in the dataset?

We can then begin thinking about what parts of the DataFrame we need to answer the question.

    What features do we need?
     - "animal_type"
    What type of logic and calculation do we perform?
     - Let's use `.value_counts()` to count the different animal types
    What type of visualization would help us answer the question?
     - A bar chart would be good for this purpose

In [None]:
animals['animal_type'].value_counts()

In [None]:
fig, ax = plt.subplots()

ax.barh(animals['animal_type'].value_counts().index,
        width=animals['animal_type'].value_counts())
ax.set_xlabel('count');

Questions lead to other questions. For the above example, the visualization raises the question, what "Other" animals are being adopted?

To find out, we need to know where the type of animal for "Other" is encoded.   
    
    What features do we need to answer what the most commonly adopted type of animal within the Other category is?

In [None]:
animals[animals['animal_type'] == 'Other']['breed'].value_counts()

### Quick Exploration

Here are some good EDA steps *whatever* your dataset:

In [None]:
# Use info to check nulls, datatypes, and shape

animals.info()

In [None]:
# Use describe to gain a bit more detail about certain features.

animals.describe()

In [None]:
# Use value counts to check a categorical feature's distribution

animals['outcome_type'].value_counts()

## `pandas`'s `.apply()`, `Series.map()`, and `df.applymap()` vs. `numpy`'s <br/> <br/> `.where()` and `.select()`

These are `pandas`-native methods for applying transformations to columns.

### `.applymap()`

`.applymap()` is used to effect changes in *all* the values of a DataFrame.

In [None]:
# This line will apply the base `type()` function to 
# all entries of the DataFrame.

animals.applymap(type)

### `Series.map()` and `.apply()`

The `.map()` method takes a function as input that it will then apply to every entry in the Series. The `.apply()` method is similar.

In [None]:
animals['age_upon_outcome'].value_counts()

In [None]:
def young(age):
    if age == '3 days':
        return 'less than 1 week'
    else:
        return age

In [None]:
animals['new_age1'] = animals['age_upon_outcome'].map(young)
animals['new_age1']

### Slower Than `numpy`

In general, `np.where()` and `np.select()` are faster:

In [None]:
animals['new_age2'] = np.where(animals['age_upon_outcome'] == '3 days',
                              'less than 1 week', animals['age_upon_outcome'])
animals['new_age2']

In [None]:
(animals['new_age1'] != animals['new_age2']).sum()

In [None]:
%timeit animals['new_age1'] = animals['age_upon_outcome'].map(young)

In [None]:
%timeit animals['new_age2'] = np.where(animals['age_upon_outcome'] == '3 days',\
                                'less than 1 week', animals['age_upon_outcome'])

### More Sophisticated Mapping

Let's use `.map()` to turn sex_upon_outcome into a ternary category: male, female, or unknown. 

First, explore the unique values:

In [None]:
animals['sex_upon_outcome'].unique()

In [None]:
def sex_mapper(status):
    if status in ['Neutered Male', 'Intact Male']:
        return 'Male'
    elif status in ['Spayed Female', 'Intact Female']:
        return 'Female'
    else:
        return 'Unknown'

In [None]:
animals['new_sex1'] = animals['sex_upon_outcome'].map(sex_mapper)
animals['new_sex1']

Again, `numpy` will be faster:

In [None]:
conditions = [animals['sex_upon_outcome'] == 'Neutered Male',
             animals['sex_upon_outcome'] == 'Intact Male',
             animals['sex_upon_outcome'] == 'Spayed Female',
             animals['sex_upon_outcome'] == 'Intact Female',
             animals['sex_upon_outcome'] == 'Unknown',
             animals['sex_upon_outcome'] == 'NULL']

choices = ['Male', 'Male', 'Female', 'Female', 'Unknown', 'Unknown']

In [None]:
animals['new_sex2'] = np.select(conditions, choices)
animals['new_sex2']

In [None]:
(animals['new_sex1'] != animals['new_sex2']).sum()

In [None]:
%timeit animals['new_sex1'] = animals['sex_upon_outcome'].map(sex_mapper)

In [None]:
%timeit animals['new_sex2'] = np.select(conditions, choices)

### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

Let's use a lambda function to get rid of 'Other' in the "animal_type' column.

In [None]:
animals[animals['animal_type'] == 'Other']

In [None]:
animals['animal_type'].map(lambda x: np.nan if x == 'Other' else x)[[0, 15, 53]]

## Methods for Re-Organizing DataFrames: `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. (And if you haven't, you'll see it very soon!) Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

It is most useful when we have numeric types that can be aggregated, so let's give ourselves a numeric type by turning "age_upon_outcome" into a number of days.

In [None]:
animals['age_split'] = animals['age_upon_outcome'].str.split(" ")
animals['age_split']

To assimilate the ages of "1 year" with the other ages measured in years, we'll pluralize it, and similarly with the other time increments.

In [None]:
def pluralize(x):
    if x[-1][-1] != 's':
        return [x[0], x[-1] + 's']
    else:
        return x
    
animals['age_split'] = animals['age_split'].map(pluralize)

In [None]:
animals['age_split']

In [None]:
def count_days(x):
    """
    This function will convert ages into numbers of days.
    """
    if x[-1] == 'days':
        return int(x[0])
    elif x[-1] == 'weeks':
        return int(x[0]) * 7
    elif x[-1] == 'months':
        return int(x[0]) * 30
    elif x[-1] == 'years':
        return int(x[0]) * 365
    else:
        return np.nan

In [None]:
animals['age_days'] = animals['age_split'].map(count_days).astype(float)
animals['age_days']

In [None]:
animals.groupby('animal_type').mean()

Notice the object type [DataFrameGroupBy](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) object. 

In [None]:
animals.groupby(['animal_type', 'outcome_type'])

#### .groups and .get_group()

In [None]:
# This retuns each group indexed by the group name,
# along with the row indices of each value.

animals.groupby('animal_type').groups

Once we know we are working with a type of object, it opens up a suite of attributes and methods. One attribute we can look at is groups.

In [None]:
animals.groupby('animal_type').get_group('Dog')

We can group by multiple columns, and also return a DataFrameGroupBy object

In [None]:
animals.groupby(['animal_type', 'outcome_type'])

In [None]:
animals.groupby(['animal_type', 'outcome_type']).groups.keys()

#### Aggregating

In [None]:
# Just like with single axis groups, we can aggregate on multiple axes

animals.groupby(['animal_type', 'outcome_type']).mean()

In [None]:
# We can then get a specific group, such as cats that were adopted

animals.groupby(['animal_type', 'outcome_type']).get_group(('Cat', 'Adoption'))