<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#More-Pandas" data-toc-modified-id="More-Pandas-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>More Pandas</a></span><ul class="toc-item"><li><span><a href="#Loading-the-Data" data-toc-modified-id="Loading-the-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Loading the Data</a></span></li></ul></li><li><span><a href="#Exploratory-Data-Analysis-(EDA)" data-toc-modified-id="Exploratory-Data-Analysis-(EDA)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploratory Data Analysis (EDA)</a></span><ul class="toc-item"><li><span><a href="#Inspecting-the-Data" data-toc-modified-id="Inspecting-the-Data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Inspecting the Data</a></span></li><li><span><a href="#Question-1:-What-animal-types-are-in-the-dataset?" data-toc-modified-id="Question-1:-What-animal-types-are-in-the-dataset?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Question 1: What animal types are in the dataset?</a></span></li><li><span><a href="#Question-2:-What-&quot;Other&quot;-animals-are-in-the-dataset?" data-toc-modified-id="Question-2:-What-&quot;Other&quot;-animals-are-in-the-dataset?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Question 2: What "Other" animals are in the dataset?</a></span></li><li><span><a href="#Question-3:-How-old-are-the-animals-in-our-dataset?" data-toc-modified-id="Question-3:-How-old-are-the-animals-in-our-dataset?-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Question 3: How old are the animals in our dataset?</a></span><ul class="toc-item"><li><span><a href="#Series.map()-and-Series.apply()" data-toc-modified-id="Series.map()-and-Series.apply()-3.4.1"><span class="toc-item-num">3.4.1&nbsp;&nbsp;</span><code>Series.map()</code> and <code>Series.apply()</code></a></span></li><li><span><a href="#Slower-Than-numpy" data-toc-modified-id="Slower-Than-numpy-3.4.2"><span class="toc-item-num">3.4.2&nbsp;&nbsp;</span>Slower Than <code>numpy</code></a></span></li><li><span><a href="#More-Sophisticated-Mapping" data-toc-modified-id="More-Sophisticated-Mapping-3.4.3"><span class="toc-item-num">3.4.3&nbsp;&nbsp;</span>More Sophisticated Mapping</a></span></li><li><span><a href="#Lambda-Functions" data-toc-modified-id="Lambda-Functions-3.4.4"><span class="toc-item-num">3.4.4&nbsp;&nbsp;</span>Lambda Functions</a></span></li></ul></li></ul></li><li><span><a href="#Level-Up:-.applymap()" data-toc-modified-id="Level-Up:-.applymap()-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Level Up: <code>.applymap()</code></a></span></li></ul></div>

![panda](http://res.freestockphotos.biz/thumbs/3/3173-illustration-of-a-giant-panda-eating-bamboo-th.png)

In [None]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

# Objectives

- Use lambda functions and DataFrame methods to transform data
- Handle missing data
- Perform one-hot encoding on categorical columns of a DataFrame

# More Pandas

Suppose you were interested in opening an animal shelter. To inform your planning, it would be useful to analyze data from other shelters to understand their operations. In this lecture, we'll analyze animal outcome data from the Austin Animal Center.  

## Loading the Data

Let's take a moment to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). 

We can also ingest the data right off the web, as we do below. The code below will load JSON data for the last 1000 animals to leave the center from this [JSON file](https://data.austintexas.gov/resource/9t4d-g238.json). 

In [None]:
json_url = 'https://data.austintexas.gov/resource/9t4d-g238.json'
animals = pd.read_json(json_url)

# Exploratory Data Analysis (EDA)

Exploring a new dataset is essential for understanding what it contains. This will generate ideas for processing the data and questions to try to answer in further analysis.

## Inspecting the Data

Let's take a look at a few rows of data.

In [None]:
animals.head()

The `info()` and `describe()` provide a useful overview of the data.

In [None]:
animals.info()

In [None]:
animals.describe()

In [None]:
# Use value counts to check a categorical feature's distribution

animals['color'].value_counts()

Now that we have a sense of the data available to us, we can focus in on some more specific questions to dig into. These questions may or may not be directly relevant to your goal (e.g. helping plan a new shelter), but will always help you gain a better understanding of your data.

In your EDA notebooks, **markdown** will be especially helpful in tracking these questions and your methods of answering the questions.  

## Question 1: What animal types are in the dataset?

We can then begin thinking about what parts of the DataFrame we need to answer the question.

* What features do we need?
 - "animal_type"
* What type of logic and calculation do we perform?
 - Let's use `.value_counts()` to count the different animal types
* What type of visualization would help us answer the question?
 - A bar chart would be good for this purpose

In [None]:
animals['animal_type'].value_counts()

In [None]:
fig, ax = plt.subplots()

animal_type_values = animals['animal_type'].value_counts()

ax.barh(
    y=animal_type_values.index,
    width=animal_type_values.values
)
ax.set_xlabel('count');

In [None]:
animals['animal_type'].hist()

Questions lead to other questions. For the above example, the visualization raises the question...

## Question 2: What "Other" animals are in the dataset?

To find out, we need to know whether the type of animal for "Other" is in our dataset - and if so, where to find it.   

**Discussion**: Where might we look to find animal types within the Other category?

<details>
    <summary>
        Answer
    </summary>
        The breed column.
</details>

In [None]:
# Your exploration here

Let's use that column to answer our question.

In [None]:
mask_other_animals = animals['animal_type'] == 'Other'
animals[mask_other_animals]['breed'].value_counts()

## Question 3: How old are the animals in our dataset?

Let's try to answer this with the `age_upon_outcome` variable to learn some new `pandas` tools.

In [None]:
animals['age_upon_outcome'].value_counts()

### `Series.map()` and `Series.apply()`

The `.map()` method applies a transformation to every entry in the Series. This transformation  "maps" each value from the Series to a new value. A transformation can be defined by a function, Series, or dictionary - usually we'll use functions.

The `.apply()` method is similar to the `.map()` method for Series, but can only use functions. It has more powerful uses when working with DataFrames.

In [None]:
def one_year(age):
    if age == '1 year':
        return '1 years'
    else:
        return age

In [None]:
animals['new_age1'] = animals['age_upon_outcome'].map(one_year)
animals['new_age1'].value_counts()

### Slower Than `numpy`

In general, `np.where()` and `np.select()` are faster:

In [None]:
animals['new_age2'] = np.where(animals['age_upon_outcome'] == '1 year',
                              '1 years', animals['age_upon_outcome'])
animals['new_age2']

In [None]:
(animals['new_age1'] != animals['new_age2']).sum()

In [None]:
%timeit animals['new_age1'] = animals['age_upon_outcome'].map(one_year)

In [None]:
%timeit animals['new_age2'] = np.where(animals['age_upon_outcome'] == '1 year', \
                              '1 years', animals['age_upon_outcome'])

### More Sophisticated Mapping

Let's use `.map()` to turn sex_upon_outcome into a category with three values (called **ternary**): male, female, or unknown. 

First, explore the unique values:

In [None]:
animals['sex_upon_outcome'].unique()

In [None]:
def sex_mapper(status):
    if status in ['Neutered Male', 'Intact Male']:
        return 'Male'
    elif status in ['Spayed Female', 'Intact Female']:
        return 'Female'
    else:
        return 'Unknown'

In [None]:
animals['new_sex1'] = animals['sex_upon_outcome'].map(sex_mapper)
animals['new_sex1']

Again, `numpy` will be faster:

In [None]:
conditions = [animals['sex_upon_outcome'] == 'Neutered Male',
             animals['sex_upon_outcome'] == 'Intact Male',
             animals['sex_upon_outcome'] == 'Spayed Female',
             animals['sex_upon_outcome'] == 'Intact Female',
             animals['sex_upon_outcome'] == 'Unknown',
             animals['sex_upon_outcome'] == 'NULL']

choices = ['Male', 'Male', 'Female', 'Female', 'Unknown', 'Unknown']

In [None]:
animals['new_sex2'] = np.select(conditions, choices)
animals['new_sex2']

In [None]:
(animals['new_sex1'] != animals['new_sex2']).sum()

In [None]:
%timeit animals['new_sex1'] = animals['sex_upon_outcome'].map(sex_mapper)

In [None]:
%timeit animals['new_sex2'] = np.select(conditions, choices)

### Lambda Functions

Simple functions can be defined just when you need them, when you would call the function. These are called **lambda functions**. These functions are **anonymous** and disappear immediately after use.

Let's use a lambda function to get rid of 'Other' in the "animal_type' column.

In [None]:
animals[animals['animal_type'] == 'Other']

In [None]:
animals['animal_type'].value_counts()

In [None]:
animals['animal_type'].map(lambda x: np.nan if x == 'Other' else x).value_counts()

# Level Up: `.applymap()`

`.applymap()` is used to apply a transformation to each element of a DataFrame.

In [None]:
# This line will apply the base `type()` function to 
# all entries of the DataFrame.

animals.applymap(type)