# 2. Series Methods More

### Objectives

+ Understand how to use the following methods to handle missing data: **`isna`**, **`notna`**, **`fillna`**, **`dropna`**
+ Find the percentage of missing values by chaining the **`mean`** method to the **`isna`** method
+ Sort values and the index with **`sort_values`** and **`sort_index`**
+ Randomly **`sample`** a Series
+ Find the index of the max and min with **`idxmax`** and **`idxmin`**
+ Explore uniqueness with **`unique`**, **`nunique`** and **`drop_duplicates`**

## Other Useful Methods
In the previous notebook, we covered the most essential and common attributes and statistical methods for Pandas Series objects. In this notebook, we will cover several other useful and common methods from the [Series API](http://pandas.pydata.org/pandas-docs/stable/api.html#series).

In [None]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

In [None]:
duration = movie['duration']
duration.head()

## Methods for handling missing values
Pandas provides the following methods to handle missing values:

* **`isna`** - Returns a Series of booleans based on whether each value is missing or not
* **`notna`** - Exact opposite of **`isna`**
* **`fillna`** - fills missing values in a variety of ways
* **`dropna`** - Drops the missing values from the Series

### Counting the number of missing values
Pandas doesn't have a single method that counts the number of missing values, so you can find them in two ways. 

* Use the **`count`** method to find the number of non-missing values and subtract this from the total number of values
* Use the **`isna`** method to return a Series of booleans and chain the **`sum`** method

In [None]:
len(duration) - duration.count()

In [None]:
duration.isna().sum()

## Finding the percentage of missing values
To find the percentage of missing values in a Series we can chain the **`mean`** method to the **`isna`** method.

In [None]:
duration.isna().mean()

### Alternate calculation
The last calculation might be confusing. We could have alternatively been more explicit and calculated the percentage of missing values by dividing the number missing by the total size of the Series like this:

In [None]:
total = len(duration)
num_missing = total - duration.count()
num_missing / total

### Why does taking the mean of the boolean Series work?
The mean is defined as the sum divided by the total. The sum in this case is the sum of all **`True`** values which is just the number of missing values.

## Filling missing values
Occasionally, it will be necessary to fill missing values. Pandas provides the **`fillna`** method to do so. There are many strategies on how to replace missing values. We will only cover how to fill the missing values with a constant here. A popular choice is to use the median or mean of the  Series.

In [None]:
duration.head()

Find the median and replace missing values with it.

In [None]:
median = duration.median()
duration.fillna(median).head()

You can use any constant number directly as well:

In [None]:
duration.fillna(-99).head()

## Dropping missing values
The **`dropna`** method simply removes the values from the Series that are missing. Notice that the size of the Series has decreased.

In [None]:
duration.dropna().size

# Sorting
The **`sort_values`** method sorts the Series from least to greatest by default. It places missing values at the end.

In [None]:
duration.sort_values().head()

The **`ascending`** parameter can be set to **`False`** to sort from greatest to least:

In [None]:
duration.sort_values(ascending=False).head()

## Sorting the index
Since Series also have an index, Pandas allows you to sort by it as well with the **`sort_index`** method.

In [None]:
duration.sort_index().head()

In [None]:
duration.sort_index(ascending=False).head()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">What percentage of actor 1 Facebook likes are missing?</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Use the `notna` method to find the number of non-missing values in the actor 1 Facebook like column. Verify this number is the same as the `count` method.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Use one line of code to fill the missing values of `actor1_fb` with the maximum of `actor2_fb`. Save this result to variable `actor1_fb_full`</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Verify the results of problem 3 by selecting just the values of `actor1_fb_full` that were filled by `actor2_fb`.</span>

In [None]:
# your code here

# Explore More Methods and their parameters
In this section below, you can learn and practice with other methods and their parameters. There are much too many to cover all during a lecture and left to you to understand on your own.

## Take a random sample of the Series
The **`sample`** method is used to take a random sample of the Series.

In [None]:
duration.sample(10)

By default it selects without replacement. Set parameter **`replace`** to **`True`** to select with replacement:

In [None]:
duration.sample(10, replace=True)

Set parameter **`frac`** to a number between 0 and 1 to select a random percentage of values. For instance, the following selects .1%.

In [None]:
duration.sample(frac=.001)

## Index of maximum and minimum
Instead of finding the maximum or minimum of the values of the Series, you can return the index of the maximum or minimum with **`idxmax`** and **`idxmin`**.

In [None]:
duration.idxmax()

Verify results by sorting: 

In [None]:
duration.sort_values(ascending=False).head()

Can also verify by doing boolean indexing:

In [None]:
duration[duration == duration.max()]

Test **`idxmin`**

In [None]:
duration.idxmin()

Verify by directly selecting the score and finding the minimum:

In [None]:
duration.loc['Shaun the Sheep']

In [None]:
duration.min()

# Uniqueness
There are a few methods that deal with unique values in a Series:
* **`unique`** - Returns a NumPy array of all the unique values in order of their appearance
* **`nunique`** - Returns the number of unique values in the Series. It is an aggregation method
* **`drop_duplicates`** - Returns a Pandas Series of just the unique values. By default, it keeps the first value it encounters

In [None]:
duration.unique()

In [None]:
duration.nunique()

Verify that **`unique`** produces the same number of values as **`nunique`**

In [None]:
len(duration.unique())

By default, **`nunique`** does not count missing values.

In [None]:
duration.nunique(dropna=False)

**`drop_duplicates`** preserves the Series object, just drops values that are duplicated.

In [None]:
duration_unique_series = duration.drop_duplicates()

In [None]:
duration_unique_series.head(10)

Size will match length of result from **`unique`** method:

In [None]:
duration_unique_series.size

# More exercises

### Problem 5
<span  style="color:green; font-size:16px">Randomly sample the `actor1_fb_full` Series with replacement to select three values. Use random state 54321.</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">How many unique directors are there?</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Select the `year` column, sort it, and drop any duplicates?</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Get the same result as problem 7 by dropping duplicates first and then sort. Which method is faster?</span>

In [None]:
# your code here