# Series Methods More II

## Overview

### Less common, but important methods
In this chapter, we will cover several more less common, but still useful and important Series methods that you need to know in order to be fully capable at analyzing data with pandas. Let's begin by reading in our data and selecting a Series.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head()

In [None]:
salary = emp['salary']
salary.head()

### The `describe` method
The `describe` method performs many of the most common aggregations all at once returning a Series with the name of the aggregation in the index.

In [None]:
salary.describe()

### The `quantile` method

By default, the `quantile` method returns the median. It uses the parameter `q`, a number between 0 and 1, to represent some fraction of the distribution. For instance, if `q` is .2, then pandas will return the value where 20% of the values are less than. A quantile of .2 represents a **percentile** of 20%.

In [None]:
salary.quantile(q=.2)

Let's find the 99th percentile (.99 quantile) of salaries.

In [None]:
salary.quantile(q=.99)

### Changing the quantiles in the `describe` method
The `describe` method has a (misnamed) parameter called `percentile` that you can use to return the quantiles of the Series. You can use a list to return as many quantiles as you would like. By default it returns the quantiles .25, .5, and .75.

In [None]:
salary.describe(percentiles=[.01, .1, .3, .4, .5, .6, .8, .9, .99, .999])

The `quantile` method also accepts a list.

In [None]:
salary.quantile(q=[.1, .2, .99])

### The `agg` method
While the `describe` method is great for giving you the flexibility to provide the exact quantiles you like, it doesn't let you specify the type of aggregations. The `agg` method provides a way to do specific aggregations. Pass it a list of the aggregation methods as strings.

In [None]:
salary.agg(['min', 'max', 'median'])

## More accumulation methods
The `cummin` method was used in a previous chapter. Below, are examples of the other three accumulation methods - `cummax`, `cumsum`, and `cumprod`.

In [None]:
sal_head = salary.head()

In [None]:
sal_head.cummax()

In [None]:
sal_head.cumsum()

In [None]:
sal_head.cumprod()

## The `rank` method
The `rank` method provides a numerical rank for each value in the Series. It's as if each value were in a competition and there was a leaderboard. Note, that this method does NOT sort the values. It simply returns each values rank beginning at 1. By default, the smallest value will be given a rank of 1.

There are different ways to break ties when using `rank`. By default, pandas reports the average of the rank. You can break ties in 5 different ways with the `method` parameter.

In [None]:
sal_head

In [None]:
sal_head.rank()

## Differencing methods  `diff` and `pct_change`
These methods take the difference between the current value and some other value. By default, the other value is the immediate preceding one.

In [None]:
# print out the Series to visually verify
sal_head

In [None]:
sal_head.diff()

The first parameter, `periods`, determines which two values are subtracted. For instance we can subtract the 2nd previous value from the current like this:

In [None]:
sal_head.diff(periods=2)

If you'd like to take the difference between the current value and an upcoming value, use negative numbers. Below, we take the difference between the current value and the very next one.

In [None]:
sal_head.diff(-1)

The `pct_change` method works analogously but returns the percentage difference instead.

In [None]:
sal_head.pct_change()

In [None]:
sal_head.pct_change(-1)

## Calling methods after an operation
Let's say you would like to find the sum of all the salaries after giving everyone a $5,000 bonus. You can do this in one line like this:

In [None]:
(salary + 5000).sum()

### Must use parentheses
The above syntax might be confusing, but it is doing the same thing as this:

In [None]:
salary_bonus = salary + 5000
salary_bonus.sum()

### Explanation
Writing, `(salary + 5000).sum()`, first adds 5,000 to each value in the `salary` Series. This produces a temporary Series, which has all the available methods as any other Series. We then call the `sum` method on this temporary Series.

### What is a temporary Series?
The word `temporary` is used to describe a Series object that is not assigned to any variable. It is only held in memory temporarily during the execution of that one statement.

### A temporary list
Let's see another example of a temporary object. Here, the temporary list `[-99, -11]` is never assigned to a variable and only exists during the execution of the second line of code below:

In [None]:
a = [1, 4, 10]
b = [-99, -11] + a
b

## Randomly sample a Series
The `sample` method is used to take a random sample of the Series. We use the `duration` column from the `movie` dataset for the rest of the examples in this chapter.

In [None]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

In [None]:
duration = movie['duration']

In [None]:
duration.sample(10)

By default, the `sample` method selects without replacement, meaning that each value can only be selected once. Set parameter `replace` to `True` to select with replacement. This is useful if you'd like to create a sample larger than the original Series.

In [None]:
duration.sample(10, replace=True)

Instead of setting an exact number of samples to return, you can set the `frac` parameter to a number between 0 and 1 to select a random fraction of values. For instance, the following selects .1% of the movie durations.

In [None]:
duration.sample(frac=.001)

## Index of maximum and minimum
Instead of finding the maximum or minimum of the values of the Series, you can return the index of the maximum or minimum with `idxmax` and `idxmin`.

In [None]:
duration.idxmax()

Verify results by sorting: 

In [None]:
duration.sort_values(ascending=False).head()

Can also verify by doing boolean indexing:

In [None]:
duration[duration == duration.max()]

Let's test `idxmin` as well.

In [None]:
duration.idxmin()

Verify by directly selecting the score and finding the minimum:

In [None]:
duration.loc['Shaun the Sheep']

In [None]:
duration.min()

## Uniqueness
There are a few methods that deal with unique values in a Series:

* `unique` - Returns a numpy array of all the unique values in order of their appearance
* `nunique` - Returns the number of unique values in the Series. It is an aggregation method
* `drop_duplicates` - Returns a pandas Series of just the unique values. By default, it keeps the first value it encounters

### The `unique` method
The `unique` method is rare in that it returns a numpy array which can be quite confusing. If you'd like to keep the data in a pandas Series, use the `drop_duplicates` method explained below.

In [None]:
duration.unique()

### The `nunique` method

In [None]:
duration.nunique()

Verify that `unique` produces the same number of values as `nunique`.

In [None]:
len(duration.unique())

By default, `nunique` does not count missing values. If there are missing values, then the following should report one more than the default.

In [None]:
duration.nunique(dropna=False)

### The `drop_duplicates` method
The `drop_duplicates` method is similar to `unique` but returns a pandas Series. By default, it keeps the first unique value it encounters. 

In [None]:
duration_unique_series = duration.drop_duplicates()

In [None]:
duration_unique_series.head(10)

Size will match length of result from `unique` method:

In [None]:
duration_unique_series.size

### Why does it matter that `drop_duplicates` keeps the first value?
A Series is composed of both an index and the values. Both `unique` and `drop_duplicates` only consider the values of a Series. But, the index will likely be different for values that are the same, so order does matter with `drop_duplicates`. Set the `keep` parameter to `last` to keep the very last occurrence or to `False` to drop all values that are duplicates.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Randomly sample the `actor1_fb` column as a Series with replacement to select three values. Use random state 54321.</span>

### Exercise 2
<span  style="color:green; font-size:16px">How many unique directors are there?</span>

### Exercise 3
<span  style="color:green; font-size:16px">Select the `year` column, sort it, and drop any duplicates?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Get the same result as problem 3 by dropping duplicates first and then sort. Which method is faster?</span>