# 2. Series Methods More

### Objectives

+ Understand how to use the following methods to handle missing data: **`isna`**, **`notna`**, **`fillna`**, **`dropna`**
+ Find the percentage of missing values by chaining the **`mean`** method to the **`isna`** method
+ Sort values and the index with **`sort_values`** and **`sort_index`**
+ Randomly **`sample`** a Series
+ Find the index of the max and min with **`idxmax`** and **`idxmin`**
+ Explore uniqueness with **`unique`**, **`nunique`** and **`drop_duplicates`**

## Other Useful Methods
In the previous notebook, we covered the most essential and common attributes and statistical methods for Pandas Series objects. In this notebook, we will cover several other useful and common methods from the [Series API](http://pandas.pydata.org/pandas-docs/stable/api.html#series).

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


In [2]:
duration = movie['duration']
duration.head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

## Methods for handling missing values
Pandas provides the following methods to handle missing values:

* **`isna`** - Returns a Series of booleans based on whether each value is missing or not
* **`notna`** - Exact opposite of **`isna`**
* **`fillna`** - fills missing values in a variety of ways
* **`dropna`** - Drops the missing values from the Series

### Counting the number of missing values
Pandas doesn't have a single method that counts the number of missing values, so you can find them in two ways. 

* Use the **`count`** method to find the number of non-missing values and subtract this from the total number of values
* Use the **`isna`** method to return a Series of booleans and chain the **`sum`** method

In [3]:
len(duration) - duration.count()

15

In [4]:
duration.isna().sum()

15

## Finding the percentage of missing values
To find the percentage of missing values in a Series we can chain the **`mean`** method to the **`isna`** method.

In [5]:
duration.isna().mean()

0.0030512611879576893

### Alternate calculation
The last calculation might be confusing. We could have alternatively been more explicit and calculated the percentage of missing values by dividing the number missing by the total size of the Series like this:

In [6]:
total = len(duration)
num_missing = total - duration.count()
num_missing / total

0.0030512611879576893

### Why does taking the mean of the boolean Series work?
The mean is defined as the sum divided by the total. The sum in this case is the sum of all **`True`** values which is just the number of missing values.

## Filling missing values
Occasionally, it will be necessary to fill missing values. Pandas provides the **`fillna`** method to do so. There are many strategies on how to replace missing values. We will only cover how to fill the missing values with a constant here. A popular choice is to use the median or mean of the  Series.

In [7]:
duration.head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

Find the median and replace missing values with it.

In [8]:
median = duration.median()
duration.fillna(median).head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens    103.0
Name: duration, dtype: float64

You can use any constant number directly as well:

In [9]:
duration.fillna(-99).head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens    -99.0
Name: duration, dtype: float64

## Dropping missing values
The **`dropna`** method simply removes the values from the Series that are missing. Notice that the size of the Series has decreased.

In [10]:
duration.dropna().size

4901

# Sorting
The **`sort_values`** method sorts the Series from least to greatest by default. It places missing values at the end.

In [11]:
duration.sort_values().head()

title
The Touch                                7.0
Shaun the Sheep                          7.0
Robot Chicken                           11.0
Vessel                                  14.0
Wal-Mart: The High Cost of Low Price    20.0
Name: duration, dtype: float64

The **`ascending`** parameter can be set to **`False`** to sort from greatest to least:

In [12]:
duration.sort_values(ascending=False).head()

title
Trapped                     511.0
Carlos                      334.0
Blood In, Blood Out         330.0
Heaven's Gate               325.0
The Legend of Suriyothai    300.0
Name: duration, dtype: float64

## Sorting the index
Since Series also have an index, Pandas allows you to sort by it as well with the **`sort_index`** method.

In [13]:
duration.sort_index().head()

title
#Horror                       101.0
10 Cloverfield Lane           104.0
10 Days in a Madhouse         111.0
10 Things I Hate About You     97.0
10,000 B.C.                    22.0
Name: duration, dtype: float64

In [14]:
duration.sort_index(ascending=False).head()

title
Æon Flux                    93.0
xXx: State of the Union    101.0
xXx                        132.0
eXistenZ                   115.0
[Rec] 2                     85.0
Name: duration, dtype: float64

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">What percentage of actor 1 Facebook likes are missing?</span>

In [15]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Use the `notna` method to find the number of non-missing values in the actor 1 Facebook like column. Verify this number is the same as the `count` method.</span>

In [16]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Use one line of code to fill the missing values of `actor1_fb` with the maximum of `actor2_fb`. Save this result to variable `actor1_fb_full`</span>

In [17]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Verify the results of problem 3 by selecting just the values of `actor1_fb_full` that were filled by `actor2_fb`.</span>

In [18]:
# your code here

# Explore More Methods and their parameters
In this section below, you can learn and practice with other methods and their parameters. There are much too many to cover all during a lecture and left to you to understand on your own.

## Take a random sample of the Series
The **`sample`** method is used to take a random sample of the Series.

In [19]:
duration.sample(10)

title
Phone Booth                                          81.0
The Girlfriend Experience                            27.0
Justin Bieber: Never Say Never                      115.0
Quo Vadis                                           171.0
Creepshow                                           130.0
Cats Don't Dance                                     75.0
Teenage Mutant Ninja Turtles: Out of the Shadows    112.0
Philomena                                            98.0
Hansel & Gretel Get Baked                            86.0
Plush                                                99.0
Name: duration, dtype: float64

By default it selects without replacement. Set parameter **`replace`** to **`True`** to select with replacement:

In [20]:
duration.sample(10, replace=True)

title
Imaginary Heroes        111.0
Latter Days             107.0
Eddie the Eagle         106.0
Cinderella Man          144.0
Love Me Tender           89.0
The Tailor of Panama    109.0
Gran Torino             116.0
Jaws 2                  131.0
Moms' Night Out          98.0
Stargate SG-1            44.0
Name: duration, dtype: float64

Set parameter **`frac`** to a number between 0 and 1 to select a random percentage of values. For instance, the following selects .1%.

In [21]:
duration.sample(frac=.001)

title
Love's Abiding Joy           87.0
Get Over It                  87.0
A Low Down Dirty Shame      100.0
The Time Traveler's Wife    107.0
Dr. Dolittle 2               87.0
Name: duration, dtype: float64

## Index of maximum and minimum
Instead of finding the maximum or minimum of the values of the Series, you can return the index of the maximum or minimum with **`idxmax`** and **`idxmin`**.

In [22]:
duration.idxmax()

'Trapped'

Verify results by sorting: 

In [23]:
duration.sort_values(ascending=False).head()

title
Trapped                     511.0
Carlos                      334.0
Blood In, Blood Out         330.0
Heaven's Gate               325.0
The Legend of Suriyothai    300.0
Name: duration, dtype: float64

Can also verify by doing boolean indexing:

In [24]:
duration[duration == duration.max()]

title
Trapped    511.0
Name: duration, dtype: float64

Test **`idxmin`**

In [25]:
duration.idxmin()

'Shaun the Sheep'

Verify by directly selecting the score and finding the minimum:

In [26]:
duration.loc['Shaun the Sheep']

7.0

In [27]:
duration.min()

7.0

# Uniqueness
There are a few methods that deal with unique values in a Series:
* **`unique`** - Returns a NumPy array of all the unique values in order of their appearance
* **`nunique`** - Returns the number of unique values in the Series. It is an aggregation method
* **`drop_duplicates`** - Returns a Pandas Series of just the unique values. By default, it keeps the first value it encounters

In [28]:
duration.unique()

array([178., 169., 148., 164.,  nan, 132., 156., 100., 141., 153., 183.,
       106., 151., 150., 143., 173., 136., 186., 113., 201., 194., 147.,
       131., 124., 135., 195., 108., 104., 165., 130., 142., 125., 123.,
       103., 118., 140., 149., 114., 116., 154., 122.,  93.,  98.,  91.,
       158.,  96., 127., 110., 144., 152.,  94., 126., 112., 176.,  95.,
        97., 109., 128., 102., 101., 120., 121., 182., 166., 137., 184.,
       206., 138., 157., 115., 111.,  89., 105., 119., 129., 146.,  88.,
        99.,  90.,  85.,  92., 196., 133., 215.,  60., 117., 107.,  82.,
       159., 174., 134.,  77., 170.,  76., 171.,  84.,  22., 145.,  78.,
       240., 172.,  87., 216., 192.,  44.,  83., 139.,  86., 162.,  54.,
        80.,  25.,  74.,  81., 177.,  73.,  43.,  45., 163.,  30., 212.,
       187., 189., 188., 280., 155.,  64., 190.,  75., 220., 160.,  52.,
       325., 251., 202., 330., 289., 161.,  28.,  79.,  63., 511.,  42.,
       167., 193., 175., 185., 219.,   7., 271.,  5

In [29]:
duration.nunique()

191

Verify that **`unique`** produces the same number of values as **`nunique`**

In [30]:
len(duration.unique())

192

By default, **`nunique`** does not count missing values.

In [31]:
duration.nunique(dropna=False)

192

**`drop_duplicates`** preserves the Series object, just drops values that are duplicated.

In [32]:
duration_unique_series = duration.drop_duplicates()

In [33]:
duration_unique_series.head(10)

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
John Carter                                   132.0
Spider-Man 3                                  156.0
Tangled                                       100.0
Avengers: Age of Ultron                       141.0
Harry Potter and the Half-Blood Prince        153.0
Name: duration, dtype: float64

Size will match length of result from **`unique`** method:

In [34]:
duration_unique_series.size

192

# More exercises

### Problem 5
<span  style="color:green; font-size:16px">Randomly sample the `actor1_fb_full` Series with replacement to select three values. Use random state 54321.</span>

In [35]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">How many unique directors are there?</span>

In [36]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Select the `year` column, sort it, and drop any duplicates?</span>

In [37]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Get the same result as problem 7 by dropping duplicates first and then sort. Which method is faster?</span>

In [38]:
# your code here