# Series Sorting, Ranking, and Uniqueness

In this chapter, we cover important methods for sorting and ranking the values in our Series, along with finding unique values and removing duplicates. We read in the movie dataset, set the title as the index, and select the `imdb_score` column as a Series.

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
score = movie['imdb_score']
score.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

In [2]:
type(score)

pandas.core.series.Series

## Sorting

The `sort_values` method sorts the Series from least to greatest by default. It places missing values at the end. You may call it without any arguments.

In [3]:
score.sort_values().head(3)

title
Justin Bieber: Never Say Never    1.6
Foodfight!                        1.7
Disaster Movie                    1.9
Name: imdb_score, dtype: float64

To sort from greatest to least, set the `ascending` parameter to `False`.

In [4]:
score.sort_values(ascending=False).head(3)

title
Towering Inferno            9.5
The Shawshank Redemption    9.3
The Godfather               9.2
Name: imdb_score, dtype: float64

### Making missing values appear first

By default, all missing values are placed at the end of the resulting Series when sorting. You can change this so that they appear first by setting the `na_position` parameter to `'first'`. This is a good way to quickly view all the missing values in your Series. Here, we sort the `duration` column so that its missing values come first.

In [5]:
movie['duration'].sort_values(na_position='first').head()

title
Star Wars: Episode VII - The Force Awakens      NaN
Harry Potter and the Deathly Hallows: Part II   NaN
Harry Potter and the Deathly Hallows: Part I    NaN
Black Water Transit                             NaN
War & Peace                                     NaN
Name: duration, dtype: float64

### Sorting the index

Since Series also have an index, pandas allows you to sort by it as well with the `sort_index` method.

In [7]:
score.sort_index().head(3)

title
#Horror                  3.3
10 Cloverfield Lane      7.3
10 Days in a Madhouse    7.5
Name: imdb_score, dtype: float64

As with `sort_values`, the same `ascending` parameter exists to change the sort from greatest to least.

In [8]:
score.sort_index(ascending=False).head(3)

title
Æon Flux                   5.5
xXx: State of the Union    4.3
xXx                        5.8
Name: imdb_score, dtype: float64

Python uses the Unicode code point (an integer) of each character to compare strings. We can use the built-in `ord` function to find the code point of a character. For instance, the character `'#'` evaluates as 35, which is less than the value for the characters `'1'` and `'A'`. The movie '#Horror' has the smallest starting character code point and appears first when sorted from least to greatest.

In [9]:
ord('#')

35

In [10]:
ord('1')

49

In [11]:
ord('A')

65

When sorting the opposite direction, the movie `'Æon Flux'` begins with the 'Æ' character, which has code point 198 and is the largest starting character in our Series. Also, all lowercase letters have higher code points than all uppercase letters, so movies that begin with lowercase letters will appear at the top.

In [12]:
ord('Æ')

198

In [13]:
ord('x')

120

In [14]:
ord('a')

97

In [15]:
ord('Z')

90

## Ranking

The `rank` method provides a numerical ranking for each value in the Series. By default, it ranks the values in ascending order beginning at 1. This method is easier to understand when working on a smaller Series. Let's assign the first 10 scores to the variable name `score10`.

In [16]:
score10 = score.head(10)
score10

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
John Carter                                   6.6
Spider-Man 3                                  6.2
Tangled                                       7.8
Avengers: Age of Ultron                       7.5
Harry Potter and the Half-Blood Prince        7.5
Name: imdb_score, dtype: float64

Every value in this Series will now be ranked from least to greatest. The movie with the lowest score gets a ranking of 1, while the greatest gets a ranking of 10.

In [17]:
score10.rank()

title
Avatar                                         9.0
Pirates of the Caribbean: At World's End       4.5
Spectre                                        3.0
The Dark Knight Rises                         10.0
Star Wars: Episode VII - The Force Awakens     4.5
John Carter                                    2.0
Spider-Man 3                                   1.0
Tangled                                        8.0
Avengers: Age of Ultron                        6.5
Harry Potter and the Half-Blood Prince         6.5
Name: imdb_score, dtype: float64

This method can be confusing the first time it is used. First of all, it does NOT sort the data. Notice that the titles in the index are in the same order as the original.

It provides the *rank*, just like you would rank runners in a race. If you look at the original data, the movie Spider-Man 3 has the lowest `imdb_score` at 6.2. In the Series resulting from the `rank` method, it gets the value 1. The next lowest score is 6.6 from the movie John Carter, which results in a ranking of 2, followed by Spectre with a ranking of 3.

### Ranking ties

After Spectre, Pirates of the Caribbean and Star Wars: Episode VII tied with a score of 7.1. There are several methods available to choose how ties are ranked. By default, pandas uses the 'average' method which works by averaging the total rank number for those tied values as if they were not tied. 

For example, there are two movies tied for the fourth rank. If they were not tied, they would be ranked 4 and 5.  Averaged together produces the value 4.5. Both of those movies are given this rank. From here, the ranking continues at 6.

Let's say there were five movies tied for the fourth rank (instead of two), then their non-tied ranks would be 4, 5, 6, 7, and 8 for an average rank of 6. Each of the five movies would be given this rank. The ranking would then continue with 9.

There are actually two sets of ties in the above dataset. Avengers: Age of Ultron and Harry Potter and the Half-Blood Prince both have an `imdb_score` of 7.5 and are given the average rank of 6.5 as their non-tied ranks would be 6 and 7.

### Change tie method

The `method` parameter changes how pandas handles ties. When using a 'dense' rank, each movie tied is given the same rank. The following movie ranks do not skip any numbers. Here, the first set of ties is given rank 4, which is immediately followed by the second set, which is given rank 5. The highest numerical rank is only 8 using dense ranking and not 10 as before.

In [18]:
score10.rank(method='dense')

title
Avatar                                        7.0
Pirates of the Caribbean: At World's End      4.0
Spectre                                       3.0
The Dark Knight Rises                         8.0
Star Wars: Episode VII - The Force Awakens    4.0
John Carter                                   2.0
Spider-Man 3                                  1.0
Tangled                                       6.0
Avengers: Age of Ultron                       5.0
Harry Potter and the Half-Blood Prince        5.0
Name: imdb_score, dtype: float64

In [19]:
score10.rank(method = 'min')

title
Avatar                                         9.0
Pirates of the Caribbean: At World's End       4.0
Spectre                                        3.0
The Dark Knight Rises                         10.0
Star Wars: Episode VII - The Force Awakens     4.0
John Carter                                    2.0
Spider-Man 3                                   1.0
Tangled                                        8.0
Avengers: Age of Ultron                        6.0
Harry Potter and the Half-Blood Prince         6.0
Name: imdb_score, dtype: float64

In [20]:
score10.rank(method = 'max')

title
Avatar                                         9.0
Pirates of the Caribbean: At World's End       5.0
Spectre                                        3.0
The Dark Knight Rises                         10.0
Star Wars: Episode VII - The Force Awakens     5.0
John Carter                                    2.0
Spider-Man 3                                   1.0
Tangled                                        8.0
Avengers: Age of Ultron                        7.0
Harry Potter and the Half-Blood Prince         7.0
Name: imdb_score, dtype: float64

In [21]:
score10.rank(method = 'first')

title
Avatar                                         9.0
Pirates of the Caribbean: At World's End       4.0
Spectre                                        3.0
The Dark Knight Rises                         10.0
Star Wars: Episode VII - The Force Awakens     5.0
John Carter                                    2.0
Spider-Man 3                                   1.0
Tangled                                        8.0
Avengers: Age of Ultron                        6.0
Harry Potter and the Half-Blood Prince         7.0
Name: imdb_score, dtype: float64

There are three other methods to handle ties:

* 'min' - give each tie the minimum rank integer
* 'max' - give each tie the maximum rank integer
* 'first' - arbitrarily give the tie that comes first in the dataset the lower/higher number.

### Rank from greatest to least

For movies, it makes more sense to rank the movie with the highest score as 1, which is done by setting the `ascending` parameter to `False`. The 'min' method of ranking is used to handle ties.

In [22]:
score10.rank(ascending=False, method='min')

title
Avatar                                         2.0
Pirates of the Caribbean: At World's End       6.0
Spectre                                        8.0
The Dark Knight Rises                          1.0
Star Wars: Episode VII - The Force Awakens     6.0
John Carter                                    9.0
Spider-Man 3                                  10.0
Tangled                                        3.0
Avengers: Age of Ultron                        4.0
Harry Potter and the Half-Blood Prince         4.0
Name: imdb_score, dtype: float64

## Uniqueness

There are a few methods that deal with unique values in a Series:

* `unique` - Returns a numpy array of all the unique values in order of their appearance
* `nunique` - Returns the number of unique values in the Series
* `drop_duplicates` - Returns a pandas Series of just the unique values

### The `unique` method

The `unique` method returns each unique value in the Series preserving the order of its appearance. Let's select the `content_rating` column as a Series and use the unique method to get all the unique ratings. Interestingly, it returns a numpy array and NOT a pandas Series.

In [23]:
unique_ratings = movie['content_rating'].unique()
unique_ratings

array(['PG-13', nan, 'PG', 'G', 'R', 'TV-14', 'TV-PG', 'TV-MA', 'TV-G',
       'Not Rated', 'Unrated', 'Approved', 'TV-Y', 'NC-17', 'X', 'TV-Y7',
       'GP', 'Passed', 'M'], dtype=object)

### The `nunique` method

The `nunique` method returns the number of unique values in the Series.

In [24]:
movie['content_rating'].nunique()

18

You might expect that the number of unique values to be same as the length of the array returned from the `unique` method. This might not be the case as the `nunique` does not count missing values if they are present. Since there are missing values in this Series, `nunique` returns one less. Also note that `ununique` is considered an aggregating method as it returns a single value.

In [None]:
len(unique_ratings)

You can choose to count a single unique missing value with `nunique` by setting the `dropna` parameter to `False`. This will add one to the count if any missing values are present.

In [25]:
movie['content_rating'].nunique(dropna=False)

19

### The `drop_duplicates` method

The `drop_duplicates` method is similar to `unique` but returns a pandas Series. By default, it keeps the first unique value it encounters. 

In [26]:
duration_unique_series = movie['content_rating'].drop_duplicates()
duration_unique_series.head()

title
Avatar                                        PG-13
Star Wars: Episode VII - The Force Awakens      NaN
Tangled                                          PG
Monsters University                               G
The Lovers                                        R
Name: content_rating, dtype: object

It does not discard missing values, so returns the same number of values as the Series returned from the `unique` method when `dropna` was set to `False`.

In [None]:
len(duration_unique_series)

### Why does it matter that `drop_duplicates` keeps the first value?

A Series is composed of both an index and the values. Both `unique` and `drop_duplicates` only consider the values of a Series. But, the index will likely be different for values that are the same, so order does matter with `drop_duplicates`. Set the `keep` parameter to `last` to keep the very last occurrence or to `False` to drop all values that are duplicates. Notice how the index for the movie rated 'G' is different.

In [27]:
movie['content_rating'].drop_duplicates(keep='last').head(7)

title
Arthur                     TV-Y
The Powerpuff Girls       TV-Y7
Hang 'Em High                 M
Home Movies               TV-PG
The Broadway Melody      Passed
Happy Valley              TV-MA
Sunday School Musical         G
Name: content_rating, dtype: object

### Preference for `drop_duplicates`

Both the `unique` and `drop_duplicates` methods accomplish similar tasks. The `unique` method returns its results as a numpy array, which isn't ideal. In my opinion, it's better to keep all your results as pandas objects (Series or DataFrames). Working with fewer types of objects makes your data analysis easier. Additionally, the `drop_duplicates` method is more flexible with the availability of the `keep` parameter.

The `unique` method is only available for Series, unlike `drop_duplicates`, which is available for both Series and DataFrames. Because the `unique` method's functionality is a subset of `drop_duplicates`, I recommend using `drop_duplicates`.

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Select the column holding the number of reviews as a Series and sort if from greatest to least.</span>

In [28]:
movie.head(2)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1


In [30]:
reviews = movie['num_reviews']
reviews.sort_values(ascending = False)

title
The Dark Knight Rises                813.0
Prometheus                           775.0
Django Unchained                     765.0
Skyfall                              750.0
Mad Max: Fury Road                   739.0
                                     ...  
Her Cry: La Llorona Investigation      NaN
Dutch Kills                            NaN
The Ridges                             NaN
On the Downlow                         NaN
The Mongol King                        NaN
Name: num_reviews, Length: 4916, dtype: float64

### Exercise 2

<span style="color:green; font-size:16px">Find the number of unique actors in each of the actor columns. Do not count missing values. Use three separate calls to `nunique`.</span>

In [31]:
movie['actor1'].nunique()

2095

In [32]:
movie['actor2'].nunique()

3030

In [33]:
movie['actor3'].nunique()

3519

In [34]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Exercise 3
<span style="color:green; font-size:16px">Select the `year` column, sort it, and drop any duplicates.</span>

In [36]:
movie['year'].sort_values().drop_duplicates()

title
Intolerance: Love's Struggle Throughout the Ages    1916.0
Over the Hill to the Poorhouse                      1920.0
The Big Parade                                      1925.0
Metropolis                                          1927.0
The Broadway Melody                                 1929.0
                                                     ...  
The Wolverine                                       2013.0
An American in Hollywood                            2014.0
The Lovers                                          2015.0
Pete's Dragon                                       2016.0
Star Wars: Episode VII - The Force Awakens             NaN
Name: year, Length: 92, dtype: float64

### Exercise 4
<span  style="color:green; font-size:16px">Get the same result as Exercise 3 by dropping duplicates first and then sorting it. Which method is faster?</span>

In [37]:
movie['year'].drop_duplicates().sort_values()

title
Intolerance: Love's Struggle Throughout the Ages    1916.0
Over the Hill to the Poorhouse                      1920.0
The Big Parade                                      1925.0
Metropolis                                          1927.0
Pandora's Box                                       1929.0
                                                     ...  
The Lone Ranger                                     2013.0
The Hobbit: The Battle of the Five Armies           2014.0
Spectre                                             2015.0
Batman v Superman: Dawn of Justice                  2016.0
Star Wars: Episode VII - The Force Awakens             NaN
Name: year, Length: 92, dtype: float64

### Exercise 5

<span style="color:green; font-size:16px">Rank each movie by duration from greatest to least and then sort this ranking from least to greatest. Output the top 10 values. Do you get the same result by sorting the duration from greatest to least?</span>

In [38]:
movie['duration'].rank(ascending = False).sort_values().head(10)

title
Trapped                      1.0
Carlos                       2.0
Blood In, Blood Out          3.0
Heaven's Gate                4.0
The Legend of Suriyothai     5.0
Das Boot                     6.0
Apocalypse Now               7.0
The Company                  8.0
Gods and Generals            9.0
Gettysburg                  10.0
Name: duration, dtype: float64

In [40]:
movie['duration'].sort_values(ascending = False).head(10)

title
Trapped                     511.0
Carlos                      334.0
Blood In, Blood Out         330.0
Heaven's Gate               325.0
The Legend of Suriyothai    300.0
Das Boot                    293.0
Apocalypse Now              289.0
The Company                 286.0
Gods and Generals           280.0
Gettysburg                  271.0
Name: duration, dtype: float64

### Exercise 6

<span style="color:green; font-size:16px">Select actor1 as a Series and sort it from least to greatest, but have missing values appear first. Output the first 10 values.</span>

In [41]:
actor1 = movie['actor1'] 

In [43]:
actor1.sort_values(na_position = 'first').head(10)

title
Pink Ribbons, Inc.                  NaN
Sex with Strangers                  NaN
The Harvest/La Cosecha              NaN
Ayurveda: Art of Being              NaN
The Brain That Sings                NaN
The Blood of My Brother             NaN
Counting                            NaN
Get Rich or Die Tryin'          50 Cent
The Good Dinosaur          A.J. Buckley
Queen of the Damned             Aaliyah
Name: actor1, dtype: object