# Series Sorting, Ranking, and Uniqueness

In this chapter, we cover a few more important methods on sorting and ranking the values in our Series, along with finding unique values and removing duplicates. We read in the movie dataset, set the title as the index, and select the `imdb_score` column as a Series.

In [None]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
score = movie['imdb_score']
score.head()

## Sorting

The `sort_values` method sorts the Series from least to greatest by default. It places missing values at the end. You may call it without any arguments.

In [None]:
score.sort_values().head(3)

To sort from greatest to least set the `ascending` parameter to `False`.

In [None]:
score.sort_values(ascending=False).head(3)

### Changing missing value to first

By default, all missing values are placed at the bottom of the resulting Series. You can change it so that they appear first by setting the `na_position` method to 'first'. This is a good way to quickly view all the missing values in your Series. Here, we sort the `duration` column so that it's missing values come first.

In [None]:
movie['duration'].sort_values(na_position='first').head()

### Sorting the index
Since Series also have an index, pandas allows you to sort by it as well with the `sort_index` method.

In [None]:
score.sort_index().head(3)

In [None]:
score.sort_index(ascending=False).head(3)

Python uses the Unicode code point (an integer) of each character to compare strings. We can use the built-in `ord` function to find each characters code point. For instance, the character '#' evaluates as 35, which is less than the value for '1' and 'A'. The movie '#Horror' has the smallest starting character code point and appears first when sorted from least to greatest.

In [None]:
ord('#')

In [None]:
ord('1')

In [None]:
ord('A')

When sorting the opposite direction, the movie 'Æon Flux' begins with the 'Æ' character which has code point 198 and is the largest starting character. All lowercase letters have higher code points than all uppercase letters, so movies that begin with lowercase letters will be at the top as most movie begin with a capital letter.

In [None]:
ord('Æ')

In [None]:
ord('x')

In [None]:
ord('a')

In [None]:
ord('Z')

## Ranking

The `rank` method provides a numerical ranking for each value in the Series. By default, it ranks the values in ascending order beginning at 1. This method is easier to understand when working on a smaller Series. Let's assign the first 10 scores to the variable name `score10`.

In [None]:
score10 = score.head(10)
score10

Every value in this Series will now be ranked from least to greatest. The lowest scoring value gets a ranking of 1, while the greatest gets a ranking of 10.

In [None]:
score10.rank()

This method can be confusing the first time it is used. First of all, it does NOT sort the data. Notice, that the titles in the index are in the same order as the original.

It provides the rank, just like you would rank runners in a race. If you look at the original data, the movie Spider-Man has the lowest `imdb_score` at 6.2. In the Series resulting from the `rank` method, it gets the value 1. The next lowest score is 6.6 from the movie John Carter which results in a ranking of 2 followed by Spectre with a ranking of 3.

### Handling ties

After Spectre, are Pirates of the Caribbean and Star Wars: Episode VII that are tied with a score of 7.1. There are several methods available to choose how ties are ranked. By default, pandas uses the 'average' method which works by averaging the total rank number for those tied values as if they were not tied. 

For example, there are two movies tied for the fourth rank. If they were not tied, they would be ranked 4 and 5. The average of this is 4.5 and each movie is given this rank. The ranking would continue here at 6.

Let's say there were 5 movies tied for the fourth rank (instead of 2), then their non-tied ranks would be 4, 5, 6, 7, and 8 for an average rank of 6. Each movie would be given this rank and the ranking would continue at 9.

There are actually two sets of ties in the above dataset. Avengers: Age of Ultron and Harry Potter and the Half-Blood Prince both have an `imdb_score` of 7.5 and are given the average rank of 6.5 as their non-tied ranks would be 6 and 7.

### Change tie handling

Use the `method` parameter to change how pandas handles ties. Using a 'dense' rank, will give each movie tied the same rank and not skip any number when moving to the next. Here, the first set of ties is given rank 4, which is immediately followed by the second set which is given rank 5.

In [None]:
score10.rank(method='dense')

There are three other methods to handle ties:
* 'min' - give each tie the minimum rank number
* 'max' - give each tie the maximum rank number
* 'first' - arbitrarily give the tie that comes first in the dataset the lower/higher number.

### Rank for greatest to least

For movies, it makes more sense to rank the movie with the highest score as 1, which is done by setting the `ascending` parameter to `False`.

In [None]:
score10.rank(ascending=False, method='first')

## Uniqueness

There are a few methods that deal with unique values in a Series:

* `unique` - Returns a numpy array of all the unique values in order of their appearance
* `nunique` - Returns the number of unique values in the Series
* `drop_duplicates` - Returns a pandas Series of just the unique values

### The `unique` method

The `unique` method returns each unique value in the Series preserving the order of its appearance. Let's select the `content_rating` column as a Series and use the unique method to get all the unique ratings. Interesting, it returns a numpy array and NOT a pandas Series.

In [None]:
unique_ratings = movie['content_rating'].unique()
unique_ratings

### The `nunique` method
The `nunique` method returns the number of unique values in the Series.

In [None]:
movie['content_rating'].nunique()

You might expect that the number of unique values to be same as the length of the array returned from the `unique` method. This might not be the case as the `nunique` does not count missing values if they are present. Since there are missing values in this Series, `nunique` returns one less.

In [None]:
len(unique_ratings)

You can choose to count a unique missing value with `nunique` by setting the `dropna` parameter to `False`. This will add one to the count if any missing values are present.

In [None]:
movie['content_rating'].nunique(dropna=False)

### The `drop_duplicates` method
The `drop_duplicates` method is similar to `unique` but returns a pandas Series. By default, it keeps the first unique value it encounters. 

In [None]:
duration_unique_series = movie['content_rating'].drop_duplicates()
duration_unique_series.head()

It will contain the same number of values as the Series returned from the `unique` method.

In [None]:
len(duration_unique_series)

### Why does it matter that `drop_duplicates` keeps the first value?
A Series is composed of both an index and the values. Both `unique` and `drop_duplicates` only consider the values of a Series. But, the index will likely be different for values that are the same, so order does matter with `drop_duplicates`. Set the `keep` parameter to `last` to keep the very last occurrence or to `False` to drop all values that are duplicates. Notice how the index for the movie rated 'G' is different

In [None]:
movie['content_rating'].drop_duplicates(keep='last').head(7)

## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Select the column holding the number of reviews as a Series and sort if from greatest to least.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Find the number of unique actors in each of the actor columns. Do not count missing valus. Use three separate calls to `nunique`.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Select the `year` column, sort it, and drop any duplicates?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Get the same result as Exercise 3 by dropping duplicates first and then sort. Which method is faster?</span>

### Exercise 5

<span  style="color:green; font-size:16px">Rank each movie by duration from greatest to least and then sort this ranking from least to greatest. Output the top 10 values. Do you get the same result by sorting the duration from greatest to least?</span>

### Exercise 6

<span  style="color:green; font-size:16px">Select actor1 as a Series and sort it from least to greatest, but have missing values show up first. Output the first 10 values.</span>