# More Series Methods

In this chapter, we cover several more useful and important, but less common Series methods that you need to know in order to be fully capable at analyzing data with pandas Series. 

* `agg` - Computes multiple aggregations at once
* `idxmin`/`idxmax` - Returns the index of the min/max value
* `diff`/`pct_change` - Finds the difference/percent change from one value to the next
* `sample` - Randomly samples values in a Series
* `nsmallest`/`nlargest` - Return the smallest/largest `n` values
* `replace` - Replace one or more values in a variety of ways

Let's begin by reading in the movie dataset and selecting the `imdb_score` column as a Series.

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
score = movie['imdb_score']
score.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

## The `agg` method

The `agg` method allows you to compute several aggregations simultaneously. Provide it a list with the aggregation methods as **strings**. For instance, the following computes the minimum and maximum returning the result as a Series.

In [2]:
score.agg(['min', 'max'])

min    1.6
max    9.5
Name: imdb_score, dtype: float64

You may provide any number of aggregation methods to the `agg` method, which is similar to `describe`, but calculates just the aggregations you desire.

In [3]:
score.agg(['min', 'max', 'count', 'nunique'])

min           1.6
max           9.5
count      4916.0
nunique      78.0
Name: imdb_score, dtype: float64

## The index of the minimum and maximum

The `min` and `max` methods return the minimum and maximum values of a Series. Occasionally, you'll want to know the index label for these values and can do so with the `idxmin` and `idxmax` methods. Let's find the movie names with worst and best scores.

In [4]:
score.idxmin()

'Justin Bieber: Never Say Never'

In [5]:
score.idxmax()

'Towering Inferno'

Let's verify these results by dropping any missing values and sorting the Series.

In [6]:
score_sorted = score.dropna().sort_values(ascending=False)

We can now output the first and last few values to verify.

In [7]:
score_sorted.head(3)

title
Towering Inferno            9.5
The Shawshank Redemption    9.3
The Godfather               9.2
Name: imdb_score, dtype: float64

In [8]:
score_sorted.tail(3)

title
The Helix... Loaded               1.9
Foodfight!                        1.7
Justin Bieber: Never Say Never    1.6
Name: imdb_score, dtype: float64

Both `idxmin` and `idxmax` always return a single index label. If two or more values share the min/max then pandas returns the index label that appears first in the Series. Since, one value is returned, `idxmin` and `idxmax` are considered aggregation methods.

## The `nsmallest` and `nlargest` methods

The `nsmallest` and `nlargest` methods are convenience methods to quickly return the top `n` values in a Series in order. By default, they return the top 5 values. Use the parameter `n` to choose how many values to return. Here, we select the top 4 movies by score.

In [9]:
score.nlargest(n=4)

title
Towering Inferno            9.5
The Shawshank Redemption    9.3
The Godfather               9.2
Dekalog                     9.1
Name: imdb_score, dtype: float64

By default, `nlargest` and `nsmallest` return exactly `n` values even if there are ties. Let's produce a similar result by calling `sort_values` and returning the first five values. You'll notice that two movies are tied for the fourth highest score. By default, `nlargest` returns the first one.

In [10]:
score.sort_values(ascending=False).head()

title
Towering Inferno            9.5
The Shawshank Redemption    9.3
The Godfather               9.2
Dekalog                     9.1
Kickboxer: Vengeance        9.1
Name: imdb_score, dtype: float64

If you'd like to keep the top `n` values and ties, set the `keep` parameter to the string `'all'`. There is only one other movie with a value of 9.1, but if there were more, all of them would be returned here.

In [11]:
score.nlargest(n=4, keep='all')

title
Towering Inferno            9.5
The Shawshank Redemption    9.3
The Godfather               9.2
Dekalog                     9.1
Kickboxer: Vengeance        9.1
Name: imdb_score, dtype: float64

The `nsmallest` method works analogously and returns the smallest `n` values.

In [12]:
score.nsmallest(n=3)

title
Justin Bieber: Never Say Never    1.6
Foodfight!                        1.7
Disaster Movie                    1.9
Name: imdb_score, dtype: float64

By default, the first tie is kept, but setting `keep` to `'last'` returns the last occurrence of the nth ranked value. Notice the last index label is different than above.

In [13]:
score.nsmallest(n=3, keep='last')

title
Justin Bieber: Never Say Never    1.6
Foodfight!                        1.7
The Helix... Loaded               1.9
Name: imdb_score, dtype: float64

## Differencing methods `diff` and `pct_change`

The `diff` method takes the difference between the current value and some other value. By default, the other value is the immediate preceding one. The first value in the Series has no previous value, so its difference will be missing in the result. Let's read a small sample of Microsoft's stock dataset found in the stocks folder containing 10 trading days worth of information.

In [14]:
msft = pd.read_csv('../data/stocks/msft_sample.csv')
msft

Unnamed: 0,date,open,high,low,close,adjusted_close,volume,dividend_amount
0,2019-10-08,137.08,137.76,135.62,135.67,135.67,25550500,0.0
1,2019-10-09,137.46,138.7,136.97,138.24,138.24,19749900,0.0
2,2019-10-10,138.49,139.67,138.25,139.1,139.1,17654600,0.0
3,2019-10-11,140.12,141.03,139.5,139.68,139.68,25446000,0.0
4,2019-10-14,139.69,140.29,139.52,139.55,139.55,13304300,0.0
5,2019-10-15,140.06,141.79,139.81,141.57,141.57,19695700,0.0
6,2019-10-16,140.79,140.99,139.53,140.41,140.41,20751600,0.0
7,2019-10-17,140.95,141.42,139.02,139.69,139.69,21460600,0.0
8,2019-10-18,139.76,140.0,136.5638,137.41,137.41,27654449,0.0
9,2019-10-21,138.45,138.5,137.01,138.39,138.39,20668059,0.0


Let's select the `adjusted_close` column as a Series and call the `diff` method on it. The difference between the second and first values is 2.57 and is now the new second value in the returned Series.

In [16]:
ac = msft['adjusted_close']
ac.diff()

0     NaN
1    2.57
2    0.86
3    0.58
4   -0.13
5    2.02
6   -1.16
7   -0.72
8   -2.28
9    0.98
Name: adjusted_close, dtype: float64

It's possible to control which two values are subtracted. By default, the `periods` parameter is set to 1. Here, we change it to 3. The first possible difference happens between the fourth (139.68) and first (135.67) values, resulting in 4.01. The first three values do not have three positions ahead of them, so they are now missing.

In [17]:
ac.diff(periods=3)

0     NaN
1     NaN
2     NaN
3    4.01
4    1.31
5    2.47
6    0.73
7    0.14
8   -4.16
9   -2.02
Name: adjusted_close, dtype: float64

We can take the difference between the current value and a value further ahead by using negative integers. Here, we take the current value and subtract the second value following it. The last two values are missing as they do not have two values ahead.

In [18]:
ac.diff(-2)

0   -3.43
1   -1.44
2   -0.45
3   -1.89
4   -0.86
5    1.88
6    3.00
7    1.30
8     NaN
9     NaN
Name: adjusted_close, dtype: float64

The `pct_change` method works analogously but returns the percentage difference instead.

In [19]:
ac.pct_change()

0         NaN
1    0.018943
2    0.006221
3    0.004170
4   -0.000931
5    0.014475
6   -0.008194
7   -0.005128
8   -0.016322
9    0.007132
Name: adjusted_close, dtype: float64

In [None]:
ac.pct_change(-2)

## Randomly sample a Series

The `sample` method is great for randomly sampling the values in your Series. Set the `n` parameter of the `sample` method to an integer to return that many randomly selected values.

In [20]:
score.sample(n=5)

title
I Origins                         7.3
Alleluia! The Devil's Carnival    7.4
Highlander                        7.2
Rise of the Guardians             7.3
Shopgirl                          6.4
Name: imdb_score, dtype: float64

By default, the sampling is done without replacement, so there is no possibility of selecting the same item. If you attempt to choose a sample larger than the number of values in the Series, you'll get an error.

In [21]:
score.sample(n=5000)

ValueError: Cannot take a larger sample than population when 'replace=False'

However, you can sample with replacement, meaning that you can get duplicate items by setting the `replace` parameter to `False`.

In [22]:
score.sample(n=5000, replace=True).head()

title
Your Highness                       5.6
Event Horizon                       6.7
Perfume: The Story of a Murderer    7.5
The Phantom                         4.9
The Ladykillers                     6.2
Name: imdb_score, dtype: float64

You can also sample a fraction of the dataset with the `frac` parameter. Here we take a random sample of 15% of the data.

In [23]:
score_sample = score.sample(frac=0.15)
score_sample.head()

title
Aliens in the Attic           5.4
Good Intentions               5.2
The Family                    7.5
Blades of Glory               6.3
Curse of the Golden Flower    7.0
Name: imdb_score, dtype: float64

Let's verify that the sample is indeed 15% of the total length of the original.

In [24]:
len(score_sample)

737

In [25]:
len(score) * 0.15

737.4

## The `replace` method

The `replace` method replaces particular values in the Series with other values. There are a lot of options with the `replace` method to handle many different kinds of replacement. Let's select the color column from the movie dataset as a Series.

In [26]:
color = movie['color']
color.head()

title
Avatar                                        Color
Pirates of the Caribbean: At World's End      Color
Spectre                                       Color
The Dark Knight Rises                         Color
Star Wars: Episode VII - The Force Awakens      NaN
Name: color, dtype: object

The simplest way to replace a value in the Series is to pass the `replace` method two arguments. The first is the value you'd like to replace and the second is the replacement value. Here, we replace the exact string 'Color' with 'Colour'.

In [27]:
color.replace('Color', 'Colour').head()

title
Avatar                                        Colour
Pirates of the Caribbean: At World's End      Colour
Spectre                                       Colour
The Dark Knight Rises                         Colour
Star Wars: Episode VII - The Force Awakens       NaN
Name: color, dtype: object

The `replace` method works with columns of all data types. Here we use the `score` Series to replace the value 7.1 with 999.

In [28]:
score.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

In [29]:
score.replace(7.1, 999).head()

title
Avatar                                          7.9
Pirates of the Caribbean: At World's End      999.0
Spectre                                         6.8
The Dark Knight Rises                           8.5
Star Wars: Episode VII - The Force Awakens    999.0
Name: imdb_score, dtype: float64

You might think you can replace specific words within strings, and you would be correct, but doing so necessitates more effort. Let's take a look at the `genres` column as a Series.

In [30]:
genres = movie['genres']
genres.head()

title
Avatar                                        Action|Adventure|Fantasy|Sci-Fi
Pirates of the Caribbean: At World's End             Action|Adventure|Fantasy
Spectre                                             Action|Adventure|Thriller
The Dark Knight Rises                                         Action|Thriller
Star Wars: Episode VII - The Force Awakens                        Documentary
Name: genres, dtype: object

Let's say we are interested in replacing the string 'Adventure' with 'Adv' to shorten the length of each string in this column. The following won't work.

In [31]:
genres.replace('Adventure', 'Adv').head()

title
Avatar                                        Action|Adventure|Fantasy|Sci-Fi
Pirates of the Caribbean: At World's End             Action|Adventure|Fantasy
Spectre                                             Action|Adventure|Thriller
The Dark Knight Rises                                         Action|Thriller
Star Wars: Episode VII - The Force Awakens                        Documentary
Name: genres, dtype: object

By default, the `replace` method works by matching the **entire** value in the Series. The genre must be exactly 'Adventure' for it to be replaced without any other text surrounding it. It is possible to do this within-string replacement, but you'll need to understand regular expressions first. Setting the `regex` parameter to `True` will do the trick. The following is presented with some precaution. You should not use the `regex` parameter until you understand the fundamentals of regular expressions, which are thoroughly covered in its own part of the book.

In [32]:
genres.replace('Adventure', 'Adv', regex=True).head()

title
Avatar                                        Action|Adv|Fantasy|Sci-Fi
Pirates of the Caribbean: At World's End             Action|Adv|Fantasy
Spectre                                             Action|Adv|Thriller
The Dark Knight Rises                                   Action|Thriller
Star Wars: Episode VII - The Force Awakens                  Documentary
Name: genres, dtype: object

## Exercises

Read in the employee dataset by executing the cell below and use it for the following exercises.

In [33]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White


### Exercise 1

<span style="color:green; font-size:16px">Find the minimum, maximum, mean, median, and standard deviation of the salary column. Return the result as a Series.</span>

In [34]:
salary = emp['salary'].agg(['min', 'max', 'mean', 'median', 'std'])
salary

min         9912.000000
max       342784.000000
mean       58206.761571
median     56956.640000
std        23322.315285
Name: salary, dtype: float64

### Exercise 2

<span style="color:green; font-size:16px">Use the `idxmax` and `idxmin` methods to find the index where the maximum and minimum salaries are located in the DataFrame. Then use the `loc` indexer to select both of those rows as a DataFrame.</span>

In [36]:
salary = emp['salary']
print(f'max index = ', salary.idxmax())

print(f'min index = ', salary.idxmin())

max index =  1732
min index =  1183


In [38]:
mx = salary.idxmax()
mn = salary.idxmin()

rows = [mx,mn]

emp.loc[rows, :]

Unnamed: 0,dept,title,hire_date,salary,sex,race
1732,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,White
1183,Library,CUSTOMER SERVICE CLERK,2016-01-19,9912.0,Female,Hispanic


### Exercise 3

<span style="color:green; font-size:16px">Repeat exercise 3, but do so on the `imdb_score` column from the movie dataset.</span>

In [41]:
score = movie['imdb_score']

mx = score.idxmax()
mn = score.idxmin()

rows = [mx, mn]

movie.loc[rows, :]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Towering Inferno,,Color,,65.0,John Blanchard,0.0,Martin Short,770.0,Andrea Martin,179.0,...,176.0,,Comedy,,10,,English,Canada,,9.5
Justin Bieber: Never Say Never,2011.0,Color,G,115.0,Jon M. Chu,209.0,Usher Raymond,569.0,Sean Kingston,69.0,...,41.0,73000942.0,Documentary|Music,84.0,74351,boyhood friend|manager|plasma tv|prodigy|star,English,USA,13000000.0,1.6


### Exercise 4

<span style="color:green; font-size:16px">The `idxmax` and `idxmin` methods are aggregations as they return a single value. Use the `agg` method to return the min/max `imdb_score` and the label for each score.</span>

In [42]:
score.agg(['min', 'max'])

min    1.6
max    9.5
Name: imdb_score, dtype: float64

### Exercise 5

<span style="color:green; font-size:16px">Read in 20 years of Microsoft stock data, setting the 'date' column as the index. Find the top 5 largest one-day percentage gains in the `adjusted_close` column.</span>

In [44]:
msft = pd.read_csv('../data/stocks/msft20.csv',
                  parse_dates=['date'], index_col = 'date')
msft.head(2)

Unnamed: 0_level_0,open,high,low,close,adjusted_close,volume,dividend_amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1999-10-19,88.25,89.25,85.25,86.313,27.8594,69945600,0.0
1999-10-20,91.563,92.375,90.25,92.25,29.7758,88090600,0.0


In [48]:
adj = msft['adjusted_close']

adj.pct_change().nlargest()

date
2000-10-19    0.195654
2008-10-13    0.186043
2008-11-21    0.122646
2002-05-08    0.111181
2001-01-03    0.105183
Name: adjusted_close, dtype: float64

### Exercise 6

<span style="color:green; font-size:16px">Randomly sample the `actor1` column as a Series with replacement to select three values. Use random state 12345. Setting a random state ensures that the same random sample is chosen regardless of which machine is used.</span>

In [49]:
act = movie['actor1']

In [50]:
act.sample(n=3, replace = True, random_state = 12345)

title
Hoop Dreams        William Gates
Only the Strong    Antoni Corone
Babel                  Brad Pitt
Name: actor1, dtype: object

### Exercise 7

<span style="color:green; font-size:16px">Select the title column from the employee dataset as a Series. Replace all occurrences of 'POLICE OFFICER' and 'SENIOR POLICE OFFICER' with 'POLICE'. You can use a list as the first argument passed to the `replace` method.</span>

In [51]:
title = emp['title']

title.replace(['POLICE OFFICER', 
              'SENIOR POLICE OFFICER'], 'POLICE')

0                      POLICE SERGEANT
1           ASSISTANT CITY ATTORNEY II
2              SENIOR SLUDGE PROCESSOR
3                               POLICE
4                               POLICE
                     ...              
24303                           POLICE
24304    SENIOR PROCUREMENT SPECIALIST
24305        WATER SERVICE INSPECTOR I
24306    HUMAN SERVICE PROGRAM MANAGER
24307                  POLICE SERGEANT
Name: title, Length: 24308, dtype: object