# Numeric Series Methods

Our main accomplishment up to this point has been selecting subsets of data. We have not changed the data or produced any interesting calculations. In this chapter, we begin our journey calling Series methods containing numeric data to produce a result of interest.

## Calling Series methods

We have already called some methods such as the `head`, `tail`, `isna`, and `set_index`. This chapter covers many more methods that work on numeric Series such as `min`, `max`, `mean`, and `median`. A numeric series is one that contains integers or floats. Many of these methods will work on data of other types, but will always work for numeric Series.

### Minimally sufficient pandas

There are over 200 methods available to both the Series and DataFrame.  It can be overwhelming to think about having to learn and memorize this staggering amount of functionality. The good news is that many of these methods are unnecessary and don't add any extra functionality. Furthermore, many methods are remnants from the early days of pandas and have few/no use cases or have been **deprecated**. When a method is deprecated, then it is both discouraged from being used and will likely be removed from the library in the future. For example, the `ix` indexer was useful in the earlier days of pandas to simultaneously select rows and columns of data. It was deprecated in favor of the `loc` and `iloc` indexers and removed in pandas version 1.0.

I suggest using a subset of the pandas library that allows you to do as many tasks as possible. I focus on the subset of pandas that maximizes both performance and readability. Since there is so much functionality, power users of pandas can think of very creative and complex code to accomplish different tasks. This is not necessarily a positive thing, and when working with a group of other data analysts can lead to confusion for those that are not familiar with the syntax. One of my most popular blog posts is titled [Minimally Sufficient Pandas][1] and goes into greater detail on this.

We begin our exploration of attributes and methods with Series objects. It is far simpler to focus on a single column of data than multiple columns in a DataFrame.

### View the API for a complete list of functionality

Modern programming languages use the term **Application Programming Interface** or **API** to list and describe all the possible functionality therein. The pandas API reference can be [found here][2]. This is a huge list, but as mentioned above, only a subset of this page is needed for the vast majority of tasks. You may find it useful to navigate to the [Series API][3] section of the documentation so that you can have a full list of its functionality for this chapter.

## City of Houston Employee Data

We will use a public dataset containing City of Houston employee information on their position, race, sex, and salary. This dataset was last updated in July of 2019 and contains nearly all of the employees for the City of Houston. Notice that the column `hire_date` can be read in as a datetime.

[1]: https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428
[2]: http://pandas.pydata.org/pandas-docs/stable/reference/index.html
[3]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html

In [62]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [1]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Select a single column as a Series

Let's select the numeric `salary` column as a Series and use it to explore the Series API.

In [2]:
salary = emp['salary']
salary.head()

0    87545.38
1    82182.00
2    49275.00
3    75942.10
4    69355.26
Name: salary, dtype: float64

Let's verify that we have a Series object.

In [3]:
type(salary)

pandas.core.series.Series

## Core Series attributes

Before calling methods we cover some of the [many pandas Series attributes][1]. The attributes to be aware of are:

* `index`
* `values`
* `shape`
* `size`
* `dtype`

The `index` and `values` were covered in a previous chapter. Only `shape`, `size`, and `dtype` are new. The `shape` and `size` attributes are nearly identical to one another. They both return the number of values in the Series. The `shape` attribute is a one-item tuple, while `size` returns an integer. The `dtype` attribute returns the data type of the values. Remember that all values in the Series share the same data type. Let's display these now.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#attributes

In [4]:
salary.shape

(24308,)

In [5]:
salary.size

24308

In [6]:
salary.dtype

dtype('float64')

### `len` function also returns the number of values

The built-in `len` function returns the same number as the `size` attribute. 

In [7]:
len(salary)

24308

Even though they both report the same number, I typically use the `len` function, as it returns the number of rows when used on a DataFrame. The DataFrame `size` attribute returns the total number of values in the DataFrame (rows times columns).

## Arithmetic operators

Before calling any Series methods, we will cover how they work with the built-in Python operators. All of the following common arithmetic operators can be used with a Series of numeric data.

* `+` - Addition
* `-` - Subtraction
* `*` - Multiplication
* `/` - Division
* `//` - Floor division
* `**` - Exponentiation
* `%` - Modular division (returns the remainder)

All of the arithmetic operators operate on every value in the Series. Let's see some examples beginning by adding 5 to every value in the Series. We save the result to a variable named `result`, which is also a Series, and output the first few values.

In [8]:
result = salary + 5
result.head(3)

0    87550.38
1    82187.00
2    49280.00
Name: salary, dtype: float64

Let's raise each value in the Series to the 0.2 power with the exponentiation operator, `**`.

In [9]:
result = salary ** 0.2
result.head(3)

0    9.737482
1    9.615134
2    8.680112
Name: salary, dtype: float64

Let's divide each value in the Series by 173. This single forward slash is referred to as **true division** and returns all decimal values.

In [10]:
result = salary / 173
result.head(3)

0    506.042659
1    475.040462
2    284.826590
Name: salary, dtype: float64

Two forward slashes are used for **floor division**. The decimals are truncated (and not rounded) from the result.

In [11]:
result = salary // 173
result.head(3)

0    506.0
1    475.0
2    284.0
Name: salary, dtype: float64

### Isn't this chapter about calling methods?

Although the above operations are not actual methods and do not use dot notation, they work similarly as methods. You can think of them as methods that take exactly one argument, the other object that is being operated on.

### Arithmetic operations are vectorized

All the above arithmetic operations are **vectorized**. This means that the operation gets applied to each value in the Series without an explicit writing of a for-loop. Python lists do not work like this and require an explicit for-loop to operate on each value.

## Comparison operations

The following six comparison operators work similarly as their arithmetic analogs from above:

* `< ` - Less than
* `<=` - Less than or equal to
* `> ` - Greater than
* `>=` - Greater than or equal to
* `==` - Equals to
* `!=` - Not equal to

In the boolean selection chapters, we used these vectorized comparison operations (without the terminology) to produce a Series of booleans. Let's see a few examples below beginning by testing whether each salary is greater than 50,000.

In [12]:
result = salary > 50_000
result.head(3)

0     True
1     True
2    False
Name: salary, dtype: bool

Here, we test whether each salary is not equal to 82,182

In [14]:
salary.head(3)

0    87545.38
1    82182.00
2    49275.00
Name: salary, dtype: float64

In [13]:
result = salary != 82_182
result.head(3)

0     True
1    False
2     True
Name: salary, dtype: bool

In [15]:
xx = salary != 82182
xx.head(3)

0     True
1    False
2     True
Name: salary, dtype: bool

## Boolean and bitwise operators

Python has three boolean operators, the keywords `and`, `or`, and `not`. These operators are syntactically unable to do vectorized boolean operations. Instead, pandas and numpy rely on the bitwise and, or, and not operators, respectively `&`, `|`, and `~` to perform vectorized boolean operations and were thoroughly covered in previous chapters.  Let's complete one example as a review and determine whether a salary is less than 50,000 or greater than 100,000.

In [16]:
result = (salary < 50000) | (salary > 100000)
result.head(3)

0    False
1    False
2     True
Name: salary, dtype: bool

## Statistical methods

We now call *actual* methods that compute [basic descriptive statistics][1] on a numerical Series. You might want to click the previous link to have the list of all the possible statistical methods. We call the methods explicitly with dot notation. It is useful to place these methods into two categories - those that **aggregate** and those that do not.

### Aggregation methods

A method that performs an aggregation returns a **single** number to summarize the Series. Examples of methods that aggregate are:

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-missing values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns the given percentile of the distribution

### Non-aggregation methods

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:

* `abs` - takes the absolute value of each value in the series
* `round` - rounds each value to the nearest given decimal place
* `cummin` - keeps track of the current minimum
* `cummax` - keeps track of the current maximum
* `cumsum` - accumulates the sum

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats

In [17]:
salary.cummin()

0        87545.38
1        82182.00
2        49275.00
3        49275.00
4        49275.00
           ...   
24303     9912.00
24304     9912.00
24305     9912.00
24306     9912.00
24307         NaN
Name: salary, Length: 24308, dtype: float64

In [18]:
salary.cummax()

0         87545.38
1         87545.38
2         87545.38
3         87545.38
4         87545.38
           ...    
24303    342784.00
24304    342784.00
24305    342784.00
24306    342784.00
24307          NaN
Name: salary, Length: 24308, dtype: float64

## Aggregation methods

Let's see a few examples of common aggregation methods beginning by finding the minimum value in the Series with the `min` method.

In [19]:
salary.min()

9912.0

Find the maximum value of a Series with the `max` method.

In [20]:
salary.max()

342784.0

Find the total of all salaries with the `sum` method.

In [21]:
salary.sum()

1359826363.82

The `median` method returns the median (50th percentile) of the Series.

In [22]:
salary.median()

56956.64

Use the `quantile` method to return the given percentile of the Series. It accepts values between 0 and 1. By default, it returns the 50th percentile. Below, we pass it 0.95 to return the 95th percentile of salary. This means that 95 percent of the employees for the City of Houston have this salary or below.

In [23]:
salary.quantile(0.95)

96063.9030000001

### The `count` method

The `count` method returns the number of non-missing values. It does NOT return the total number of values in the Series. Since this number is less than `len(salary)`, we know missing values exist.

In [24]:
salary.count()

23362

In [25]:
len(salary)

24308

### The `std` and `var` methods

The standard deviation and variance of a numeric Series may be computed with the `std` and `var` methods. Set the degrees of freedom with the `ddof` parameter (which is defaulted to 1).

In [26]:
salary.std()

23322.315284661538

The variance is the square of the standard deviation.

In [27]:
salary.var()

543930390.2371572

### pandas ignores missing values by default

One big difference between pandas and numpy is that pandas ignores missing values by default. When calling aggregation methods such as `sum` or `mean`, pandas ignores any missing value as if that piece of data did not exist. On the other hand, numpy returns `nan` for its aggregation methods when one or more values are missing. Let's verify this by extracting the values of `salary` as a numpy array and then calling the array `sum` method.

In [28]:
salary.values.sum()

nan

We can make pandas Series behave like numpy by setting the `skipna` parameter to `False`. All of the statistical methods have the `skipna` parameter available.

In [29]:
salary.sum(skipna=False)

nan

### The `describe` method

The `describe` method returns several aggregations at once as a Series. The name of the aggregation is placed in the index. By default, it returns the count (number of non-missing values), mean, standard deviation, min, 25th, 50th, and 75th percentiles, and the max.

In [30]:
salary.describe()

count     23362.000000
mean      58206.761571
std       23322.315285
min        9912.000000
25%       41122.000000
50%       56956.640000
75%       69355.260000
max      342784.000000
Name: salary, dtype: float64

Use the `percentiles` parameter to control which percentiles get returned. Pass it a list of all the percentiles (numbers between 0 and 1) you would like returned.

In [31]:
salary.describe(percentiles=[0.1, 0.2, 0.5, 0.8, 0.9, 0.99])

count     23362.000000
mean      58206.761571
std       23322.315285
min        9912.000000
10%       33030.000000
20%       38314.000000
50%       56956.640000
80%       75942.100000
90%       86993.900000
99%      130036.000000
max      342784.000000
Name: salary, dtype: float64

In [32]:
salary.describe(percentiles=[0.2, 0.4, 0.99])

count     23362.000000
mean      58206.761571
std       23322.315285
min        9912.000000
20%       38314.000000
40%       50066.000000
50%       56956.640000
99%      130036.000000
max      342784.000000
Name: salary, dtype: float64

## Non-Aggregation methods

In this section, we'll cover some basic non-aggregation Series methods. They return an entire new Series with the same number of values as the original. For instance, the `abs` method takes the absolute value of each individual value in the Series. In this example, none of the values in the Series are negative, so the values remain the same.

In [33]:
salary.abs().head(3)

0    87545.38
1    82182.00
2    49275.00
Name: salary, dtype: float64

The `round` method rounds each value to the nearest given decimal place, given by the first argument. Negative integers may be used to round places to the left of the decimal. In the following example, we round to the nearest thousand.

In [38]:
salary.round(-2).head(3)

0    87500.0
1    82200.0
2    49300.0
Name: salary, dtype: float64

### Accumulation methods

There are a few accumulation methods that work by keeping track of previous data. For instance, the `cummin` method keeps track of the current minimum value in the Series. It begins at the top with the first value. Since it's the first, it will be the minimum. It then continues down the Series to the second value. If the second value is less than the first, then it will be the new minimum. If not, the first value will remain as the minimum. It returns a Series with the same length as the original of all the current minimums.

In [39]:
salary.cummin().head(10)

0    87545.38
1    82182.00
2    49275.00
3    49275.00
4    49275.00
5    44616.00
6    39998.00
7    39998.00
8    39998.00
9    39998.00
Name: salary, dtype: float64

The `cummax` method works similarly, but keeps track of the maximum value. The `cumsum` method accumulates the sum beginning from the first value.

In [40]:
salary.cumsum().head()

0     87545.38
1    169727.38
2    219002.38
3    294944.48
4    364299.74
Name: salary, dtype: float64

### Non-aggregation methods return an entirely new Series

The non-aggregation methods return an entirely new Series and do not modify the calling Series. This is a crucial concept to understand. pandas has only a few operations and methods that modify objects in-place. Nearly all of the time, a new object is returned. To show this, we assign the result of the `round` method to a variable name.

In [41]:
salary_round = salary.round(decimals=-3)
salary_round.head(3)

0    88000.0
1    82000.0
2    49000.0
Name: salary, dtype: float64

Let's verify that the calling object has not changed. The `salary` Series is the calling object, i.e., the one that is calling the method and remains unchanged.

In [42]:
salary.head(3)

0    87545.38
1    82182.00
2    49275.00
Name: salary, dtype: float64

## Series methods with a non-default index

Let's use a different Series that does not use the default `RangeIndex` to run some of the same methods as above. We'll read in the movie dataset with the title as the index and select the `imdb_score` column as a Series.

In [43]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
score = movie['imdb_score']
score.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

All of the methods in this chapter work the exact same way as they do with the default index. They all operate on the **values** of the Series and NOT on the index. The index is merely a label for the values. The methods do calculations on the values. Let's show this by taking the mean of the scores. Notice how a single value is returned. The index has nothing to do with these calculations.

In [44]:
score.mean()

6.437428803905615

Let's show this is the case with another method and calculate the statistical variance with `var`.

In [45]:
score.var()

1.2719375585109596

Calling the non-aggregation methods is where some confusion might arise. Below, we round each score to the nearest whole number. When no argument for decimal place is given, as done below, it defaults to rounding to the nearest whole number. Since we are not aggregating, a Series is returned, and the original index remains with it. Again, no calculation is done on the index. The calculation is only applied to the values.

In [46]:
score.round().head()

title
Avatar                                        8.0
Pirates of the Caribbean: At World's End      7.0
Spectre                                       7.0
The Dark Knight Rises                         8.0
Star Wars: Episode VII - The Force Awakens    7.0
Name: imdb_score, dtype: float64

Here, we find the current maximum value with the `cummax` method. Avatar retains the highest score until it is surpassed by 'The Dark Knight Rises'.

In [47]:
score.cummax().head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.9
Spectre                                       7.9
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    8.5
Name: imdb_score, dtype: float64

## Operations on a boolean Series

All of the above methods were called on a Series with numeric values. In this section, we will execute a few of the same aggregation and non-aggregation methods on a Series of booleans. Let's create a boolean Series by determining which movies had a score greater than eight.

In [48]:
score_8 = score > 8
score_8.head()

title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: imdb_score, dtype: bool

We can use this Series to filter the data just like we did in the chapters on boolean selection.

In [49]:
only_8 = score[score_8]
only_8.head()

title
The Dark Knight Rises         8.5
The Avengers                  8.1
Captain America: Civil War    8.2
Toy Story 3                   8.3
WALL·E                        8.4
Name: imdb_score, dtype: float64

We can determine the number of movies that have a score greater than eight by finding the length of this result.

In [50]:
len(only_8)

249

### Sum a boolean Series

Boolean selection is not needed to find the number of movies with a score greater than eight. Instead, we can call the `sum` method on the original boolean Series.

In [51]:
score_8.sum()

249

### Boolean values are treated as numeric

When performing arithmetic calculations, pandas treats boolean values as numeric. `False` evaluates as 0 and `True` evaluates as 1. With the `score_8` boolean Series, there are 249 `True` values with the rest being `False`. Calling the `sum` method on any boolean Series returns the number of `True` values in that Series.

It is possible to compute this sum without first assigning the boolean Series to a new variable name. We can surround the condition in parentheses and then call the `sum` method.

In [52]:
(score > 8).sum()

249

### Explanation of this one line of code

Let's examine the code `(score > 8).sum()`. Python first evaluates the expression in parentheses, `score > 8`. This results in a Series, which has all the available methods as any other Series. We then call the `sum` method on this boolean Series to get the desired result.

## Exercises

Continue to use the `score` Series for the first several exercises.

### Exercise 1

<span  style="color:green; font-size:16px">What is the data type of `score` and how many values does it contain?</span>

In [53]:
score.dtype

dtype('float64')

In [59]:
score.size

4916

In [55]:
len(score)

4916

### Exercise 2

<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

In [63]:
score.min()
score.max()

1.6

9.5

### Exercise 3

<span  style="color:green; font-size:16px">How many movies have scores greater than 6?</span>

In [68]:
(score>6).sum()

3368

### Exercise 4

<span  style="color:green; font-size:16px">How many movies have scores greater than 4 and less than 7?</span>

In [69]:
((4 < score) & (score < 7)).sum()

3021

### Exercise 5

<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

In [71]:
med = score.median()
avg = score.mean()

med-avg

0.16257119609438497

In [75]:
score.median() - score.mean()

0.16257119609438497

### Exercise 6

<span  style="color:green; font-size:16px">Add 1 to every value of `score` and then calculate the median.</span>

In [72]:
new = score + 1
new.median()

7.6

In [73]:
(score+1).median()

7.6

### Exercise 7

<span  style="color:green; font-size:16px">Calculate the median of `score` and add 1 to this. Why is this value the same as Exercise 6?</span>

In [74]:
score.median() + 1

7.6

### Exercise 8

<span style="color:green; font-size:16px">Return a Series that has only scores above the 99.9th percentile.</span>

In [79]:
x = score.quantile(.999)
new = score > x
new

title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Name: imdb_score, Length: 4916, dtype: bool

In [81]:
x = score.quantile(.999)
new = score[(score>x)]
new

title
The Shawshank Redemption    9.3
Towering Inferno            9.5
Dekalog                     9.1
The Godfather               9.2
Kickboxer: Vengeance        9.1
Name: imdb_score, dtype: float64

In [86]:
filt = score > score.quantile(.999)
score[filt]

title
The Shawshank Redemption    9.3
Towering Inferno            9.5
Dekalog                     9.1
The Godfather               9.2
Kickboxer: Vengeance        9.1
Name: imdb_score, dtype: float64

### Exercise 9

<span style="color:green; font-size:16px">Assign the gross column of the movie dataset to its own variable name as a Series. Round it to the nearest million.</span>

In [83]:
gross = movie['gross']
gross.round(-5)

title
Avatar                                        760500000.0
Pirates of the Caribbean: At World's End      309400000.0
Spectre                                       200100000.0
The Dark Knight Rises                         448100000.0
Star Wars: Episode VII - The Force Awakens            NaN
                                                 ...     
Signed Sealed Delivered                               NaN
The Following                                         NaN
A Plague So Pleasant                                  NaN
Shanghai Calling                                      0.0
My Date with Drew                                100000.0
Name: gross, Length: 4916, dtype: float64

### Exercise 10

<span  style="color:green; font-size:16px">Calculate the cumulative sum of the gross Series and then select the 99th integer location.</span>

In [87]:
cum = gross.cumsum()
cum.iloc[99]

23119723385.0

### Exercise 11

<span  style="color:green; font-size:16px">Select the first 100 values of the gross Series and then calculate the sum. Does the result match exercise 10?</span>

In [90]:
gross.iloc[0:100].sum()

23119723385.0