# Series Attributes and Statistical Methods

Our main accomplishment up to this point has been selecting subsets of data. We have not changed the data or made many interesting calculations. Our selections have happened in two ways:

* Selection by label and integer location
* Selection by actual values (boolean selection and the `query` method)

Other than `query`, these selections all used the square brackets.

## Calling methods on a Series/DataFrame

In this chapter we will call many methods that perform actions on our DataFrame. We have actually already called some methods such as the `head`, `tail`, `isna`, and `set_index` methods. There are around 250 methods available to both DataFrames and Series.

### Use a subset of methods

It can be quite overwhelming to think about having to learn and memorize this staggering amount of functionality. The good news is that many of these methods are unnecessary and don't add any extra functionality. Furthermore, many methods are remnants from the early days of pandas and have few/no use cases or have been **deprecated**. When a method is deprecated, then it is both discouraged from being used and will likely be removed from the library in the future.

### Minimally sufficient pandas

I suggest using a subset of the pandas library that allows you to do as many tasks as possible. I focus on the subset of pandas that maximizes both performance and readability. Since there is so much functionality, power users of pandas can think of very creative and complex code to accomplish different tasks. This is not necessarily a positive thing and when working with a group of other data analysts can lead to confusion for those that are not familiar with the syntax. One of my most popular blog posts is titled [Minimally Sufficient Pandas][1] and goes into great detail on this.

## Series Attributes and Methods

We begin our exploration of attributes and methods with Series objects. It is far simpler to focus on a single column of data then multiple columns in a DataFrame.

### View the API for a complete list of functionality

Modern programming languages use the term **Application Programming Interface** or **API** to list and describe all the possible functionality therein. The pandas API reference can be found [here][2]. This is a huge list, but as mentioned above, only a subset of this page is needed for the vast majority of tasks.

### The best of the pandas Series API

The pandas Series object is a single dimension of data and easier to work with than an entire DataFrame. We start with it and cover the most basic and important methods below. You may find it useful to navigate to the [Series API][3] section of the documentation so that you can have a full list of the functionality.

### City of Houston Employee Data

We will use a public dataset containing City of Houston employee information on their position, race, gender, and salary. Notice that the column `hire_date` can be read in as a datetime.

[1]: https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428
[2]: http://pandas.pydata.org/pandas-docs/stable/reference/index.html
[3]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head(3)

This dataset was last updated in July of 2019 and contains nearly all of the employees for the City of Houston.

In [None]:
emp.shape

### Select a single column as a Series
Let's select the `salary` column as a Series and use it to explore the Series API.

In [None]:
salary = emp['salary']
salary.head()

Let's verify that we have a Series object.

In [None]:
type(salary)

## Core Series Attributes
pandas Series have [many attributes][1], but only a few are important to know. The attributes to be aware of are:

* `index`
* `values`
* `size`
* `dtype`

The `index` and `values` were covered in a previous chapter. Only `size` and `dtype` are new. The size represents the total number of values in the Series. The `dtype` returns the data type of the values. Remember that all values in the Series share the same data type. Let's display these now.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#attributes

In [None]:
salary.size

In [None]:
salary.dtype

### `len` function also returns the number of values

The built-in `len` function returns the same number as the `size` attribute. 

In [None]:
len(salary)

Even though they both report the same number, I typically use the `len` function, as it returns the number of rows when used on a DataFrame. The DataFrame `size` attribute returns the total number of values in the DataFrame.

In [None]:
len(emp)

In [None]:
emp.size

## Arithmetic operators
Series have the ability to work with all of the following common arithmetic operators:

* `+` - Addition
* `-` - Subtraction
* `*` - Multiplication
* `/` - Division
* `//` - Floor division
* `**` - Exponentiation
* `%` - Modular division (returns the remainder)

All of the arithmetic operators operate on every value in the Series. Let's see some examples and begin by adding 5 to every value in the Series.

In [None]:
result = salary + 5
result.head(3)

Raise each value in the Series to the .2 power.

In [None]:
result = salary ** .2
result.head(3)

Divide each value in the Series by 173. This single division sign is referred to as **true division** and returns all decimal values.

In [None]:
result = salary / 173
result.head(3)

Two division signs are used for **floor division**. The decimals are truncated (and not rounded) from the result.

In [None]:
result = salary // 173
result.head(3)

### Isn't this chapter about calling methods?

Although the above operations are not actual methods and do not use dot notation, they work similarly as methods. You can think of them as methods that take exactly one parameter, the other object that is being operated on.

### Arithmetic operations are vectorized

All the above arithmetic operations are **vectorized**. This means that the operation was applied to each value in the Series without an explicit writing of a `for` loop. Python lists do not work like this and require an explicit for loop to operate on each value.

## Comparison Operations

The following six comparison operators work similarly as their arithmetic analogs from above:

* `< ` - Less than
* `<=` - Less than or equal to
* `> ` - Greater than
* `>=` - Greater than or equal to
* `==` - Equals to
* `!=` - Not equal to

In the boolean selection chapters, we used these vectorized comparison operations (without the terminology) to produce Series of booleans. Let's see a few examples below beginning by testing whether each salary is greater than 50,000.

In [None]:
result = salary > 50000
result.head(3)

Here, we test whether each salary is not equal to 82,182

In [None]:
result = salary != 82182
result.head(3)

## Boolean and bitwise operators

Python has three boolean operators, the keywords `and`, `or`, and `not`. These operators are syntactically unable to do vectorized boolean operations. Instead, pandas and numpy rely on the bitwise and, or, and not operators, respectively `&`, `|`, and `~` to perform vectorized boolean operations. They were thoroughly covered in the preceding chapters.  Let's do one example to review by determining whether or not a salary is less than 50,000 or greater than 100,000.

In [None]:
result = (salary < 50000) | (salary > 100000)
result.head(3)

## Statistical methods

We now call *actual* methods that compute [basic descriptive statistics][1] on a numerical Series. You might want to click the previous link to have the list of all the possible statistical methods. We call the methods explicitly with dot notation. It is useful to place these methods into two categories - those that **aggregate** and those that do not.

### Aggregation methods

A method that performs an aggregation returns a **single** number to summarize the Series. Examples of methods that aggregate are:

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-missing values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns the given percentile of the distribution

### Non-aggregation methods

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:

* `abs` - takes absolute value
* `round` - round to the nearest given decimal place
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats

## Aggregation methods
Let's see a few examples of common aggregation methods. Let's begin by summing every value in the Series with the `sum` method.

In [None]:
salary.sum()

Get the minimum value of the Series with the `min` method.

In [None]:
salary.min()

Get the maximum value of a Series with the `max` method.

In [None]:
salary.max()

Use the `quantile` method to return the given percentile of the Series. It accepts values between 0 and 1. By default, it returns the 50th percentile. Below, we pass it .95 to return the 95th percentile of salary. This means that 95 percent of the employees for the city of Houston have this salary or below.

In [None]:
salary.quantile(.95)

### The `count` method

The `count` method returns the number of non-missing values. It does NOT return the total number of values in the Series. Since this number is less than `len(salary)`, we know missing values exist.

In [None]:
salary.count()

In [None]:
len(salary)

### pandas ignores missing values by default

One big difference between pandas and numpy is that pandas ignores missing values by default. When calling aggregation methods such as `sum` or `mean`, pandas ignores any missing value as if that piece of data did not exist. numpy returns `nan` for its aggregation methods when one or more values are missing. Let's verify this by extracting the values of `salary` as a numpy array and then calling the array `sum` method.

In [None]:
salary.values.sum()

We can make pandas Series behave like numpy by setting the `skipna` parameter to `False`. All of the statistical methods have the `skipna` parameter available.

In [None]:
salary.sum(skipna=False)

### The `describe` method

The `describe` method returns several aggregations at once as a Series. The name of the aggregation is placed in the index. By default, it returns the count (number of non-missing values), min, median, mean, max, standard deviation, and 25th and 75 percentiles.

In [None]:
salary.describe()

You can use the `percentiles` parameter to control which percentiles get returned. Pass it a list of all the percentiles (numbers between 0 and 1) you would like returned.

In [None]:
salary.describe(percentiles=[.1, .2, .5, .8, .9, .99])

## Non-Aggregation methods

Many of these computational methods aggregate and return a single value, but others do not. For instance, the `abs` method takes the absolute value of each individual value in the Series. It returns a Series with the same number of values as the original. In this example, none of the values in the Series are negative, so the values remain the same.

In [None]:
salary.abs().head(3)

The `round` method rounds each value to the nearest given decimal place. Use the `decimals` parameter to determine the place of the rounding. Negative numbers may be used to round places to the left of the decimal. In the following example, we round to the nearest thousand.

In [None]:
salary.round(decimals=-3).head(3)

### Accumulation methods

There are a few accumulation methods that work by keeping track of previous data. For instance, the `cummin` method keeps track of the current minimum value in the Series. It begins at the top with the first value. Since it's the first, it will be the minimum. It then continues down the Series to the second value. If the second value is less than the first, then it will be the new minimum. If not, then the first value will remain as the minimum. It returns a Series the same length as the original of all the current minimums. With our Series, the first salary remains the lowest until the fifth value which remains the lowest until the 10th value.

In [None]:
salary.cummin().head(10)

In [None]:
salary.cumsum()

### Non-aggregation methods return an entirely new Series

The non-aggregation methods return an entirely new Series and do not modify the calling Series. This is a crucial concept to understand. pandas has only a few operations and methods that modify objects in-place. Nearly all of the time, a new object is returned. Here, we begin by assigning the result of the `round` method to a variable name.

In [None]:
salary_round = salary.round(decimals=-3)
salary_round.head(3)

Let's verify that the calling object has not changed. The `salary` Series is the calling object, i.e., the one that is calling the method and remains unchanged.

In [None]:
salary.head(3)

## Series with a non-default index

Let's use a different Series that does not use the default `RangeIndex` to run some of the same methods as above. We'll read in the movie dataset with the title as the index and select the `imdb_score` column as a Series.

In [None]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
score = movie['imdb_score']
score.head()

All of the methods in this chapter work the exact same way as they do with the default index. They all operate on the **values** of the Series and NOT on the index. The index is merely a label for the values. The methods do calculations on the values. Let's show this by taking the mean of the scores. Notice how a single value is returned. The index has nothing to do with these calculations.

In [None]:
score.mean()

Let's calculate the statistical variance with the `var` method.

In [None]:
score.var()

Calling the non-aggregation methods is where some confusion might arise. Below, we round each score to the nearest whole number. Since we are not aggregating a Series is returned and the original index remains with it. Again, no calculation is done on the index. The calculation is only applied to the values.

In [None]:
score.round().head()

Here, we take the current maximum value with the `cummax` method. The index helps out here by informing us which movie is attached to the score. Avatar retains the highest score until it is surpassed by The Dark Knight Rises.

In [None]:
score.cummax().head()

## Operations on a boolean Series

All of the above methods were called on a Series with numeric values. In this section, we will several of the same aggregation and non-aggregation methods on a Series of booleans. Let's begin by creating a boolean Series by determining which movies had a score greater than eight.

In [None]:
score_8 = score > 8
score_8.head()

We can use this Series to filter the data just like we did in the chapters on boolean selection.

In [None]:
only_8 = score[score_8]
only_8.head()

We can determine the number of movies that have a score greater than eight by finding the length of this result.

In [None]:
len(only_8)

### Sum a boolean Series

We can find the number of movies with a score greater than 8 without doing boolean selection. Instead, we can call the `sum` method.

In [None]:
score_8.sum()

### Boolean values are treated as numeric

When performing arithmetic calculations, pandas treats boolean values as numeric. `False` evaluates as 0 and `True` evaluates as 1. With the `score_8` boolean Series, there are 249 `True` values with the rest being `False`. Calling the `sum` method any boolean Series returns the number of `True` values in that Series.

It is possible to compute this sum without first assigning the boolean Series to a new variable name. We can surround the condition in parentheses and then call the `sum` method.

In [None]:
(score > 8).sum()

### Explanation of this one line of code

Let's examine the line `(score > 8).sum()`. Python first evaluates the expression in parentheses - `score > 8`. This results in a Series, which has all the available methods as any other Series. We then call the `sum` method on this Series to get the desired result.

## Exercises

Continue to use the `score` Series for the first several exercises.

### Exercise 1

<span  style="color:green; font-size:16px">What is the data type of `score` and how many values does it contain?</span>

### Exercise 2

<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

### Exercise 3

<span  style="color:green; font-size:16px">How many movies have scores greater than 6?</span>

### Exercise 4

<span  style="color:green; font-size:16px">How many movies have scores greater than 4 and less than 7?</span>

### Exercise 5

<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

### Exercise 6

<span  style="color:green; font-size:16px">Add 1 to every value of `score` and then calculate the median.</span>

### Exercise 7

<span  style="color:green; font-size:16px">Calculate the median of `score` and add 1 to this. Why is this value the same as Exercise 7?</span>

### Exercise 8

<span  style="color:green; font-size:16px">Return a Series that has only scores above the 99.9th percentile</span>

### Exercise 9

<span  style="color:green; font-size:16px">Assign the gross column of the movie dataset to its own variable name. Round it to the nearest million.</span>

### Exercise 10

<span  style="color:green; font-size:16px">Calculate the cumulative sum of the gross Series and then select the 99th integer location.</span>

### Exercise 11

<span  style="color:green; font-size:16px">Select the first 100 values of the gross Series and then calculate the sum. Does the result match exercise 11.</span>