## Only selected subsets of data
Up to this point, we have only selected subsets of data. We have not changed the data or made any operations on the data. Our selections have happened in two ways:

* Selection by label and integer location
* Selection by actual values of the DataFrame (Boolean Selection)

Both of these methods involve using Python's indexing operator, brackets **`[]`**. Boolean selection formed Series of booleans with the comparison operators, **`<`**, **`>`**, etc...

## Calling methods on a Series/DataFrame
Selecting subsets of data does not usually require calling methods using dot notation. In this notebook we will call many methods that will perform actions on our DataFrame. We have actually already done some of this with **`head`**, **`tail`**, **`isna`**, and **`set_index`**.

There are more than 250 methods available to both DataFrames and Series at your disposal.

## Use a small subset of methods
It can be quite overwhelming to think about having to learn and memorize this staggering amount of functionality. The good news is that many of these methods are quite unnecessary and don't add any extra functionality. Furthermore, many methods are remnants from the early days of Pandas and have few/no use cases or have been **deprecated**. When a method is deprecated, then it both discouraged from being used and will likely be removed from the library in the future.

## Minimally Sufficient Pandas
I suggest to use a subset of the Pandas library that allows you to do as many tasks as possible. You should strive to write Pandas as simple as possible to maximize both performance and readability. Since there is so much functionality, power users of Pandas can think of very creative and complex code to accomplish different tasks. This is NOT a positive thing and when working with a group of other data scientists can lead to confusion for those that are not familiar with the syntax.

## Knowing lots of tricks doesn't help
It is not an uncommon sight to see Pandas experts provide several different to the same question. Nearly all operations can be accomplished with a small subset of the available operations. 

# Begin with the Series
We will begin our exploration of the Pandas library by retrieving attributes and calling methods from Series objects.

## View the API for complete list of functionality
All modern programming languages use the term, **Application Programming Interface** or **API**, to list and describe all the possible functionality therein. The Pandas API reference can be found [here][1]. This is a huge list, but as mentioned above, only a small subset of this page is needed for the vast majority of tasks.

## The best of the Pandas Series API
The Pandas Series object is a single column of data and easier to work with than an entire DataFrame. We start with it and cover the most basic and important methods below. Navigate to the [Series API][2] section of Pandas.

### City of Houston Employee Data
We will use a small public dataset from City of Houston employees with information on their position, race, gender, and salary. Notice that the columns `hire_date` and `job_date` can be coerced to datetimes.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html
[2]: http://pandas.pydata.org/pandas-docs/stable/api.html#series

In [1]:
import pandas as pd

emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date', 'job_date'])
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,job_date
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic,Female,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic,Female,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Male,1989-06-19,1994-10-22


## Select a single column as a Series
Let's select the **`salary`** column as a Series and use it to explore the API.

In [2]:
salary = emp['salary']
salary.head()

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
Name: salary, dtype: float64

In [3]:
type(salary)

pandas.core.series.Series

## Core Series Attributes
Pandas Series have [many attributes][1], but only a few are important to know. The attributes to be aware of are:

* `index`
* `values`
* `size`
* `dtype`

The **`index`** and **`values`** were covered in a previous notebook. Only **`size`** and **`dtype`** are new. The size represents the total number of values in the Series. And the **`dtype`** is the data type of the values. Let's display these now.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#attributes

In [4]:
salary.size

2000

In [5]:
salary.dtype

dtype('float64')

### `len` function instead of `size` attribute
The built-in **`len`** function returns the same number as the **`size`** attribute. They are both equally acceptable.

In [6]:
len(salary)

2000

# Basic arithmetic operations
Adding 5 to each value in a  Series is an example of a basic arithmetic operation. For these, we use the operator symbol (the plus sign here) like this.

In [8]:
salary.head()

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
Name: salary, dtype: float64

In [7]:
sal_head = salary.head()
sal_head + 5

0    121867.0
1     26130.0
2     45284.0
3     63171.0
4     56352.0
Name: salary, dtype: float64

### Other arithmetic operations on a Series
Let's see some examples of this with a small Series to avoid long output:

In [9]:
sal_head - 1

0    121861.0
1     26124.0
2     45278.0
3     63165.0
4     56346.0
Name: salary, dtype: float64

In [10]:
# raise to the power of three
sal_head ** 3

0    1.809693e+15
1    1.783072e+13
2    9.283046e+13
3    2.520288e+14
4    1.789008e+14
Name: salary, dtype: float64

In [11]:
# floor division
sal_head // 17

0    7168.0
1    1536.0
2    2663.0
3    3715.0
4    3314.0
Name: salary, dtype: float64

## Isn't this notebook about calling methods?
Although the above operations are not using dot notation, they are technically invoking something called **special methods**. Special methods are an advanced Python topic. 

## Arithmetic operations are vectorized
All the above arithmetic operations were **vectorized**. This means that each operation was applied to each value in the Series without an explicit writing of a **`for`** loop. Python lists do NOT work like this.

## We've already done this with the comparison operators

In the boolean indexing notebooks, we used the vectorized comparison operators to produce Series of booleans. The same thing is happening now with the arithmetic operators.

# Descriptive Statistics Methods
We will now call methods that compute [basic descriptive statistics][1] of a numerical Series. We will do so explicitly with dot notation. It is useful to place these methods into two categories - those that **aggregate** and those that do not.

A method that performs an aggregation returns a **single** number to represent the description. Examples of methods that aggregate are:
* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats

## Aggregation methods
Let's use some of the populat built-in aggregation methods:

In [12]:
salary.sum()

105178319.0

In [13]:
salary.min()

24960.0

Count non-missing values. Since this number is less than 2,000, we know missing values exist.

In [14]:
salary.count()

1886

# Pandas ignores missing values by default
One big difference between Pandas and NumPy is that Pandas ignores missing values by default. When calling aggregation methods such as **`sum`** or **`mean`**, Pandas ignores any missing value as if that piece of data did not exist.

## Non-Aggregation Descriptive Statistic Methods

In [15]:
sal_head.abs()

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
Name: salary, dtype: float64

In [16]:
# round to the nearest thousand
sal_head.round(decimals=-3)

0    122000.0
1     26000.0
2     45000.0
3     63000.0
4     56000.0
Name: salary, dtype: float64

### Accumulation methods
There are a few accumulation methods that work by keeping track of previous data. For instance, the `cummin` method keeps track of the current minimum value in the Series. It begins at the top at the first value. Since it's the first, it will be the minimum. It then continues down the Series to the second value. If the second value is less than the first, then it will be the new minimum. It returns a Series of all the current minimums.

In [17]:
sal_head.cummin()

0    121862.0
1     26125.0
2     26125.0
3     26125.0
4     26125.0
Name: salary, dtype: float64

## The non-aggregation methods return an entirely new Series
The non-aggregation methods return an entirely new Series and do not modify the calling Series. This is a crucial concept to understand. Pandas has only a few operations and methods that modify objects in-place. Nearly all of the time, a new object is returned. We verify that the calling method has not changed.

In [18]:
sal_head_round = sal_head.round(decimals=-3)

In [19]:
sal_head_round

0    122000.0
1     26000.0
2     45000.0
3     63000.0
4     56000.0
Name: salary, dtype: float64

In [20]:
sal_head

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
Name: salary, dtype: float64

# Operations on a boolean Series
One nice property of boolean Series is that there values evaluate to 0/1. False evaluates to 0 and True evaluates to 1. This makes for some nice shortcuts when answering some queries.

Let's create a boolean Series determining whether an employee is white or not.

In [21]:
race = emp['race'] 
filt = race == 'White'
filt.head()

0    False
1    False
2     True
3     True
4     True
Name: race, dtype: bool

If we are interested in the number of employees that are white, we could do boolean selection like this and then find the length of the result.

In [22]:
just_white = race[filt]
just_white.head()

2    White
3    White
4    White
7    White
8    White
Name: race, dtype: object

In [23]:
len(just_white)

665

### Just sum a boolean Series to find the number that meet the condition

In [24]:
filt.sum()

665

We can even shorten this to a single line of code.

In [25]:
(emp['race'] == 'White').sum()

665

## Explanation of this one line of code
Writing, **`(emp['race'] == 'White').sum()`**, first compares each value in the race column to white. This is the code within the parentheses: `emp['race'] == 'White'`.

This produces a temporary Series, which has all the available methods as any other Series. We then call the **`sum`** method on this temporary Series.

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index and assign the `imdb_score` as a Series to variable `score`. Output the first 5 values.</span>

In [26]:
# your code here
movie = pd.read_csv('../data/movie.csv', index_col= 'title')

In [27]:
score = movie['imdb_score']
score.head(5)

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">What is the data type of `score` and how many values does it contain?</span>

In [28]:
# your code here
score.dtypes

dtype('float64')

In [30]:
score.count()

4916

In [31]:
score.size

4916

In [32]:
len(score)

4916

### Problem 3
<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

In [34]:
# your code here
score.min()

1.6

In [35]:
score.max()

9.5

### Problem 4
<span  style="color:green; font-size:16px">How many missing values are there in the `score`?</span>

In [None]:
score.size - score.count() # ans, count returns non-na count

In [40]:
# your code here
filt = score.isna()
score[filt]

Series([], Name: imdb_score, dtype: float64)

### Problem 5
<span  style="color:green; font-size:16px">Read he docstrings on how the `rank` method works and then rank the first 10 values in `score` from highest to lowest.</span>

In [None]:
# your code here


### Problem 6
<span  style="color:green; font-size:16px">How many movies have scores greater than 6? (Remember that True/False evaluates to 1/0)</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">How many movies have scores greater than 4 and less than 7?</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

In [None]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Add 1 to every value of `score` and then calculate the median.</span>

In [None]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Calculate the median of `score` and add 1 to this. Why is this value the same as problem 9?</span>

In [None]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">Return a Series that has only scores above the 99.9th percentile</span>

In [None]:
# your code here

# Explore More Methods and their parameters
In this section below, you can learn and practice with other methods and their parameters. There are much too many to cover all during a lecture and left to you to understand on your own.


### Skipping missing values
Pandas provides you with control over how missing data is handled with the **`skipna`** parameter which exists for all the aggregation methods. By default it is **`True`**. If you set it to **`False`** and your Series contains missing values, then Pandas will return a missing value as the result. Experiment with this parameter:

In [None]:
# your code here

### `describe` and `quantile`
There are more built-in aggregation methods such as `describe` and `quantile`.

In [None]:
salary.describe()

By default, the `quantile` method returns the median. It uses parameter `q` as a number between 0 and 1. We can change it to return the 20th and 99th percentiles respectively.

In [None]:
salary.quantile(q=.2)

In [None]:
salary.quantile(q=.99)

In [None]:
# your code here

# More accumulation methods

In [None]:
sal_head.cummax()

In [None]:
sal_head.cumsum()

## The `rank` method
This will provide a numerical rank to each value in the Series. It's as if each value were in a competition and there was a leaderboard. Experiment with the `method` parameter. There are many different types of ranking that can be done.

In [None]:
sal_head

In [None]:
sal_head.rank()

In [None]:
# your code here

# Differencing methods  `diff` and `pct_change`
These methods will take the difference between the current value and some other value. By default the other value is the immediate preceding one.

In [None]:
# print out the Series to visually verify
sal_head

In [None]:
sal_head.diff()

The first parameter, **`periods`**, determines which two values are subtracted. For instance we can subtract the 2nd previous value from the current like this:

In [None]:
sal_head.diff(periods=2)

Or even reverse the subtraction by using negative values:

In [None]:
sal_head.diff(-1)

The **`pct_change`** method works analogously but returns the percentage instead

In [None]:
sal_head.pct_change()

In [None]:
sal_head.pct_change(-1)

In [None]:
# your code here

# Calling methods after an operation
Let's say, you would like to find the sum of all the salaries after giving everyone a $5,000 bonus. You can do this in one line like this:

In [None]:
(salary + 5000).sum()

### Must use parentheses
The above syntax might be confusing but it is doing the same thing as this:

In [None]:
salary_bonus = salary + 5000
salary_bonus.sum()

## Explanation
Writing, **`(salary + 5000).sum()`**, first adds 5,000 to each value in the salary Series. This produces a temporary Series, which has all the available methods as any other Series. We then call the **`sum`** method on this temporary Series.

### What is a temporary Series?
The word **`temporary`** is used to describe a Series object that is not assigned to any variable. It is only held in memory temporarily during the execution of that one statement.

## A temporary list
Let's see another example of a temporary object. Here, the temporary list **`[-99, -11]`** is never assigned to a variable and only exists during the execution of the second line of code below:

In [None]:
a = [1, 4, 10]
b = [-99, -11] + a
b