# More DataFrame Methods

In this chapter, we cover several more less common, but still useful and important DataFrames methods that you need to know in order to be fully capable at analyzing data with pandas. 

* `agg` - Compute multiple aggregations at once
* `idxmax` and `idxmin` - Return the index of the max/min
* `diff` and `pct_change` - Find the difference/percent change from one value to the next
* `sample` - Randomly sample rows/columns
* `nsmallest`/`nlargest` - Return the top/bottom `n` values
* `replace` - Replace one or more values in a variety of ways
* `corr` - Compute the correlation between each pair of numeric columns

Let's read in the movie dataset with the title in the index and select just the numeric columns.

In [None]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title').select_dtypes('number')
movie.head()

## The `agg` method

The `agg` method allows us to calculate several aggregations at once by providing it a list of the aggregation methods as strings. Here, we find the min, max, and number of unique values for each column.

In [None]:
aggs = movie.agg(['min', 'max', 'nunique'])
aggs

This returned data might be easier to read when transposed. Let's transpose the results with the `T` attribute.

In [None]:
aggs.T

## The index of the minimum and maximum

The `idxmin` and `idxmax` methods return the index where the maximum value occurs for each column. When we call the `idxmax` method on our DataFrame, we learn that the movie with longest duration is 'Trapped', the movie with the highest gross is 'Avatar', the one highest IMDB score is 'Towering Inferno', etc... These methods do NOT work for string columns and will error if used with them.

In [None]:
movie.idxmax()

## Differencing methods `diff` and `pct_change`

The `diff` and `pct_change` methods work just as they do on a Series. Let's read in the `stocks10` dataset which contains the closing stock price for ten stocks beginning from 2010.

In [None]:
stocks = pd.read_csv('../data/stocks/stocks10.csv', index_col='date', 
                     parse_dates=['date'])
stocks.head()

The `diff` method takes the difference between the current value and the nth value preceding it. Below, we get the change in price from two trading days prior.

In [None]:
stocks.diff(2).head()

The `pct_change` method returns the percentage change as a fraction. Here, we round the number and multiply by 100 so the results show actual percentages.

In [None]:
stocks.pct_change(2).round(3).head() * 100

## The `sample` method

The `sample` method randomly samples rows or columns from the DataFrame. Here, we select three random rows. By default, sampling is done without replacement, so these will be three unique rows.

In [None]:
movie.sample(3)

It's possible to randomly sample columns by setting the `axis` parameter to 'columns' or 1.

In [None]:
movie.sample(5, axis='columns').head()

Use the `frac` parameter to select a random fraction of the rows and set `replace` equal to `True` to sample with replacement. Here, we select a random 25% of the rows with replacement.

In [None]:
movie.sample(frac=0.25, replace=True).shape

## The `nsmallest` and `nlargest` methods

The `nsmallest` and `nlargest` methods provide a similar solution that `sort_values` does. Pass them the number of rows to return as an integer and a string of a column name you would like to use to determine the ordering.  The following returns all the rows for movies with the three highest values of the column gross.

In [None]:
movie.nlargest(3, 'gross')

It is possible to duplicate this with `sort_values` together with the `head` method.

In [None]:
movie.sort_values('gross', ascending=False).head(3)

### Why use `nsmallest/nlargest`?

While `nsmallest/nlargest` can be duplicated with `sort_values`, in theory, `nsmallest/nlargest` should perform better as they use the [selection algorithm][1] and not a sorting one. The `nsmallest/nlargest` methods also have the ability to keep the top n rows with ties by setting the `keep` parameter to `True`. 

[1]: https://en.wikipedia.org/wiki/Selection_algorithm

## The `corr` method

The `corr` method computes the correlation between every pair of numeric columns in the DataFrame. By default, it computes Pearson's correlation coefficient which is a metric that determines how well the two variables are linearly related, returning a score ranging between -1 and 1. When an increase in one variable always corresponds with the same relative increase in the other variable, a perfect positive linear relationship exists and yields a correlation of 1.

For example, the relationship between Celsius and Fahrenheit is a perfect positive relationship. An increase in one degree Celsius always corresponds with an increase in a 1.8 degree change in Fahrenheit. A perfect negative linear relationship does the opposite and yields a correlation of -1. An increase in one variable always corresponds with the same relative decrease in the other.

The result of the `corr` method is a square DataFrame (has the same number of rows as columns) where the new row labels are the same as the original columns. The number of rows will equal the number of columns. Let's call the `corr` method now to compute the correlation between each pair of stocks.

In [None]:
stocks.corr().round(2)

Take a look at the first column of data. This is the pairwise correlation between MSFT and all other stocks. For example, the correlation between MSFT and TSLA is 0.72. This means that there is a tendency for the stocks MSFT and TSLA to move in the same direction. One should not read too much into correlation. By itself, correlation does not imply a causal relationship between the variables. It is just one metric to provide some information about the linear relationship between two variables.

The above DataFrame is also **symmetric**. All values along the diagonal are 1, as each stock has a perfect correlation with itself. All values to the left of the diagonal are the same as they are to the right, as the correlation is the same regardless of the order.

Notice that the technology stocks, MSFT, AAPL, AMZN, and FB are all highly correlated with one another. The energy stocks, XOM and SLB, are also highly correlated with one another, but less correlated with the technology stocks.

### Series correlation method

Series also have a `corr` method. You must pass it a Series to find its correlation. Below, we get the correlation between MSFT and AAPL, which is the same value found in the DataFrame above.

In [None]:
stocks['MSFT'].corr(stocks['AAPL'])

## The `replace` method

The `replace` method can be used to replace values in your DataFrame. It is very powerful and flexible. It is also quite complex as there are many different combinations of parameters to handle a variety of different kinds of replacement. Let's read in the first 5 rows of the San Francisco employee compensation dataset dropping the year column. Each numeric column is rounded to the nearest ten-thousand.

In [None]:
sf_emp_head = pd.read_csv('../data/sf_employee_compensation.csv', nrows=5)
sf_emp_head = sf_emp_head.drop(columns='year').round(-4)
sf_emp_head

The `replace` method has two main parameters, `to_replace` and `value`. The simplest application is to set each one to a single value. Below, we replace all of the values equal to 10,000 with 9,999. All values in the entire DataFrame are searched to be replaced.

In [None]:
sf_emp_head.replace(to_replace=10000, value=9999)

The `replace` method can also replace exact strings. Here, we replace 'Public Protection' with 'PP'.

In [None]:
sf_emp_head.replace(to_replace='Public Protection', value='PP')

Instead of using two parameters, you can set `to_replace` to a dictionary to map the old values to the new values. When using a dictionary, you do not use the parameter `value`. Below, we replace 'Community Health' with 'Health'.

In [None]:
sf_emp_head.replace(to_replace={'Community Health': 'Health'})

You can replace as many values as you'd like with a dictionary. The first parameter is `to_replace`, so we can call this method without explicitly providing the parameter name. We import `numpy` to help replace all zeros with missing values.

In [None]:
import numpy as np
sf_emp_head.replace({'Community Health':'Health', 0: np.nan})

### Specifying which columns to search for replacement

Calling `replace` as we did above replaces all values in all columns that match the value to replace. Instead, we might be interested in only replacing values in a particular column, or replacing the same value with different values depending on the column.

We can specify which columns to replace which values by using in a dictionary of dictionaries, where the keys of the dictionary specify the column names and the values are dictionaries of original values mapped to their replacement. Take a look at the following dictionary. When passed to the `replace` method, it instructs it to replace 0 with nan and 60,000 with 99,999 for just the overtime column. The retirement column will have 0 replaced with -999.

```python
{'overtime':{0: np.nan, 
             60000: 99999}, 
 'retirement': {0: -999}}
```

Let's use this dictionary to make the specific replacement.

In [None]:
sf_emp_head.replace({'overtime':{0: np.nan, 60000:99999}, 
                     'retirement': {0:-999}})

### Replacing Substrings

By default, the `replace` method searches for exact strings. Attempting to replace 'Public' with 'Pub.' will do nothing in our DataFrame as there is no exact value 'Public'.

In [None]:
sf_emp_head.replace({'Public':'Pub.'})

In order to replace a substring, you must set the `regex` parameter to `True`.

In [None]:
sf_emp_head.replace({'Public':'Pub.'}, regex=True)

## Methods available only to Series and not DataFrames

There are more than a few methods that are available only to Series objects, but the following are the most important.

### No `str` or `dt` accessor or `unique` method

DataFrames have no special methods just for strings or datetimes. There is no `str` or `dt` accessor. They can only be used on Series objects. Also, the `unique` method is only available to Series.

## Exercises

Execute the following cell to read in the City of Houston dataset and use it to answer the next exercises.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

### Exercise 1

<span style="color:green; font-size:16px">Find the relative frequency of departments for all employees and then find the relative frequency of departments for the top 100 salaries. Compare the differences.</span>

### Exercise 2

<span style="color:green; font-size:16px">Sample 100 rows of data with replacement using a random state value of 999. Then find the count of each unique department as a Series.</span>

### Stocks dataset

Use the following stocks dataset for the remaining exercises.

In [None]:
stocks = pd.read_csv('../data/stocks/stocks10.csv', index_col='date', parse_dates=['date'])
stocks.head(3)

### Exercise 3

<span style="color:green; font-size:16px">Find the day that each stock had its largest percentage one-day drop in price.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find the min, max, and date of the min and max for each stock. Return a DataFrame with the stock ticker symbols in the index and the aggregations as column names.</span>