# DataFrame Descriptive Statistic Methods

## Overview
DataFrames have identical [descriptive statistical methods][1] as Series. Again, we distinguish between methods that aggregate and those that do not. A method that performs an aggregation returns a **single** number to summarize the values. Examples of methods that aggregate are:

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes the absolute value
* `round` - rounds to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

Let's begin by reading in the college dataset and setting the index to be the institution name.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 40)
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head()

## Major Differences between DataFrame and Series Methods
When calling one of the above methods on a DataFrame, it is applied to each individual column by default. For instance, if we call the `sum` method, each column will be summed individually. Calling the `sum` method on a Series produces a single scalar value, while a DataFrame produces a single value for each column.

### Select numeric columns
Many of these statistical methods above work only for numeric columns. We will begin by using a DataFrame of only numeric columns by selecting all the columns that have undergraduate race proportion data. These columns are located together and start with `ugds_white` and end at `ugds_unkn`. We select them with the `loc` indexer which allow us to use slice notation for the columns.

In [None]:
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
college_race.head()

### Take the mean of each column
Let's demonstrate calling the `mean` aggregation method on each column.

In [None]:
college_race.mean()

### Did you notice what type of object was returned?
pandas takes the mean of each column and returns a Series. The new Series has the column names as the index and the mean as the values. Let's see a couple more aggregations:

In [None]:
college_race.max()

In [None]:
college_race.std()

### Potentially confusing orientation
The above results should be fairly easy to understand. If someone asked you what the standard deviation of the `ugds_hisp` column, you would easily be able to respond with the correct number. What is potentially confusing is the orientation of the result. We began with a DataFrame, summed every column and were returned a Series which displays all of the columns vertically. The orientation of the columns changed. It might have been easier to understand the operation if the columns remained horizontal as in the following diagram.

![](images/df_agg_keep_dim.png)

### DataFrames are collections of columns
It's important to think of DataFrames as a collection of columns as opposed to a collection of rows. It is the column that is the fundamental component of the DataFrame and it is the column that is acted on as a default by most of the methods as seen by the aggregation methods above. 

## Changing the Direction of the Operation
Since DataFrames are two-dimensional, we might want to complete an operation horizontally across the rows instead of vertically down the columns.

### The `axis` parameter controls the direction of the operation.
Nearly all DataFrame methods have an `axis` parameter. This is a crucial parameter to understand. It controls the direction of the operation. By default, operations happen vertically down each column.

### Referencing each axis by number and by label
DataFrames are two-dimensional and therefore have two axes. Both the rows and the columns may be referenced with a number or a string label. The rows are referenced by the number 0 and also by the label 'index'. The columns are referenced by the number 1 and also by the label 'columns'.

### Default value of `axis` is 0
For most DataFrame methods, the default value of the `axis` parameter is 0. Technically, you will see `None` in the method signature, but if you don't explicitly set it, pandas will use 0. You can also refer to it as the string 'index'. Let's take the mean of each column again, but instead use the string 'index' for the value of the `axis` parameter. This is the exact same thing as the default.

In [None]:
college_race.mean(axis='index')

This is the exact same thing as `axis=0`, which again is the default.

In [None]:
college_race.mean(axis=0)

Since summing the columns is the default, it's not necessary to specify it as such and most people do not do so.

### Change the direction of the operation with `axis='columns'`
Let's change the direction of the operation and sum each row by setting the `axis` parameter to 'columns'. The total should equal 1 as each row contains the entire race distribution of a single school.

In [None]:
college_race.sum(axis='columns').head()

You can also use `axis=1`.

In [None]:
college_race.sum(axis=1).head()

### I prefer to use `axis='columns'`
I prefer to use `axis='columns'` over `axis=1`. The reason for this is that the string 'columns' is more descriptive than the integer 1. I also use `axis='index'` instead of `axis=0` for the same reason. That said, many people prefer using an integer to refer to the axis so you will come across it often.

### Confusion between string 'index' and 'columns'
It's definitely confusing and difficult to remember which direction the operation is going to happen. As with the examples above, using the string 'index' sums up each column while using the string 'columns' sums up each rows.

![][1]

[1]: images/pandas_axes.png

A little trick that helps me remember is that when using `axis='columns'` the result is going to be the same length as a column in the DataFrame. Let's verify this below.

In [None]:
college_race.shape

In [None]:
len(college_race.sum(axis='columns'))

### Summary of `axis`
* axis 0 - default axis for most DataFrame methods. My preferred reference label is 'index'. The operations happen vertically, up and down the columns. `df.sum()` finds the sum of each column individually.
* axis 1 - My preferred reference is 'columns'. The operations happen horizontally across, left to right across each row. `df.sum(axis='columns')` sums each row individually.

## Non-Aggregation DataFrame methods
The non-aggregation DataFrame methods keep the shape of the DataFrame but can change each value. Let's round all the values to two digits.

In [None]:
college_race.round(2).head()

### Some methods don't have an `axis` parameter
Methods such as `round` work independently of the axis and therefore do not have an `axis` parameter. Other methods however, such as `cumsum`, do have an `axis` parameter. Let's call `cumsum` in both directions.

In [None]:
college_race.cumsum().head()

In [None]:
college_race.cumsum(axis='columns').head()

### Get Summary Statistics for all columns with the `describe` method
The describe method calculates several summary statistics for each column and is a great way to inspect all of your data at once. Notice that a DataFrame is returned with the name of each summary statistic in the **index**.

In [None]:
college_race.describe()

### The `describe` method with non-numeric columns
The `college_race` DataFrame from above contains only numeric columns. If `describe` is called on a DataFrame containing a mix of numeric and non-numeric columns, then summary statistics for just the numeric columns will be returned. The others will be ignored. The original `college` DataFrame contains a mix of data types. Let's call `describe` on it. Notice how the number of columns after calling `describe` decreased.

In [None]:
college.shape

In [None]:
college.describe()

In [None]:
college.describe().shape

### Calling `describe` on non-numeric columns
The `describe` method can work with non-numeric columns. Set the `include` parameter as a string of the data type you would like to use. Let's see the summary with the object columns. Notice that pandas returns a completely different set of summary statistics that make more sense with strings.

In [None]:
college.describe(include='object')

### Transposing a  DataFrame with the `T` attribute
Transposing a DataFrame 'rotates' the data 90 degrees. The columns and the rows switch places. The first column is now the first row. The `.T` attribute transposes the DataFrame. I find this useful after running the `describe` method when there are many columns as it's easier to read many rows of data as opposed to many columns of data.

In [None]:
college.describe().T

## Nuisance Columns
Above, we called common statistical methods from the `college_race` DataFrame, which was composed of only numeric columns. It is possible to call these same methods from DataFrames composed of any combination of data types.

### Dropping columns that don't work with the method
pandas allows you to call these statistical methods on DataFrames that contain columns with data types that don't work for that particular method. The entire `college` DataFrame contains string and numeric columns. Taking the mean of a string column does not work. Instead of raising an error, pandas will **silently** drop this column. These DataFrame columns that don't compute with certain methods are sometimes referred to as **nuisance columns** in the documentation. 

Let's read in the `bikes` DataFrame, which has a larger number of string columns and some datetime columns as well. We will work with just the first 100 rows of this dataset which we can choose when reading in the data with the `nrows` parameter.

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'], nrows=100)
bikes.shape

In [None]:
bikes.head()

Taking the mean of this DataFrame will drop all the non-numeric columns. Only boolean, integer, and float columns will work.

In [None]:
bikes.mean()

### The `round` method
The `round` method works in a slightly different manner. Instead of dropping the non-numeric columns, it will keep them in the resulting DataFrame as they were. This is quite nice as we can use `round` to on a DataFrame with both strings and numeric columns without error and without dropping columns.

In [None]:
bikes.round(-1).head()

### Many methods do work with non-numeric data types
Many of the aggregation methods do work with string and datetime columns. Let's find the max of all the `bikes` columns.

In [None]:
bikes.max()

The `sum` method is valid for string (but not datetime) columns and concatenates all the values together to produce one long string. This usually isn't something you'd like to do. It's also a computationally expensive operation. The following call to `sum` took 8 seconds on the full dataset (50k rows) on my machine.

In [None]:
bikes.sum()

### Use `numeric_only=True`
The `sum` method, as well as all the other aggregation methods,  provides the boolean parameter `numeric_only` that is defaulted to `False`. By setting it to `True`, pandas will only apply the method to boolean, integer, and float columns. The following operation only took 7 ms on the full dataset on my machine or more than 1,000 times faster than the previous one.

In [None]:
bikes.sum(numeric_only=True)

### The slow  `mean` method
The `mean` method is also extremely slow, even though it only works on numeric columns. This is because pandas takes the `sum` of all the columns first and then divides by the length. The reason pandas doesn't just skip over string columns is that they are technically object columns and an object column can hold any data type. So, the only way for pandas to decide whether or not the `mean` will work on an object column is to actually sum up every value first and then attempt to divide by the length. If that fails, then it will skip it. The issue with this, is that it is extremely slow for string columns since strings can be summed. pandas only fails after the string column has been concatenated together when it attempts to divide by the length. If you want to take the `mean` on a DataFrame with string columns, make sure you set `numeric_only` to `True.`

In [None]:
bikes.head()

In [None]:
bikes.mean()

Even on this small dataset of 100 rows, there is a substantial performance difference.

In [None]:
%timeit -n 5 bikes.mean()

In [None]:
%timeit -n 5  bikes.mean(numeric_only=True)

## Exercises

Execute the following cell before attempting the exercises.

In [None]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
movie = pd.read_csv('../data/movie.csv', index_col='title')

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

### Exercise 3
<span  style="color:green; font-size:16px">What percentage of movies have more than 10,000 total actor FB likes?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

### Exercise 5
<span  style="color:green; font-size:16px">From Exercise 4, it appears that movies with more than 10,000 total actor FB likes gross 2.5 times as much. This may be due to the fact that newer movies have more actors that are recognized by FB users. Find the median year produced for both groups.</span>

### Exercise 6
<span  style="color:green; font-size:16px">For each movies made in the year 2016, what is the median of the total actor FB likes?</span>

### Exercise 7
<span  style="color:green; font-size:16px">Write a function that has a single parameter, `year`. Have it return the median of the total actor FB likes for the given year. Test your function with the year 2016 and verify the result with Exercise 6.</span>

### Exercise 8
<span  style="color:green; font-size:16px">Write a loop to print out the year and median total actor FB likes for that year from 1990 to 2016</span>

### Exercise 9
<span  style="color:green; font-size:16px">Using the **college** dataset, find the number of non-missing values in each column and again for each row.</span>

### Exercise 10
<span  style="color:green; font-size:16px">What is the average number of missing values for each row?</span>

### Exercise 11
<span  style="color:green; font-size:16px">The `ugds` column of the college dataset contains the total undergraduate population. What is the least number of colleges it would take to have have a total of more than 5 million students.</span>