# 6. DataFrame Descriptive Statistic Methods
DataFrames have identical [descriptive statistical methods][1] as Series. Again, we distinguish between methods that aggregate and those that do not.

A method that performs an aggregation returns a **single** number to represent the description. Examples of methods that aggregate are:
* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution

Any other method that does not return a single value is not an aggregation. Some examples of these methods are:
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats

In [None]:
import pandas as pd
pd.options.display.max_columns = 50
college = pd.read_csv('../data/college.csv')
college.head()

# Major Differences between DataFrame and Series Methods
When calling one of the above methods on a DataFrame, it is applied to each individual column by default. For instance, if we call the **`sum`** method, each column will be summed individually. Calling the **`sum`** method on a Series produces a single scalar value, while a DataFrame produces a sum for each column.

### Select numeric columns
Many of these statistical methods above work only for numeric columns. We will select all the columns that have undergraduate race proportion data. These columns are located together and start with **`UGDS_WHITE`** and end at **`UGDS_UNKN`**.

In [None]:
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
college_race.head()

### Take the mean of each column
Let's demonstrate calling the **`mean`** aggregation method on each column.

In [None]:
college_race.mean()

## Did you notice what type of object was returned?
Pandas takes the mean of each column and returns a Series. The new Series has the column names as the index and the mean as the values.

Let's see a couple more aggregations:

In [None]:
college_race.max()

In [None]:
college_race.std()

## Changing the Direction of the Operation
Since DataFrames are two-dimensional you might be interested on doing an operation that happens across the rows - summing up each row for instance.

## The `axis` parameter controls the direction of the operation.
Nearly all DataFrame methods have an **`axis`** parameter. This is a very crucial parameter to understand. It controls the direction of the operation. By default, operations happen down each column.

## Referencing each axis by number and by label
DataFrames are two-dimensional and therefore have two axes. The rows are referenced by the number 0 and also by the label 'index'. The columns are referenced by the number 1 and also by the label 'columns'.

## Default value of `axis` is 0
The default value for the **`axis`** parameter is 0. You an also refer to it as 'index'. Let's take the mean again for each column, but instead use the string 'index' for the value of the **`axis`** parameter.

In [None]:
college_race.mean(axis='index')

This is the exact same thing as **`axis=0`**, which is the default:

In [None]:
college_race.mean(axis=0)

## Use `axis='columns'`
Let's sum each row by changing the direction of the operation by setting the **`axis`** parameter equal to **`columns`**. The total should equal 1 as each row contains all the race distribution of a single school.

In [None]:
college_race.sum(axis='columns').head()

You can also use **`axis=1`**

In [None]:
college_race.sum(axis=1).head()

## I always use `axis='columns'`
I always use **`axis='columns'`** and never **`axis=1`**. The reason for this is that the string 'columns' is much more descriptive than the integer 1. I also always use **`axis='index'`** instead of **`axis=0`** for the same reason.

## Confusion between string 'index' and 'columns'
It's definitely confusing and difficult to remember which direction the operation is going to happen. A little trick that helps me remember is that when using **`axis='columns'`** the result is going to be the same length as a **column** in the DataFrame. 

![][1]

[1]: images/df_axis.jpg

In [None]:
college_race.shape

In [1]:
len(college_race.sum(axis=1))

NameError: name 'college_race' is not defined

### Summary of `axis`
* axis 0 - default axis for all DataFrame methods. It's preferred reference label is 'index'. The operations happen vertically, up and down columns. **`df.sum()`** finds the sum of each column individually.
* axis 1 - It's preferred reference is 'columns'. The operations happen horizontally, left to right. **`df.sum(axis='columns')`** sums each row individually.

# Non-Aggregation DataFrame methods
The non-aggregation DataFrame methods keep the shape of the DataFrame but can change each value. Let's round all the values to two digits.

In [None]:
college_race.round(2).head()

## Some of the methods don't have an `axis` parameter
Methods such as **`round`** work independently of the axis and therefore do not have an **`axis`** parameter. Other methods however, such as **`cumsum`**, do have an **`axis`** parameter.

Let's call **`cumsum`** in both directions.

In [None]:
college_race.cumsum().head()

In [None]:
college_race.cumsum(axis='columns').head()

## Get Summary Statistics for all columns with the `describe` method
The describe method calculates several summary statistics for each column and is a great way to inspect all of your data at once. Notice that a DataFrame is returned with the name of each summary statistic in the **index**.

In [None]:
college_race.describe()

### The `describe` method with non-numeric columns
The **`college_race`** DataFrame from above contains only numeric columns. If **`describe`** is called on a DataFrame containing a mix of numeric and non-numeric columns, then summary statistics for just the numeric columns will be returned. The others will be ignored.

The original **`college`** DataFrame contains a mix of data types. Let's use describe on it. Notice how the number of columns after calling **`describe`** decreased.

In [None]:
college.shape

In [None]:
college.describe()

In [None]:
college.describe().shape

### Calling `describe` on non-numeric columns
The **`describe`** method can work with non-numeric columns. Pass the **`include`** parameter a string of the data type you would like to use. Let's see the summary with the string and Datetime columns from the **`bikes`** DataFrame. Notice that the summary statistics are very different.

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.describe(include='object')

## Transposing a  DataFrame with the `T` attribute
Transposing a DataFrame 'turns' the data 90 degrees. The columns and the rows switch places. The first column is now the first row, etc...

The **`.T`** attribute transposes the DataFrame. I find this useful after running the **`describe`** method with long output.

In [None]:
college.describe().T

# Exercises

In [None]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')
college_race = college.loc[:, 'ugds_white':'ugds_unkn']
movie = pd.read_csv('../data/movie.csv', index_col='title')

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

In [1]:
# your code here
import pandas as pd

In [2]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


### Problem 2
<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

In [6]:
# your code here
cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
movie_fb = movie[cols]
movie_fb.head()

Unnamed: 0_level_0,actor1_fb,actor2_fb,actor3_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,1000.0,936.0,855.0
Pirates of the Caribbean: At World's End,40000.0,5000.0,1000.0
Spectre,11000.0,393.0,161.0
The Dark Knight Rises,27000.0,23000.0,23000.0
Star Wars: Episode VII - The Force Awakens,131.0,12.0,


In [10]:
movie_fb_total = movie_fb.sum(axis='columns')
movie_fb_total.head()

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">What percentage of movies have more than 10,000 total actor FB likes?</span>

In [16]:
# your code here
(movie_fb_total > 10000).mean()


0.2982099267697315

### Problem 4
<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

In [17]:
# your code here
filt = movie_fb_total > 10000
movie_gross = movie['gross']
movie_gross.head()

title
Avatar                                        760505847.0
Pirates of the Caribbean: At World's End      309404152.0
Spectre                                       200074175.0
The Dark Knight Rises                         448130642.0
Star Wars: Episode VII - The Force Awakens            NaN
Name: gross, dtype: float64

In [19]:
movie_gross[filt].median()

42391915.5

In [20]:
movie_gross[~filt].median()

16815752.5

### Problem 5
<span  style="color:green; font-size:16px">From problem 4, it appears that movies with more than 10,000 total actor FB likes gross 2.5 times as much. This may be due to the fact that newer movies have more actors that are recognized by FB users. Find the median year produced for both groups.</span>

In [23]:
# your code here
movie_year = movie['year']
movie_year[filt].median()

2006.0

In [24]:
movie_year = movie['year']
movie_year[~filt].median()

2005.0

### Problem 6
<span  style="color:green; font-size:16px">For each movies made in the year 2016, what is the median of the total actor FB likes?</span>

In [28]:
# your code here
filt = movie_year == 2016

movie_2016 = movie[filt]
movie_2016.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Batman v Superman: Dawn of Justice,2016.0,Color,PG-13,183.0,Zack Snyder,0.0,Henry Cavill,15000.0,Lauren Cohan,4000.0,...,2000.0,330249062.0,Action|Adventure|Sci-Fi,673.0,371639,based on comic book|batman|sequel to a reboot|...,English,USA,250000000.0,6.9
Captain America: Civil War,2016.0,Color,PG-13,147.0,Anthony Russo,94.0,Robert Downey Jr.,21000.0,Scarlett Johansson,19000.0,...,11000.0,407197282.0,Action|Adventure|Sci-Fi,516.0,272670,based on comic book|knife|marvel cinematic uni...,English,USA,250000000.0,8.2
Star Trek Beyond,2016.0,Color,PG-13,122.0,Justin Lin,681.0,Sofia Boutella,998.0,Melissa Roxburgh,119.0,...,105.0,130468626.0,Action|Adventure|Sci-Fi|Thriller,322.0,53607,hatred|sequel|space opera|star trek|third part,English,USA,185000000.0,7.5


In [33]:
cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
movie_2016_3actor = movie_2016[cols]
movie_2016_3actor.sum(axis='columns').median()

3571.5

### Problem 7
<span  style="color:green; font-size:16px">Write a function that has a single parameter, `year`. Have it return the median of the total actor FB likes for the given year. Test your function with the year 2016 and verify the result with problem 6.</span>

In [35]:
# your code here
def find_median_movie_fb (year):
    cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
    movie_year = movie[movie['year'] == year]
    return movie_year[cols].sum(axis='columns').median()

In [36]:
find_median_movie_fb(2016)

3571.5

### Problem 8
<span  style="color:green; font-size:16px">Write a loop to print out the year and median total actor FB likes for that year from 1990 to 2016</span>

In [45]:
for year in range(1990, 2017):
    print(year, find_median_movie_fb(year))

1990 2017.0
1991 2436.0
1992 2147.5
1993 2018.0
1994 2368.5
1995 2612.0
1996 2692.5
1997 1964.0
1998 2482.0
1999 2595.0
2000 2378.0
2001 2424.0
2002 2146.0
2003 2019.0
2004 2298.0
2005 2072.0
2006 2359.0
2007 2002.5
2008 2400.0
2009 2145.0
2010 2411.0
2011 2818.5
2012 2426.0
2013 2420.0
2014 2084.0
2015 2063.0
2016 3571.5


### Problem 9
<span  style="color:green; font-size:16px">Using the **college** dataset, find the number of non-missing values in each column and again for each row.</span>

In [None]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">What is the average number of missing values for each row?</span>

In [None]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">The `UGDS` column of the college dataset contains the total undergraduate population. What is the least number of colleges it would take to have have a total of more than 5 million students.</span>

In [None]:
# your code here