# Numeric DataFrame Methods

In this chapter, we cover [statistical methods][1] for DataFrames with mainly numeric columns. These methods are nearly identical to those available to a Series. Again, we distinguish between methods that aggregate and those that do not. A method that performs an aggregation returns a **single** number to summarize the values. Any method that does not return a single value is not an aggregation. We begin by reading in the San Francisco employee compensation dataset.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

In [18]:
import pandas as pd
sf_emp = pd.read_csv('../data/sf_employee_compensation.csv')
sf_emp.head(3)

Unnamed: 0,year,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,Personnel Technician,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,2013,General Administration & Finance,Planner 2,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,2013,Public Protection,Firefighter,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2


## Aggregation methods

The following are some common aggregation methods available for DataFrames:

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-missing values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns the given percentile of the distribution

### Differences between DataFrame and Series methods

When calling an aggregation method on a DataFrame, it is applied to each individual column by default. For instance, calling the `sum` method sums each column individually. A single value is returned for each column. Calling the `sum` method on a Series produces a single scalar value.

### Select numeric columns

Some of these statistical methods above work only with numeric columns. In order to successfully call these methods, we'll select only the columns with compensation information.

In [20]:
comp = sf_emp.loc[:, 'salaries':]
comp.head(3)

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2


### Take the mean of each column

Let's demonstrate taking the mean of each column by calling the `mean` aggregation method.

In [21]:
comp.mean()

salaries             53715.441133
overtime              4201.272687
other salaries        2816.296542
retirement           10484.755614
health and dental     9382.390735
other benefits        4053.381941
dtype: float64

### Did you notice what type of object was returned?

pandas takes the mean of each column and returns a Series. The new Series uses the old column names as the index and the calculated mean as the values. Let's call a couple of different aggregation methods.

In [22]:
comp.max()

salaries             645739.46
overtime             258124.17
other salaries       239294.57
retirement           120791.40
health and dental     36369.96
other benefits        37563.46
dtype: float64

In [23]:
comp.std()

salaries             47686.502923
overtime             11601.573498
other salaries        6637.820066
retirement            9922.598455
health and dental     7379.199008
other benefits        4171.974274
dtype: float64

### Potentially confusing orientation

The above results should be fairly easy to understand. If someone asked you what the standard deviation of the `overtime` column, you would easily be able to respond with the correct number. What may be potentially confusing is the orientation of the result. We began with a DataFrame, and were returned a Series which is visually displayed in the notebook as a vertical sequence of values. The orientation of the columns changed. It might have been easier to understand the operation if the columns remained horizontal as in the following image.

![1]

### DataFrames are collections of columns

It's good to think of DataFrames as a collection of columns as opposed to a collection of rows. It is the column that is the fundamental component of the DataFrame. Each column has a data type and all values in that column are the same data type. It is the column that is acted on by default by most of the methods as demonstrated with the aggregations above. 

## Changing the direction of the operation

Since DataFrames are two-dimensional, we might want to complete an operation horizontally across the rows instead of vertically down the columns.

### The `axis` parameter controls the direction of the operation

Most DataFrame methods have an `axis` parameter. This is a crucial parameter to understand as it controls the direction of the operation. By default, operations take place vertically down each column.

### Each axis may be referenced by number or string label

DataFrames are two-dimensional and therefore have two axes. Both the rows and the columns may be referenced with either a number or a string label. The rows are referenced by the number 0 and also by the label `'index'`. The columns are referenced by the number 1 and also by the label `'columns'`.

### Default value of `axis` is 0

For most DataFrame methods, the default value of the `axis` parameter is 0. Technically, you will see `None` in the method signature, but if you don't explicitly set it, pandas will use 0. You can also refer to it with the string `'index'`. Let's take the mean of each column again, but use the string 'index' for the value of the `axis` parameter. This produces the exact same result as calling it with the defaults.

[1]: images/df_agg_keep_dim.png

In [24]:
comp.mean()

salaries             53715.441133
overtime              4201.272687
other salaries        2816.296542
retirement           10484.755614
health and dental     9382.390735
other benefits        4053.381941
dtype: float64

In [25]:
comp.mean(axis='index')

salaries             53715.441133
overtime              4201.272687
other salaries        2816.296542
retirement           10484.755614
health and dental     9382.390735
other benefits        4053.381941
dtype: float64

We could have set `axis` to 0, which also returns the same result.

In [26]:
comp.mean(axis=0)

salaries             53715.441133
overtime              4201.272687
other salaries        2816.296542
retirement           10484.755614
health and dental     9382.390735
other benefits        4053.381941
dtype: float64

Since the default behavior is to act vertically, it's not necessary to specify the axis parameter as such, and most people do not do so when calculating aggregations on each column. I recommend calling aggregation methods that act vertically without using the `axis` parameter.

### Change the direction of the operation with `axis='columns'`

Let's change the direction of the operation and sum each row by setting the `axis` parameter to the string `'columns'`. This gives us the total compensation for each employee.

In [27]:
total_emp_com = comp.sum(axis='columns')
total_emp_com.head(10)

0    104242.87
1     96627.18
2    239777.06
3     46485.63
4     45680.15
5     14095.03
6     86078.82
7    112333.62
8    206498.36
9     66604.32
dtype: float64

A Series is returned with the same length as the DataFrame. Let's verify this is the case.

In [28]:
len(comp), len(total_emp_com)

(50000, 50000)

Instead of using the string 'columns', you can set `axis` to 1 to achieve the same result.

In [29]:
comp.sum(axis=1).head(10)

0    104242.87
1     96627.18
2    239777.06
3     46485.63
4     45680.15
5     14095.03
6     86078.82
7    112333.62
8    206498.36
9     66604.32
dtype: float64

### Use either `axis='columns'` or `axis=1`

You are free to use either `axis='columns'` or `axis=1` as they both accomplish the same exact task.

### Difficult to remember

It's definitely confusing and difficult to remember which direction the operation is going to happen. As with the examples above, using 'index' or 0 sums up each column while using 'columns' or 1 sums up each row.

![1]

[1]: images/df_axes_explanation.png

A little trick that helps me remember is that when setting `axis='columns'` the result is going to be the same length as a column in the DataFrame.

### Summary of the `axis` parameter

* **axis 0**
    * Default axis for most DataFrame methods
    * Also referenced by the string 'index'
    * Operations happen vertically, up and down the columns
    * Example - `df.sum()` computes the sum of each column individually
* **axis 1**
    * Also referenced by the string 'columns'
    * Operations happen horizontally, left to right across each row
    * Example - `df.sum(axis='columns')` computes the sum of each row individually

## Non-Aggregation methods

The non-aggregation DataFrame methods do not return a single value for each column, and instead return a DataFrame that usually has the same shape as the original. Here are some common non-aggregation methods.

* `abs` - takes absolute value
* `round` - round to the nearest given decimal place
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum

Let's use the `round` method to round each column to the nearest thousand. Remember that negative numbers round to the left of the decimal place.

In [30]:
comp.round(-3).head(3)

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,71000.0,0.0,0.0,14000.0,13000.0,6000.0
1,68000.0,0.0,0.0,13000.0,10000.0,6000.0
2,117000.0,60000.0,19000.0,25000.0,16000.0,3000.0


You can use the `round` method on DataFrames that contain non-numeric data. pandas will intelligently ignore the columns where rounding is not possible.

In [31]:
sf_emp.round(-3).head(3)

Unnamed: 0,year,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2000,Public Protection,Personnel Technician,71000.0,0.0,0.0,14000.0,13000.0,6000.0
1,2000,General Administration & Finance,Planner 2,68000.0,0.0,0.0,13000.0,10000.0,6000.0
2,2000,Public Protection,Firefighter,117000.0,60000.0,19000.0,25000.0,16000.0,3000.0


All numeric columns from above were rounded to the nearest thousand including the year. In many cases, you'll want to round different columns to different decimal places. You can do so by providing the `round` method a dictionary mapping the column name to the decimal place. Below, we round only the salaries and retirement columns.

In [32]:
sf_emp.round({'salaries': -3, 'retirement': -1}).head()

Unnamed: 0,year,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,Personnel Technician,71000.0,0.0,0.0,14040.0,12918.24,5872.04
1,2013,General Administration & Finance,Planner 2,68000.0,0.0,0.0,13030.0,10047.52,5608.37
2,2013,Public Protection,Firefighter,117000.0,59975.43,19037.3,24800.0,15788.97,3222.2
3,2013,Community Health,IT Operations Support Admn III,32000.0,0.0,0.0,6790.0,5262.99,2574.91
4,2013,Community Health,Special Nurse,30000.0,0.0,5898.73,960.0,0.0,9230.03


### Some methods don't have an `axis` parameter

Methods such as `round` work independently of the axis and therefore do not have an `axis` parameter. Other non-aggregation methods such as `cumsum` do have an `axis` parameter. Called with the defaults (`axis=0`), the `cumsum` method computes the cumulative sum of each column individually.

In [34]:
comp.cumsum().head(3)

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,139355.07,0.0,0.0,27068.81,22965.76,11480.41
2,256311.79,59975.43,19037.3,51865.25,38754.73,14702.61


Changing the direction of the operation, the `cumsum` method calculates the cumulative sum of each row individually.

In [35]:
comp.cumsum(axis='columns').head(3)

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,71414.01,71414.01,71414.01,85452.59,98370.83,104242.87
1,67941.06,67941.06,67941.06,80971.29,91018.81,96627.18
2,116956.72,176932.15,195969.45,220765.89,236554.86,239777.06


The values in the last column of the above DataFrame are equal to the sum of the entire row.

In [36]:
comp.sum(axis=1).head(3)

0    104242.87
1     96627.18
2    239777.06
dtype: float64

### Summary statistics for all columns with the `describe` method

The describe method calculates several summary statistics for each column and is a nice way to inspect all of your data at once. Notice that a DataFrame is returned with the name of each summary statistic in the index. By default, it returns the 25th, 50th, and 75th percentiles. You can customize these by passing in a list of numbers between 0 and 1 to the `percentiles` parameter.

In [37]:
comp.describe(percentiles=[.1, .4, .5, .99])

Unnamed: 0,salaries,overtime,other salaries,retirement,health and dental,other benefits
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,53715.441133,4201.272687,2816.296542,10484.755614,9382.390735,4053.381941
std,47686.502923,11601.573498,6637.820066,9922.598455,7379.199008,4171.974274
min,-2984.52,-18458.15,-604.85,-13692.12,-287.37,-8584.14
10%,0.0,0.0,0.0,0.0,0.0,0.0
40%,32216.216,0.0,0.0,5630.182,7566.94,1994.89
50%,52181.955,0.0,164.77,10427.54,11416.36,3107.04
99%,186236.458,57417.594,25787.0027,37884.39,29507.92,18707.4025
max,645739.46,258124.17,239294.57,120791.4,36369.96,37563.46


### The `describe` method with non-numeric columns

The `comp` DataFrame from above contains only numeric columns. If `describe` is called on a DataFrame containing a mix of numeric and non-numeric columns, then summary statistics for just the numeric columns will be returned. The others will be ignored. The original `sf_emp` DataFrame contains a mix of data types. Let's call `describe` on it. Notice how the number of columns after calling `describe` decreased from 9 to 7.

In [38]:
sf_emp.shape

(50000, 9)

In [39]:
sf_emp.describe()

Unnamed: 0,year,salaries,overtime,other salaries,retirement,health and dental,other benefits
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2016.53176,53715.441133,4201.272687,2816.296542,10484.755614,9382.390735,4053.381941
std,1.877153,47686.502923,11601.573498,6637.820066,9922.598455,7379.199008,4171.974274
min,2013.0,-2984.52,-18458.15,-604.85,-13692.12,-287.37,-8584.14
25%,2015.0,5281.285,0.0,0.0,0.0,2106.925,398.465
50%,2017.0,52181.955,0.0,164.77,10427.54,11416.36,3107.04
75%,2018.0,85455.0025,1907.26,2727.59,17227.3125,13371.03,6433.7825
max,2019.0,645739.46,258124.17,239294.57,120791.4,36369.96,37563.46


In [23]:
sf_emp.describe().shape

(8, 7)

In [41]:
sf_emp.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,50000.0,2016.53176,1.877153,2013.0,2015.0,2017.0,2018.0,2019.0
salaries,50000.0,53715.441133,47686.502923,-2984.52,5281.285,52181.955,85455.0025,645739.46
overtime,50000.0,4201.272687,11601.573498,-18458.15,0.0,0.0,1907.26,258124.17
other salaries,50000.0,2816.296542,6637.820066,-604.85,0.0,164.77,2727.59,239294.57
retirement,50000.0,10484.755614,9922.598455,-13692.12,0.0,10427.54,17227.3125,120791.4
health and dental,50000.0,9382.390735,7379.199008,-287.37,2106.925,11416.36,13371.03,36369.96
other benefits,50000.0,4053.381941,4171.974274,-8584.14,398.465,3107.04,6433.7825,37563.46


### Calling `describe` on non-numeric columns

The `describe` method can work with non-numeric columns, but you'll need to set the `include` parameter to a string of the data type you would like to use. Below, a summary of the object (string) columns is produced. Notice that pandas returns a completely different set of summary statistics that make more sense with strings.

In [40]:
sf_emp.describe(include='object')

Unnamed: 0,organization group,job
count,50000,50000
unique,7,1140
top,"Public Works, Transportation & Commerce",Transit Operator
freq,12751,3105


### Transposing a  DataFrame with the `T` attribute

Transposing a DataFrame 'rotates' the data 90 degrees. The columns and the rows switch places. The first column is now the first row. The `.T` attribute transposes the DataFrame. I find this useful after running the `describe` method when there are many columns, as it's easier to read many rows of data as opposed to many columns of data.

In [25]:
sf_emp.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,50000.0,2016.53176,1.877153,2013.0,2015.0,2017.0,2018.0,2019.0
salaries,50000.0,53715.441133,47686.502923,-2984.52,5281.285,52181.955,85455.0025,645739.46
overtime,50000.0,4201.272687,11601.573498,-18458.15,0.0,0.0,1907.26,258124.17
other salaries,50000.0,2816.296542,6637.820066,-604.85,0.0,164.77,2727.59,239294.57
retirement,50000.0,10484.755614,9922.598455,-13692.12,0.0,10427.54,17227.3125,120791.4
health and dental,50000.0,9382.390735,7379.199008,-287.37,2106.925,11416.36,13371.03,36369.96
other benefits,50000.0,4053.381941,4171.974274,-8584.14,398.465,3107.04,6433.7825,37563.46


In [26]:
sf_emp.describe()

Unnamed: 0,year,salaries,overtime,other salaries,retirement,health and dental,other benefits
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2016.53176,53715.441133,4201.272687,2816.296542,10484.755614,9382.390735,4053.381941
std,1.877153,47686.502923,11601.573498,6637.820066,9922.598455,7379.199008,4171.974274
min,2013.0,-2984.52,-18458.15,-604.85,-13692.12,-287.37,-8584.14
25%,2015.0,5281.285,0.0,0.0,0.0,2106.925,398.465
50%,2017.0,52181.955,0.0,164.77,10427.54,11416.36,3107.04
75%,2018.0,85455.0025,1907.26,2727.59,17227.3125,13371.03,6433.7825
max,2019.0,645739.46,258124.17,239294.57,120791.4,36369.96,37563.46


## Nuisance Columns

Above, we called common statistical methods a DataFrame composed of only numeric columns. It's possible to call these same methods from DataFrames composed of any combination of data types.

### Dropping columns that don't work with the method

pandas allows you to call these statistical methods on DataFrames containing columns with data types that don't work for that particular method. The entire `sf_emp` DataFrame contains string and numeric columns. Taking the mean of a string column does not work. Instead of raising an error, pandas **silently** drops these column. These DataFrame columns that don't compute with certain methods are sometimes referred to as **nuisance columns**.

Let's show this by calling the `mean` method on the San Francisco employee compensation dataset with all of the original columns. We will work with only 100 rows of the data, which will be explained shortly.

In [27]:
sf_emp_100 = sf_emp.head(100)
sf_emp_100.head(3)

Unnamed: 0,year,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,2013,Public Protection,Personnel Technician,71414.01,0.0,0.0,14038.58,12918.24,5872.04
1,2013,General Administration & Finance,Planner 2,67941.06,0.0,0.0,13030.23,10047.52,5608.37
2,2013,Public Protection,Firefighter,116956.72,59975.43,19037.3,24796.44,15788.97,3222.2


Calling the mean method drops the two columns containing strings from the result. No error is raised.

In [28]:
sf_emp_100.mean()

  sf_emp_100.mean()


year                  2013.0000
salaries             66812.8721
overtime              4680.5567
other salaries        3718.1512
retirement           12658.8683
health and dental     9044.6313
other benefits        4585.1201
dtype: float64

In [32]:
sf_emp_100.dtypes

year                    int64
organization group     object
job                    object
salaries              float64
overtime              float64
other salaries        float64
retirement            float64
health and dental     float64
other benefits        float64
dtype: object

### Many methods do work with non-numeric data types

Many of the aggregation methods do work with string and datetime columns. Let's find the max of all the `sf_emp` columns.

In [38]:
sf_emp_100.max()

year                                                     2013
organization group    Public Works, Transportation & Commerce
job                                     X-Ray Laboratory Aide
salaries                                            285446.37
overtime                                             59975.43
other salaries                                       24897.03
retirement                                            54710.5
health and dental                                    15788.97
other benefits                                       17780.94
dtype: object

The `sum` method is valid for string (but not datetime) columns and concatenates all the values together to produce one long string. This usually isn't something you'd like to do. It's also a computationally expensive operation. The following call to `sum` took about 4 seconds on the full dataset (50k rows) on my machine.

In [39]:
sf_emp_100.sum()

year                                                             201300
organization group    Public ProtectionGeneral Administration & Fina...
job                   Personnel TechnicianPlanner 2FirefighterIT Ope...
salaries                                                     6681287.21
overtime                                                      468055.67
other salaries                                                371815.12
retirement                                                   1265886.83
health and dental                                             904463.13
other benefits                                                458512.01
dtype: object

### Use `numeric_only=True`

The `sum` method, as well as all the other aggregation methods, provides the boolean parameter `numeric_only` that is defaulted to `False`. By setting it to `True`, pandas will only apply the method to boolean, integer, and float columns. The following operation only took 7 ms on the full dataset on my machine or more than 1,000 times faster than the previous one.

In [40]:
sf_emp.sum(numeric_only=True)

year                 1.008266e+08
salaries             2.685772e+09
overtime             2.100636e+08
other salaries       1.408148e+08
retirement           5.242378e+08
health and dental    4.691195e+08
other benefits       2.026691e+08
dtype: float64

### The slow  `mean` method

The `mean` method is also extremely slow, even though it only works on numeric columns. This is because pandas takes the `sum` of all the columns first and then divides by the length. The reason pandas doesn't just skip over string columns is that they are technically object columns and an object column can hold any data type. The only way for pandas to decide whether or not the `mean` will work on an object column is to actually sum up every value first and then attempt to divide by the length. If that fails, then it will skip it. The issue with this, is that it is extremely slow for string columns since strings can be summed. pandas only fails after the string column has been concatenated together when it attempts to divide by the length. If you want to take the `mean` on a DataFrame with string columns, make sure you set `numeric_only` to `True.`

Even on this small dataset of 100 rows, there is a substantial performance difference.

In [41]:
%timeit -n 1 -r 1 sf_emp_100.mean()

1.95 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)




In [42]:
%timeit -n 1 -r 1 sf_emp_100.mean(numeric_only=True)

1.35 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Exercises

Execute the following cell to read in the movie dataset with the title in the index selecting all three actor Facebook like columns.

In [42]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
actor_fb = movie[cols]
actor_fb.head(3)

Unnamed: 0_level_0,actor1_fb,actor2_fb,actor3_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,1000.0,936.0,855.0
Pirates of the Caribbean: At World's End,40000.0,5000.0,1000.0
Spectre,11000.0,393.0,161.0


### Exercise 1
<span  style="color:green; font-size:16px">Calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

In [43]:
actor_fb.mean()

actor1_fb    6494.488491
actor2_fb    1621.923516
actor3_fb     631.276313
dtype: float64

### Exercise 2

<span  style="color:green; font-size:16px">The result of exercise 1 is a Series of three values. Can you call a method on this Series to choose the column name with the highest mean Facebook likes.</span>

In [49]:
actor_fb.mean().idxmax()

'actor1_fb'

### Exercise 3

<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

In [44]:
actor_fb.sum(axis =1)

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
                                               ...   
Signed Sealed Delivered                        1425.0
The Following                                  1753.0
A Plague So Pleasant                              0.0
Shanghai Calling                               2154.0
My Date with Drew                               125.0
Length: 4916, dtype: float64

### Exercise 4
<span  style="color:green; font-size:16px">What percentage of movies have more than 10,000 total actor FB likes?</span>

In [45]:
sum_likes = actor_fb.sum(axis = 'columns')

In [46]:
sum_likes > 10000

title
Avatar                                        False
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Length: 4916, dtype: bool

In [47]:
(sum_likes > 10000).mean()

0.2982099267697315

In [49]:
more_than_10 = sum_likes[sum_likes > 10000]
len(more_than_10)/ len(sum_likes)

0.2982099267697315

### Exercise 5

<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

In [61]:
movie.head(1)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9


In [63]:
movie['sum_likes'] = movie[['actor1_fb', 'actor2_fb', 'actor3_fb']].sum(axis = 1)

In [65]:
filt = movie['sum_likes'] > 10000

In [72]:
more_than = movie[filt]['gross'].median()
more_than

42391915.5

In [73]:
less_than = movie[-filt]['gross'].median()
less_than

16815752.5

In [74]:
more_than/less_than

2.5209645241864735

### Exercise 6

<span  style="color:green; font-size:16px">From exercise 5, it appears that movies with more than 10,000 total actor FB likes gross 2.5 times as much. This may be due to the fact that newer movies have more actors that are recognized by FB users. Find the median year produced for both groups.</span>

In [75]:
movie[filt]['year'].median()

2006.0

In [76]:
movie[-filt]['year'].median()

2005.0

### Exercise 7

<span  style="color:green; font-size:16px">For each movie made in the year 2016, what is the median of the total actor FB likes?</span>

In [77]:
filt = movie['year'] == 2016
movie[filt]['sum_likes'].median()

3571.5

### Exercise 8

<span  style="color:green; font-size:16px">Write a function that has a single parameter, `year`. Have it return the median of the total actor FB likes for the given year. Test your function with the year 2016 and verify the result with Exercise 6.</span>

In [78]:
def med_likes (year) :
    filt = movie['year'] == year
    return movie[filt]['sum_likes'].median()

In [79]:
med_likes(2016)

3571.5

### Exercise 9

<span  style="color:green; font-size:16px">Write a loop to print out the year and median total actor FB likes for that year from 1990 to 2016</span>

In [80]:
for i in range(1990,2017):
    print(i, med_likes(i))

1990 2017.0
1991 2436.0
1992 2147.5
1993 2018.0
1994 2368.5
1995 2612.0
1996 2692.5
1997 1964.0
1998 2482.0
1999 2595.0
2000 2378.0
2001 2424.0
2002 2146.0
2003 2019.0
2004 2298.0
2005 2072.0
2006 2359.0
2007 2002.5
2008 2400.0
2009 2145.0
2010 2411.0
2011 2818.5
2012 2426.0
2013 2420.0
2014 2084.0
2015 2063.0
2016 3571.5


Use the college dataset with the institution name as the index for the remaining exercises.

In [81]:
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [4]:
college.columns

Index(['city', 'stabbr', 'hbcu', 'menonly', 'womenonly', 'relaffil',
       'satvrmid', 'satmtmid', 'distanceonly', 'ugds', 'ugds_white',
       'ugds_black', 'ugds_hisp', 'ugds_asian', 'ugds_aian', 'ugds_nhpi',
       'ugds_2mor', 'ugds_nra', 'ugds_unkn', 'pptug_ef', 'curroper', 'pctpell',
       'pctfloan', 'ug25abv', 'md_earn_wne_p10', 'grad_debt_mdn_supp'],
      dtype='object')

### Exercise 10

<span  style="color:green; font-size:16px">Find the number of non-missing values in each column and again for each row.</span>

In [82]:
college.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7535 entries, Alabama A & M University to Excel Learning Center-San Antonio South
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                7535 non-null   object 
 1   stabbr              7535 non-null   object 
 2   hbcu                7164 non-null   float64
 3   menonly             7164 non-null   float64
 4   womenonly           7164 non-null   float64
 5   relaffil            7535 non-null   int64  
 6   satvrmid            1185 non-null   float64
 7   satmtmid            1196 non-null   float64
 8   distanceonly        7164 non-null   float64
 9   ugds                6874 non-null   float64
 10  ugds_white          6874 non-null   float64
 11  ugds_black          6874 non-null   float64
 12  ugds_hisp           6874 non-null   float64
 13  ugds_asian          6874 non-null   float64
 14  ugds_aian           6874 non-null   float64
 15  ug

In [103]:
college.count()

city                  7535
stabbr                7535
hbcu                  7164
menonly               7164
womenonly             7164
relaffil              7535
satvrmid              1185
satmtmid              1196
distanceonly          7164
ugds                  6874
ugds_white            6874
ugds_black            6874
ugds_hisp             6874
ugds_asian            6874
ugds_aian             6874
ugds_nhpi             6874
ugds_2mor             6874
ugds_nra              6874
ugds_unkn             6874
pptug_ef              6853
curroper              7535
pctpell               6849
pctfloan              6849
ug25abv               6718
md_earn_wne_p10       6413
grad_debt_mdn_supp    7503
dtype: int64

In [104]:
college.count(axis = 1)

instnm
Alabama A & M University                                  26
University of Alabama at Birmingham                       26
Amridge University                                        24
University of Alabama in Huntsville                       26
Alabama State University                                  26
                                                          ..
SAE Institute of Technology  San Francisco                 5
Rasmussen College - Overland Park                          5
National Personal Training Institute of Cleveland          5
Bay Area Medical Academy - San Jose Satellite Location     5
Excel Learning Center-San Antonio South                    5
Length: 7535, dtype: int64

In [83]:
college.describe()

Unnamed: 0,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,...,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv
count,7164.0,7164.0,7164.0,7535.0,1185.0,1196.0,7164.0,6874.0,6874.0,6874.0,...,6874.0,6874.0,6874.0,6874.0,6874.0,6853.0,7535.0,6849.0,6849.0,6718.0
mean,0.014238,0.009213,0.005304,0.190975,522.819409,530.76505,0.005583,2356.83794,0.510207,0.189997,...,0.013813,0.004569,0.02395,0.016086,0.045181,0.226639,0.923291,0.530643,0.522211,0.410021
std,0.118478,0.095546,0.072642,0.393096,68.578862,73.469767,0.074519,5474.275871,0.286958,0.224587,...,0.070196,0.033125,0.031288,0.050172,0.09344,0.24647,0.266146,0.225544,0.283616,0.228939
min,0.0,0.0,0.0,0.0,290.0,310.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,475.0,482.0,0.0,117.0,0.2675,0.036125,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.3578,0.3329,0.2415
50%,0.0,0.0,0.0,0.0,510.0,520.0,0.0,412.5,0.5557,0.10005,...,0.0026,0.0,0.0175,0.0,0.0143,0.1504,1.0,0.5215,0.5833,0.40075
75%,0.0,0.0,0.0,0.0,555.0,565.0,0.0,1929.5,0.747875,0.2577,...,0.0073,0.0025,0.0339,0.0117,0.0454,0.3769,1.0,0.7129,0.745,0.572275
max,1.0,1.0,1.0,1.0,765.0,785.0,1.0,151558.0,1.0,1.0,...,1.0,0.9983,0.5333,0.9286,0.9027,1.0,1.0,1.0,1.0,1.0


### Exercise 11

<span  style="color:green; font-size:16px">What is the average number of non-missing values for each row?</span>

In [84]:
college.count(axis=1).mean()

22.70763105507631

### Exercise 12

<span style="color:green; font-size:16px">The `ugds` column of the college dataset contains the total undergraduate population. What is the least number of colleges it would take to have a total of more than 5 million students?</span>

Sort largest to smallest
New column w/ cumulative sum
Isolate row w/ > 5MM

In [11]:
college.sort_values

instnm
Education and Technology Institute                        0.0
Taft University System                                    0.0
Prince Institute-Rocky Mountains                          0.0
Lyme Academy College of Fine Arts                         0.0
American Conservatory Theater                             0.0
                                                         ... 
SAE Institute of Technology  San Francisco                NaN
Rasmussen College - Overland Park                         NaN
National Personal Training Institute of Cleveland         NaN
Bay Area Medical Academy - San Jose Satellite Location    NaN
Excel Learning Center-San Antonio South                   NaN
Name: ugds, Length: 7535, dtype: float64

In [16]:
sorted_list = college.sort_values(by='ugds', ascending = False)
sorted_list.head(10)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp,cum_pop
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
University of Phoenix-Arizona,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,151558.0,...,0.0131,0.3152,0.0,1,0.6009,0.592,,,33000.0,151558.0
Ivy Tech Community College,Indianapolis,IN,0.0,0.0,0.0,0,,,0.0,77657.0,...,0.0003,0.0354,0.635,1,0.5153,0.3384,0.478,29400,13000.0,77657.0
Miami Dade College,Miami,FL,0.0,0.0,0.0,0,,,0.0,61470.0,...,0.0521,0.028,0.5824,1,0.5399,0.0921,0.3503,30100,8500.0,61470.0
Lone Star College System,The Woodlands,TX,0.0,0.0,0.0,0,,,0.0,59920.0,...,0.019,0.0292,0.6863,1,0.3405,0.1984,0.3201,32900,11000.0,59920.0
Houston Community College,Houston,TX,0.0,0.0,0.0,0,,,0.0,58084.0,...,0.0911,0.0198,0.7027,1,0.668,0.3348,0.4751,32500,10750.0,58084.0
University of Central Florida,Orlando,FL,0.0,0.0,0.0,0,590.0,595.0,0.0,52280.0,...,0.01,0.0067,0.3043,1,0.3814,0.45,0.2083,42900,18350.0,52280.0
Liberty University,Lynchburg,VA,0.0,0.0,0.0,1,525.0,510.0,0.0,49340.0,...,0.0135,0.2626,0.4458,1,0.4984,0.6648,0.6265,35600,23250.0,49340.0
Texas A & M University-College Station,College Station,TX,0.0,0.0,0.0,0,580.0,615.0,0.0,46941.0,...,0.0131,0.0022,0.1049,1,0.2183,0.332,0.0308,53900,19000.0,46941.0
American Public University System,Charles Town,WV,0.0,0.0,0.0,0,,,1.0,44924.0,...,0.0071,0.0499,0.9316,1,0.3255,0.3388,0.8147,PrivacySuppressed,18543.5,44924.0
Ashford University,San Diego,CA,0.0,0.0,0.0,0,,,0.0,44744.0,...,0.002,0.0192,0.0003,1,0.5944,0.7203,0.8997,39100,32823.0,44744.0


In [85]:
sorted_list['ugds'].cumsum().head(10)


instnm
University of Phoenix-Arizona             151558.0
Ivy Tech Community College                229215.0
Miami Dade College                        290685.0
Lone Star College System                  350605.0
Houston Community College                 408689.0
University of Central Florida             460969.0
Liberty University                        510309.0
Texas A & M University-College Station    557250.0
American Public University System         602174.0
Ashford University                        646918.0
Name: ugds, dtype: float64

### Exercise 13

<span style="color:green; font-size:16px">Call the `describe` method, but make it work only for the string columns.</span>

In [86]:
college.describe(include = 'object')

Unnamed: 0,city,stabbr,md_earn_wne_p10,grad_debt_mdn_supp
count,7535,7535,6413,7503
unique,2514,59,598,2038
top,New York,CA,PrivacySuppressed,PrivacySuppressed
freq,87,773,822,1510


In [89]:
college.dtypes

city                   object
stabbr                 object
hbcu                  float64
menonly               float64
womenonly             float64
relaffil                int64
satvrmid              float64
satmtmid              float64
distanceonly          float64
ugds                  float64
ugds_white            float64
ugds_black            float64
ugds_hisp             float64
ugds_asian            float64
ugds_aian             float64
ugds_nhpi             float64
ugds_2mor             float64
ugds_nra              float64
ugds_unkn             float64
pptug_ef              float64
curroper                int64
pctpell               float64
pctfloan              float64
ug25abv               float64
md_earn_wne_p10        object
grad_debt_mdn_supp     object
dtype: object

### Exercise 14

<span style="color:green; font-size:16px">Call the `max` method, but only return columns that are numeric.</span>

In [91]:
college.max()

  college.max()


city            Zanesville
stabbr                  WY
hbcu                   1.0
menonly                1.0
womenonly              1.0
relaffil                 1
satvrmid             765.0
satmtmid             785.0
distanceonly           1.0
ugds              151558.0
ugds_white             1.0
ugds_black             1.0
ugds_hisp              1.0
ugds_asian          0.9727
ugds_aian              1.0
ugds_nhpi           0.9983
ugds_2mor           0.5333
ugds_nra            0.9286
ugds_unkn           0.9027
pptug_ef               1.0
curroper                 1
pctpell                1.0
pctfloan               1.0
ug25abv                1.0
dtype: object

In [92]:
college.max(numeric_only = True)

hbcu                 1.0000
menonly              1.0000
womenonly            1.0000
relaffil             1.0000
satvrmid           765.0000
satmtmid           785.0000
distanceonly         1.0000
ugds            151558.0000
ugds_white           1.0000
ugds_black           1.0000
ugds_hisp            1.0000
ugds_asian           0.9727
ugds_aian            1.0000
ugds_nhpi            0.9983
ugds_2mor            0.5333
ugds_nra             0.9286
ugds_unkn            0.9027
pptug_ef             1.0000
curroper             1.0000
pctpell              1.0000
pctfloan             1.0000
ug25abv              1.0000
dtype: float64