In [26]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)

3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
1.17.2
0.25.1


# Applying Functions to DataFrames

Previously, we stored our grade data as a csv file.  Let's retrieve it with the Pandas `read_csv()` method.

In [27]:
gradebook = pd.read_csv('gradebook.csv')
gradebook

Unnamed: 0,student,midterm,final
0,Ben,88,85
1,May,78,82
2,Sue,92,51
3,Blake,56,85
4,Amy,79,91
5,Steve,92,79


Notice that we no longer have student names as our Index.  This is because there's no way to specify which variable is an index in a csv file, so that information is lost.  We can go back to the way we had things before with `set_index()`.

In [28]:
gradebook = gradebook.set_index('student')
gradebook

Unnamed: 0_level_0,midterm,final
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Ben,88,85
May,78,82
Sue,92,51
Blake,56,85
Amy,79,91
Steve,92,79


Let's also throw in some missing values, just to make things more interesting.

In [29]:
gradebook.loc['Steve','midterm'] = np.nan
gradebook.loc['Amy','final'] = np.nan
gradebook

Unnamed: 0_level_0,midterm,final
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Ben,88.0,85.0
May,78.0,82.0
Sue,92.0,51.0
Blake,56.0,85.0
Amy,79.0,
Steve,,79.0


It's time to apply some functions!  Finding the mean score for each exam is about as easy as you could imagine:

In [30]:
gradebook.mean()

midterm    78.6
final      76.4
dtype: float64

Notice:

- Pandas usually assumes you want missing values ignored - that is, left out so that computation can continue.  (Remember that NumPy would have computed the means as NA).  This is usually what we want, but there's a risk that you don't stop to think about your missing values because Pandas doesn't force you to.  What if the missing values represent students that fell behind in the course?  Are the means still a fair measure of how well students are doing?

You can override this behavior if you want with `skipna`.

In [31]:
gradebook.mean(skipna = False)

midterm   NaN
final     NaN
dtype: float64

- By default, mean operates 'along the rows' which we call axis 0.  This should make sense for most DataFrames.  Remember that different columns could represent totally different things.  It doesn't make much sense to average 15 mg of caffeine with 4 pets.

It is possible to take the mean horizontally ('along the columns').  Just set `axis = 1`.

In [32]:
gradebook.mean(axis = 1)

student
Ben      86.5
May      80.0
Sue      71.5
Blake    70.5
Amy      79.0
Steve    79.0
dtype: float64

Notice that the mean() method returns a Series

In [33]:
avg = gradebook.mean()
type(avg)

pandas.core.series.Series

We can subtract this Series from the DataFrame, to find 'de-meaned' grades for each exam.

In [34]:
gradebook - avg

Unnamed: 0_level_0,midterm,final
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Ben,9.4,8.6
May,-0.6,5.6
Sue,13.4,-25.4
Blake,-22.6,8.6
Amy,0.4,
Steve,,2.6


Notice that the Series gets matched against the *columns*.

Also, the Series is *broadcast*, just like in NumPy.  Here's we're trying to subtract a 1-by-2 Series from a 6-by-2 DataFrame.  The Series gets copied 6 times automatically so that the dimensions line up.

DataFrames have a lot of other methods to compute statistics for the columns (or rows)

In [36]:
gradebook.sum()

midterm    393.0
final      382.0
dtype: float64

In [37]:
gradebook.sum(axis=1)

student
Ben      173.0
May      160.0
Sue      143.0
Blake    141.0
Amy       79.0
Steve     79.0
dtype: float64

In [38]:
gradebook.min()

midterm    56.0
final      51.0
dtype: float64

In [39]:
gradebook.min(axis=1)

student
Ben      85.0
May      78.0
Sue      51.0
Blake    56.0
Amy      79.0
Steve    79.0
dtype: float64

In [40]:
gradebook.std()

midterm    13.957077
final      14.415270
dtype: float64

You can use the `corr()` and `cov()` methods to get a full correlation or covariance matrix, respectively.

In [41]:
gradebook.corr()

Unnamed: 0,midterm,final
midterm,1.0,-0.571463
final,-0.571463,1.0


If you have too many variables, computing the full matrix may be too much information.  You can also get the the correlation of each column with a specific Series using the `corrwith()` method.

In [42]:
gradebook.corrwith(gradebook.mean(axis=1))

midterm    0.437976
final      0.492324
dtype: float64

## Applying Custom Functions

Quite frequently, you'll want to apply your own custom function to the columns and rows of a DataFrame.  You can do this with the general (and very important) method, `apply()`. 

In [141]:
g_range = lambda x: x.max() - x.min()

In [142]:
gradebook.apply(g_range)

midterm    36.0
final      34.0
dtype: float64

You can apply your function either along the rows or along the columns

In [143]:
gradebook.apply(g_range, axis = 1)

student
Ben       3.0
May       4.0
Sue      41.0
Blake    29.0
Amy       0.0
Steve     0.0
dtype: float64

You'll sometimes see people use the agg method in these situations.  Agg is short for aggregate, and this mainly emphasizes that we're losing a dimension.

In [144]:
gradebook.agg(g_range)

midterm    36.0
final      34.0
dtype: float64

It's also possible to apply a function that returns an entire Series.  In this case, we wouldn't lose a dimension in the result (and we can't use agg).

In [17]:
def g_min_max(x):
    return pd.Series([x.min(),x.max()], index = ['min','max'])

In [18]:
gradebook.apply(g_min_max)

Unnamed: 0,midterm,final
min,56.0,51.0
max,92.0,85.0


This is less common, but you should know that you can apply a function to every element of a DataFrame.  This is done with the applymap() function.  It's named after the map() function for a Series.

In [19]:
gradebook.applymap(lambda x: x+1)

Unnamed: 0_level_0,midterm,final
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Ben,89.0,86.0
May,79.0,83.0
Sue,93.0,52.0
Blake,57.0,86.0
Amy,80.0,
Steve,,80.0


## Methods for Getting to Know Variables

Let's look at a really handy method for when you're trying to do some basic data exploration.  `describe()` gives you some common statistics, and it's a great place to start when you're just digging into a dataset.

In [43]:
gradebook.describe()

Unnamed: 0,midterm,final
count,5.0,5.0
mean,78.6,76.4
std,13.957077,14.41527
min,56.0,51.0
25%,78.0,79.0
50%,79.0,82.0
75%,88.0,85.0
max,92.0,85.0


`describe()` behaves differently, depending on the type of variable.  For example, let's add a variable for a letter grade.

In [44]:
gradebook['letter_grade_12'] = pd.cut(gradebook.mean(axis=1), (0,70,80,90,100), labels = ('F','C','B','A'))
gradebook

Unnamed: 0_level_0,midterm,final,letter_grade_12
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ben,88.0,85.0,B
May,78.0,82.0,C
Sue,92.0,51.0,C
Blake,56.0,85.0,C
Amy,79.0,,C
Steve,,79.0,C


In [150]:

gradebook.letter_grade.describe()

count     6
unique    2
top       C
freq      5
Name: letter_grade, dtype: object

A method that is even more important is `value_counts()`.  This returns a Series that tells you how often each value occurs in a variable.

In [151]:
gradebook.letter_grade.value_counts()

C    5
B    1
F    0
A    0
Name: letter_grade, dtype: int64

Let's save our work so we can use it later.

In [152]:
gradebook.to_csv('gradebook_v2.csv')