# Aggregate functions

When working with a dataset, it's often a good idea to compute summary statistics. 

**Summary statistics** ``can tell you about your outliers, if your data is symmetrical, and how tightly grouped your data is. ``

**Aggregate Methods**

It is often a good idea to compute summary statistics.

Aggregate Method | Description
--- | --- 
sum | sum of values
cumsum | cumulative sum
mean | mean of values
median | arithmetic median of values
min | minimum
max | maximum
mode | mode
std | unbiased standard deviation
var | unbiased variance
quantile | compute rank-based statistics of elements

**Load Data**

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [16]:
# import data
filename = 'car_financing.xlsx'
df = pd.read_excel(filename)

In [17]:
## Filtering 
car_filter = df['car_type']=='Toyota Sienna'
interest_filter = df['interest_rate']==0.0702
df = df.loc[car_filter & interest_filter, :]

# Approach 1 dictionary substitution using rename method
df = df.rename(columns={'Starting Balance': 'starting_balance',
                        'Interest Paid': 'interest_paid', 
                        'Principal Paid': 'principal_paid',
                        'New Balance': 'new_balance'})

# Approach 2 list replacement
# Only changing Month -> month, but we need to list the rest of the columns
df.columns = ['month',
              'starting_balance',
              'Repayment',
              'interest_paid',
              'principal_paid',
              'new_balance',
              'term',
              'interest_rate',
              'car_type']

# Approach 1
# This approach allows you to drop multiple columns at a time 
df = df.drop(columns=['term'])

# Approach 2 use the del command
del df['Repayment']

df.shape

(60, 7)

For the car loan dataset where we have a payment table for a $34,690 loan at a 7.02% interest rate for a Toyota Sienna over 60 months, it would be interesting to find out how much total interest paid would be over the course of the loan. 

In [18]:
df.head()

Unnamed: 0,month,starting_balance,interest_paid,principal_paid,new_balance,interest_rate,car_type
0,1,34689.96,202.93,484.3,34205.66,0.0702,Toyota Sienna
1,2,34205.66,200.1,487.13,33718.53,0.0702,Toyota Sienna
2,3,33718.53,197.25,489.98,33228.55,0.0702,Toyota Sienna
3,4,33228.55,194.38,492.85,32735.7,0.0702,Toyota Sienna
4,5,32735.7,191.5,495.73,32239.97,0.0702,Toyota Sienna


For this, we're going to use a sum method. And what the sum method does is it sums the values in a column. And the way this works is I have the name of the data frame, DF. I have single brackets, I have the column I'm interested in. In this case, interest_paid, closing single brackets, and I do .sum. And what this gives me is a total amount of interest paid over the course of the loan. And as you see, over the course of the loan, the interest paid is $6,450 and 27 cents. 

In [12]:
# sum the values in a column
# total amount of interest paid over the course of the loan
df['interest_paid'].sum()

np.float64(6450.2699999999995)

You can also use a sum method on an entire data frame to sum across all the values across all the columns. While the sum for the interest paid column is what we had before when we just summed across the interest paid column, the car type column seems to be a bit different. And the reason why this is the case is if you recall with Python, when you add two strings together, they can concatenate. 


In [13]:
# sum all the values across all columns
df.sum()

month                                                            1830
starting_balance                                           1118598.13
interest_paid                                                 6450.27
principal_paid                                               34690.29
new_balance                                                1083907.84
interest_rate                                                   4.212
car_type            Toyota SiennaToyota SiennaToyota SiennaToyota ...
dtype: object

It's important to note that if you ever don't know exactly what a method does, 

In [14]:
'Toyota Sienna' + 'Toyota Sienna'

'Toyota SiennaToyota Sienna'

I encourage you to use the help method where you have help, open parenthesis, the name of the data frame, in this case DF, open brackets, the name of the column, .sum, shift plus enter. The way the sum method works is it sums down all the values in a column. However, by default, it ignores all NAs or null values when computing the result. So if this interest paid column had any NAs or NANS in them, it ignored them. And that's why it's always important to know exactly what a method does or an aggregate function in this case, as are some potentially did include any NANs. 


In [None]:
# Notice that by default it seems like the sum function ignores missing values. 
help(df['interest_paid'].sum)

And the way to check if you have NANs in your data frame is by using the inbuilt info method. And as you see here, I have one NAN in the interest paid column. When working with a dataset, it's often a good idea to use aggregate functions on your various columns. This can tell you how well grouped your dataset is, what your outliers are, and potentially useful information like how much interest you'll pay over the course of a loan by using the sum method.

In [None]:
# The info method gives the column datatypes + number of non-null values
# Notice that we seem to have 60 non-null values for all but the Interest Paid column. 
df.info()