# Optimising Dataframes and Chunked Processing

In this guided project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from Lending Club, a marketplace for personal loans that matches borrowers with investors. More information about the marketplace is available [on its website](https://www.lendingclub.com/public/how-peer-lending-works.action).

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-) that Lending Club charges.

We'll be working with a dataset of loans approved from 2007-2011, which can be downloaded from [Lending Club's website](https://www.lendingclub.com/info/download-data.action). The `desc` column has already been removed to make the system run more quickly.

If we read in the entire data set, it will consume about 67 megabytes of memory. Let's imagine that we only have 10 megabytes of memory available throughout this project, to practice the techniques mentioned above. 

In [1]:
import pandas as pd
pd.options.display.max_columns = 99

In [2]:
first_five = pd.read_csv('loans_2007.csv', nrows=5)
first_five

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [3]:
thousand_chunk = pd.read_csv('loans_2007.csv', nrows=1000)
mem_usage = thousand_chunk.memory_usage(deep=True).sum()/(1024*1024)
print('{:.2f} MB of memory used for 1000 rows'.format(mem_usage))

1.55 MB of memory used for 1000 rows


In [24]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_rows = 0
mem_usage = 0
print('3000 rows per chunk')
for i, chunk in enumerate(chunk_iter):
    mem_usage = chunk.memory_usage(deep=True).sum()/(1024*1024)
    print('Chunk {}: {:.2f} MB'.format(i, mem_usage))
    total_rows += len(chunk)
    total_mem_usage += mem_usage
print('{:.3f} MB total memory usage for the dataset'.format(total_mem_usage))
print('{} total rows in the dataset'.format(total_rows))

3000 rows per chunk
Chunk 0: 4.65 MB
Chunk 1: 4.64 MB
Chunk 2: 4.65 MB
Chunk 3: 4.65 MB
Chunk 4: 4.64 MB
Chunk 5: 4.65 MB
Chunk 6: 4.64 MB
Chunk 7: 4.65 MB
Chunk 8: 4.65 MB
Chunk 9: 4.65 MB
Chunk 10: 4.66 MB
Chunk 11: 4.66 MB
Chunk 12: 4.66 MB
Chunk 13: 4.90 MB
Chunk 14: 0.88 MB
174.492 MB total memory usage for the dataset
42538 total rows in the dataset


## Exploring the data in chunks

We will now continue to understand the column types better while using dataframe chunks.

### String or int
We will start by looking at the data types available in each chunk

In [5]:
import numpy as np

chunk_iter = pd.read_csv('loans_2007.csv',chunksize=3000)
numeric = []
string = []
for chunk in chunk_iter:
    nums = chunk.select_dtypes(include=[np.number]).shape[1]
    numeric.append(nums)
    strs = chunk.select_dtypes(include=['object']).shape[1]
    string.append(strs)

print('Numeric type columns per chunk:')
print(numeric)
print('\nString type columns per chunk:')
print(string)

Numeric type columns per chunk:
[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]

String type columns per chunk:
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


The data types across the chunks seem to be consistent, except for the final 2 chunks. 

It seems like one column in particular (the `id` column) is being cast to `int64` in the last 2 chunks but not in the earlier chunks. Since the `id` column won't be useful for analysis, visualization, or predictive modelling let's ignore this column.

### Unique values

How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?

In [6]:
chunk_iter = pd.read_csv('loans_2007.csv',chunksize=3000)

# calculating unique values for each chunk
uniques = {}
for chunk in chunk_iter:
    strings_only = chunk.select_dtypes(include=['object'])
    cols = strings_only.columns
    for c in cols:
        val_counts = strings_only[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

# grouping statistics across chunks          
unique_stats = {
    'column_name': [],
    'total_values': [],
    'unique_values': [],
}
for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()    
    
    total_values = 0
    for k,v in u_group.items():
        total_values += v
    unique_stats['column_name'].append(col)
    unique_stats['total_values'].append(total_values)
    unique_stats['unique_values'].append(u_group.shape[0])

pd.DataFrame(unique_stats).sort_values(by=['unique_values']).reset_index(drop=True)

Unnamed: 0,column_name,total_values,unique_values
0,application_type,42535,1
1,initial_list_status,42535,1
2,term,42535,2
3,pymnt_plan,42535,2
4,verification_status,42535,3
5,home_ownership,42535,5
6,grade,42535,7
7,loan_status,42535,9
8,emp_length,41423,11
9,purpose,42535,14


We can see the top 11 columns above which have less than 50 unique values. 

### Missing values

Which float columns have no missing values and could be candidates for conversion to the integer type?

In [7]:
chunk_iter = pd.read_csv('loans_2007.csv',chunksize=3000)

missing = []
for chunk in chunk_iter:
    floats = chunk.select_dtypes(include=['float'])
    missing.append(floats.apply(pd.isnull).sum())

combined_missing = pd.concat(missing)
combined_missing = combined_missing.groupby(combined_missing.index).sum().sort_values(ascending=False)
combined_missing = combined_missing.to_frame(name='missing_vals').reset_index().rename(columns={'index': 'col_name'})
combined_missing.head(10)

Unnamed: 0,col_name,missing_vals
0,pub_rec_bankruptcies,1368
1,chargeoff_within_12_mths,148
2,collections_12_mths_ex_med,148
3,tax_liens,108
4,acc_now_delinq,32
5,open_acc,32
6,delinq_amnt,32
7,pub_rec,32
8,delinq_2yrs,32
9,total_acc,32


We can see the 10 columns with the most missing values here. The remaining columns are missing less that 50 values out of the 40,000 rows of the dataset. The main columns we would have to focus on are `pub_rec_bankrupcies`, `chargeoff_within_12_mths`, `collections_12_mths_ex_med`, and `tax_liens`.

As we go forward, we will not drop the data rows for now. Instead we will insert zeros if needed for manipulation/conversions. 

## Optimising string columns

Next, we can move to determine which string columns we can convert to a numeric type if we clean them. For example, the `int_rate` column is only a string because of the "%" sign at the end.

Out of all the object colums, the following have been identified as useful for analysis:

- `term`
- `sub_grade`
- `emp_title`
- `home_ownership`
- `verification_status`
- `issue_d`
- `purpose`
- `earliest_cr_line`
- `revol_util`
- `last_pymnt_d`
- `last_credit_pull_d`

In [8]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

## Create dictionary (key: column, value: list of Series objects representing each chunk's value counts)
str_cols_vc = {}
for chunk in chunk_iter:
    str_cols = chunk.select_dtypes(include=['object'])
    for col in str_cols.columns:
        current_col_vc = str_cols[col].value_counts()
        if col in str_cols_vc:
            str_cols_vc[col].append(current_col_vc)
        else:
            str_cols_vc[col] = [current_col_vc]
            
## Combine the value counts.
combined_vcs = {}
for col in str_cols_vc:
    combined_vc = pd.concat(str_cols_vc[col])
    final_vc = combined_vc.groupby(combined_vc.index).sum()
    combined_vcs[col] = final_vc

The following should be converted to *numerical*:

- `term`
- `revol_util`

The following should be converted to *datetime*:

- `issue_d`
- `earliest_cr_line`
- `last_pymnt_d`
- `last_credit_pull_d`

Of the remaining 5 cols, the following 4 can be set to *category* dtype:

- `sub_grade`
- `home_ownership`
- `verification_status`
- `purpose`

In [9]:
convert_col_dtypes = {
    "sub_grade": "category", 
    "home_ownership": "category", 
    "verification_status": "category", 
    "purpose": "category"
}

chunk_iter = pd.read_csv(
    'loans_2007.csv', 
    chunksize=3000, 
    dtype=convert_col_dtypes, 
    parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
)

chunk = next(chunk_iter)
term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
revol_cleaned = chunk['revol_util'].str.rstrip("%")
chunk['term'] = pd.to_numeric(term_cleaned)
chunk['revol_util'] = pd.to_numeric(revol_cleaned)    
chunk.dtypes

id                                     int64
member_id                            float64
loan_amnt                            float64
funded_amnt                          float64
funded_amnt_inv                      float64
term                                   int64
int_rate                              object
installment                          float64
grade                                 object
sub_grade                           category
emp_title                             object
emp_length                            object
home_ownership                      category
annual_inc                           float64
verification_status                 category
issue_d                       datetime64[ns]
loan_status                           object
pymnt_plan                            object
purpose                             category
title                                 object
zip_code                              object
addr_state                            object
dti       

## Optimising numeric columns

It looks like we were able to realize some powerful memory savings by converting to the category type and converting string columns to numeric ones.

In [10]:
def is_integer(vals):
    for v in vals:
        if not np.isnan(v):
            split = str(v).split('.')
            if int(split[-1]) != 0:
                return False
    return True

In [11]:
chunk_iter = pd.read_csv(
    'loans_2007.csv', 
    chunksize=3000, 
    dtype=convert_col_dtypes, 
    parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
)

chunk = next(chunk_iter)
term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
revol_cleaned = chunk['revol_util'].str.rstrip("%")
chunk['term'] = pd.to_numeric(term_cleaned)
chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    
## checking if float data are actually integers
float_cols = chunk.select_dtypes(include=['float'])
float_cols.agg(is_integer)

member_id                      True
loan_amnt                      True
funded_amnt                    True
funded_amnt_inv               False
installment                   False
annual_inc                    False
dti                           False
delinq_2yrs                    True
inq_last_6mths                 True
open_acc                       True
pub_rec                        True
revol_bal                      True
revol_util                    False
total_acc                      True
out_prncp                     False
out_prncp_inv                 False
total_pymnt                   False
total_pymnt_inv               False
total_rec_prncp               False
total_rec_int                 False
total_rec_late_fee            False
recoveries                    False
collection_recovery_fee       False
last_pymnt_amnt               False
collections_12_mths_ex_med     True
policy_code                    True
acc_now_delinq                 True
chargeoff_within_12_mths    

The following *float* columns can be converted to *int*:

- `member_id`
- `loan_amnt`
- `funded_amnt`
- `delinq_2yrs`
- `inq_last_6mths`
- `open_acc`
- `pub_rec`
- `revol_bal`
- `total_acc`
- `out_prncp`
- `out_prncp_inv`
- `collections_12_mths_ex_med`
- `policy_code`
- `acc_now_delinq`
- `chargeoff_within_12_mths`
- `delinq_amnt`
- `pub_rec_bankruptcies`
- `tax_liens`

In [18]:
convert_col_dtypes = {
    "sub_grade": "category", 
    "home_ownership": "category", 
    "verification_status": "category", 
    "purpose": "category",
}

int_col_names = [
    'member_id', 'loan_amnt', 'funded_amnt', 'delinq_2yrs', 'inq_last_6mths', 
    'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp',
    'out_prncp_inv', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths',
    'delinq_amnt'
]

chunk_iter = pd.read_csv(
    'loans_2007.csv', 
    chunksize=3000, 
    dtype=convert_col_dtypes, 
    parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
)

chunk = next(chunk_iter)
term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
revol_cleaned = chunk['revol_util'].str.rstrip("%")
chunk['term'] = pd.to_numeric(term_cleaned)
chunk['revol_util'] = pd.to_numeric(revol_cleaned)
chunk[int_col_names] = chunk[int_col_names].fillna(0).astype('int64')
chunk.dtypes

id                                     int64
member_id                              int64
loan_amnt                              int64
funded_amnt                            int64
funded_amnt_inv                      float64
term                                   int64
int_rate                              object
installment                          float64
grade                                 object
sub_grade                           category
emp_title                             object
emp_length                            object
home_ownership                      category
annual_inc                           float64
verification_status                 category
issue_d                       datetime64[ns]
loan_status                           object
pymnt_plan                            object
purpose                             category
title                                 object
zip_code                              object
addr_state                            object
dti       

In [25]:
chunk_iter = pd.read_csv(
    'loans_2007.csv', 
    chunksize=3000, 
    dtype=convert_col_dtypes, 
    parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
)

total_rows = 0
total_mem_usage = 0
for i, chunk in enumerate(chunk_iter):
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    chunk[int_col_names] = chunk[int_col_names].fillna(0).astype('int64')
    mem_usage = chunk.memory_usage(deep=True).sum()/(1024*1024)
    print('Chunk {}: {:.2f} MB'.format(i, mem_usage))
    total_rows += len(chunk)
    total_mem_usage += mem_usage
print('{:.3f} MB total memory usage for the dataset'.format(total_mem_usage))
print('{} total rows in the dataset'.format(total_rows))

Chunk 0: 2.94 MB
Chunk 1: 2.94 MB
Chunk 2: 2.94 MB
Chunk 3: 2.94 MB
Chunk 4: 2.94 MB
Chunk 5: 2.94 MB
Chunk 6: 2.94 MB
Chunk 7: 2.94 MB
Chunk 8: 2.94 MB
Chunk 9: 2.94 MB
Chunk 10: 2.95 MB
Chunk 11: 2.96 MB
Chunk 12: 2.96 MB
Chunk 13: 3.20 MB
Chunk 14: 0.58 MB
42.061 MB total memory usage for the dataset
42538 total rows in the dataset


In total, we were able to reduce the dataset memory usage from almost 175MB, down to 42MB. That is a significant difference, and would have huge implications especially if we were working on a larger scale. 

This was achieved with only a few actions of optimising column data types as required. Some to string columns were converted to categories and numeric types, while some float types went to integers. All helped in optimising the storage, along with some basic cleaning. 

## Next steps...

In this project we were able to optimise a dataframe's memory footprint and work with dataframe chunks. Here's an idea for some next steps:

- Create a function that automates the work we just did, so that we can use it on other Lending Club data sets. This function should:
    - Determine the optimal chunk size based on the memory constraints provided
    - Determine which string columns can be converted to numeric ones by removing the % character
    - Determine which numeric columns can be converted to more space efficient representations