# Processing dataframes in chunks

We'll work with a dataset of loans from the [Lending Club approved](https://www.lendingclub.com/investing/peer-to-peer) from 2007-2011. The entire dataset consums about 67MB of memory, but we want to work with 10MB of memory.

In [113]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns = 99

## Loading and inspecting the data

In [114]:
loans_5 = pd.read_csv('loans_2007.csv',nrows=5)
loans_5

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


From the first five rows, we can see that there are many columns, and we may be able to reduce memory usage by selecting relevant columns and optimising data types.

Before we do that, let's see how many rows we can read in to converge on memory usage under 5 MB.

In [115]:
max_mem = {}
for i in [1000,1500,2000,2500,3000,3500]:
    memory_footprints = []
    chunk_iter = pd.read_csv("loans_2007.csv", chunksize=i)
    for chunk in chunk_iter:
        mem = chunk.memory_usage(deep=True).sum()/(2**20)
        memory_footprints.append(mem)
    max_mem[i] = np.array(memory_footprints).max()   

In [116]:
max_mem

{1000: 1.6121845245361328,
 1500: 2.43115234375,
 2000: 3.2185497283935547,
 2500: 4.02269172668457,
 3000: 4.896956443786621,
 3500: 5.699357986450195}

Based on the chunk sizes tested above, it looks like we should use chunks of 3000 rows to ensure that we don't exceed 5MB of memory.

## Optimising data types

Before we start determining the number of columns with different data types to see which ones can be optimised, let's see how many rows the dataframe has.

In [117]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print(total_rows)

42538


#### Number of numeric and string columns

In [118]:
# Numeric columns
loans_chunks = pd.read_csv('loans_2007.csv',chunksize=3000)

numeric = []
string = []
for lc in loans_chunks:
    nums = lc.select_dtypes(include=[np.number]).shape[1]
    numeric.append(nums)
    strs = lc.select_dtypes(include=['object']).shape[1]
    string.append(strs)

print(numeric)
print(string)

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


It appears that the last 2 chunks have 1 less numerical column. Let's try to find out which one it is

In [119]:
# Are string columns consistent across chunks?
obj_cols = []
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    chunk_obj_cols = chunk.select_dtypes(include=['object']).columns.tolist()
    if len(obj_cols) > 0:
        is_same = obj_cols == chunk_obj_cols
        if not is_same:
            print("overall obj cols:", obj_cols, "\n")
            print("chunk obj cols:", chunk_obj_cols, "\n")    
    else:
        obj_cols = chunk_obj_cols

overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

chunk obj cols: ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 



Conclusion: 31 numeric columns, 21 obj columns.

It appears that the 'id' column is numeric in all but the last 2 chunks. We will see whether to convert it to a suitable numeric data type or to ignore it altogether since it does not contain much useful information.

In [120]:
# Read column names from file
columns = list(pd.read_csv("loans_2007.csv", nrows =1))
print(columns)

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']


#### Unique values in string columns

In [121]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, 
                         usecols =[i for i in columns if i != 'id'])

string_uniques = {}
for chunk in chunk_iter:
    strs = chunk.select_dtypes(include=['object'])
    cols = strs.columns
    for c in cols:
        val_counts = strs[c].value_counts()
        if c in string_uniques:
            string_uniques[c].append(val_counts)
        else:
            string_uniques[c] = [val_counts]

uniques_combined = {}
for col in string_uniques:
    u_concat = pd.concat(string_uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    if u_group.shape[0] / total_rows < 0.5:
        print(col, u_group.shape[0])

term 2
int_rate 394
grade 7
sub_grade 35
emp_length 11
home_ownership 5
verification_status 3
issue_d 55
loan_status 9
pymnt_plan 2
purpose 14
title 21264
zip_code 837
addr_state 50
earliest_cr_line 530
revol_util 1119
initial_list_status 1
last_pymnt_d 103
last_credit_pull_d 108
application_type 1


It appears that all the 20 string columns that we've included have fewer than 50% unique values. 

In [122]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000,
                         usecols =[i for i in columns if i != 'id'])

null_num = {}
for chunk in chunk_iter:
    num = chunk.select_dtypes(include=[np.number])
    cols = num.columns
    for c in cols:
        null_vals = num[c].isna().sum()
        if c in null_num:
            null_num[c].append(null_vals)
        else:
            null_num[c] = [null_vals] 

nulls_combined = {}
for c in null_num:
    n_concat = np.array(null_num[c]).sum()
    nulls_combined[c] = n_concat

In [123]:
nulls_combined

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108}

All of the remaining 30 numeric columns have null values

#### Calculating memory usage across chunks

In [124]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000,
                         usecols =[i for i in columns if i != 'id'])
memory_usage = []
for chunk in chunk_iter:
    mem = chunk.memory_usage(deep=True) / 2**20
    memory_usage.append(mem)

total_mem = np.array(memory_usage).sum()
print(total_mem)

65.70592975616455


All the chunks, excluding the id column, use about 65.7MB of memory in total

### Optimising data types

We can achieve the greatest memory improvements by converting the string columns to a numeric type. Let's convert all the columns where the values are less than 50% unique to the category type, and the columns that contain numeric values to the float type. we'll also check if any of the float columns can be downcast.

In [125]:
loans_5.select_dtypes(include=['object'])

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,last_credit_pull_d,application_type
0,36 months,10.65%,B,B2,,10+ years,RENT,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,Jan-1985,83.7%,f,Jan-2015,Jun-2016,INDIVIDUAL
1,60 months,15.27%,C,C4,Ryder,< 1 year,RENT,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,Apr-1999,9.4%,f,Apr-2013,Sep-2013,INDIVIDUAL
2,36 months,15.96%,C,C5,,10+ years,RENT,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,Nov-2001,98.5%,f,Jun-2014,Jun-2016,INDIVIDUAL
3,36 months,13.49%,C,C1,AIR RESOURCES BOARD,10+ years,RENT,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,Feb-1996,21%,f,Jan-2015,Apr-2016,INDIVIDUAL
4,60 months,12.69%,B,B5,University Medical Group,1 year,RENT,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,Jan-1996,53.9%,f,Jun-2016,Jun-2016,INDIVIDUAL


We can convert the following string columns to numeric types to save space:

- term (remove months)
- int_rate (remove %)
- revol_util (remove %)

We can also convert the following to datetime:

- issue_d
- earliest_cr_line
- last_pymnt_d
- last_credit_pull_d

We can convert the following string columns to categories:

- grade
- sub-grade
- emp_length
- home_ownership
- verification_status
- loan_status
- pymnt_plan
- purpose
- addr_state

Since the initial_list_status and application_type columns only have 1 unique value each, we will drop them.

#### Making changes to string columns

In [130]:
# Creating a chunk iterator that parses date columns, excludes
# less useful columns and converts some to categories
convert_col_dtypes = {"grade": "category", "sub_grade": "category", 
    "home_ownership": "category", "emp-length": "category",
    "loan_status": "category", "pymnt_plan": "category",
    "addr_state": "category","verification_status": "category", 
    "purpose": "category"}

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000,
                         dtype=convert_col_dtypes,
                         parse_dates=["issue_d", 
                                      "earliest_cr_line", 
                                      "last_pymnt_d", 
                                      "last_credit_pull_d"],
                         usecols =[i for i in columns if i 
                                   not in ['id',
                                           'initial_list_status',
                                          'application_type']])

In [131]:
rev_memory_usage = []
for chunk in chunk_iter:
    # Converting some to numeric
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    int_cleaned = chunk['int_rate'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    chunk['int_rate'] = pd.to_numeric(int_cleaned)
    # Downcasting float columns
    num_col = chunk.select_dtypes(include=[np.number])
    ncols = num_col.columns
    for nc in ncols:
        chunk[nc] = pd.to_numeric(chunk[nc], downcast='float')
        
    # checking memory usage
    rev_mem = chunk.memory_usage(deep=True)
    rev_memory_usage.append(rev_mem)

total_mem_rev = np.array(rev_memory_usage).sum()/2**20
print(total_mem_rev)
    
chunk.dtypes

18.19203758239746


member_id                            float32
loan_amnt                            float32
funded_amnt                          float32
funded_amnt_inv                      float32
term                                 float32
int_rate                             float32
installment                          float32
grade                               category
sub_grade                           category
emp_title                             object
emp_length                            object
home_ownership                      category
annual_inc                           float32
verification_status                 category
issue_d                       datetime64[ns]
loan_status                         category
pymnt_plan                          category
purpose                             category
title                                 object
zip_code                              object
addr_state                          category
dti                                  float32
delinq_2yr

Removing some unnecessary columns, converting string columns to numeric, datetime or category columns, and downcasting float columns enabled us to reduce the total memory usage by about 3.6 times from 65.7MB to 18.2MB.

Potential next steps could include creating a function to automate the work I have done, and determining whether more columns could be dropped.