# Optimizing Dataframes and Processing in Chunks

In this project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. We can read more about the marketplace [on its website](https://www.lendingclub.com/public/how-peer-lending-works.action).

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214463677) that Lending Club charges.

If we read in the entire data set, it will consume about 67 megabytes of memory. Let's imagine that we only have __10 megabytes of memory available__ throughout this project.

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 99

In [2]:
# Reading in first five rows just to check if any data quality issues
first_five = pd.read_csv("loans_2007.csv", nrows=5)
first_five

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [3]:
print(first_five.dtypes)

id                              int64
member_id                     float64
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
term                           object
int_rate                       object
installment                   float64
grade                          object
sub_grade                      object
emp_title                      object
emp_length                     object
home_ownership                 object
annual_inc                    float64
verification_status            object
issue_d                        object
loan_status                    object
pymnt_plan                     object
purpose                        object
title                          object
zip_code                       object
addr_state                     object
dti                           float64
delinq_2yrs                   float64
earliest_cr_line               object
inq_last_6mths                float64
open_acc    

In [4]:
print(first_five.shape)

(5, 52)


In [5]:
# Checking memory occupied by first 1000 rows, will increase / decrease chunk size accordingly
first_1000 = pd.read_csv("loans_2007.csv", nrows=1000)
print(first_1000.memory_usage(deep=True).sum() / (1024 * 1024))

1.5502548217773438


Assuming we can spare 10 MB of memory for dataframe, let's try to read 3000 row chunks - expecting each chunk to have size around 5 MB (to be on safe side).

In [6]:
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum() / (1024 * 1024))

4.649059295654297
4.644805908203125
4.646563529968262
4.647915840148926
4.644108772277832
4.645991325378418
4.644582748413086
4.646951675415039
4.645077705383301
4.64512825012207
4.657840728759766
4.656707763671875
4.663515090942383
4.896956443786621
0.880854606628418


As expected, each chunk size is around 5 MB.

### Total rows in the dataset

In [7]:
total_rows = 0
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    total_rows += len(chunk)
print(total_rows)

42538


## Exploring the Data in Chunks

Let's see for each chunk - 
- no. of numeric columns, no. of string columns

Identify string columns, if any, where no. of unique values is less than 50% of total non-null entries in the column.
We can read these types as "category" into our dataframe when using `pandas.read_csv()`

Identify float columns that have no missng values - some of these columns could be candidates for conversion to integer type.

We can also see total memory usage across all of the chunks.

In [8]:
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
# to store number of string columns in each chunk
num_string_cols = []
num_numeric_cols = []
for chunk in chunk_iter:
    string_cols = len(chunk.select_dtypes(include=["object"]).columns)
    numeric_cols = len(chunk.select_dtypes(include=[np.number]).columns)
    num_string_cols.append(string_cols)
    num_numeric_cols.append(numeric_cols)
print(num_string_cols, sep="\n")
print(num_numeric_cols, sep="\n")

[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]
[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]


We observe number of object columns and numeric columns are 21 and 31 respectively, except for the 
last 2 chunks where the counts are 22 and 30. Let's see the column name list for object and numeric columns for these last 2 chunks and compare them with the first chunk.

In [9]:
total_chunks = len(num_string_cols)
first_chunk_string_cols = []
first_chunk_numeric_cols = []
second_last_chunk_string_cols = []
second_last_chunk_numeric_cols = []
last_chunk_string_cols = []
last_chunk_numeric_cols = []
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk_no, chunk in enumerate(chunk_iter):
    chunk_no += 1
    if chunk_no == 1:
        first_chunk_string_cols = chunk.select_dtypes(include=["object"]).columns.tolist()
        first_chunk_numeric_cols = chunk.select_dtypes(include=[np.number]).columns.tolist()
    elif chunk_no == total_chunks - 1:
        second_last_chunk_string_cols = chunk.select_dtypes(include=["object"]).columns.tolist()
        second_last_chunk_numeric_cols = chunk.select_dtypes(include=[np.number]).columns.tolist()
    elif chunk_no == total_chunks:
        last_chunk_string_cols = chunk.select_dtypes(include=["object"]).columns.tolist()
        last_chunk_numeric_cols = chunk.select_dtypes(include=[np.number]).columns.tolist()

In [10]:
print(first_chunk_string_cols)
print(first_chunk_numeric_cols)

['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']


In [11]:
print(second_last_chunk_string_cols)
print(second_last_chunk_numeric_cols)

['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']


In [12]:
print(last_chunk_string_cols)
print(last_chunk_numeric_cols)

['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']


In [13]:
print(set(second_last_chunk_string_cols) - set(first_chunk_string_cols))

{'id'}


In [14]:
print(set(last_chunk_string_cols) - set(first_chunk_string_cols))

{'id'}


In [15]:
print(set(first_chunk_numeric_cols) - set(second_last_chunk_numeric_cols))

{'id'}


In [16]:
print(set(first_chunk_numeric_cols) - set(last_chunk_numeric_cols))

{'id'}


The `id` column is read as string in last two chunks instead of a numeric value, which explains the discrepancy between no. of string columns and numeric columns in the last two chunks as compared to other chunks. Since the id column won't be useful for analysis, visualization, or predictive modelling let's ignore this column

### Finding string columns for optimization

In [17]:
col_data = {}
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    string_cols = chunk.select_dtypes(include=["object"]).columns.tolist()
    for col in string_cols:
        if col in col_data:
            col_data[col]["uniques"].update(chunk[col].dropna().unique())
            # count() returns no. of non-null values
            col_data[col]["total_non_nulls"] += chunk[col].count()
        else:
            col_data[col] = {
                "uniques": set(),
                "total_non_nulls": 0    
            }
            col_data[col]["uniques"].update(chunk[col].dropna().unique())
            col_data[col]["total_non_nulls"] += chunk[col].count()

In [18]:
for col, data in col_data.items():
    print(col, len(data["uniques"]))

term 2
int_rate 394
grade 7
sub_grade 35
emp_title 30658
emp_length 11
home_ownership 5
verification_status 3
issue_d 55
loan_status 9
pymnt_plan 2
purpose 14
title 21264
zip_code 837
addr_state 50
earliest_cr_line 530
revol_util 1119
initial_list_status 1
last_pymnt_d 103
last_credit_pull_d 108
application_type 1
id 3538


We are interested in string columns where unique values have less than 50% share of total non-null values in the column - 

In [19]:
# candidate string columns for optimization
cand_string_cols_opt = []
for col, data in col_data.items():
    if (len(data["uniques"]) / data["total_non_nulls"]) < 0.5:
        print(col, len(data["uniques"]))
        cand_string_cols_opt.append(col)

term 2
int_rate 394
grade 7
sub_grade 35
emp_length 11
home_ownership 5
verification_status 3
issue_d 55
loan_status 9
pymnt_plan 2
purpose 14
zip_code 837
addr_state 50
earliest_cr_line 530
revol_util 1119
initial_list_status 1
last_pymnt_d 103
last_credit_pull_d 108
application_type 1


In [20]:
col_val_counts = {}
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    for col in cand_string_cols_opt:
        if col in col_val_counts:
            col_val_counts[col].append(chunk[col].value_counts())
        else:
            col_val_counts[col] = [chunk[col].value_counts()]

In [21]:
col_val_counts_overall = {}
for col in col_val_counts:
    combined_vc_series = pd.concat(col_val_counts[col])
    col_vc_final = combined_vc_series.groupby(combined_vc_series.index).sum()
    col_val_counts_overall[col] = col_vc_final

In [22]:
for col in col_val_counts_overall:
    print(col)
    print(col_val_counts_overall[col])
    print("---------------------------------------------------------------\n")

term
 36 months    31534
 60 months    11001
Name: term, dtype: int64
---------------------------------------------------------------

int_rate
  5.42%    573
  5.79%    410
  5.99%    347
  6.00%     19
  6.03%    447
          ... 
 23.59%      4
 23.91%     11
 24.11%      3
 24.40%      1
 24.59%      1
Name: int_rate, Length: 394, dtype: int64
---------------------------------------------------------------

grade
A    10183
B    12389
C     8740
D     6016
E     3394
F     1301
G      512
Name: grade, dtype: int64
---------------------------------------------------------------

sub_grade
A1    1142
A2    1520
A3    1823
A4    2905
A5    2793
B1    1882
B2    2113
B3    2997
B4    2590
B5    2807
C1    2264
C2    2157
C3    1658
C4    1370
C5    1291
D1    1053
D2    1485
D3    1322
D4    1140
D5    1016
E1     884
E2     791
E3     668
E4     552
E5     499
F1     392
F2     308
F3     236
F4     211
F5     154
G1     141
G2     107
G3      79
G4      99
G5      86
Name: sub_grade

From the above value counts for various columns, we can select following columns for which dtype to be kept as `category` instead of the default `object` dtype when calling `pandas.read_csv()`:
- `grade`
- `sub_grade`
- `emp_length`
- `home_ownership`
- `verification_status`
- `loan_status`
- `purpose`
- `addr_state`


We can also observe that it's better if we read certain columns as dates instead of `object`:
- `issue_d`
- `earliest_cr_line`
- `last_pymnt_d`
- `last_credit_pull_d`

We can also clean and convert the columns `term` and `int_rate` to numeric types.

In [23]:
# Will pass later on `wanted_dtypes` in `dtypes` parameter while calling `pandas.read_csv()`
wanted_dtypes = {
    "grade": "category",
    "sub_grade": "category",
    "emp_length": "category",
    "home_ownership": "category",
    "verification_status": "category", 
    "verification_status": "category", 
    "loan_status": "category",
    "purpose": "category",
    "addr_state": "category"
}

### Finding Float columns that have no missing values and could be canditates for conversion to integer type

In [24]:
missing_value_counts_float_cols = []
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    chunk["term"] = chunk["term"].str.strip().str.split().str[0]
    chunk["term"] = chunk["term"].astype(np.float64)
    chunk["int_rate"] = chunk["int_rate"].str.strip().str.rstrip("%")
    chunk["int_rate"] = chunk["int_rate"].astype(np.float64)
    float_cols = chunk.select_dtypes(include="float")
    missing_value_counts_float_cols.append(float_cols.isnull().sum())
    
missing_value_counts_combined = pd.concat(missing_value_counts_float_cols)
missing_value_counts_final = missing_value_counts_combined.groupby(missing_value_counts_combined.index).sum()
missing_value_counts_final

acc_now_delinq                  32
annual_inc                       7
chargeoff_within_12_mths       148
collection_recovery_fee          3
collections_12_mths_ex_med     148
delinq_2yrs                     32
delinq_amnt                     32
dti                              3
funded_amnt                      3
funded_amnt_inv                  3
inq_last_6mths                  32
installment                      3
int_rate                         3
last_pymnt_amnt                  3
loan_amnt                        3
member_id                        3
open_acc                        32
out_prncp                        3
out_prncp_inv                    3
policy_code                      3
pub_rec                         32
pub_rec_bankruptcies          1368
recoveries                       3
revol_bal                        3
tax_liens                      108
term                             3
total_acc                       32
total_pymnt                      3
total_pymnt_inv     

All columns have one or more null values so none of these float columns are candidates for being converted to integer columns. However, we can still downcast `np.float64` type columns to optimize memory. We should consider if we need to do any computations on these columns as it could result in underflow / overflow if downcasted. However, for the purpose of demonstrating effects of our memory optimaization, we will downcast all `np.float64` type columns.

In [25]:
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    chunk["term"] = chunk["term"].str.strip().str.split().str[0]
    chunk["term"] = chunk["term"].astype(np.float64)
    chunk["int_rate"] = chunk["int_rate"].str.strip().str.rstrip("%")
    chunk["int_rate"] = chunk["int_rate"].astype(np.float64)
    float_cols_names = chunk.select_dtypes(include="float").columns
    for col in float_cols_names:
        chunk[col] = pd.to_numeric(chunk[col], downcast="float")

## Memory Usage when chunk dataframes are still unoptimized

In [26]:
mem_usage_MB = 0
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    size = (chunk.memory_usage(deep=True).sum()) / (1024 * 1024)
    mem_usage_MB += size
print(mem_usage_MB)

66.21605968475342


## Memory Usage upon optimization

In [27]:
mem_usage_opt_MB = 0
chunk_iter = pd.read_csv("loans_2007.csv", chunksize=3000, dtype=wanted_dtypes,
                         parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d",
                                      "last_credit_pull_d"]
                        )
for chunk in chunk_iter:
    chunk["term"] = chunk["term"].str.strip().str.split().str[0]
    chunk["term"] = chunk["term"].astype(np.float64)
    chunk["int_rate"] = chunk["int_rate"].str.strip().str.rstrip("%")
    chunk["int_rate"] = chunk["int_rate"].astype(np.float64)
    float_cols_names = chunk.select_dtypes(include="float").columns
    for col in float_cols_names:
        chunk[col] = pd.to_numeric(chunk[col], downcast="float")
    
    mem_usage_opt_MB += chunk.memory_usage(deep=True).sum() / (1024 * 1024)

print(mem_usage_opt_MB)

26.542840003967285


Thus, we can see that after optimizing our columns we are able to bring down total memory usage across all chunks from 66.2 MB to 26.5 MB.