# Optimizing DataFrames and Processing in Chunks

In this project, we'll practice with chunked dataframes and optimize a dataframe's memory usage. We'll work with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace [on its website](https://www.lendingclub.com/help/personal-loan-faq).

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to lend. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the origination fee that Lending Club charges.

We'll work with a dataset of loans approved from 2007-2011. Although Lending Club no longer hosts the data, a comprehensive view of the data is available on Kaggle [here](https://www.kaggle.com/datasets/wordsforthewise/lending-club/data). The `desc` column has been removed to make the system run faster.

If we read in the entire dataset, it consumes about 67 megabytes of memory. Let's imagine that we only have 10 megabytes of memory available throughout this project, so that everything needs to be processed in chunks. We'll start by reading in and checking the first five lines. 

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = 99

loans_chunk = pd.read_csv('loans_2007.csv', nrows=5)
loans_chunk

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


There are no apparent quality issues. We now read in 1000 rows and check the total memory usage. From that we can roughly estimate, how large the chunk size should be if we want to remain under 5 MB (to stay on the conservative side).

In [2]:
loans_chunk = pd.read_csv('loans_2007.csv', nrows=1000)
loans_chunk_memory_usage = loans_chunk.memory_usage(deep=True).sum() / (1024 ** 2)
print(loans_chunk_memory_usage)

1.5273666381835938


A 1000-row-chunk will take up approx. 1.57 MB, meaning that we can use chunks of 3000 rows each for further processing. Let's check how many rows we have in total for the whole dataset.

In [3]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print(total_rows)

42538


Lastly, let's check how much memory is used overall (summing up over all chunks).


In [4]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_memory_usage = 0
for chunk in chunk_iter:
    total_memory_usage += chunk.memory_usage(deep=True).sum() / (1024 ** 2)
print(total_memory_usage)

65.24251079559326


## Exploring the Data in Chunks

Let's find out how many columns have a numeric and how many have a string type.

In [5]:
# Counting numeric types by chunk
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
numeric_counts =  [len(chunk.select_dtypes(['float64', 'int64']).columns) for chunk in chunk_iter]
print(numeric_counts)

# Counting string types by chunk
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
string_counts =  [len(chunk.select_dtypes(['O']).columns) for chunk in chunk_iter]
print(string_counts)

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


It seems that most of the time, 31 columns are counted as numeric, and 21 columns as string. Let's check which column(s) are responsible for this.

In [6]:
# Identify columns with inconsistent dtypes
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
dtypes_list = [chunk.dtypes for chunk in chunk_iter]

inconsistent_columns = set()
for i in range(1, len(dtypes_list)):
    diff = dtypes_list[i].compare(dtypes_list[i-1])
    inconsistent_columns.update(diff.index)

print(list(inconsistent_columns))

['id']


The `id`column, at least in the last two chunks seems to contain data that's being interpreted as string. This is not a crucial variable for analysis, and we will go on treating it as a string.

Let's check how many non-missing values each string column has, how many of those are unique, and what the share is of unique values to total non-missing values in each string column. Those columns with less than 50% of unique values are candidates for a conversion into the categorical datatype.

In [7]:
# How many non-missing values in each string column?

# Create a list of object columns
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
obj_cols = set()  # any column that at least once is counted as object will be added to this set
for chunk in chunk_iter:
    obj_cols.update(chunk.select_dtypes(include=['object']).columns)
obj_cols = list(obj_cols)  # convert set to list for easier handling

# Count non-missing values, unique values and share of unique in each object column
obj_dict = {}
for col in obj_cols:
    chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=[col])
    non_missing = [chunk[col].dropna() for chunk in chunk_iter]
    combined_non_missing = pd.concat(non_missing)
    total_non_missing = len(combined_non_missing)
    unique_non_missing = combined_non_missing.nunique()
    share_unique = unique_non_missing / total_non_missing
    obj_dict[col] = {
        'total_non_missing': total_non_missing,
        'unique_non_missing': unique_non_missing,
        'share_unique': share_unique
    }

print(obj_dict)

{'sub_grade': {'total_non_missing': 42535, 'unique_non_missing': 35, 'share_unique': 0.0008228517691313037}, 'home_ownership': {'total_non_missing': 42535, 'unique_non_missing': 5, 'share_unique': 0.00011755025273304338}, 'revol_util': {'total_non_missing': 42445, 'unique_non_missing': 1119, 'share_unique': 0.026363529273177054}, 'verification_status': {'total_non_missing': 42535, 'unique_non_missing': 3, 'share_unique': 7.053015163982603e-05}, 'emp_length': {'total_non_missing': 41423, 'unique_non_missing': 11, 'share_unique': 0.0002655529536730802}, 'term': {'total_non_missing': 42535, 'unique_non_missing': 2, 'share_unique': 4.702010109321735e-05}, 'pymnt_plan': {'total_non_missing': 42535, 'unique_non_missing': 2, 'share_unique': 4.702010109321735e-05}, 'earliest_cr_line': {'total_non_missing': 42506, 'unique_non_missing': 530, 'share_unique': 0.012468827930174564}, 'emp_title': {'total_non_missing': 39909, 'unique_non_missing': 30658, 'share_unique': 0.7681976496529604}, 'applicat

To make this easier to read, let's print the columns with less than 50% unique and more than 50% unique values separately.

In [8]:
# Only show the columns where the share of unique values is < 0.5
filtered_dict = {k: v for k, v in obj_dict.items() if v['share_unique'] < 0.5}
print(filtered_dict)

print('\n')

# Only show the columns where the share of unique values is > 0.5
filtered_dict = {k: v for k, v in obj_dict.items() if v['share_unique'] > 0.5}
print(filtered_dict)

{'sub_grade': {'total_non_missing': 42535, 'unique_non_missing': 35, 'share_unique': 0.0008228517691313037}, 'home_ownership': {'total_non_missing': 42535, 'unique_non_missing': 5, 'share_unique': 0.00011755025273304338}, 'revol_util': {'total_non_missing': 42445, 'unique_non_missing': 1119, 'share_unique': 0.026363529273177054}, 'verification_status': {'total_non_missing': 42535, 'unique_non_missing': 3, 'share_unique': 7.053015163982603e-05}, 'emp_length': {'total_non_missing': 41423, 'unique_non_missing': 11, 'share_unique': 0.0002655529536730802}, 'term': {'total_non_missing': 42535, 'unique_non_missing': 2, 'share_unique': 4.702010109321735e-05}, 'pymnt_plan': {'total_non_missing': 42535, 'unique_non_missing': 2, 'share_unique': 4.702010109321735e-05}, 'earliest_cr_line': {'total_non_missing': 42506, 'unique_non_missing': 530, 'share_unique': 0.012468827930174564}, 'application_type': {'total_non_missing': 42535, 'unique_non_missing': 1, 'share_unique': 2.3510050546608674e-05}, 'i

As it turns out, the majority of string columns has much less than 50% unique values. Only `id`, `emp_title`, and `title` have so many unique values that assigning them the string-type makes sense. The other string-variables could potentially be converted to categorical variables.

Let's check which of the float columns have no missing values and might be candidates for a conversion to the integer type.

In [13]:
# Which float columns have no missing values?
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
float_cols = set()
for chunk in chunk_iter:
    float_cols.update(chunk.select_dtypes(include=['float64']).columns)
float_cols = list(float_cols)

float_dict = {}
for col in float_cols:
    chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=[col])
    non_missing_cases = [chunk[~(chunk.isnull())].count() for chunk in chunk_iter]
    total_non_missing = pd.concat(non_missing_cases).sum()
    share_non_missing = total_non_missing / (total_rows - 1)
    float_dict[col] = {
        'total_non_missing': total_non_missing,
        'share_non_missing': share_non_missing
    }

print(float_dict)

{'total_rec_int': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'tax_liens': {'total_non_missing': np.int64(42430), 'share_non_missing': np.float64(0.9974845428685615)}, 'loan_amnt': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'out_prncp_inv': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'collection_recovery_fee': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'chargeoff_within_12_mths': {'total_non_missing': np.int64(42390), 'share_non_missing': np.float64(0.9965441850624163)}, 'dti': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'total_rec_late_fee': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'pub_rec_bankruptcies': {'total_non_missing': np.int64(41170), 'share_non_missing': np.float64(0.96786327197498

In [13]:
# Which float columns have no missing values?
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
float_cols = set()
for chunk in chunk_iter:
    float_cols.update(chunk.select_dtypes(include=['float64']).columns)
float_cols = list(float_cols)

float_dict = {}
for col in float_cols:
    chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=[col])
    non_missing_cases = [chunk[~(chunk.isnull())].count() for chunk in chunk_iter]
    total_non_missing = pd.concat(non_missing_cases).sum()
    share_non_missing = total_non_missing / (total_rows - 1)
    float_dict[col] = {
        'total_non_missing': total_non_missing,
        'share_non_missing': share_non_missing
    }

print(float_dict)

{'total_rec_int': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'tax_liens': {'total_non_missing': np.int64(42430), 'share_non_missing': np.float64(0.9974845428685615)}, 'loan_amnt': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'out_prncp_inv': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'collection_recovery_fee': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'chargeoff_within_12_mths': {'total_non_missing': np.int64(42390), 'share_non_missing': np.float64(0.9965441850624163)}, 'dti': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'total_rec_late_fee': {'total_non_missing': np.int64(42535), 'share_non_missing': np.float64(0.9999529821096927)}, 'pub_rec_bankruptcies': {'total_non_missing': np.int64(41170), 'share_non_missing': np.float64(0.96786327197498

Turns out, there are no floats with non-missing values. That's why we can't easily transform them to integers.

## Optimizing Numeric Columns

We can automatize the steps from above and write a function `load_optimized_dataframe` that will automatically assign optimal `dtypes` for each column. This is what we do here.

In [22]:
def load_optimized_dataframe(filename, sample_fraction=0.1, chunksize=1000, use_float_for_nan_ints=False,
                             use_float_for_nan_bools=False, encoding='latin1', **kwargs):
    sample_list = []
    chunk_iter = pd.read_csv(filename, encoding=encoding, chunksize=chunksize, **kwargs)

    for chunk in chunk_iter:
        sample_list.append(chunk.sample(frac=sample_fraction, random_state=1))

    file_sample = pd.concat(sample_list).sample(frac=sample_fraction, random_state=1)

    dtype_dict = {}
    parse_dates = []

    for col in file_sample.columns:
        col_data = file_sample[col].dropna()

        if col_data.dtype == 'object':
            try:
                pd.to_datetime(col_data)
                parse_dates.append(col)
            except (ValueError, TypeError):
                if col_data.nunique() / len(col_data) < 0.5:
                    dtype_dict[col] = 'category'

        elif col_data.dtype == 'float64':
            if col_data.mod(1).eq(0).all():
                min_val, max_val = col_data.min(), col_data.max()

                if set(col_data.unique()).issubset({0, 1}):
                    dtype_dict[col] = 'float32' if use_float_for_nan_bools else 'boolean' if file_sample[
                        col].isna().any() else 'bool'
                else:
                    for dtype in [np.int8, np.int16, np.int32, np.int64]:
                        if np.iinfo(dtype).min <= min_val <= max_val <= np.iinfo(dtype).max:
                            dtype_dict[col] = dtype
                            break
                    if file_sample[col].isna().any():
                        dtype_dict[col] = 'float32' if use_float_for_nan_ints else 'Int64'
            else:
                dtype_dict[col] = 'float32' if col_data.abs().max() < np.finfo(np.float32).max else 'float64'

        elif col_data.dtype == 'int64':
            min_val, max_val = col_data.min(), col_data.max()

            if set(col_data.unique()).issubset({0, 1}):
                dtype_dict[col] = 'float32' if use_float_for_nan_bools else 'boolean' if file_sample[
                    col].isna().any() else 'bool'
            else:
                for dtype in [np.int8, np.int16, np.int32, np.int64]:
                    if np.iinfo(dtype).min <= min_val <= max_val <= np.iinfo(dtype).max:
                        dtype_dict[col] = dtype
                        break

    df = pd.read_csv(filename, dtype=dtype_dict, parse_dates=parse_dates, encoding=encoding, **kwargs)
    return df

The function will load a chunk of data, use it as a sample and then determine the optimal `dtype`for each column. Since the date type is tested on all columns, this will lead to some user warnings, which we can ignore. Let's see how well this function will optimize our dataframe.

In [23]:
loans = load_optimized_dataframe('loans_2007.csv', sample_fraction=1, chunksize=3000)
print(loans.info(memory_usage='deep'))

  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  pd.to_datetime(col_data)
  df = pd.read_csv(filename, dtype=dtype_dict, parse_dates=parse_dates, encoding=encoding, **kwargs)
  df = pd.read_csv(filename, dtype=dtype_dict, parse_dates=parse_dates, encoding=encoding, **kwargs)
  df = pd.read_csv(filename, dtype=dtype_dict, parse_dates=parse_dates, encoding=encoding, **kwargs)
  df = pd.read_csv(filename, dtype=dtype_dict, parse_dates=parse_dates, encoding=encoding, **kwargs)
  df = pd.read_csv(filename, dtype=dtype_dict, parse_dat

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   id                          42538 non-null  object        
 1   member_id                   42535 non-null  Int64         
 2   loan_amnt                   42535 non-null  Int64         
 3   funded_amnt                 42535 non-null  Int64         
 4   funded_amnt_inv             42535 non-null  float32       
 5   term                        42535 non-null  category      
 6   int_rate                    42535 non-null  category      
 7   installment                 42535 non-null  float32       
 8   grade                       42535 non-null  category      
 9   sub_grade                   42535 non-null  category      
 10  emp_title                   39909 non-null  object        
 11  emp_length                  41423 non-null  category  

Not bad! From initially 62 MB of memory usage, we managed to bring this down to 16.6 MB, just by assigning the appropriate `dtypes`. Sweet!