### Exploratory data analysis of the Lending Club peer-to-peer loan portfolio

Lending Club, founded in 2006, was a pioneering peer-to-peer (P2P) lending platform in the US. It provided a marketplace where individual investors could fund loans directly for borrowers seeking personal loans, debt consolidation, or other financial needs.

A snapshot of their lending data taken in April 2019 (featuring originations to the end of 2018) can be obtaing from kaggle:
```
#!/bin/bash
curl -L -o ./lending-club.zip https://www.kaggle.com/api/v1/datasets/download/wordsforthewise/lending-club
unzip lending-club.zip
```

First, let's load the data into a pandas dataframe so we can do some simple exploratory analysis.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm.auto import tqdm
tqdm.pandas()

In [2]:
df = pd.read_csv('../lendingclub/accepted_2007_to_2018Q4.csv.gz', low_memory=False)

Drop loans with missing issue date.

In [3]:
df.issue_d.isna().sum()

33

In [4]:
df = df[df.issue_d.notna()].reset_index(drop=True)

Convert timestamps to appropriate data types.

In [5]:
df.issue_d = pd.to_datetime(df.issue_d, format='%b-%Y')
df.last_pymnt_d = pd.to_datetime(df.last_pymnt_d, format='%b-%Y')
df.next_pymnt_d = pd.to_datetime(df.next_pymnt_d, format='%b-%Y')

In [6]:
snapshot_date = df.last_pymnt_d.max()

If no pymnt have been made at all, ensure `last_pymnt_d` is populated even with the date of issue, and bump all month-0 pymnt to the following month.

In [7]:
df.last_pymnt_d = df.last_pymnt_d.combine_first(df.issue_d)
df.loc[df.issue_d==df.last_pymnt_d, 'last_pymnt_d'] = df.last_pymnt_d + pd.tseries.offsets.DateOffset(months=1)

### Limitations of a single monthly snapshot

The dataset’s single monthly snapshot limits our ability to analyze loans consistently across their lifecycle—comparing new loans (e.g., 3 months old) with closed ones (e.g., 10+ years old) is like judging a film by one frame. For closed loans, did they repay early, late due to deferments, or because of arrears? For "up-to-date" loans, which faced past struggles or exceed their 60-month term? A single snapshot obscures these dynamics. Credit risk modeling demands time series data to fairly compare loans (e.g., 2007 vs. 2017 vintages) at equivalent lifecycle stages and to reconstruct behaviors—like payment momentum, deferment impacts, or hidden delinquency patterns—that define true risk.

To bridge gaps, we’ll simulate a monthly time series up to April 2019. First, derive each loan’s expected payment schedule using its installment and term, including maturity dates. This lets us model deviations (e.g., early/late pymnt, deferments) and infer historical trends. Without this, we risk misjudging performance—such as labeling a loan “current” despite prior arrears—or overlooking systemic risks (e.g., cohorts prone to late-term defaults). Historical snapshots, even simulated, transform static data into a causal narrative, revealing how portfolios behave over time, not just where they stand today.

In [8]:
df['term_numeric'] = pd.to_numeric(df.term.str.replace('months', ''), errors='coerce')
df['maturity_d'] = df.apply(lambda x: x.issue_d + pd.tseries.offsets.DateOffset(months=x.term_numeric), axis=1)

### Inflate the dataset to represent the full monthly timeseries from the point of origination up to the true report date

By enumerating a time series of expected installments from each loan’s issue date to the April 2019 snapshot, we create a dynamic timeline of anticipated pymnt. For loans where this timeline extends beyond their maturity date (e.g., a 60-month term ending in 2018), we mark post-maturity installment values as n/a—ensuring the model reflects contractual obligations, not speculative extrapolation. Using the loan_status, last_pymnt_d, and total_pymnt data, we approximate actual payment behavior against this baseline. For example, a loan marked "charged off" with sparse pymnt would show persistent gaps in its reconstructed timeline, while a "fully paid" loan might reveal early settlements or deferments.

This basic attribution model serves as a starting point. Refinements—like incorporating principal/interest splits, fee assessments, or hardship flags—could resolve ambiguities (e.g., distinguishing forbearance from delinquency). Even in simplified form, however, the simulated time series transforms static snapshots into actionable narratives: How did prepymnt cluster in certain vintages? Did post-2015 loans exhibit slower principal reduction? While crude, this approach highlights systemic risks (e.g., cohorts with rising late-term defaults) and prioritizes gaps for deeper analysis. Future iterations can layer complexity, but even a "placeholder" timeline anchors credit risk in causality, not just cross-sectional snapshots.

In [9]:
def basic_pymnt_attr(
    snapshot_date, 
    issue_d, 
    total_pymnt, 
    recoveries, 
    last_pymnt_amnt, 
    last_pymnt_d, 
    installment, 
    maturity_d,
    **kwargs):

    # numerate a sequence of report dates from issue to snapshot date
    report_d = pd.date_range(issue_d, snapshot_date, freq='MS', inclusive='right')
    n_report_d = len(report_d)
    n_pymnt = (report_d < last_pymnt_d).sum()

    # spread the total pymnt made over the vector of months paid
    with np.errstate(divide='ignore', invalid='ignore'):
        pymnt = np.float32(total_pymnt - recoveries - last_pymnt_amnt) / n_pymnt
    pymnt = np.full(n_report_d, pymnt)
    pymnt[report_d >= last_pymnt_d] = 0

    # surplus with respect to the initial schedule (indicative of paid early)
    #surplus = pymnt - np.maximum(0, pymnt - installment)
    #last_pymnt_amnt += np.nansum(surplus)

    # calculate a mask for traing the data (i.e. up to maturity)
    train = report_d <= maturity_d

    # backload the surplus onto the last payment (will be 0 for charged off) and return
    pymnt[report_d == last_pymnt_d] = last_pymnt_amnt
    return pd.Series({
        'pymnt': pymnt.tolist(), 
        'train': train.tolist()
    })

In [10]:
df = df.progress_apply(lambda x: basic_pymnt_attr(snapshot_date, **x.squeeze()), axis=1).join(df.id)

  0%|          | 0/2260668 [00:00<?, ?it/s]

In [17]:
df

Unnamed: 0,pymnt,train,id
0,"[119.41815863715277, 119.41815863715277, 119.4...","[True, True, True, True, True, True, True, Tru...",68407277
1,"[4950.662109375, 4950.662109375, 4950.66210937...","[True, True, True, True, True, True, True, Tru...",68355089
2,"[405.44850068933823, 405.44850068933823, 405.4...","[True, True, True, True, True, True, True, Tru...",68341763
3,"[827.948902027027, 827.948902027027, 827.94890...","[True, True, True, True, True, True, True, Tru...",66310712
4,"[268.5900065104167, 268.5900065104167, 268.590...","[True, True, True, True, True, True, True, Tru...",68476807
...,...,...,...
2260663,"[543.4642857142857, 543.4642857142857, 543.464...","[True, True, True, True, True, True, True, Tru...",89885898
2260664,"[517.5996442522321, 517.5996442522321, 517.599...","[True, True, True, True, True, True, True, Tru...",88977788
2260665,"[858.7274693080357, 858.7274693080357, 858.727...","[True, True, True, True, True, True, True, Tru...",88985880
2260666,"[562.8036221590909, 562.8036221590909, 562.803...","[True, True, True, True, True, True, True, Tru...",88224441


In [18]:
df.to_json('accepted_2007_to_2018Q4.jsonl.gz', orient='records', lines=True)