# Data Wrangling

Data wrangling comprises a substantial portion of every data professional's life. Wrangling data encompasses the steps you undertake to organize and clean for your analysis. Wrangling includes merging and appending datasets, finding typos, and creating new variables.

For this exercise, we'll be using public data from [LendingClub](https://www.lendingclub.com), an online loan provider. There are two datasets one with the LendingClub loans (`loans`) and one with the other loans held by the consumer (`credit_record`).

Let's start with loading some libraries.

- **pandas**: Main library for manipulating data
- **urllib**: Download files

In [19]:
import pandas as pd
import numpy as np
import urllib

In [20]:
# Store the URL in a variable
URL_LOANS = "https://www.dropbox.com/s/0f7jetfmcy18y1n/loans.csv.gz?dl=1"
URL_CREDIT_RECORD = "https://www.dropbox.com/s/qh45gs56s4omq8o/credit_record.csv.gz?dl=1"

urllib.request.urlretrieve(URL_LOANS, "loans.csv.gz")
urllib.request.urlretrieve(URL_CREDIT_RECORD, "credit_record.csv.gz")

# Download the files
loans = pd.read_csv("loans.csv.gz")
credit_record = pd.read_csv("credit_record.csv.gz")

In [21]:
# Print the first few rows of the loans
loans.head()

Unnamed: 0,member_id,loan_status,status_dt,int_rate,dti
0,1,1,2007-07-20,10.65,27.65
1,2,1,2007-07-20,7.9,11.2
2,3,1,2007-07-20,17.58,18.79
3,4,1,2007-07-20,20.3,23.21
4,5,0,2007-07-20,9.91,7.83


In [22]:
# Print the first few rows of the credit record file
credit_record.head()

Unnamed: 0,member_id,loan_id,record_dt,open_acc,delinq_payment
0,1,1,2005-07-12,1,0
1,1,1,2005-08-12,1,0
2,1,1,2005-09-12,1,0
3,1,1,2005-10-12,1,0
4,1,1,2005-11-12,1,0


Let's say we want to add the number of times an individual has been deliquent on a payment in the last 2 years, last year and last 6 months to the `loans` dataset. Here are the steps we'll undertake.


In [50]:
# Convert dates
loans['status_dt'] = pd.to_datetime(loans['status_dt'])
credit_record['record_dt'] = pd.to_datetime(credit_record['record_dt'])

# Calculate deliquent payment aggregates by time periods
delinq_totals = loans[['member_id', 'status_dt']].merge(
    credit_record,
    how = "inner",
    on = "member_id",
    validate = "1:m"
)

# Years since status
delinq_totals['years_since_rec'] = (
    delinq_totals['status_dt'] - delinq_totals['record_dt']
) / pd.Timedelta('365 days')

# Create indicator for relevant deliquent payment
delinq_totals["delinq_6mo"] = (delinq_totals['years_since_rec'] < 0.5) * delinq_totals['delinq_payment']
delinq_totals["delinq_1yr"] = (delinq_totals['years_since_rec'] < 1)   * delinq_totals['delinq_payment']
delinq_totals["delinq_2yr"] = (delinq_totals['years_since_rec'] < 2)   * delinq_totals['delinq_payment']

# Aggregate
delinq_totals = delinq_totals.groupby('member_id').agg({
    'delinq_6mo' : 'sum',
    'delinq_1yr' : 'sum',
    'delinq_2yr' : 'sum'
}).reset_index()

Merge the aggregates back to the loans dataset.

In [53]:
# Merge the aggregates back to the loans dataset.
loans_w_delinq = loans.merge(
    delinq_totals,
    how = "left",
    on = "member_id",
    validate = "1:1"
)
loans_w_delinq.head()

Unnamed: 0,member_id,loan_status,status_dt,int_rate,dti,delinq_6mo,delinq_1yr,delinq_2yr
0,1,1,2007-07-20,10.65,27.65,0,0,0
1,2,1,2007-07-20,7.9,11.2,0,0,0
2,3,1,2007-07-20,17.58,18.79,0,0,0
3,4,1,2007-07-20,20.3,23.21,0,0,0
4,5,0,2007-07-20,9.91,7.83,0,0,0
