# Data Wrangling

Data wrangling comprises a substantial portion of every data professional's life. Wrangling data encompasses the steps you undertake to organize and clean for your analysis. Wrangling includes merging and appending datasets, finding typos, and creating new variables.

For this exercise, we'll be using public data from [LendingClub](https://www.lendingclub.com), an online loan provider. There are two datasets one with the LendingClub loans (`loans`) and one with the other loans held by the consumer (`credit_record`).

Let's start with loading some libraries.
- **readr**: Load datasets
- **dplyr**: Manipulate data
- **lubridate**: Handle dates

In [3]:
library(readr)
library(dplyr)
library(lubridate)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




Let's load the data! It's a CSV that has been compressed by `gzip`, which `readr` can read in automatically but we need to download it first.

In [4]:
# Store the URL in a variable
URL_LOANS <- "https://www.dropbox.com/s/0f7jetfmcy18y1n/loans.csv.gz?dl=1"
URL_CREDIT_RECORD <- "https://www.dropbox.com/s/qh45gs56s4omq8o/credit_record.csv.gz?dl=1"

# Download the files
download.file(URL_LOANS, "loans.csv.gz")
download.file(URL_CREDIT_RECORD, "credit_record.csv.gz")

# Load the files
loans <- read_csv("loans.csv.gz")
credit_record <- read_csv("credit_record.csv.gz")


[36m──[39m [1m[1mColumn specification[1m[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  member_id = [32mcol_double()[39m,
  loan_status = [32mcol_double()[39m,
  status_dt = [34mcol_date(format = "")[39m,
  int_rate = [32mcol_double()[39m,
  dti = [32mcol_double()[39m
)



[36m──[39m [1m[1mColumn specification[1m[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  member_id = [32mcol_double()[39m,
  loan_id = [32mcol_double()[39m,
  record_dt = [34mcol_date(format = "")[39m,
  open_acc = [32mcol_double()[39m,
  delinq_payment = [32mcol_double()[39m
)




In [5]:
# Print the first few rows of the loans
head(loans)

member_id,loan_status,status_dt,int_rate,dti
<dbl>,<dbl>,<date>,<dbl>,<dbl>
1,1,2007-07-20,10.65,27.65
2,1,2007-07-20,7.9,11.2
3,1,2007-07-20,17.58,18.79
4,1,2007-07-20,20.3,23.21
5,0,2007-07-20,9.91,7.83
6,1,2007-07-20,16.29,18.18


In [6]:
# Print the first few rows of the credit record file
head(credit_record)

member_id,loan_id,record_dt,open_acc,delinq_payment
<dbl>,<dbl>,<date>,<dbl>,<dbl>
1,1,2005-07-12,1,0
1,1,2005-08-12,1,0
1,1,2005-09-12,1,0
1,1,2005-10-12,1,0
1,1,2005-11-12,1,0
1,1,2005-12-12,1,0


Let's say we want to add the number of times an individual has been deliquent on a payment in the last 2 years, last year and last 6 months to the `loans` dataset. Here are the steps we'll undertake.

- First subset the loans by the columns we need.
- Join the dataset to the credit record dataset.
- Calculate the length of time between the credit record date and the loan status date.
- Group by the individual member.
- Collapse a sum of an indicator for a deqliquent payment times an indicator for the relevant time period.

In [7]:
# Calculate deliquent payment aggregates by time periods
delinq_totals <- loans %>% 
  select(member_id, status_dt) %>%
  inner_join(credit_record, by = "member_id") %>%
  mutate(years_since_rec = time_length(interval(record_dt, status_dt), unit = "year")) %>%
  group_by(member_id) %>%
  summarize(
      delinq_6mo = sum(delinq_payment * (years_since_rec <= 0.5)),
      delinq_1yr = sum(delinq_payment * (years_since_rec <= 1)),
      delinq_2yr = sum(delinq_payment * (years_since_rec <= 2)),
  ) %>%
  ungroup()


Merge the aggregates back to the loans dataset.

In [8]:
# Merge the aggregates back to the loans dataset.
loans_w_delinq <- loans %>%
left_join(delinq_totals, by = "member_id")

In [10]:
# Print the first few rows
head(loans_w_delinq)

member_id,loan_status,status_dt,int_rate,dti,delinq_6mo,delinq_1yr,delinq_2yr
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
50,1,2007-07-20,12.69,15.31,1,1,1
68,1,2007-07-20,9.91,22.88,1,1,1
77,1,2007-07-20,14.27,6.94,1,2,2
195,1,2007-07-20,11.71,8.73,1,1,1
202,1,2007-07-20,12.42,19.78,1,1,1
300,1,2007-07-20,20.99,21.2,1,1,1
