# Data Wrangling

Data wrangling comprises a substantial portion of every data professional's life. Wrangling data encompasses the steps you undertake to organize and clean for your analysis. Wrangling includes merging and appending datasets, finding typos, and creating new variables.

For this exercise, we'll be using public data from [LendingClub](https://www.lendingclub.com), an online loan provider. There are two datasets one with the LendingClub loans (`loans`) and one with the other loans held by the consumer (`credit_record`).

Let's load the data! It's a .dta can read into Stata using the command called `use`.

In [None]:
* Store the URL in a variable
local URL_LOANS "https://www.dropbox.com/s/606068orwclejin/loans.dta?dl=1"
local URL_CREDIT_RECORD "https://www.dropbox.com/s/o7suo1qxrt8gcp7/credit_record.dta?dl=1"

* Download the files
copy "`URL_LOANS'" "loans.dta", replace
copy "`URL_CREDIT_RECORD'" "credit_record.dta", replace

In [None]:
* Print the first few rows of the loans
use "loans.dta", clear
list in 1/10

In [None]:
* Print the first few rows of the credit record file
use "credit_record.dta", clear
list in 1/10

Let's say we want to add the number of times an individual has been deliquent on a payment in the last 2 years, last year and last 6 months to the `loans` dataset. Here are the steps we'll undertake.

- First subset the loans by the columns we need.
- Join the dataset to the credit record dataset.
- Calculate the length of time between the credit record date and the loan status date.
- Group by the individual member.
- Collapse a sum of an indicator for a deqliquent payment times an indicator for the relevant time period.

In [None]:
* Load a subset of the columns we need from loans.dta
use member_id status_dt using "loans.dta", clear

* Save the subset as a temporary file
tempfile loans_subset
save `loans_subset', replace

* Now load the credit records data
use "credit_record.dta", clear

* Merge the status date into credit records data
merge m:1 member_id using `loans_subset', gen(_m_loans_subset)

* Calculate the number of years (as a fraction) between the record date and status date
gen years_since_rec = (status_dt - record_dt) / 365.25

* Create variables that we will collapse over
gen delinq_6mo = delinq_payment * (years_since_rec <= 0.5)
gen delinq_1yr = delinq_payment * (years_since_rec <= 1)
gen delinq_2yr = delinq_payment * (years_since_rec <= 2)

* Collapse the data
collapse (sum) delinq_6mo (sum) delinq_1yr (sum) delinq_2yr, by(member_id)

* List the first few rows
list in 1/10

* Save as a temporary file to merge back into the loans records
tempfile delinq_totals
save `delinq_totals', replace

In [None]:
* Load back that the original loans data
use "loans.dta", clear

* Merge the deliquency totals into the loans data
merge 1:1 member_id using `delinq_totals', gen(_m_delinq_totals)

* List first few rows
list in 1/10