# Part I - (Dataset Exploration Title)
## by Zadock Mainda

## Introduction

This dataset describes 81 variables of 113,937 loans taken at a credit facility between Nov 2005 and Mar 2014. Since it will be impossible to investigate each of the 81 variables in this project, we are going to curate a short list of variables to investigate each oother ..... ****** ******  

> Introduce the dataset

>**Rubric Tip**: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

> **Rubric Tip**: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

>**Rubric Tip**: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.  



## Preliminary Wrangling


In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import datetime


%matplotlib inline

> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.


In [72]:
#Read dataset into a df using pandas 

LoanData = pd.read_csv('prosperLoanData.csv')

#### Structure of the dataset

Retrieve a sample of 5 rows so that we can have a broad overview of the dataframe

In [74]:
LoanData.sample(5)

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
13591,42823572423737265788207,717399,2013-02-21 15:55:44.413000000,,60,Current,,0.23872,0.2139,0.2039,...,-131.67,0.0,0.0,0.0,0.0,1.0,0,0,0.0,204
104411,0F45355175382393508F122,611682,2012-07-14 16:14:43.517000000,,60,Current,,0.17849,0.1551,0.1451,...,-176.13,0.0,0.0,0.0,0.0,1.0,0,0,0.0,171
86988,5EB8360335898135138B7CF,1195981,2014-03-08 06:59:33.540000000,,36,Current,,0.13124,0.1029,0.0929,...,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,100
80267,543633852189678123C867D,112717,2007-03-18 19:16:40.893000000,D,36,Completed,2010-02-09 00:00:00,0.17219,0.165,0.15,...,-53.27,0.0,0.0,0.0,0.0,1.0,0,0,0.0,114
12145,0D7A35574439600505B40FA,636938,2012-09-10 13:03:20.337000000,,60,Past Due (31-60 days),,0.24682,0.2218,0.2118,...,-261.12,-235.64,0.0,0.0,0.0,1.0,0,0,0.0,174


In [79]:
# dataframe dimensions

LoanData.shape

(113937, 81)

According to the shape attribute, the dataframe is made up of 113,937 rows and 81 columns.

### The main features of interest in this dataset are listed below:


In [101]:
selectedCols = [
    'ListingKey', 'ListingCreationDate','CreditGrade', 
    'Term','LoanStatus', 'BorrowerRate', 'ProsperScore', 
    'EmploymentStatus', 'EmploymentStatusDuration', 
    'IsBorrowerHomeowner', 'CurrentDelinquencies',
    'PublicRecordsLast10Years', 'DebtToIncomeRatio', 'IncomeRange', 
    'LoanOriginalAmount', 'MonthlyLoanPayment','Recommendations'
     ] 

Create a new dataframe that comprises only of the columns indicated above: 

In [169]:
loan_new = LoanData[selectedCols]

## Assessing Data

### Quality issues


1. ListingCreationDate is a string 
2. Duplicate descriptor in the Employment status *('Employed' & 'Full-time' )*
3. IncomeRange is a string instead of Categorical datatype
4. CreditGrade is a String object 


In [170]:
loan_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 17 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ListingKey                113937 non-null  object 
 1   ListingCreationDate       113937 non-null  object 
 2   CreditGrade               28953 non-null   object 
 3   Term                      113937 non-null  int64  
 4   LoanStatus                113937 non-null  object 
 5   BorrowerRate              113937 non-null  float64
 6   ProsperScore              84853 non-null   float64
 7   EmploymentStatus          111682 non-null  object 
 8   EmploymentStatusDuration  106312 non-null  float64
 9   IsBorrowerHomeowner       113937 non-null  bool   
 10  CurrentDelinquencies      113240 non-null  float64
 11  PublicRecordsLast10Years  113240 non-null  float64
 12  DebtToIncomeRatio         105383 non-null  float64
 13  IncomeRange               113937 non-null  o

Check for duplicated rows in the dataframe

In [171]:

loan_new.duplicated().sum()

0

There are no duplicates in the LoanData datframe

Create a custom function that returns unique items in a column. This will assist with reducing repetitive code while at the same time giving us a glimpse of how the specified columns are populated. 

In [172]:
# dataframe name is hardcoded: loan_new

def entries(columnName):
    #param: name of column
    global loan_new
    df = loan_new
    return df[columnName].unique()

In [173]:
#List unique entries in the LoanStatus column

entries('LoanStatus')

array(['Completed', 'Current', 'Past Due (1-15 days)', 'Defaulted',
       'Chargedoff', 'Past Due (16-30 days)', 'Cancelled',
       'Past Due (61-90 days)', 'Past Due (31-60 days)',
       'Past Due (91-120 days)', 'FinalPaymentInProgress',
       'Past Due (>120 days)'], dtype=object)

In [180]:
#List unique entries in the Term column

entries('Term')

array([36, 60, 12], dtype=int64)

In [175]:
#List unique entries in the EmploymentStatus column

entries('EmploymentStatus')

array(['Self-employed', 'Employed', 'Not available', 'Full-time', 'Other',
       nan, 'Not employed', 'Part-time', 'Retired'], dtype=object)

In [176]:
#List unique entries in the IsBorrowerHomeowner column

entries('IsBorrowerHomeowner')

array([ True, False])

In [177]:
#List unique entries in the Recommendations column

entries('Recommendations')

array([ 0,  2,  1,  4,  3,  9,  5, 16, 39, 21,  7, 14,  8,  6, 24, 19, 18],
      dtype=int64)

In [178]:
#List unique entries in the IncomeRange column

entries('IncomeRange')

array(['$25,000-49,999', '$50,000-74,999', 'Not displayed', '$100,000+',
       '$75,000-99,999', '$1-24,999', 'Not employed', '$0'], dtype=object)

In [179]:
#List unique entries in the IncomeRange column

entries('CreditGrade')

array(['C', nan, 'HR', 'AA', 'D', 'B', 'E', 'A', 'NC'], dtype=object)

## Cleaning

###### Make copies of the original data

In [182]:
loan_clean = loan_new.copy()

### Issue #1: ListingCreationDate is a string instead of datetime object 

#### Define:
* Change the datatype of ListingCreationDate from string to Datetime

#### Code

In [12]:
#Change datatype to Datetime from string

loanD_new.ListingCreationDate= pd.to_datetime(LoanData.ListingCreationDate)

In [13]:
#Create a month column

LoanData['month'] = pd.to_datetime(LoanData.ListingCreationDate).dt.strftime('%b')

In [14]:
#Create a year column

LoanData['year'] = pd.to_datetime(LoanData.ListingCreationDate).dt.strftime('%Y')

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 




>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

