# # Prosper Loans Data Exploration
## by Ahmed Abdelhafez

## Preliminary Wrangling

> This notebook explores a dataset of Prosper loans.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy import stats
%matplotlib inline

In [None]:
#load the data
loans = pd.read_csv('../input/prosper-loan/prosperLoanData.csv')

In [None]:
#view the data shape and features
print(loans.shape)
print(loans.dtypes)
print(loans.head(10))

In [None]:
loans = loans[['Term', 'LoanStatus','BorrowerAPR','BorrowerRate','LenderYield','EstimatedEffectiveYield'
,'EstimatedLoss','EstimatedReturn','IncomeRange'
,'Recommendations','Investors']]

In [None]:
#view the data shape and features
print(loans.shape)
print(loans.dtypes)
print(loans.head(10))

In [None]:
loans.info()

The data has a lot of null values that needs to be cleaned. Since the number of nulls is many, we can't just drop the null rows. We'll apply a mean interpolation to the missing features.

In [None]:
loans['EstimatedEffectiveYield'].fillna(loans['EstimatedEffectiveYield'].mean(), inplace = True)

In [None]:
loans['EstimatedLoss'].fillna(loans['EstimatedLoss'].mean(), inplace = True)

In [None]:
loans['EstimatedReturn'].fillna(loans['EstimatedReturn'].mean(), inplace = True)

Then, To complete the cleaning process, we shall drop the duplicated rows.

In [None]:
loans.drop_duplicates(inplace=True)

In [None]:
loans.info()

### What is the structure of your dataset?

> There are 113,937 loan data records in the dataset with 12 features. Most variables are numeric in nature, but the variables Loan status, Income Range and Term are categorcal two of them are strings.

### What is/are the main feature(s) of interest in your dataset?

> The main interest is to figure out What factors affect a borrower's rate and APR.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> The Term and Estimated Retrun.

## Univariate Exploration

First we will start with plotting the distribution of the variables of interest.

In [None]:
base_color = sb.color_palette()[0];

In [None]:
sb.distplot(loans['BorrowerRate'], kde = True);


In [None]:
sb.distplot(loans['BorrowerAPR'], kde = True)

Borrowers' Rate and APR have a normal like distribution with high kernel density towards the tails of the distribution.

Second, we will navigate the other features in the data set to see their distributions and trends.

In [None]:
pltData = loans['LoanStatus'].value_counts()
plt.bar(pltData.index, pltData)
plt.xticks(rotation = 90);

In [None]:
sb.distplot(loans['LenderYield'], kde = False)

In [None]:
sb.distplot(loans['EstimatedEffectiveYield'], kde = False)

In [None]:
sb.distplot(loans['EstimatedLoss'], kde = False)

In [None]:
sb.distplot(loans['EstimatedReturn'], kde = False)

In [None]:
pltData = loans['IncomeRange'].value_counts()
plt.bar(pltData.index, pltData)
plt.xticks(rotation = 90);

In [None]:
sb.countplot(loans['Recommendations'], color = base_color);

In [None]:
bins = 10**np.arange(0, np.log10(loans['Investors'].max())+0.2, 0.2);
plt.hist(loans['Investors'], bins = bins);
ticks = np.arange(0, np.log10(loans['Investors'].max())+0.5, 0.5);
plt.xscale('log');
labels = [1, 3, 10, 30, 100, 300, 1000, 3000]
plt.xticks(10**ticks, labels);

In [None]:
sb.countplot(loans['Term'], color = base_color);

### General Comments:
<ul>
    <li> Most of the distribution is normal with some values that has relatively high frequencies.</li>
    <li> In "Term" feature it's appeared that 36 is the most common term followed by 60 then 12.</li>
    <li>'0' recommendations is the most common value.</li>
</ul>

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> The distibution is normal with high frequency of high-valued points towards the tail of the distibution.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> The "Investors" feature was significantly skewed so "log" transformation was used to see the distribution among the data.

## Bivariate Exploration

> First thing we shall generate the scatter matrix to spot quick relations in the data to help further investigation.

In [None]:
pd.plotting.scatter_matrix(loans,figsize = [21,21]);

After having a quick look on the relations between the features, we will start plotting the relation between the categorical features and the features of interest: BorrowerRate and BorrowerAPR.

In [None]:
sb.boxplot(data = loans, x = 'Term', y = 'BorrowerAPR', color = base_color);

In [None]:
sb.boxplot(data = loans, x = 'Term', y = 'BorrowerRate', color = base_color);

In [None]:
sb.boxplot(data = loans, x = 'LoanStatus', y = 'BorrowerRate', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'LoanStatus', y = 'BorrowerAPR', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'IncomeRange', y = 'BorrowerRate', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'IncomeRange', y = 'BorrowerAPR', color = base_color);
plt.xticks(rotation = 90);

Also, discovering relations between categorical variables and other features along the way.

In [None]:
sb.boxplot(data = loans, x = 'Term', y = 'EstimatedLoss', color = base_color);

In [None]:
sb.boxplot(data = loans, x = 'Term', y = 'EstimatedReturn', color = base_color);

In [None]:
sb.boxplot(data = loans, x = 'IncomeRange', y = 'EstimatedReturn', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'IncomeRange', y = 'EstimatedLoss', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'LoanStatus', y = 'EstimatedLoss', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'LoanStatus', y = 'EstimatedReturn', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'LoanStatus', y = 'Investors', color = base_color);
plt.xticks(rotation = 90);

In [None]:
sb.boxplot(data = loans, x = 'LoanStatus', y = 'LenderYield', color = base_color);
plt.xticks(rotation = 90);

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
<ul>
<li> As the term increases from 12 to 35 to 60, the borrower rate also increases. </li>
<li>BorrowerRate and BorrowerAPR has a positive corelating with variable strength with the following variables: (LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn) </li>
</ul>

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> The Term is directly correlated with the estimated return.
> Loan Status does not vary much with Borrower rate, however, there is slightly positive correlation with Lender Yield and we can see that the completed Status is much more common with smaller Lender Yields.

## Multivariate Exploration

> In this part, we shall explore more relation between more than one feature

In [None]:
g = sb.FacetGrid(data = loans, col = 'Term', height = 3,
                margin_titles = True)
g.map(plt.scatter, 'EstimatedReturn', 'BorrowerRate');

In [None]:
g = sb.FacetGrid(data = loans, col = 'Term', height = 3,
                margin_titles = True)
g.map(plt.scatter, 'EstimatedLoss', 'BorrowerRate');

In [None]:
g = sb.FacetGrid(data = loans, col = 'Term', height = 3,
                margin_titles = True)
g.map(plt.scatter, 'LenderYield', 'BorrowerRate');

In [None]:
g = sb.FacetGrid(data = loans, col = 'Term', height = 3,
                margin_titles = True)
g.map(plt.scatter, 'EstimatedEffectiveYield', 'BorrowerRate');

In [None]:
plt.scatter(data = loans, x = 'LenderYield', y = 'BorrowerRate', c = 'EstimatedReturn')
plt.colorbar()

In [None]:
plt.scatter(data = loans, x = 'LenderYield', y = 'BorrowerRate', c = 'EstimatedLoss')
plt.colorbar()

In [None]:
plt.scatter(data = loans, x = 'LenderYield', y = 'BorrowerRate', c = 'EstimatedEffectiveYield')
plt.colorbar()

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Regarding the Estimated Return the direct correlation with the BorrowerRate strengthens when the Terms are 36 and 60. Whereas in Estimaed Loss the correlation is stronger in 12 and 60.

> The positive correlation maintains accross all the terms between the LenderYield and BorrowerRate.

### Were there any interesting or surprising interactions between features?

> Higher EstimatedffectiveYield is common with the positive correlation between the LenderYield and BorrowerRate. The same applies to EstimatedReturn.

> EstimatedLoss is less common with strong correlation between LenderYield and BorrowerRate