# Problem Statement (Gramener Case Study)
Introduction
Solving this assignment will give you an idea about how real business problems are solved using EDA. In this case study, apart from applying the techniques you have learnt in EDA, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

  

Business Understanding
You work for a consumer finance company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

 

The data given below contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

 

In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

![image.png](attachment:image.png)

When a person applies for a loan, there are two types of decisions that could be taken by the company:

Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:

Fully paid: Applicant has fully paid the loan (the principal and the interest rate)

Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'.

Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan 

Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

# Business Objectives
This company is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface. 

 

Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who default cause the largest amount of loss to the lenders. In this case, the customers labelled as 'charged-off' are the 'defaulters'. 

 

If one is able to identify these risky loan applicants, then such loans can be reduced thereby cutting down the amount of credit loss. Identification of such applicants using EDA is the aim of this case study.

 

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment. 


To develop your understanding of the domain, you are advised to independently research a little about risk analytics (understanding the types of variables and their significance should be enough).

 

# Data Understanding
 

Download the dataset from below. It contains the complete loan data for all loans issued through the time period 2007 t0 2011.

In [1]:
# loan dataset in EDA_Resources

Meaning of these variables 

In [2]:
# data dictionary in EDA_Resources

# Results Expected

    1. Write all your code in one well-commented R file; briefly mention the insights and observations from the analysis 
    2.Present the overall approach of the analysis in a presentation 
       a. Mention the problem statement and the analysis approach briefly 
       b. Explain the results of univariate, bivariate analysis etc. in business terms
       c. Include visualisations and summarise the most important results in the presentation
 

You need to submit the following two components:

    => R commented file: Should include detailed comments and should not contain unnecessary pieces of code 
    => PPT:  Make a PPT to present your analysis to the chief data scientist of your company (and thus you should include both the technical and the business aspects). The PPT should be concise, clear, and to the point. Submit the PPT after converting into the PDF format.
 

Important Note: You are supposed to code entirely in R. All your plots (i.e. the ones you create during EDA and those that you choose to put in PPT) must be created in R, though you may recreate the same in Tableau as well for better aesthetics. Please submit the PPT in a PDF format. Please make sure to rename your R script as "Group_Facilitator_RollNo_main.R".

# Evaluation Rubric
 

Criteria

Meets expectations

Does not meet expectations

Data understanding and preparation (10%)

 

 

All data quality issues are correctly identified and reported. 

 

Wherever required, the meanings of the variables is correctly interpreted and written either in the comments or the PPT.

 


 	
Data quality issues are overlooked or are not identified correctly such as outliers, missing values and other data quality issues.

 

The variables are interpreted incorrectly or the meaning of variables is not mentioned in either the comments or the PPT.

 

Data Cleaning and Manipulation (20%)

Data quality issues are addressed in the right way (missing value imputation, outlier treatment and other kinds of data redundancies, etc.). 

 

If applicable, data is converted to a suitable and convenient format to work with using the right methods.

 

 

Manipulation of strings and dates is done correctly wherever required.

Data quality issues are not addressed correctly.

 

 

 

 

The variables are not converted to an appropriate format for analysis.

 

 

 

String and date manipulation is not done correctly or is done using complex methods.

Data analysis (40%)

The right problem is solved which is coherent with the needs of the business. The analysis has a clear structure and the flow is easy to understand.

 

 

Univariate and segmented univariate analysis are done correctly and appropriate realistic assumptions are made wherever required. The analyses successfully identify at least the 5 important driver variables (i.e. variables which are strong indicators of default).

 

Business-driven, type-driven and data-driven metrics are created for the important variables and utilised for analysis. The explanation for creating the derived metrics is mentioned and is reasonable.

 

Bivariate analysis is performed correctly and is able to identify the important combinations of driver variables. The combinations of variables are chosen such that they make business or analytical sense. 

 

The most useful insights are explained correctly in the comments.

 
 

Appropriate plots (ggplot or Tableau) are created to present the results of the analysis. The choice of plots for respective cases is correct. The plots should clearly present the relevant insights and should be easy to read. The axes and important data points are labelled correctly.

The analyses do not address the right problem or deviate from the business objectives. The analysis lacks a clear structure and is not easy to follow.

 

 

 

The univariate and bivariate analysis is not performed in sufficient detail and thus some crucial insights are missed out. The analyses are not able to identify enough important driver variables.

 

New metrics are not derived wherever appropriate. The explanation for creating the derived metrics is either not mentioned or the metrics are not reasonable. 

 

Derived metrics are not analysed correctly/are insufficiently utilised.

 
.

 

 

 

Important insights are not mentioned in the report or the R file. Relevant plots are not created. The choice of plots is not ideal and the plots are either difficult to interpret or lack clarity or neatness. Relevant insights are not clearly presented by the plots. The axes and important data points are not labelled correctly/neatly.

Presentation and Recommendations (20%)

 

 

The presentation has a clear structure, is not too long, and explains the most important results concisely in simple language.

 

The recommendations to solve the problems are realistic, actionable and coherent with the analysis. 

 

If any assumptions are made, they are stated clearly.

The presentation lacks structure, is too long or does not put emphasis on the important observations. The language used is complicated for business people to understand.

 

The recommendations to solve the problems are either unrealistic, non-actionable or incoherent with the analysis. 

 

 

Contains unnecessary details or lacks the important ones.

 

Assumptions made, if any, are not stated clearly.

Conciseness and readability of the code (10%)

 

The code is concise and syntactically correct. Wherever appropriate, built-in functions and standard libraries are used instead of writing long code (if-else statements, for loops, etc.).

 

Custom functions are used to perform repetitive tasks.

 

The code is readable with appropriately named variables and detailed comments are written wherever necessary.

Long and complex code used instead of shorter built-in functions.

 

Custom functions are not used to perform repetitive tasks resulting in the same piece of code being repeated multiple times.

 

Code readability is poor because of vaguely named variables or lack of comments wherever necessary.

