# Data Science Challenge

In [1]:
# If you'd like to install packages that are not installed by default, uncomment the last two lines of this cell and replace <package list> with a list of your packages.
# This will ensure your notebook has all the dependencies and works everywhere

#import sys
#!{sys.executable} -m pip install <package list>

In [2]:
#Libraries
import pandas as pd
pd.set_option("display.max_columns", 101)

## Data Description

Column | Description
:---|:---
`id` | Unique identifier of a bank
`institution_name` | Name of the bank
`institution_type` | Type of bank with respect to its operation (savings bank, savings association or commercial bank)
`charter_type` | Type of bank defined with respect to its area of operation (STATE, FEDERAL, STATE/FEDERAL)
`headquarters` | Location of the bank's headquarters in the US (City, State)
`latitude` | Latitude of the bank's State
`longitude` | Longitude of the bank's State
`failure_date` | Date on which the bank was officially declared as a failure
`insurance_provider` | Insurance provider to the bank depositors (FDIC, FSLIC, RTC, BIF, SAIF, DIF)
`failure_outcome` | Bailout plan for the bank after failure (PAYOUT, ACQUISITION, PRIVATIZATION, TRANSFER, MANAGEMENT CHANGE)
`total_deposits` | Total deposits in the bank at the time of failure (in thousands of dollars)
`total_assets` | Total assets in the bank at the time of failure (in thousands of dollars)
`liquidity` | Percentage of total available liquid assets at the time of failure
`estimated_loss` | Net Expected Loss (in thousands $)

## Data Wrangling & Visualization

In [3]:
# Dataset is already loaded below
data = pd.read_csv("train.csv")

In [4]:
data.head()

Unnamed: 0,id,institution_name,institution_type,charter_type,headquarters,latitude,longitude,failure_date,insurance_provider,failure_outcome,total_deposits,total_assets,liquidity,estimated_loss
0,1,DELTA SECURITY BANK AND TRUST COMPANY,COMMERCIAL BANK,STATE,"FERRIDAY, LA",31.630166,-91.554565,1973-01-19,FDIC,PAYOUT,0.8079,0.978,0.826074,0.269135
1,2,BANK OF STURGIS,COMMERCIAL BANK,STATE,"STURGIS, KY",37.546714,-87.983914,1937-07-03,FDIC,PAYOUT,0.0213,0.0246,0.865854,0.078526
2,3,OCONTO COUNTY STATE BANK,COMMERCIAL BANK,STATE,"OCONTO FALLS, WI",44.87388,-88.14288,1939-01-04,FDIC,PAYOUT,0.0346,0.0386,0.896373,0.065794
3,4,THE FIRST NAT BK & TR CO. OF OKLAHOMA,COMMERCIAL BANK,FEDERAL,"OKLAHOMA CITY, OK",35.472989,-97.517054,1986-07-14,FDIC,ACQUISITION,130.1346,175.4157,0.741864,22.0721
4,5,THE BANK OF BRONSON,COMMERCIAL BANK,STATE,"BRONSON, KS",37.895871,-95.073308,1985-08-23,FDIC,PAYOUT,0.9294,0.9604,0.967722,0.244277


In [5]:
#Explore columns
data.columns

Index(['id', 'institution_name', 'institution_type', 'charter_type',
       'headquarters', 'latitude', 'longitude', 'failure_date',
       'insurance_provider', 'failure_outcome', 'total_deposits',
       'total_assets', 'liquidity', 'estimated_loss'],
      dtype='object')

In [6]:
#Description
data.describe()

Unnamed: 0,id,latitude,longitude,total_deposits,total_assets,liquidity,estimated_loss
count,2660.0,2660.0,2660.0,2427.0,2504.0,2545.0,2660.0
mean,1330.5,36.576786,-93.239347,26.777879,31.211454,0.947292,6.07582
std,768.020182,6.290209,15.121618,97.85274,116.580617,0.227383,31.702348
min,1.0,-32.88418,-157.855676,0.0,0.0014,0.0,0.0
25%,665.75,32.425746,-98.493387,1.46285,1.492775,0.88317,0.2211
50%,1330.5,35.960395,-94.74049,4.6973,4.72415,0.950415,0.86235
75%,1995.25,40.853905,-84.839522,16.9113,17.5171,0.996296,3.462375
max,2660.0,69.75,136.66339,2007.2099,2545.5112,8.897978,1243.8005


## Visualization, Modeling, Machine Learning

Build a model that can predict whether patient opted for the procedure or not and identify how different features influence the model's decision. Please explain the findings effectively to technical and non-technical audiences using comments and visualizations, if appropriate.
- **Build an optimized model that effectively solves the business problem.**
- **The model will be evaluated on the basis of mean absolute percent error.**
- **Read the test.csv file and prepare features for testing.**

In [7]:
#Loading Test data
test_data=pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,id,institution_name,institution_type,charter_type,headquarters,latitude,longitude,failure_date,insurance_provider,failure_outcome,total_deposits,total_assets,liquidity
0,1,CENTER S & LA,SAVINGS ASSOCIATION,FEDERAL/STATE,"CLIFTON, NJ",40.858433,-74.163755,1991-01-25,RTC,ACQUISITION,12.5372,13.2562,0.945761
1,2,THE FIRST NATIONAL BANK OF ONAGA,COMMERCIAL BANK,FEDERAL,"ONAGA, KS",39.488885,-96.169999,1985-07-23,FDIC,ACQUISITION,2.2259,2.2379,0.994638
2,3,EXPRESSWAY BANK,COMMERCIAL BANK,STATE,"OKLAHOMA CITY, OK",35.472989,-97.517054,1987-03-12,FDIC,ACQUISITION,,1.9089,
3,4,TREASURE STATE BANK,COMMERCIAL BANK,STATE,"GLASGOW, MT",48.195591,-106.635556,1989-06-09,FDIC,ACQUISITION,1.315,1.4553,0.903594
4,5,JEFFERSON NATIONAL BANK,COMMERCIAL BANK,FEDERAL,"WATERTOWN, NY",43.974784,-75.910756,1993-02-26,BIF,ACQUISITION,,25.6014,0.971244




**The management wants to know the most important features for the model.**

> #### Task:
- **Visualize the top 20 features and their feature importance.**


> #### Task:
- **Submit the predictions on the test dataset using the optimized model** <br/>
    For each record in the test set (`test.csv`), predict the value of the `estimated_loss` variable. Submit a CSV file with a header row and one row per test entry. 

The file (`submissions.csv`) should have exactly 2 columns:
   - **id**
   - **estimated_loss**

In [0]:
#Submission
submission_df.to_csv('submissions.csv',index=False)

---