# Evaluation: Basic Modeling

In this task, we ask you to do do basic model building. We took the dataset from a competition on [Kaggle](https://www.kaggle.com/c/home-credit-default-risk/overview).

_Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities._

## Download Data

The following two cells downloads the data to your `./data` directory.

In [None]:
# this dataset is about 160 MB
!curl -L -o ./data/application_train.csv https://www.dropbox.com/s/y9k7cwvpmokua0f/application_train.csv?dl=0

In [2]:
# this dataset is about 170 MB
!curl -L -o ./data/bureau.csv https://www.dropbox.com/s/bdrzlo9dp3yrk53/bureau.csv?dl=0

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0  162M    0  526k    0     0   136k      0  0:20:12  0:00:03  0:20:09  168k^C


In [3]:
# this dataset is about 23 MB, used for evaluation
!curl -L -o ./data/application_test.csv https://www.dropbox.com/s/rp85beqodnstg2s/application_test.csv\?dl\=0

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 37 25.3M   37 9790k    0     0  5118k      0  0:00:05  0:00:01  0:00:04 9715k^C


In [None]:
task_id = "eval_loans"
# please fill out your participand id, e.g., "experiment_6" 
pid = ...

from midas import Midas
m = Midas(pid, task_id)

train_df = m.from_file("./data/application_train.csv")
train_df

### Data: application

Below is what we know about the data.  Note that the original Kaggle document did not provide much information either, and you maybe just have to infer from context, or work with the unknown.

* `TARGET` is the field you will be ask to predict, given the other columsn.
* For having a 1 for `TARGET` means he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample.
* All the time fields are only relative to the application---`DAYS_BIRTH`, `DAYS_EMPLOYED`, `DAYS_REGISTRATION`, and `DAYS_ID_PUBLIS`.
* The housing information should be self-explanatory.

Header | Description | Special
---|--------- | -----------
SK_ID_CURR | ID of loan in our sample |
TARGET | 1: payment difficulties, 0: other |
NAME_CONTRACT_TYPE | Identification if loan is cash or revolving |
CODE_GENDER | Gender of the client |
FLAG_OWN_CAR | Flag if the client owns a car |
FLAG_OWN_REALTY | Flag if client owns a house or flat |
CNT_CHILDREN | Number of children the client has |
AMT_INCOME_TOTAL | Income of the client |
AMT_CREDIT | Credit amount of the loan |
AMT_ANNUITY | Loan annuity |
AMT_GOODS_PRICE | For consumer loans it is the price of the goods for which the loan is given |
NAME_TYPE_SUITE | Who was accompanying client when he was applying for the loan |
NAME_INCOME_TYPE | "Clients income type (businessman, working, maternity leave,…)" |
NAME_EDUCATION_TYPE | Level of highest education the client achieved |
NAME_FAMILY_STATUS | Family status of the client |
NAME_HOUSING_TYPE | "What is the housing situation of the client (renting, living with parents, ...)" |
REGION_POPULATION_RELATIVE | Normalized population of region where client lives (higher number means the client lives in more populated region) | normalized 
DAYS_BIRTH | Client's age in days at the time of application
DAYS_EMPLOYED | How many days before the application the person started current employment
DAYS_REGISTRATION | How many days before the application did client change his registration
DAYS_ID_PUBLISH | How many days before the application did client change the identity document with which he applied for the loan
OWN_CAR_AGE | Age of client's car |
FLAG_MOBIL | "Did client provide mobile phone (1=YES, 0=NO)" |
FLAG_EMP_PHONE | "Did client provide work phone (1=YES, 0=NO)" |
FLAG_WORK_PHONE | "Did client provide home phone (1=YES, 0=NO)" |
FLAG_CONT_MOBILE | "Was mobile phone reachable (1=YES, 0=NO)" |
FLAG_PHONE | "Did client provide home phone (1=YES, 0=NO)" |
FLAG_EMAIL | "Did client provide email (1=YES, 0=NO)" |
OCCUPATION_TYPE | What kind of occupation does the client have |
CNT_FAM_MEMBERS | How many family members does client have |
REGION_RATING_CLIENT | "Our rating of the region where client lives (1,2,3)" |
REGION_RATING_CLIENT_W_CITY | "Our rating of the region where client lives with taking city into account (1,2,3)" |
WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for the loan |
HOUR_APPR_PROCESS_START | Approximately at what hour did the client apply for the loan | rounded
REG_REGION_NOT_LIVE_REGION | "Flag if client's permanent address does not match contact address (1=different, 0=same, at region level)" |
REG_REGION_NOT_WORK_REGION | "Flag if client's permanent address does not match work address (1=different, 0=same, at region level)" |
LIVE_REGION_NOT_WORK_REGION | "Flag if client's contact address does not match work address (1=different, 0=same, at region level)" |
REG_CITY_NOT_LIVE_CITY | "Flag if client's permanent address does not match contact address (1=different, 0=same, at city level)" |
REG_CITY_NOT_WORK_CITY | "Flag if client's permanent address does not match work address (1=different, 0=same, at city level)" |
LIVE_CITY_NOT_WORK_CITY | "Flag if client's contact address does not match work address (1=different, 0=same, at city level)" |
ORGANIZATION_TYPE | Type of organization where client works |
EXT_SOURCE_1 | Normalized score from external data source | normalized
EXT_SOURCE_2 | Normalized score from external data source | normalized
EXT_SOURCE_3 | Normalized score from external data source | normalized
APARTMENTS_AVG | housing information | normalized
BASEMENTAREA_AVG | housing information | normalized
YEARS_BEGINEXPLUATATION_AVG | housing information | normalized
YEARS_BUILD_AVG | housing information | normalized
COMMONAREA_AVG | housing information | normalized
ELEVATORS_AVG | housing information | normalized
ENTRANCES_AVG | housing information | normalized
FLOORSMAX_AVG | housing information | normalized
FLOORSMIN_AVG | housing information | normalized
LANDAREA_AVG | housing information | normalized
LIVINGAPARTMENTS_AVG | housing information | normalized
LIVINGAREA_AVG | housing information | normalized
NONLIVINGAPARTMENTS_AVG | housing information | normalized
NONLIVINGAREA_AVG | housing information | normalized
APARTMENTS_MODE | housing information | normalized
BASEMENTAREA_MODE | housing information | normalized
YEARS_BEGINEXPLUATATION_MODE | housing information | normalized
YEARS_BUILD_MODE | housing information | normalized
COMMONAREA_MODE | housing information | normalized
ELEVATORS_MODE | housing information | normalized
ENTRANCES_MODE | housing information | normalized
FLOORSMAX_MODE | housing information | normalized
FLOORSMIN_MODE | housing information | normalized
LANDAREA_MODE | housing information | normalized
LIVINGAPARTMENTS_MODE | housing information | normalized
LIVINGAREA_MODE | housing information | normalized
NONLIVINGAPARTMENTS_MODE | housing information | normalized
NONLIVINGAREA_MODE | housing information | normalized
APARTMENTS_MEDI | housing information | normalized
BASEMENTAREA_MEDI | housing information | normalized
YEARS_BEGINEXPLUATATION_MEDI | housing information | normalized
YEARS_BUILD_MEDI | housing information | normalized
COMMONAREA_MEDI | housing information | normalized
ELEVATORS_MEDI | housing information | normalized
ENTRANCES_MEDI | housing information | normalized
FLOORSMAX_MEDI | housing information | normalized
FLOORSMIN_MEDI | housing information | normalized
LANDAREA_MEDI | housing information | normalized
LIVINGAPARTMENTS_MEDI | housing information | normalized
LIVINGAREA_MEDI | housing information | normalized
NONLIVINGAPARTMENTS_MEDI | housing information | normalized
NONLIVINGAREA_MEDI | housing information | normalized
FONDKAPREMONT_MODE | housing information | normalized
HOUSETYPE_MODE | housing information | normalized
TOTALAREA_MODE | housing information | normalized
WALLSMATERIAL_MODE | housing information | normalized
EMERGENCYSTATE_MODE | housing information | normalized
OBS_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 30 DPD (days past due) default |
DEF_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 30 DPD (days past due)  |
OBS_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 60 DPD (days past due) default |
DEF_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 60 (days past due) DPD |
DAYS_LAST_PHONE_CHANGE | How many days before application did client change phone |
FLAG_DOCUMENT_2 | Did client provide document 2 |
FLAG_DOCUMENT_3 | Did client provide document 3 |
FLAG_DOCUMENT_4 | Did client provide document 4 |
FLAG_DOCUMENT_5 | Did client provide document 5 |
FLAG_DOCUMENT_6 | Did client provide document 6 |
FLAG_DOCUMENT_7 | Did client provide document 7 |
FLAG_DOCUMENT_8 | Did client provide document 8 |
FLAG_DOCUMENT_9 | Did client provide document 9 |
FLAG_DOCUMENT_10 | Did client provide document 10 |
FLAG_DOCUMENT_11 | Did client provide document 11 |
FLAG_DOCUMENT_12 | Did client provide document 12 |
FLAG_DOCUMENT_13 | Did client provide document 13 |
FLAG_DOCUMENT_14 | Did client provide document 14 |
FLAG_DOCUMENT_15 | Did client provide document 15 |
FLAG_DOCUMENT_16 | Did client provide document 16 |
FLAG_DOCUMENT_17 | Did client provide document 17 |
FLAG_DOCUMENT_18 | Did client provide document 18 |
FLAG_DOCUMENT_19 | Did client provide document 19 |
FLAG_DOCUMENT_20 | Did client provide document 20 |
FLAG_DOCUMENT_21 | Did client provide document 21 |
AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the client one hour before application |
AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) |
AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) |
AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) |
AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application) |
AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application) |

### Data: beauro

All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.

Can be joined to the previous dataset via `SK_ID_CURR`.

<font color="gray">The original Kaggle contest was not as fully documented, so this is all we know.</font>

## Task One: Exploratory Data Analysis (5 minutes)

Please get a basic sense of the data and write down any relevant insights. <font color="gray">E.g., only X percent of the applicants are women, or most of the purpose of the loan is for Y.</font>

Please treat this document as a resource to be shared with your (imaginary) team.  You can use either comments or mardown cells to report your insights.

## Relevant Midas Techniques

* **Sampling for faster interactions**: The original data may be too large to be analyzed interactive---you may see a large delay. However, you can sample a subset of data to do interactive analysis first---`sample_df = train_df.sample(k=500)`, and then verify your results on the full dataset with static visualizations. You can record the query used by copy-ing out from the cell dropdown (📋), or directly snapping the visualization (📷), which will contain the code to derive the data in the comment.
* **Data cleaning** by modifying or adding new processed columns. <font color="gray">For instance, you might want to create a new columsn `YEARS_BIRTH` from the `DAYS_BIRTH` to get a more readable age distribution, e.g., `sample_df['YEARS_BIRTH'] = m.np.round(sample_df['DAYS_BIRTH'] / 365)`</font>
* **Logging your insights**: often the EDA results will infr

## Task Two: Build a Basic Model (35 minutes)

Try to help _Home Credit_ with their prediction task! We ask you to try to come up with an explanable model based on the exploration you just performed. Please try to limit your features in your final model. If you might have any intuitions about why a model is not working, please also record that.

* Feel free to use `sklearn` or whatever library that you are comfortable with E.g., `from sklearn.linear_model import LogisticRegression`
* Feel free to use Pandas dataframe to pass into the libary, simply do `train_df.to_df()`, but we ask you to do the manipulation in our dataframe language