# MLE challenge - Features engineering

### Notebook 1

In this notebook we compute five features for the **credit risk** dataset. 
Each row in the dataset consists of the credit that a user took on a given date.

These features are roughly defined as follows:

**nb_previous_loans:** number of loans granted to a given user, before the current loan.

**avg_amount_loans_previous:** average amount of loans granted to a user, before the current rating.

**age:** user age in years.

**years_on_the_job:** years the user has been in employment.

**flag_own_car:** flag that indicates if the user has his own car.

We have the following problem: the feature `avg_amount_loans_previous` takes just too long to be computed for all the rows of the dataset (at least the way it's implemented).




In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/raw_data/dataset_credit_risk.csv')

In [3]:
df.shape

(777715, 24)

In [4]:
df.head()

Unnamed: 0,loan_id,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,...,flag_work_phone,flag_phone,flag_email,occupation_type,cnt_fam_members,status,birthday,job_start_date,loan_date,loan_amount
0,208089,5044500,F,N,Y,0,45000.0,Pensioner,Secondary / secondary special,Widow,...,0,0,0,,1.0,0,1955-08-04,3021-09-18,2019-01-01,133.714974
1,112797,5026631,F,N,Y,0,99000.0,Working,Secondary / secondary special,Separated,...,0,0,0,Medicine staff,1.0,0,1972-03-30,1997-06-05,2019-01-01,158.800558
2,162434,5036645,M,Y,N,0,202500.0,Working,Incomplete higher,Married,...,0,0,0,Drivers,2.0,0,1987-03-24,2015-02-22,2019-01-01,203.608487
3,144343,5033584,F,N,Y,0,292500.0,Working,Higher education,Married,...,0,0,0,,2.0,0,1973-03-15,2009-06-29,2019-01-01,113.204964
4,409695,5085755,F,Y,Y,1,112500.0,Commercial associate,Secondary / secondary special,Civil marriage,...,0,0,0,Core staff,3.0,0,1989-10-15,2019-07-03,2019-01-01,109.37626


In [5]:
df = df.sort_values(by=["id", "loan_date"])
df = df.reset_index(drop=True)
df["loan_date"] = pd.to_datetime(df.loan_date)
df

Unnamed: 0,loan_id,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,...,flag_work_phone,flag_phone,flag_email,occupation_type,cnt_fam_members,status,birthday,job_start_date,loan_date,loan_amount
0,1008,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-01,102.283361
1,1000,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-15,136.602049
2,1012,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-17,114.733694
3,1011,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-05-20,103.539050
4,1003,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-07-05,112.948147
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
777710,172506,5150487,M,Y,N,0,202500.0,Working,Secondary / secondary special,Married,...,0,0,0,Drivers,2.0,0,1968-08-08,2015-10-13,2020-09-10,117.792205
777711,172513,5150487,M,Y,N,0,202500.0,Working,Secondary / secondary special,Married,...,0,0,0,Drivers,2.0,0,1968-08-08,2015-10-13,2020-10-13,105.778335
777712,172512,5150487,M,Y,N,0,202500.0,Working,Secondary / secondary special,Married,...,0,0,0,Drivers,2.0,0,1968-08-08,2015-10-13,2020-10-16,112.319242
777713,172500,5150487,M,Y,N,0,202500.0,Working,Secondary / secondary special,Married,...,0,0,0,Drivers,2.0,0,1968-08-08,2015-10-13,2020-11-25,113.627617


#### Feature nb_previous_loans

In [6]:
df_grouped = df.groupby("id")
df["nb_previous_loans"] = df_grouped["loan_date"].rank(method="first") - 1

#### Feature avg_amount_loans_previous

In [7]:
def avg_amount_loans_prev(df):
    avg = pd.Series(index=df.index)
    for i in df.index:
        df_aux = df.loc[df.loan_date < df.loan_date.loc[i], :]
        avg.at[i] = df_aux.loan_amount.mean()
    return avg

In [8]:
avg_amount_loans_previous = pd.Series()
# the following cycle is the one that takes forever if we try to compute it for the whole dataset
for user in df.id.unique():
    df_user = df.loc[df.id == user, :]
    avg_amount_loans_previous = avg_amount_loans_previous.append(avg_amount_loans_prev(df_user))

  """Entry point for launching an IPython kernel.
  


In [9]:
df["avg_amount_loans_previous"] = avg_amount_loans_previous

#### Feature age

In [10]:
from datetime import datetime, date

In [11]:
df['birthday'] = pd.to_datetime(df['birthday'], errors='coerce')


In [12]:
df['age'] = (pd.to_datetime('today').normalize() - df['birthday']).dt.days // 365

#### Feature years_on_the_job

In [13]:
df['job_start_date'] = pd.to_datetime(df['job_start_date'], errors='coerce')

In [14]:
df['years_on_the_job'] = (pd.to_datetime('today').normalize() - df['job_start_date']).dt.days // 365

#### Feature flag_own_car

In [15]:
df['flag_own_car'] = df.flag_own_car.apply(lambda x : 0 if x == 'N' else 1)

## Save dataset for model training

In [16]:
df = df[['id', 'age', 'years_on_the_job', 'nb_previous_loans', 'avg_amount_loans_previous', 'flag_own_car', 'status']]


In [17]:
df.to_csv('train_model.csv', index=False)