# Data Ingestion & Brief EDA notebook
This notebook is for incremental development of modularized classes for ingesting the various datasets in the loan application dataset

The goals for data ingestion in src are to
1) Ingest all data sources (bronze tables) as individual pandas dataframes
2) Merge all datasets into one large dataframe suitable for analysis (one silver table)

delete later: goal is to predict the loan outcome for finished loans at the time of loan start

## EDA
Start developing data ingestion and merging code. Do not prioritize modularizing the code. Investigation OK here

In [21]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# url to raw content in repo
data_url= 'https://raw.githubusercontent.com/vvbauman/sample-work-loan-application/feature/data-ingestion/dataset/'

In [22]:
# ingest all data sources as pandas dataframes - these are saved as bronze tables
account= pd.read_csv(data_url + 'account.txt', sep= ';')
card= pd.read_csv(data_url + 'card.txt', sep= ';')
client= pd.read_csv(data_url + 'client.txt', sep= ';')
disp= pd.read_csv(data_url + 'disp.txt',  sep= ';')
district= pd.read_csv(data_url + 'district.txt',  sep= ';')
loan= pd.read_csv(data_url + 'loan.txt', sep= ';')
order= pd.read_csv(data_url + 'order.txt', sep= ';')
transactions= pd.read_csv(data_url + 'trans.txt', sep= ';')

In [25]:
# check for null values. If less than 5% of overall data, drop rows with nulls
# option to provide list subset, used in pd.dropna() to only consider a subset of rows when counting null values
def drop_nulls(df : pd.DataFrame, subset : list = []):
    if len(subset) == 0:
        percent_null= df.isnull().sum().sum() / len(df)
    else:
        percent_null= df[subset].isnull().sum().sum() / len(df)

    if percent_null == 0:
        print('No null values in dataframe. Returned dataframe is same as input dataframe')
        return df
    elif percent_null < 0.05:
        print('Less than 5% of rows are missing data. Returned dataframe is same as input dataframe with these rows dropped')
        return df.dropna(subset= subset)
    else:
        print('More than 5% of rows are missing data. Returning dataframe without dropping rows')
        return df

account= drop_nulls(account)
card= drop_nulls(card)
client= drop_nulls(client)
disp= drop_nulls(disp)
district= drop_nulls(district)
loan= drop_nulls(loan)
order= drop_nulls(order)
transactions= drop_nulls(transactions, subset= ['trans_id', 'account_id', 'date'])

# if any rows are dropped, this would be a good place to save these tables as silver tables, since they have undergone some preprocessing

No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe
No null values in dataframe. Returned dataframe is same as input dataframe


In [7]:
# get info for all dataframes, to understand how the star schema can be configured
for i in [account, card, client, disp, district, loan, order, transactions]:
    print(i.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4500 entries, 0 to 4499
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   account_id   4500 non-null   int64 
 1   district_id  4500 non-null   int64 
 2   frequency    4500 non-null   object
 3   date         4500 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 140.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   card_id  892 non-null    int64 
 1   disp_id  892 non-null    int64 
 2   type     892 non-null    object
 3   issued   892 non-null    object
dtypes: int64(2), object(2)
memory usage: 28.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5369 entries, 0 to 5368
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   client_id     53

In [35]:
# take advantage of the fact there are common column names across dataframes and assume they mean the same thing across dataframes
# (i.e., will be able to join on these columns and there will be common elements within these columns across the multiple tables)

# confirm that the common column among account, loan, order, transactions, and disp is "account_id"
print(set(account.columns) & set(loan.columns) & set(order.columns) & set(transactions.columns) & set(disp.columns))

# confirm that the common column among disp and client is "client_id"
print(set(disp.columns) & set(client.columns))

# card can be joined with disp - disp has "account_id" and "disp_id columns"
# no common columns in the district dataframe with any of the other dataframes. Do merges for above dataframes, then see how district fits in

{'account_id'}
{'client_id'}


In [54]:
# inner-join account, loan, order, transactions, and disp dataframes on account_id
# we only want details on accounts with a loan
account_id_merge= (account
                    .merge(loan, on= ['account_id'], how= 'inner', suffixes= ('_account', '_loan')) 
                    .merge(order, on= ['account_id'], how= 'inner', suffixes= ('_loan', '_order'))
                    .merge(transactions, on= ['account_id'], how= 'inner', suffixes= ('_order', '_transactions'))
                    .merge(disp, on= ['account_id'], how= 'inner', suffixes= ('_transactions', '_disp'))
                     )
account_id_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330720 entries, 0 to 330719
Data columns (total 27 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   account_id             330720 non-null  int64  
 1   district_id            330720 non-null  int64  
 2   frequency              330720 non-null  object 
 3   date_account           330720 non-null  int64  
 4   loan_id                330720 non-null  int64  
 5   date_loan              330720 non-null  int64  
 6   amount_loan            330720 non-null  int64  
 7   duration               330720 non-null  int64  
 8   payments               330720 non-null  float64
 9   status                 330720 non-null  object 
 10  order_id               330720 non-null  int64  
 11  bank_to                330720 non-null  object 
 12  account_to             330720 non-null  int64  
 13  amount_order           330720 non-null  float64
 14  k_symbol_order         330720 non-nu

In [66]:
# inner-join disp and client dataframes on client_id, then merge with card on disp_id, then merge with account_id_merge on disp_id
client_id_merge= disp.merge(client, on= ['client_id'], how= 'inner', suffixes= ('_disp', '_client'))
disp_id_merge= client_id_merge.merge(card, on= ['disp_id'], how= 'inner', suffixes= ('_client_disp', '_card'))

silver= disp_id_merge.merge(account_id_merge, on= ['account_id'], how= 'inner', suffixes= ('', '_disp_id'))
silver.info() # silver dataframe with 7/8 tables merged has 75599 rows

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75599 entries, 0 to 75598
Data columns (total 35 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   disp_id                75599 non-null  int64  
 1   client_id              75599 non-null  int64  
 2   account_id             75599 non-null  int64  
 3   type_client_disp       75599 non-null  object 
 4   birth_number           75599 non-null  int64  
 5   district_id            75599 non-null  int64  
 6   card_id                75599 non-null  int64  
 7   type_card              75599 non-null  object 
 8   issued                 75599 non-null  object 
 9   district_id_disp_id    75599 non-null  int64  
 10  frequency              75599 non-null  object 
 11  date_account           75599 non-null  int64  
 12  loan_id                75599 non-null  int64  
 13  date_loan              75599 non-null  int64  
 14  amount_loan            75599 non-null  int64  
 15  du

In [8]:
# final goal is to predict loan outcome for finished loans at the time of loan start
# in status column in loan, A is good loan that is finished, B is bad loan that is finished, 
# C is good loan that is unfinished, D is bad loan that is unfinished
district.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040


In [22]:
np.unique(loan.status.values, return_counts= True)

(array(['A', 'B', 'C', 'D'], dtype=object),
 array([203,  31, 403,  45], dtype=int64))