##  Credit Scoring with Zettapark for Python Set-up Notebook


### 1. Singdata Lakehouse Trial Account

The prerequisite is to have a Singdata Lakehouse account. If you do not have a Singdata Lakehouse account, you can [cantact us](https://www.singdata.com/contactus) for a free trial using.

After signing-up for the trial, please bookmark the URL of the Singdata Lakehouse account, and save your credentials as they will be needed in this lab.


This version requires Zettapark **0.1.2** or higher

### 2. Python Libraries

The following libraries are needed to run this demo. In this section, add any python library missing in your environment.

In [1]:
# !pip install -q --upgrade clickzetta_zettapark_python

In [2]:
# !pip install scikit-plot

In [3]:
# !pip install pyarrow==6.0.0

In [4]:
# !pip install seaborn

In [5]:
# !pip install matplotlib

### 3. File Download

#### 3.1 The Dataset

In [6]:
! curl -o data/credit_files.csv https://raw.githubusercontent.com/yunqiqiliang/clickzetta_quickstart/refs/heads/main/Zettapark-credit-scoring/data/credit_files.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  292k  100  292k    0     0  66280      0  0:00:04  0:00:04 --:--:-- 69100


In [7]:
! curl -o data/credit_request.csv https://raw.githubusercontent.com/yunqiqiliang/clickzetta_quickstart/refs/heads/main/Zettapark-credit-scoring/data/credit_request.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6068  100  6068    0     0   2297      0  0:00:02  0:00:02 --:--:--  2297


#### 3.2 The config.json credential file

The file below needs to be edited with credentials of your Singdata Lakehouse account and saved. It will be used to connect to Singdata Lakehouse on the main Notebook:


```

{
  "username": "<username>",
  "password": "<password>",
  "service": "<service url>",
  "instance": "<instance id>",
  "workspace": "<workspace>",
  "schema": "<schema>",
  "vcluster": "<vcluster>",
  "sdk_job_timeout": 60,
  "hints": {
    "sdk.job.timeout": 60,
    "query_tag": "test_zettapark_credit_scoring"
  }
}
```   

In [8]:
! curl -o config/config_tobe_renamed.json https://raw.githubusercontent.com/yunqiqiliang/clickzetta_quickstart/refs/heads/main/Zettapark-credit-scoring/config/config.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   321  100   321    0     0    138      0  0:00:02  0:00:02 --:--:--   138


### 4. The Database

In the section below, please fill-up the different parameters to connect to your Singdata Lakehouse Environment in the config.json file.

In [None]:
import pandas as pd
import json
from clickzetta.zettapark.session import Session
import clickzetta.zettapark.functions as F
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

# read connection para from config file
with open('config/config.json', 'r') as config_file:
    config = json.load(config_file)

schema = config['schema']
vcluster = config['vcluster']

print("Connecting to Lakehouse.....\n")

# create session
session = Session.builder.configs(config).create()

session.sql(f"CREATE SCHEMA IF NOT EXISTS {schema}").collect()
session.sql(f"CREATE VCLUSTER IF NOT EXISTS {vcluster} VCLUSTER_SIZE=1 VCLUSTER_TYPE = GENERAL").collect()

print(session.sql("SELECT current_instance_id(), current_workspace(),current_workspace_id(), current_schema(), current_user(),current_user_id(), current_vcluster()").collect())

print("\nConnected！...\n")

### 5. The Tables

There are 2 tables associated with this demo:

* CREDIT_FILES: This table contains currently the credit on files along with the credit standing whether the loan is being repaid or if there are actual issues with reimbursing the credit. This dataset is going to be used for historical analysis and build a machine learning model to score new applications.

* CREDIT_REQUESTS: This table contains the new credit requests that the bank needs to provide approval on based on the ML algorithm.


#### 5.1 CREDIT_FILES Table



After check running the command below, log into your Singdata Lakehouse environment and make sure the table was created. It should have 2.9K rows. 

In [10]:
credit_files = pd.read_csv('data/credit_files.csv')
credit_files.columns = credit_files.columns.str.lower()
session.sql("drop table if exists CREDIT_FILES").collect()
session.write_pandas(credit_files,"CREDIT_FILES",auto_create_table='True', quote_identifiers=False)

<clickzetta.zettapark.table.Table at 0x7fe58538e990>

In [11]:
credit_df = session.table("CREDIT_FILES")
credit_df.schema

StructType([StructField('`credit_request_id`', LongType(), nullable=True), StructField('`credit_amount`', LongType(), nullable=True), StructField('`credit_duration`', LongType(), nullable=True), StructField('`purpose`', StringType(), nullable=True), StructField('`installment_commitment`', LongType(), nullable=True), StructField('`other_parties`', StringType(), nullable=True), StructField('`credit_standing`', StringType(), nullable=True), StructField('`credit_score`', LongType(), nullable=True), StructField('`checking_balance`', DoubleType(), nullable=True), StructField('`savings_balance`', DoubleType(), nullable=True), StructField('`existing_credits`', LongType(), nullable=True), StructField('`assets`', StringType(), nullable=True), StructField('`housing`', StringType(), nullable=True), StructField('`qualification`', StringType(), nullable=True), StructField('`job_history`', LongType(), nullable=True), StructField('`age`', LongType(), nullable=True), StructField('`sex`', StringType(), 

In [12]:
credit_df.toPandas().head()

Unnamed: 0,credit_request_id,credit_amount,credit_duration,purpose,installment_commitment,other_parties,credit_standing,credit_score,checking_balance,savings_balance,...,assets,housing,qualification,job_history,age,sex,marital_status,num_dependents,residence_since,other_payment_plans
0,147,8600,6,Vehicle,1,,good,466,-728.12,17.0,...,,own,skilled,15,39,F,Married,1,4,none
1,248,12040,6,Vehicle,4,,good,202,0.0,2443.0,...,,rent,skilled,1,35,M,Single,1,1,bank
2,112,3920,15,Tuition,4,,good,736,4696.0,143.0,...,Investments,rent,skilled,1,23,F,Married,1,4,none
3,222,12000,12,Tuition,4,,good,732,-25.35,0.0,...,Investments,rent,skilled,3,23,F,Married,1,4,bank
4,162,10550,18,Vehicle,4,,good,507,0.0,510.0,...,Investments,own,skilled,1,30,F,Married,1,1,none


In [13]:
credit_df.toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   credit_request_id       2940 non-null   int64  
 1   credit_amount           2940 non-null   int64  
 2   credit_duration         2940 non-null   int64  
 3   purpose                 2940 non-null   object 
 4   installment_commitment  2940 non-null   int64  
 5   other_parties           271 non-null    object 
 6   credit_standing         2940 non-null   object 
 7   credit_score            2940 non-null   int64  
 8   checking_balance        2940 non-null   float64
 9   savings_balance         2940 non-null   float64
 10  existing_credits        2940 non-null   int64  
 11  assets                  2489 non-null   object 
 12  housing                 2940 non-null   object 
 13  qualification           2940 non-null   object 
 14  job_history             2940 non-null   

#### 5.2 CREDIT_REQUEST Table

After check running the command below, log into your Singdata Lakehouse environment and make sure the table was created. It should have 60 rows.

In [14]:
credit_requests = pd.read_csv('data/credit_request.csv')
credit_requests.columns = credit_requests.columns.str.lower()
session.sql("drop table if exists CREDIT_REQUESTS").collect()
session.write_pandas(credit_requests,"CREDIT_REQUESTS",auto_create_table='True', quote_identifiers=False)

<clickzetta.zettapark.table.Table at 0x7fe50b7556d0>

In [15]:
credit_req_df = session.table("CREDIT_REQUESTS")
credit_req_df.schema

StructType([StructField('`credit_request_id`', LongType(), nullable=True), StructField('`credit_amount`', LongType(), nullable=True), StructField('`credit_duration`', LongType(), nullable=True), StructField('`purpose`', StringType(), nullable=True), StructField('`installment_commitment`', LongType(), nullable=True), StructField('`other_parties`', StringType(), nullable=True), StructField('`credit_score`', LongType(), nullable=True), StructField('`checking_balance`', DoubleType(), nullable=True), StructField('`savings_balance`', DoubleType(), nullable=True), StructField('`existing_credits`', LongType(), nullable=True), StructField('`assets`', StringType(), nullable=True), StructField('`housing`', StringType(), nullable=True), StructField('`qualification`', StringType(), nullable=True), StructField('`job_history`', LongType(), nullable=True), StructField('`age`', LongType(), nullable=True), StructField('`sex`', StringType(), nullable=True), StructField('`marital_status`', StringType(), n

In [16]:
credit_req_df.toPandas().head()

Unnamed: 0,credit_request_id,credit_amount,credit_duration,purpose,installment_commitment,other_parties,credit_score,checking_balance,savings_balance,existing_credits,assets,housing,qualification,job_history,age,sex,marital_status,num_dependents,residence_since,other_payment_plans
0,182,44550,36,Business,2,,656,259.0,628.0,2,Home,own,highly skilled,1,30,M,Divorced,1,2,stores
1,188,11750,16,Vehicle,2,,549,1892.0,723.0,3,Vehicle,free,unemployed,0,68,M,Single,1,3,none
2,122,38680,24,Vehicle,4,,356,0.0,586.0,2,Vehicle,rent,highly skilled,7,41,F,Married,1,2,none
3,245,34470,12,Tuition,4,,749,0.0,8102.0,1,Home,own,unskilled,1,35,F,Married,2,3,none
4,447,18420,36,Vehicle,4,,672,-463.77,428.0,1,Vehicle,own,skilled,1,34,F,Married,1,4,none


In [17]:
credit_req_df.toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   credit_request_id       60 non-null     int64  
 1   credit_amount           60 non-null     int64  
 2   credit_duration         60 non-null     int64  
 3   purpose                 60 non-null     object 
 4   installment_commitment  60 non-null     int64  
 5   other_parties           8 non-null      object 
 6   credit_score            60 non-null     int64  
 7   checking_balance        60 non-null     float64
 8   savings_balance         60 non-null     float64
 9   existing_credits        60 non-null     int64  
 10  assets                  49 non-null     object 
 11  housing                 60 non-null     object 
 12  qualification           60 non-null     object 
 13  job_history             60 non-null     int64  
 14  age                     60 non-null     int6