# Setup

After creating a project, users may want to install basic libraries that are required and source required datasets for preliminary analysis.

Users can choose the type of runtime with suitable programming kernel and editor. Most commonly used programming language is Python. Academic data scientists also prefer to use R as it has better implementations of algorithms and mathematical funtions.

Here we choose Python as the language kernel and editor that supports Jupyter notebook.

## Environment setup

Environment setup requires users to know existing setup and additional steps required to configure further packages required. In case of out-of-box runtimes, they have pre-installed packages. To find list of pre-installed packages, users can refer to [Cloudera documentation](https://docs.cloudera.com/machine-learning/cloud/runtimes-preinstalled-packages/topics/ml-runtimes-packaging.html). <br>

### Package install

If required packages are not pre-installed, users can install them either using jupyter magic commands or CML session terminal. We start by installing pandas and imblearn packages.

In [2]:
!pip install pandas

Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/9e/0d/91a9fd2c202f2b1d97a38ab591890f86480ecbb596cbc56d035f6f23fdcc/pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting tzdata>=2022.1 (from pandas)
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
Downloading pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: tzdata, pandas
Successfully installed pandas-2.0.3 tzdata-2023.3


In [3]:
# Test if pandas is installed
import pandas as pd

In [4]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn (from imblearn)
  Obtaining dependency information for imbalanced-learn from https://files.pythonhosted.org/packages/a3/9e/fbe60a768502af54563dcb59ca7856f5a8833b3ad5ada658922e1ab09b7f/imbalanced_learn-0.11.0-py3-none-any.whl.metadata
  Downloading imbalanced_learn-0.11.0-py3-none-any.whl.metadata (8.3 kB)
Collecting scipy>=1.5.0 (from imbalanced-learn->imblearn)
  Obtaining dependency information for scipy>=1.5.0 from https://files.pythonhosted.org/packages/a3/d3/f88285098505c8e5d141678a24bb9620d902c683f11edc1eb9532b02624e/scipy-1.11.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading scipy-1.11.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn>=1.0.2 (from imbalanced-learn->imble

In [5]:
# Test if imblearn is installed
import imblearn

## Data

Users can access from local file system, cloud storage or data from any of the data services in CDP.

### Local data

Root directory in a CML project session is `/home/cdsw`. Sub directories can be created and data can be stored within the desired directory for local access.

Users typically prefer this as they are used to storing and accessing data from their laptops. For the purpose of this project, we will be using approach. However, standard recommendation is to store data elsewhere like cloud storage/HDFS or from other services like Hive, Impala and Spark. This allows for access control to be enforced.

To keep it simple, Credit Card Default has been chosen as the use case. Data is sourced from UCI datasets. This dataset has been downloaded and saved locally under `data` directory.

In [6]:
# Test is data is accessible
pd_df = pd.read_csv('/home/cdsw/data/UCI_Credit_Card.csv.zip', compression='zip')

In [7]:
pd_df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## Data Exploration & Model Training

Refer notebook [Baseline Model](./1_Baseline_Model.ipynb) for next steps.