<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

# RapidsAI

<center><img src="https://miro.medium.com/max/700/0*NnhdtK0Zs9BfLu0d.png" width="800" height="400" /></center>

## Table of Contents

1. [Introduction](#section1)<br>
2. [Pre-Requisites](#section2)
3. [Installing Rapids in Google Colab](#section3)
4. [Importing Libraries](#section)
5. [Reading the dataset using pandas and cuDF](#section3)
    
6. [Data Pre-processing](#section6)
    - 6.1.  [Dropping unnecessary columns](#section601)<br/>
    - 6.2.  [Label Encoding](#section602)<br/>
    - 6.3.  [Splitting Dependent and Independent variables](#section603)<br/>
    - 6.4.  [Standardising the values](#section604)
    - 6.5.  [Splitting into training and testing](#section605)
    - 6.6.  [Converting the pandas dataframe to a GPU Dataframe using cuDF](#section606)
    
7.  [Model Creation](#section6)<br>
    
    - 7.1.  [Scikit-learn Logistic Regression](#section701)
    - 7.2.  [cuML Logistic Regression](#section702)
    - 7.3.  [Scikit-learn Random-forest](#section703)
    - 7.4.  [cuML Logistic Random-forest](#section704)

8.  [Conclusions](#section801) 

<a id=section1></a>
### 1. Introduction 
<center><img src="https://miro.medium.com/max/700/1*wIyHwhV39p77Grug_3Cq4A.png" width="800" height="400" /></center>


### 1.1 Accelerated Data Science — What is RAPIDS?

RAPIDS is a “suite of open source software libraries and APIs” grouped together for the purpose of providing users the ability to “execute end-to-end data science and analytics pipelines entirely on GPUs.”

RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
The suite also focuses on common data preparation tasks for data science including a Pandas-esque dataframe API which integrates with a variety of machine learning algorithms to hedge typical serialization costs.
RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.

### 1.2 Libraries and APIs Overview
- **cuDF** — pandas-like dataframe manipulation library
- **cuML** — collection of ML libraries that will provide GPU versions of algorithms available in scikit-learn
- **cuGraph** — network-X like graphing API
- **cuDNN**- Tensorflow, PyTorch like framework for Deep Neural Networks

You can have look at the official cheatsheet of Rapids [here](https://rapids.ai/assets/files/cheatsheet.pdf) where you will get to know that it is almost similar to pandas

### 2. Pre-Requisites

The RAPIDS AI GPU accelerated data science suite runs on all GPUs with NVIDIA Pascal architecture (or better) and compute capability 6.0+.
Users must also have CUDA 9.2, 10.0, or 10.1.2 with corresponding NVIDIA Driver, and either an Ubuntu or a CentOS operating system. Following GPU's are supported to use Rapids in colab
- Tesla T4
- Tesla V100
- Tesla P40
- Tesla P4
- Tesla P100
 
 You can use a GPU powered google colab notebook by clicking [here](https://colab.research.google.com/drive/1rY7Ln6rEE1pOlfSHCYOVaqt8OvDO35J0#forceEdit=true&sandboxMode=true&scrollTo=CtNdk7PSafKP). 



In [1]:
!nvidia-smi

Sat Oct  3 09:49:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    12W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 3. Installing Rapids in Google Colab

- Step 1 — Set the Runtime
    - In Google Colab, click Runtime in in the top toolbar
    - Click Change runtime type
    - Select GPU for Hardware accelerator
- Step 2 - Run RAPIDS install script
    - Once the Runtime has been set to GPU, execute the following script in
      a code cell to install RAPIDS


In [2]:
# Install RAPIDS

!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 185 (delta 5), reused 0 (delta 0), pack-reused 171[K
Receiving objects: 100% (185/185), 57.14 KiB | 11.43 MiB/s, done.
Resolving deltas: 100% (67/67), done.
PLEASE READ
********************************************************************************************************
Changes:
1. IMPORTANT CHANGES: RAPIDS on Colab will be pegged to 0.14 Stable until further notice.
2. Default stable version is now 0.14.  Nightly will redirect to 0.14.
3. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.14, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <version/label>"
        Examples: '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.14', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh stable', o

#Setup:
Set up script installs
1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
1. Removes incompatible files
1. Install RAPIDS libraries
1. Set necessary environment variables
1. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
1. If running v0.11 or higher, updates pyarrow library to 0.15.x.

### Problem Statement

**The process of issuing loans has increased in complexity over the years due to the different possibilities, market demands and clients’ circumstances.** Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by treacherous lenders. This has made banks a highly regulated entity which is expected to act responsively while providing loans.

<center><img src="https://raw.githubusercontent.com/insaid2018/Domain_Case_Studies/master/Finance/loan2.png"></center>


### 4. Importing Libraries

In [4]:
import cudf as cu
import pandas as pd

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


#### 5. Reading the dataset using pandas and cuDF 

We have extracted the data, **year(2007-2015),** of **customer being defaulter** and it can be retrieved from the <a href="https://storage.googleapis.com/industryanalytics/LoanDefaultData.csv">link</a>.

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 8,87,379 | 22 | 131 MB|

| Id | Features | Description |
| :--| :--| :--|
|01|**cust_id**|Unique ID of customer|
|02|**year**|Loan Applied Year|
|03|**state**|State where loan was approved|
|04|**date_issued**|Date when loan was issued|
|05|**date_final**|Final date of loan payment|
|06|**emp_duration**|Employment duration(in years). Range=[0, 10] , where 0 indicates < 1 year and 10 indicates >=10 years|
|07|**own_type**|A status provided by the borrower during registration.  Possible values are MORTGAGE, OTHER, NONE, ANY, RENT, OWN|
|08|**income_type**|Income categorization of customer. Possible values are Low, Medium, High|
|09|**app_type**|Signifies whether the loan is an individual application or a joint application|
|10|**loan_purpose**|Signifies the requirement of loan|
|11|**interest_payments**|Signifies the type of interest payments, categorized under Low and High|
|12|**grade**|Assigned loan grade by the company|
|13|**annual_pay**|Annual salary of the customer|
|14|**loan_amount**|Loan amount required by the customer|
|15|**interest_rate**|Interest rate on the lent money|
|16|**loan_duration**|Loan repayment duration in months(36 or 60)|
|17|**dti**|Debt-to-Income(DTI) is the percentage of a consumer's monthly gross income that goes toward paying debts.|
|18|**total_pymnt**|Total amount that has been paid so far|
|19|**total_rec_prncp**| Total recoverd principal amount so far|
|20|**recoveries**|Amount that has yet to recover|
|21|**installment**|Monthly payment owed by the borrower|
|22|**is_default**|Customer been default or not|

The **target feature** in the acquired data set is **is_default** and it's values are:

|Target Feature|Potential Values|
| :-- | :-- |
|**is_default**|0: Not default|
||1: Default|

In [5]:
%%time
df=pd.read_csv('https://storage.googleapis.com/industryanalytics/LoanDefaultData.csv')
print(df.shape)

(887379, 22)
CPU times: user 1.74 s, sys: 390 ms, total: 2.13 s
Wall time: 8.39 s


In [6]:
df.memory_usage(index=True).sum()

156178832

In [9]:
%%time
df1= cu.read_csv('https://storage.googleapis.com/industryanalytics/LoanDefaultData.csv')
print(df1.shape)

(887379, 22)
CPU times: user 1.36 s, sys: 1.29 s, total: 2.65 s
Wall time: 4.79 s


- cuDF takes almost half the time taken by pandas for reading a dataset. You will feel the difference more while importing big datasets.

In [7]:
df.head()

Unnamed: 0,cust_id,year,state,date_issued,date_final,emp_duration,own_type,income_type,app_type,loan_purpose,interest_payments,grade,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
0,180675,2007,Andhra Pradesh,01/12/2007,1032009,10.0,MORTGAGE,Low,INDIVIDUAL,debt_consolidation,Low,C,73000,25000,10.91,36 months,22.13,13650.38,8767.32,2207.65,817.41,1
1,85781,2007,Rajasthan,01/06/2007,1072010,0.5,RENT,Low,INDIVIDUAL,other,Low,C,40000,1400,10.91,36 months,8.61,1663.04,1400.0,0.0,45.78,0
2,85675,2007,Manipur,01/06/2007,1062010,10.0,RENT,Low,INDIVIDUAL,other,High,E,25000,1000,14.07,36 months,16.27,1231.38,1000.0,0.0,34.21,0
3,84918,2007,Andhra Pradesh,01/09/2007,1042008,10.0,MORTGAGE,Low,INDIVIDUAL,other,Low,A,65000,5000,7.43,36 months,0.28,5200.44,5000.0,0.0,155.38,0
4,84670,2007,Arunachal Pradesh,01/06/2007,1082009,10.0,MORTGAGE,High,INDIVIDUAL,other,Low,A,300000,5000,7.75,36 months,5.38,5565.65,5000.0,0.0,156.11,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 22 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   cust_id            887379 non-null  int64  
 1   year               887379 non-null  int64  
 2   state              887379 non-null  object 
 3   date_issued        887379 non-null  object 
 4   date_final         887379 non-null  int64  
 5   emp_duration       887379 non-null  float64
 6   own_type           887379 non-null  object 
 7   income_type        887379 non-null  object 
 8   app_type           887379 non-null  object 
 9   loan_purpose       887379 non-null  object 
 10  interest_payments  887379 non-null  object 
 11  grade              887379 non-null  object 
 12  annual_pay         887379 non-null  int64  
 13  loan_amount        887379 non-null  int64  
 14  interest_rate      887379 non-null  float64
 15  loan_duration      887379 non-null  object 
 16  dt

#### 6. Data Pre-processing

6.1 Dropping unnecessary columns

In [11]:
df.drop(labels = ['cust_id', 'state','date_issued', 'date_final'], axis = 1, inplace = True)
print(df.shape)
df.head()


KeyError: ignored

6.2 Label encoding

In [12]:
ordered_labels = ['income_type', 'app_type', 'interest_payments', 'grade', 'loan_duration']
encode = LabelEncoder()

for i in ordered_labels:
  if isinstance(df[i].dtype, object):
    df[i] = encode.fit_transform(df[i])
print('Label Encoding Success!')
print('Data Shape:', df.shape)
df.head()

Label Encoding Success!
Data Shape: (887379, 18)


Unnamed: 0,year,emp_duration,own_type,income_type,app_type,loan_purpose,interest_payments,grade,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default
0,2007,10.0,MORTGAGE,1,0,debt_consolidation,1,2,73000,25000,10.91,0,22.13,13650.38,8767.32,2207.65,817.41,1
1,2007,0.5,RENT,1,0,other,1,2,40000,1400,10.91,0,8.61,1663.04,1400.0,0.0,45.78,0
2,2007,10.0,RENT,1,0,other,0,4,25000,1000,14.07,0,16.27,1231.38,1000.0,0.0,34.21,0
3,2007,10.0,MORTGAGE,1,0,other,1,0,65000,5000,7.43,0,0.28,5200.44,5000.0,0.0,155.38,0
4,2007,10.0,MORTGAGE,0,0,other,1,0,300000,5000,7.75,0,5.38,5565.65,5000.0,0.0,156.11,0


In [14]:
data = pd.get_dummies(data=df, columns = ['own_type', 'loan_purpose'])
print('Data Shape:', df.shape)
data.head()

Data Shape: (887379, 18)


Unnamed: 0,year,emp_duration,income_type,app_type,interest_payments,grade,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,is_default,own_type_ANY,own_type_MORTGAGE,own_type_NONE,own_type_OTHER,own_type_OWN,own_type_RENT,loan_purpose_car,loan_purpose_credit_card,loan_purpose_debt_consolidation,loan_purpose_educational,loan_purpose_home_improvement,loan_purpose_house,loan_purpose_major_purchase,loan_purpose_medical,loan_purpose_moving,loan_purpose_other,loan_purpose_renewable_energy,loan_purpose_small_business,loan_purpose_vacation,loan_purpose_wedding
0,2007,10.0,1,0,1,2,73000,25000,10.91,0,22.13,13650.38,8767.32,2207.65,817.41,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,2007,0.5,1,0,1,2,40000,1400,10.91,0,8.61,1663.04,1400.0,0.0,45.78,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,2007,10.0,1,0,0,4,25000,1000,14.07,0,16.27,1231.38,1000.0,0.0,34.21,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,2007,10.0,1,0,1,0,65000,5000,7.43,0,0.28,5200.44,5000.0,0.0,155.38,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,2007,10.0,0,0,1,0,300000,5000,7.75,0,5.38,5565.65,5000.0,0.0,156.11,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


6.3 Splitting the dependent and independent variables

In [15]:
X, y = data.drop('is_default', axis = 1), data['is_default']
print('X Shape:', X.shape)
print('y Shape:', y.shape)

X Shape: (887379, 35)
y Shape: (887379,)


6.4 Standardising the values

In [16]:
std_scale = StandardScaler()
scale_fit = std_scale.fit_transform(X)

X_data = pd.DataFrame(scale_fit, columns = X.columns)
print('Data Shape:', X_data.shape)
X_data.head()

Data Shape: (887379, 35)


Unnamed: 0,year,emp_duration,income_type,app_type,interest_payments,grade,annual_pay,loan_amount,interest_rate,loan_duration,dti,total_pymnt,total_rec_prncp,recoveries,installment,own_type_ANY,own_type_MORTGAGE,own_type_NONE,own_type_OTHER,own_type_OWN,own_type_RENT,loan_purpose_car,loan_purpose_credit_card,loan_purpose_debt_consolidation,loan_purpose_educational,loan_purpose_home_improvement,loan_purpose_house,loan_purpose_major_purchase,loan_purpose_medical,loan_purpose_moving,loan_purpose_other,loan_purpose_renewable_energy,loan_purpose_small_business,loan_purpose_vacation,loan_purpose_wedding
0,-5.565138,1.126029,-0.351868,-0.024004,0.95239,0.153586,-0.031339,1.214486,-0.533275,-0.654724,0.231112,0.7739,0.454251,5.276457,1.559025,-0.001839,1.000299,-0.007507,-0.014323,-0.330681,-0.818732,-0.100442,-0.55016,0.832332,-0.021838,-0.249058,-0.064769,-0.140912,-0.098577,-0.078349,-0.225373,-0.025464,-0.108777,-0.073251,-0.051496
1,-5.565138,-1.582528,-0.351868,-0.024004,0.95239,0.153586,-0.5414,-1.583231,-0.533275,-0.654724,-0.555363,-0.749029,-0.657724,-0.112082,-1.600978,-0.001839,-0.999701,-0.007507,-0.014323,-0.330681,1.2214,-0.100442,-0.55016,-1.201443,-0.021838,-0.249058,-0.064769,-0.140912,-0.098577,-0.078349,4.437084,-0.025464,-0.108777,-0.073251,-0.051496
2,-5.565138,1.126029,-0.351868,-0.024004,-1.04999,1.677281,-0.773246,-1.63065,0.187879,-0.654724,-0.109771,-0.803869,-0.718097,-0.112082,-1.64836,-0.001839,-0.999701,-0.007507,-0.014323,-0.330681,1.2214,-0.100442,-0.55016,-1.201443,-0.021838,-0.249058,-0.064769,-0.140912,-0.098577,-0.078349,4.437084,-0.025464,-0.108777,-0.073251,-0.051496
3,-5.565138,1.126029,-0.351868,-0.024004,0.95239,-1.370109,-0.15499,-1.15646,-1.327458,-0.654724,-1.03993,-0.299621,-0.114363,-0.112082,-1.152141,-0.001839,1.000299,-0.007507,-0.014323,-0.330681,-0.818732,-0.100442,-0.55016,-1.201443,-0.021838,-0.249058,-0.064769,-0.140912,-0.098577,-0.078349,4.437084,-0.025464,-0.108777,-0.073251,-0.051496
4,-5.565138,1.126029,-2.866061,-0.024004,0.95239,-1.370109,3.477264,-1.15646,-1.254429,-0.654724,-0.743257,-0.253223,-0.114363,-0.112082,-1.149151,-0.001839,1.000299,-0.007507,-0.014323,-0.330681,-0.818732,-0.100442,-0.55016,-1.201443,-0.021838,-0.249058,-0.064769,-0.140912,-0.098577,-0.078349,4.437084,-0.025464,-0.108777,-0.073251,-0.051496


6.5 Splitting into training and testing dataset

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size = 0.2, random_state = 42, stratify = y)
print('Train Shape:', X_train.shape, y_train.shape)
print('Test Shape:', X_test.shape, y_test.shape)

Train Shape: (709903, 35) (709903,)
Test Shape: (177476, 35) (177476,)


6.6 Converting the pandas dataframe to a GPU Dataframe using cuDF

In [18]:
X_train_gdf = cu.DataFrame.from_pandas(X_train)
X_test_gdf = cu.DataFrame.from_pandas(X_test)
y_train_gdf = cu.DataFrame.from_pandas(pd.DataFrame(y_train))
y_test_gdf = cu.DataFrame.from_pandas(pd.DataFrame(y_test))

We need to convert the 64-bit memory allocation to 32-bit memory allocation for cuML

In [19]:
X_train_gdf=X_train_gdf.astype('float32')
y_train_gdf=y_train_gdf.astype('int32')
y_test_gdf=y_test_gdf.astype('int32')
X_test_gdf=X_test_gdf.astype('float32')


Checking the memory occupied by the GPU dataframe in the GPU 

In [20]:
!nvidia-smi

Sat Oct  3 10:10:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    31W /  70W |    749MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 7. Model Creation 

#### 7.1 Scikit-learn Logistic Regression

In [21]:
import time
start_time = time.time()
log  = LogisticRegression(max_iter=200)
log.fit(X_train, y_train)
print("Training Time with Pandas dataframe: %s seconds" % (str(time.time() - start_time)))

Training Time with Pandas dataframe: 15.903741836547852 seconds


#### 7.2 cuML Logistic Regression

In [22]:
from cuml import LogisticRegression as lgr

In [23]:
y_train_gdf=y_train_gdf.astype('float32')

In [24]:
import time
start_time = time.time()
log_gpu  = lgr(max_iter=200)
log_gpu.fit(X_train_gdf, y_train_gdf)
print("GPU Training Time with GPU dataframe: %s seconds" % (str(time.time() - start_time)))

GPU Training Time with GPU dataframe: 3.7464096546173096 seconds


#### 7.3 Scikit-learn Random-forest

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
import time
start_time = time.time()
rfc = RandomForestClassifier(n_estimators = 100, max_depth = 5)
rfc.fit(X_train, y_train)
print("Training Time with Pandas dataframe: %s seconds" % (str(time.time() - start_time)))

Training Time with Pandas dataframe: 51.74715781211853 seconds


#### 7.4 cuML Random-forest

In [28]:
from cuml import RandomForestClassifier as curf

In [30]:
y_train_gdf=y_train_gdf.astype('int32')

In [31]:
import time
start_time = time.time()
rfc_gpu = curf(n_estimators = 100, max_depth = 5)
rfc_gpu.fit(X_train_gdf, y_train_gdf)
print("GPU Training Time with GPU dataframe: %s seconds" % (str(time.time() - start_time)))

GPU Training Time with GPU dataframe: 1.4308819770812988 seconds


### 8. Conclusions

- cuML's Random forest is found to be way more faster than scikit-learn's Random Forest
- Training a model using Rapids accelerates the speed of training
- The data analysis part is also accelerated using cuDF
- Syntax of both cuDF and cuML is similar to Pandas and sklearn