# Dataframes
The SAP HANA Python Client API for machine learning algorithms (Python Client API for ML) provides a set of client-side Python functions for accessing and querying SAP HANA data, and a set of functions for developing machine learning models.

The Python Client API for ML consists of two main parts:

<li>A set of machine learning APIs for different algorithms.</li>
<li>The SAP HANA dataframe, which provides a set of methods for analyzing data in SAP HANA without bringing that data to the client.</li>

This library uses the SAP HANA Python driver (hdbcli) to connect to and access SAP HANA.
<br>
<br>
<img src="images/highlevel_overview2_new.png" title="Python API Overview" style="float:left;" width="300" height="50" />
<br>
A dataframe represents a table (or any SQL statement).  Most operations on a dataframe are designed to not bring data back from the database unless explicitly asked for.

In [1]:
from hana_ml import dataframe
import logging

## Setup connection and data sets
Let us load some data into a HANA table.  The data is loaded into 4 tables - full set, test set, training set, and the validation set:DBM2_RFULL_TBL, DBM2_RTEST_TBL, DBM2_RTRAINING_TBL, DBM2_RVALIDATION_TBL.

The data is related with direct marketing campaigns of a Portuguese banking institution. More information regarding the data set is at https://archive.ics.uci.edu/ml/datasets/bank+marketing#.

To do that, a connection is created and passed to the loader.  There is a config file, <b>config/e2edata.ini</b> that controls the connection parameters.  Please edit it to point to your hana instance.

In [2]:
from data_load_utils import DataSets, Settings
url, port, user, pwd = Settings.load_config("../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
full_tbl, training_tbl, validation_tbl, test_tbl = DataSets.load_bank_data(connection_context)

### Simple DataFrame
<table align="left"><tr><td>
</td><td><img src="images/Dataframes_1.png" style="float:left;" width="600" height="400" /></td></tr></table>

In [3]:
dataset1 = connection_context.table(training_tbl)
# Alternatively, it could be any SELECT
print(dataset1.select_statement)

SELECT * FROM "DBM2_RTRAINING_TBL"


### Simple Operations
#### Drop duplicates

In [4]:
dataset2 = dataset1.drop_duplicates()
print(dataset2.select_statement)

SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_0"


#### Remove a column

In [5]:
dataset3 = dataset2.drop(["LABEL"])
print(dataset3.select_statement)

SELECT "ID", "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_0") AS "DT_1"


#### Take null values and substitute with a specific value

In [6]:
dataset4 = dataset2.fillna(25, ["AGE"])
print(dataset4.select_statement)

SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_0") dt


### Bring data to client
#### Fetch 5 rows into client as a <b>Pandas Dataframe</b>

In [7]:
dataset4.head(5).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,27178,57,housemaid,married,basic.4y,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195,no
1,31377,39,blue-collar,divorced,basic.9y,unknown,no,no,cellular,may,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.334,5099,no
2,5987,34,blue-collar,married,basic.9y,no,no,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,12963,41,blue-collar,married,unknown,no,no,yes,cellular,jul,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228,no
4,5479,32,management,married,university.degree,no,no,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no


In [8]:
pd1 = dataset4.head(5).collect()
print(type(pd1))

<class 'pandas.core.frame.DataFrame'>


### Projection
<img src="images/Projection.png" style="float:left;" width="150" height="750" />

In [9]:
dsp = dataset4.select("ID", "AGE", "JOB", ('"AGE"*2', "TWICE_AGE"))
dsp.head(5).collect()  # collect() brings data to the client)

Unnamed: 0,ID,AGE,JOB,TWICE_AGE
0,7792,37,entrepreneur,74
1,7841,32,blue-collar,64
2,7857,39,admin.,78
3,7916,36,technician,72
4,7921,48,blue-collar,96


In [10]:
dsp.select_statement

'SELECT "ID", "AGE", "JOB", "AGE"*2 AS "TWICE_AGE" FROM (SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_0") dt) AS "DT_3"'

### Filtering Data
<img src="images/Filter.png" style="float:left;" width="200" height="100" />

In [11]:
dataset4.filter('AGE > 60').head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,30134,79,retired,married,basic.9y,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
1,30215,83,retired,married,basic.4y,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
2,30242,81,retired,married,professional.course,no,no,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
3,30295,69,retired,married,university.degree,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
4,30380,66,unemployed,single,basic.4y,no,no,no,telephone,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
5,30384,66,unemployed,single,basic.4y,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,yes
6,30391,71,retired,divorced,basic.4y,no,yes,no,telephone,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
7,29567,68,retired,married,high.school,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099,yes
8,29669,71,retired,married,university.degree,no,no,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099,yes
9,37171,70,retired,married,basic.4y,unknown,yes,no,cellular,aug,...,1,999,0,nonexistent,-2.9,92.201,-31.4,0.883,5076,no


In [12]:
dataset4.filter('AGE > 60').select_statement

'SELECT * FROM (SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_0") dt) AS "DT_3" WHERE AGE > 60'

### Sorting
<img src="images/Sort.png" style="float:left;" width="200" height="100" />

In [13]:
dataset4.filter('AGE>60').sort(['AGE']).head(2).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,38841,61,retired,married,basic.4y,no,yes,no,cellular,nov,...,1,9,3,failure,-3.4,92.649,-30.1,0.714,5017,yes
1,39643,61,management,married,university.degree,no,no,no,cellular,may,...,2,999,1,failure,-1.8,93.876,-40.0,0.682,5008,yes


### Simple Joins
<img src="images/Join.png" style="float:left;" width="300" height="200" />

In [14]:
condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)
dataset5 = dataset4.join(dataset2, condition)

In [15]:
dataset5.head(5).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,7792,37,entrepreneur,married,university.degree,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228,no
1,7841,32,blue-collar,married,professional.course,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228,no
2,7857,39,admin.,married,university.degree,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228,no
3,7916,36,technician,married,professional.course,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228,no
4,7921,48,blue-collar,married,basic.4y,no,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228,no


### Describing a dataframe
<img src="images/Describe.png" style="float:left;" width="300" height="200" />

In [16]:
dataset4.describe().collect()

Unnamed: 0,column,count,unique,nulls,mean,std,min,max,median,25_percent_cont,25_percent_disc,50_percent_cont,50_percent_disc,75_percent_cont,75_percent_disc
0,ID,16895,16895,0,21282.286652,12209.759725,5.0,41187.0,21786.0,10583.5,10583.0,21786.0,21786.0,32067.5,32068.0
1,AGE,16895,78,0,40.051376,10.716907,17.0,98.0,38.0,32.0,32.0,38.0,38.0,47.0,47.0
2,DURATION,16895,1267,0,263.96567,264.331384,0.0,4918.0,184.0,107.0,107.0,184.0,184.0,324.0,324.0
3,CAMPAIGN,16895,35,0,2.344658,2.428449,1.0,43.0,2.0,1.0,1.0,2.0,2.0,3.0,3.0
4,PDAYS,16895,24,0,944.406688,226.331944,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
5,PREVIOUS,16895,7,0,0.209529,0.53945,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,EMP_VAR_RATE,16895,10,0,-0.038798,1.621945,-3.4,1.4,1.1,-1.8,-1.8,1.1,1.1,1.4,1.4
7,CONS_PRICE_IDX,16895,26,0,93.538844,0.579189,92.201,94.767,93.444,93.075,93.075,93.444,93.444,93.994,93.994
8,CONS_CONF_IDX,16895,26,0,-40.334123,4.86572,-50.8,-26.9,-41.8,-42.7,-42.7,-41.8,-41.8,-36.4,-36.4
9,EURIBOR3M,16895,283,0,3.499297,1.777986,0.634,5.045,4.856,1.313,1.313,4.856,4.856,4.961,4.961


In [17]:
dataset4.describe().select_statement

'SELECT * FROM (SELECT "SimpleStats".*, "Percentiles"."25_percent_cont", "Percentiles"."25_percent_disc", "Percentiles"."50_percent_cont", "Percentiles"."50_percent_disc", "Percentiles"."75_percent_cont", "Percentiles"."75_percent_disc" FROM (select \'ID\' as "column", COUNT("ID") as "count", COUNT(DISTINCT "ID") as "unique", SUM(CASE WHEN "ID" is NULL THEN 1 ELSE 0 END) as "nulls", AVG("ID") as "mean", STDDEV("ID") as "std", MIN("ID") as "min", MAX("ID") as "max", MEDIAN("ID") as "median" FROM (SELECT "ID", COALESCE("AGE", 25) AS "AGE", "JOB", "MARITAL", "EDUCATION", "DBM_DEFAULT", "HOUSING", "LOAN", "CONTACT", "DBM_MONTH", "DAY_OF_WEEK", "DURATION", "CAMPAIGN", "PDAYS", "PREVIOUS", "POUTCOME", "EMP_VAR_RATE", "CONS_PRICE_IDX", "CONS_CONF_IDX", "EURIBOR3M", "NREMPLOYED", "LABEL" FROM (SELECT DISTINCT * FROM (SELECT * FROM "DBM2_RTRAINING_TBL") AS "DT_0") dt) AS "DT_3" UNION ALL select \'AGE\' as "column", COUNT("AGE") as "count", COUNT(DISTINCT "AGE") as "unique", SUM(CASE WHEN "AGE" 

### Saving a dataframe

In [18]:
dataset4.head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,27178,57,housemaid,married,basic.4y,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195,no
1,31377,39,blue-collar,divorced,basic.9y,unknown,no,no,cellular,may,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.334,5099,no
2,5987,34,blue-collar,married,basic.9y,no,no,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,12963,41,blue-collar,married,unknown,no,no,yes,cellular,jul,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228,no
4,5479,32,management,married,university.degree,no,no,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
5,33491,53,technician,married,professional.course,no,no,no,cellular,may,...,5,999,1,failure,-1.8,92.893,-46.2,1.291,5099,no
6,30259,56,entrepreneur,married,university.degree,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
7,35092,36,management,divorced,university.degree,no,unknown,unknown,cellular,may,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.25,5099,no
8,7744,51,blue-collar,married,unknown,unknown,yes,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.864,5191,no
9,38755,29,admin.,single,university.degree,no,no,yes,cellular,nov,...,1,3,1,success,-3.4,92.649,-30.1,0.715,5017,yes


In [19]:
dataset4.count()

16895

In [20]:
dataset4.save("#MYTEST2")

<hana_ml.dataframe.DataFrame at 0x9e4f2e8>

In [21]:
dataset8 = connection_context.table("#MYTEST2")

In [22]:
dataset8.head(10).collect()

Unnamed: 0,ID,AGE,JOB,MARITAL,EDUCATION,DBM_DEFAULT,HOUSING,LOAN,CONTACT,DBM_MONTH,...,CAMPAIGN,PDAYS,PREVIOUS,POUTCOME,EMP_VAR_RATE,CONS_PRICE_IDX,CONS_CONF_IDX,EURIBOR3M,NREMPLOYED,LABEL
0,27178,57,housemaid,married,basic.4y,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195,no
1,31377,39,blue-collar,divorced,basic.9y,unknown,no,no,cellular,may,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.334,5099,no
2,5987,34,blue-collar,married,basic.9y,no,no,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,12963,41,blue-collar,married,unknown,no,no,yes,cellular,jul,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228,no
4,5479,32,management,married,university.degree,no,no,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
5,33491,53,technician,married,professional.course,no,no,no,cellular,may,...,5,999,1,failure,-1.8,92.893,-46.2,1.291,5099,no
6,30259,56,entrepreneur,married,university.degree,no,yes,no,cellular,apr,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.365,5099,no
7,35092,36,management,divorced,university.degree,no,unknown,unknown,cellular,may,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.25,5099,no
8,7744,51,blue-collar,married,unknown,unknown,yes,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.864,5191,no
9,38755,29,admin.,single,university.degree,no,no,yes,cellular,nov,...,1,3,1,success,-3.4,92.649,-30.1,0.715,5017,yes


In [23]:
dataset8.count()

16895