# Rule Generator (Decision Tree algorithm) Spark Example

The Rule Generator (Decision Tree algorithm) is used to create rules based on a labelled dataset (stored as a Koalas DataFrame). This algorithm generate rules by extracting the highest performing branches from a tree ensemble model.

**You should use this module when loading the dataset into memory is not possible. In this case, the standard Rule Generator algorithm cannot be used, as it relies on Pandas & Sklearn.**

## Requirements

To run, you'll need the following:

* A labelled, processed dataset (nulls imputed, categorical features encoded).

----

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDTSpark
from iguanas.metrics.classification import FScore

import databricks.koalas as ks
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql import SparkSession

## Create Spark session

In [2]:
spark = SparkSession.builder.config('spark.dynamicAllocation.enabled', True).getOrCreate()

21/12/17 17:17:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/17 17:17:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Read in data

Let's read in some labelled, processed dummy data.

In [3]:
X_train = ks.read_csv(
    'dummy_data/X_train.csv', 
    index_col='eid'
)
y_train = ks.read_csv(
    'dummy_data/y_train.csv', 
    index_col='eid'
).squeeze()
X_test = ks.read_csv(
    'dummy_data/X_test.csv', 
    index_col='eid'
)
y_test = ks.read_csv(
    'dummy_data/y_test.csv', 
    index_col='eid'
).squeeze()

----

## Generate rules

### Set up class parameters

Now we can set our class parameters for the Rule Generator. Here we're using the F1 score as the rule performance metric (you can choose a different function from the `metrics.classification` module or create your own). 

**Note that if you're using the FScore, Precision or Recall score as the optimisation function, use the *FScore*, *Precision* or *Recall* classes in the *metrics.classification* module rather than the same functions from Sklearn's *metrics* module, since Sklearn's functions do not work on Koalas DataFrames.**

**Please see the class docstring for more information on each parameter.**

In [4]:
fs = FScore(beta=1)

In [5]:
params = {
    'n_total_conditions': 4,
    'metric': fs.fit,
    'tree_ensemble': RandomForestClassifier(numTrees=5, seed=0),
    'precision_threshold': 0.5,
    'target_feat_corr_types': 'Infer',
    'verbose': 1
}

### Instantiate class and run fit method

Once the parameters have been set, we can run the `fit` method to generate rules.

In [6]:
rg = RuleGeneratorDTSpark(**params)

In [7]:
X_rules = rg.fit(
    X=X_train, 
    y=y_train, 
    sample_weight=None
)

--- Calculating correlation of features with respect to the target ---


21/12/17 17:17:47 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
21/12/17 17:17:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

--- Returning column datatypes ---


                                                                                

--- Creating Spark DataFrame for training ---
--- Training tree ensemble ---
--- Extracting rules from tree ensemble ---




### Outputs

The `fit` method returns a dataframe giving the binary columns of the generated rules as applied to the training dataset. See the `Attributes` section in the class docstring for a description of each attribute generated:

In [8]:
X_rules.head()

Unnamed: 0_level_0,RGDT_Rule_20211217_0,RGDT_Rule_20211217_1,RGDT_Rule_20211217_2,RGDT_Rule_20211217_3,RGDT_Rule_20211217_4,RGDT_Rule_20211217_5,RGDT_Rule_20211217_6,RGDT_Rule_20211217_7,RGDT_Rule_20211217_8,RGDT_Rule_20211217_9,RGDT_Rule_20211217_10,RGDT_Rule_20211217_11,RGDT_Rule_20211217_12
eid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
867-8837095-9305559,0,0,0,0,0,0,0,0,0,0,0,0,0
974-5306287-3527394,0,0,0,0,0,0,0,0,0,0,0,0,0
584-0112844-9158928,0,0,0,0,0,0,0,0,0,0,0,0,0
956-4190732-7014837,0,0,0,0,0,0,0,0,0,0,0,0,0
349-7005645-8862067,0,0,0,0,0,0,0,0,0,0,0,0,0


----

## Apply rules to a separate dataset

Use the `transform` method to apply the generated rules to a separate dataset.

In [11]:
X_rules_test = rg.transform(X=X_test)



### Outputs

The `transform` method returns a dataframe giving the binary columns of the rules as applied to the given dataset:

In [13]:
X_rules_test.head()

Unnamed: 0_level_0,RGDT_Rule_20211217_0,RGDT_Rule_20211217_1,RGDT_Rule_20211217_2,RGDT_Rule_20211217_3,RGDT_Rule_20211217_4,RGDT_Rule_20211217_5,RGDT_Rule_20211217_6,RGDT_Rule_20211217_7,RGDT_Rule_20211217_8,RGDT_Rule_20211217_9,RGDT_Rule_20211217_10,RGDT_Rule_20211217_11,RGDT_Rule_20211217_12
eid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
975-8351797-7122581,0,0,0,0,0,0,0,0,0,0,0,0,0
785-6259585-7858053,0,0,0,0,0,0,0,0,0,0,0,0,0
057-4039373-1790681,0,0,0,0,0,0,0,0,0,0,0,0,0
095-5263240-3834186,0,0,0,0,0,0,0,0,0,0,0,0,0
980-3802574-0009480,0,0,0,0,0,0,0,0,0,0,0,0,0


----