# FIT5202 Data processing for big data

##  Activity: Machine Learning with Spark (Classification Using Decision Tree and Random Forest)

Last week we learnt about basics of machine learning with Apache Spark. **``MLlib``** is Apache Spark's scalable machine learning library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.

We looked into transformers, estimators and machine learning pipeline in the last weeks tutorial activity.

This week we have learnt about decision tree and random forest algorithms in the lecture. We will look into how to use the two different popular family of classification and regression methods; Decision Trees and Random forests.


# Decision Tree

A *decision tree* is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

In this exercise, we will be using Apache spark to create a decision tree. Basically the dataset (shown below) lists the conditions which impacts if a game of tennis can be played outside or not. The values of outlook, temperature, humidity and wind are described and outcome that the game was played or not under these conditions. 

![Dataset for Decision Tree](https://camo.githubusercontent.com/750443a0828b170b12a3eeaf42b7c1aa5e7c25b8/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f3630302f312a426e3364345a3632736f66334b3455315f3070536c512e6a706567)

We will build the decision tree like the one below.
![Decision Tree](https://camo.githubusercontent.com/210366841be13b6b5ff9fa3e4e8e7819679c5ad4/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f3830302f312a546c547a677438495f35645553624d5a6d524b7971512e6a706567)

We will go through the code and explain which part of the code is doing what in the codebase.

## 1. Include the required library

In [45]:
import operator
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
import numpy as np
# Uncomment the following line to install networkx
!pip install networkx
import networkx as nx
from matplotlib import pyplot as plt

%matplotlib inline



## 2.  Instantiate the spark context

We will use and import **`SparkContext`** from **`pyspark`**, which is the main entry point for Spark Core functionality. The **`SparkSession`** object provides methods used to create DataFrames from various input sources. 
A [DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession. Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: [DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame), [Column](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column).


In [46]:
sc.stop()#to not multiple SparkContext
sc = SparkContext(master="local[*]", appName="Decision Tree")
sqlContext = SQLContext(sc)
attr_name_info_gain = {}
G = nx.DiGraph()

## 3.  Declare schema of the dataset. 
We have created 2 variables for storing the schema and types.

In [47]:
attrs = ["outlook","temp","humidity","wind"]
attrs_type = {"outlook":"string","temp":"string","humidity":"string","wind":"string"}

## 4.  Calculate the Information gain

In [48]:
def calculate_info_gain(entropy, joined_df, total_elements):
    attr_entropy = 0.0
    for anAttributeData in joined_df.rdd.collect():
        yes_class_count = anAttributeData[1]
        no_class_count = anAttributeData[2]
        if yes_class_count is None:
            yes_class_count = 0
        elif no_class_count is None:
            no_class_count = 0

        count_of_class = yes_class_count + no_class_count
        classmap = {'y' : yes_class_count, 'n' : no_class_count}
        attr_entropy = attr_entropy + ((count_of_class / total_elements) *\
                                       calculate_entropy(count_of_class, classmap))

    gain = entropy - attr_entropy

    return gain


## 5.  Attribute information gain data preparation function

In [49]:
def get_attr_info_gain_data_prep(attr_name, data, entropy, total_elements, where_condition):

    if not where_condition:
        attr_grp_y = data.where(col('y') == 'yes').groupBy(attr_name).agg({"y": 'count'})\
            .withColumnRenamed('count(y)','played_count')
    else:
        attr_grp_y = data.where(" y like '%yes%'  " + where_condition).groupBy(attr_name).agg({"y": 'count'})\
            .withColumnRenamed('count(y)','played_count')

    if not where_condition:
        attr_grp_n = data.where(col('y') == 'no').groupBy(attr_name).agg({"y": 'count'})\
            .withColumnRenamed(attr_name,'n_' + attr_name)\
            .withColumnRenamed('count(y)','not_played_count')
    else:
        attr_grp_n = data.where(" y like '%no%'  " + where_condition).groupBy(attr_name).agg({"y": 'count'})\
            .withColumnRenamed(attr_name,'n_' + attr_name)\
            .withColumnRenamed('count(y)','not_played_count')

    joined_df = attr_grp_y.join(attr_grp_n, on = [col(attr_grp_y.columns[0]) == col(attr_grp_n.columns[0])], how='outer' )\
        .withColumn("total", col(attr_grp_y.columns[0]) + col(attr_grp_n.columns[0]))\
        .select(attr_grp_y.columns[0], attr_grp_y.columns[1],\
                 attr_grp_n.columns[1]) \

    gain_for_attribute = calculate_info_gain(entropy, joined_df, total_elements)
    attr_name_info_gain[attr_name] = gain_for_attribute


## 6.  Calculate the entropy of the elements

In [50]:
def calculate_entropy(total_elements, elements_in_each_class):
    # for target set S having 2 class 0 and 1, the entropy is -p0logp0 -p1logp1
    # here the log is of base 2
    # elements_in_each_class is a dictionary where the key is class label and the
    # value is number of elements in that class
    keysInMap = list(elements_in_each_class.keys())
    entropy = 0.0

    for aKey in keysInMap:
        number_of_elements_in_class = elements_in_each_class.get(aKey)
        if number_of_elements_in_class == 0:
            continue
        ratio = number_of_elements_in_class/total_elements
        entropy = entropy - ratio * np.log2(ratio)

    return entropy


## 7.  Process the data

As we build the tree, we will need to get data corresponding to that branch of the tree only. The ‘where_condition’ attribute will contain these predicates.

We group the records in the file which have outcome as ‘yes’ for the attribute names passed

- For first time, the where_condition will be blank,
- Second iteration onwards, after root of the tree is found, we will have **where_condition**
- **excludedAtttts** will contain the list of attributes which are already processed so that we dont need to process again.
- **data** is the spark dataframe for this file
- **played** — count when match was played
- **notplayed** — count when match was not played
- **Where_condition** — condition used to select the data, as and when attributes are processed we will keep chaging this condition



In [51]:
def process_dataset(excludedAttrs, data, played, notplayed, where_condition):
    total_elements = played + notplayed
    subs_info = {"played" : played, "notplayed" : notplayed}
    entropy = calculate_entropy(total_elements, subs_info)
    print ("entropy is " + str(entropy))
    global attr_name_info_gain
    attr_name_info_gain = dict()

    for attr in attrs:
        if attr not in excludedAttrs:
            get_attr_info_gain_data_prep(attr, data, entropy, total_elements, where_condition)


## 8. Build the Tree

In [52]:
def build_tree(max_gain_attr, processed_attrs, data, where_condition):
    attrValues = sqlContext.sql("select distinct " + max_gain_attr + " from data  where 1==1 " + where_condition)
    orig_where_condition = where_condition

    for aValueForMaxGainAttr in attrValues.rdd.collect():
        adistinct_value_for_attr = aValueForMaxGainAttr[0]
        G.add_edges_from([(max_gain_attr, adistinct_value_for_attr)])

        if attrs_type[max_gain_attr] == "string":
            where_condition = str(orig_where_condition + " and " + max_gain_attr + "=='" + adistinct_value_for_attr + "'")
        else:
            where_condition = str(orig_where_condition + " and " + max_gain_attr + "==" + adistinct_value_for_attr)

        played_for_attr = sqlContext.sql("select * from data where y like '%yes%' " + where_condition).count()
        notplayed_for_attr = sqlContext.sql("select * from data where y like '%no%' " + where_condition).count()
        # if either has zero value then entropy for this attr will be zero and its the last attr in the tree
        leaf_values = []
        if played_for_attr == 0 or notplayed_for_attr == 0:
            leaf_node = sqlContext.sql("select distinct y from data where 1==1 " + where_condition)
            for leaf_node_data in leaf_node.rdd.collect():
                G.add_edges_from([(adistinct_value_for_attr, str(leaf_node_data[0]))])
            continue
        process_dataset(processed_attrs, data, played_for_attr, notplayed_for_attr, where_condition)
        if not attr_name_info_gain: # we processed all attributes
            # attach leaf node
            leaf_node = sqlContext.sql("select distinct y from data where 1==1 " + where_condition)
            for leaf_node_data in leaf_node.rdd.collect():
                G.add_edges_from([(adistinct_value_for_attr, str(leaf_node_data[0]))])
            continue # we are done for this branch of tree

        # get the attr with max info gain under aValueForMaxGainAttr
        # sort by info gain
        sorted_by_info_gain = sorted(attr_name_info_gain.items(), key=operator.itemgetter(1), reverse=True)
        new_max_gain_attr = sorted_by_info_gain[0][0]
        if sorted_by_info_gain[0][1] == 0:
            # under this where condition, records dont have entropy
            leaf_node = sqlContext.sql("select distinct y from data where 1==1 " + where_condition)
            # there might be more than one leaf node
            for leaf_node_data in leaf_node.rdd.collect():
                G.add_edges_from([(adistinct_value_for_attr, str(leaf_node_data[0]))])
            continue # we are done for this branch of tree

        G.add_edges_from([(adistinct_value_for_attr, new_max_gain_attr)])
        processed_attrs.append(new_max_gain_attr)
        build_tree(new_max_gain_attr, processed_attrs, data, where_condition)


## 9. Load the dataset and draw the graph 

In [53]:
data = sqlContext.read.format('com.databricks.spark.csv').option('header', 'true')\
        .option('delimiter', ';').load("lss_myDataset.txt")

data.registerTempTable('data')
played = sqlContext.sql("select * from data WHERE y like  '%y%' ").count()
notplayed = sqlContext.sql("select * from data WHERE y like  '%n%' ").count()
process_dataset([], data, played, notplayed, '')
# sort by info gain
sorted_by_info_gain = sorted(attr_name_info_gain.items(), key=operator.itemgetter(1), reverse=True)

processed_attrs = []
max_gain_attr = sorted_by_info_gain[0][0]
processed_attrs.append(max_gain_attr)
build_tree(max_gain_attr, processed_attrs, data, '')
nx.draw(G, with_labels=True)
plt.tight_layout()
plt.savefig("Graph.png", format="png")

entropy is 0.9612366047228759


AnalysisException: "cannot resolve '`outlook`' given input columns: [Gender, y, Lover, GPA, Laptop];;\n'Aggregate ['outlook], ['outlook, count(y#1815) AS count(y)#1859L]\n+- Filter (y#1815 = yes)\n   +- Relation[Gender#1811,Lover#1812,GPA#1813,Laptop#1814,y#1815] csv\n"

## 10. Stoppping the Spark Context

In [31]:
sc.stop()

## Congratulations on finishing this activity. See you next week.