**Tutorial 2**

This tutorial is based on using  ML package Weka for machine learning. Weka is a famous machine learning software and a set of libraries that one can use within a programming language. Weka was created at the University of Waikato, New Zealnd (https://www.cs.waikato.ac.nz/ml/weka/). It is accompanied with a text book of data mining taught in schools around the world (https://www.cs.waikato.ac.nz/ml/weka/book.html). The advantage of using Weka's Python package is that the implementation of algorithms is complete, comprehsive and easy to use. Let's see below.


First install Weka's Python package.

In [None]:
! pip install python-weka-wrapper3

Collecting python-weka-wrapper3
  Using cached python-weka-wrapper3-0.2.14.tar.gz (15.9 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-javabridge>=4.0.0 (from python-weka-wrapper3)
  Using cached python-javabridge-4.0.3.tar.gz (1.3 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting configurable-objects (from python-weka-wrapper3)
  Downloading configurable-objects-0.0.1.tar.gz (4.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting simple-data-flow (from python-weka-wrapper3)
  Downloading simple-data-flow-0.0.1.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: python-weka-wrapper3, python-javabridge, configurable-objects, simple-data-flow
  Building wheel for python-weka-wrapper3 (setup.py) ... [?25l[?25hdone
  Created wheel for python-weka-wrapper3: filename=python_weka_wrapper3-0.2.14-py3-none-any.whl size=14496261 sha256=ab3837d82113283cfa10c9081cad954081e365bca

Weka was built on Java, and below we shall be setting Java and launching it in Python environment. Don't worry about understanding this code.

In [None]:
import os
import sys
sys.path
sys.path.append("/usr/lib/jvm/java-11-openjdk-amd64/bin/")
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"



In [None]:

import weka.core.jvm as jvm
jvm.start()

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['/usr/local/lib/python3.10/dist-packages/javabridge/jars/rhino-1.7R4.jar', '/usr/local/lib/python3.10/dist-packages/javabridge/jars/runnablequeue.jar', '/usr/local/lib/python3.10/dist-packages/javabridge/jars/cpython.jar', '/usr/local/lib/python3.10/dist-packages/weka/lib/core.jar', '/usr/local/lib/python3.10/dist-packages/weka/lib/weka.jar', '/usr/local/lib/python3.10/dist-packages/weka/lib/arpack_combined.jar', '/usr/local/lib/python3.10/dist-packages/weka/lib/mtj.jar', '/usr/local/lib/python3.10/dist-packages/weka/lib/python-weka-wrapper.jar']
DEBUG:weka.core.jvm:MaxHeapSize=default
DEBUG:weka.core.jvm:Package support disabled


We shall now upload a dataset file. Weka works with arff format easily, it can load CSV too. We shall upload .arff file because I have defined the correct data types of variables (cagtegorical or numerical) in it already.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving bank.arff to bank.arff


Let's load our dataset into memory. It will be loaded using the following code. Dataset file that I have uplaoded is german_credit.arff. Note this loaded data in moemeory is not a Pandas' data frame.

In [None]:
from weka.core.converters import Loader
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
#data_file = 'german_credit.arff'
#data_file="churn.arff"
data_file="bank.arff"
data = filtered_data

print('Data set size: ', data.num_instances)

Data set size:  4521


In [None]:
#Let's look at the attributes and their types
# We have two data types here: categorical and numeric.
for i in range(data.num_attributes):
  print ("index ",i)
  print(data.attribute(i))

index  0
@attribute age numeric
index  1
@attribute duration numeric
index  2
@attribute poutcome {unknown,failure,other,success}
index  3
@attribute y {no,yes}


Index of class attribute in our data is 0--creditability. It can be observed above. I am setting up class attribute here.

In [None]:

# index of class atrribute is 0 (Creditability) for German credit card
# index of class attribute is 20(Churn) for Churn data set
# index of class attribute is 16(y) for bank data set
# Again, you can see all the index numbers for attributes by running the previous cell
class_idx=3
print('Will be classifying on: ', data.attribute(class_idx))
data.class_index = class_idx


Will be classifying on:  @attribute y {no,yes}


Time to split dataset into train and test set.

In [None]:
# Splitting 66% for training and 34% for testing using a seed of 1 for random number generator
train, test = data.train_test_split(66.0, Random(1))

We are now going to train a decision tree. This decision tree is C4.5 decision tree and it's name in Weka is J48. Good thing about this decision tree is that it is the exact implementation of the C4.5 decision tree as in theory and as we studied. C4.5 decision tree algorithm can handle numeric and categorical attributes by itself. So there is no need to convert categorical features(or variables) to numeric features by using on-hot-encoding.

In [None]:
# We are generating a pruned C4.5 decision tree, with a confidence factor used for pruning of 0.25.
# You can change it to different threshold values to change the size of the tree.
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls.build_classifier(train)
# See the tree below.
print(cls)

J48 pruned tree
------------------

duration <= 221: no (1780.0/54.0)
duration > 221
|   duration <= 645
|   |   poutcome = unknown
|   |   |   age <= 59: no (746.0/79.0)
|   |   |   age > 59
|   |   |   |   age <= 70
|   |   |   |   |   age <= 60: no (10.0/1.0)
|   |   |   |   |   age > 60
|   |   |   |   |   |   age <= 68: yes (13.0/5.0)
|   |   |   |   |   |   age > 68: no (3.0)
|   |   |   |   age > 70: yes (5.0)
|   |   poutcome = failure: no (99.0/21.0)
|   |   poutcome = other: no (41.0/15.0)
|   |   poutcome = success: yes (47.0/12.0)
|   duration > 645: yes (240.0/117.0)

Number of Leaves  : 	10

Size of the tree : 	17



In the above tree, these values ": 1 (8.0/2.0)" means the class at the leaf is 1, total training records during evlaution on the training set after building the tree reached here are 8 but only 2 of them were incorrectly predicted.

In [None]:
import weka.plot.graph as graph  # If pygrpahviz is installed, you can plot the graph of tree too but it may not work
graph.plot_dot_graph(cls.graph)

ERROR:weka.plot.graph:Pygraphviz is not installed, cannot generate graph plot!


In [None]:
# Let's evaluate it on the test set

evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary())


Correctly Classified Instances        1366               88.8744 %
Incorrectly Classified Instances       171               11.1256 %
Kappa statistic                          0.4275
Mean absolute error                      0.1582
Root mean squared error                  0.2866
Relative absolute error                 77.2531 %
Root relative squared error             89.1181 %
Total Number of Instances             1537     



Here "Correctly Classified Instances"   means accuracy, and "Total Number of Instances" means total records in the test set. Ignore everything else as we have not studied them.

In [None]:
# Here are all the metrics
#print ("Class Index ", class_idx)
print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line and TN will be for the class at second position
print(evl.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


Classes at different positions are  @attribute y {no,yes}
confusion Matrix
[[1284.   73.]
 [  98.   82.]]

Evaluation from the perspective of class at position 0
TP  0.94620486366986
FP 0.5444444444444444
Precision  0.9290882778581766
Recall  0.94620486366986

Evaluation from the perspective of class at position 1
TP  0.45555555555555555
FP 0.05379513633014001
Precision  0.5290322580645161
Recall  0.45555555555555555


**Naive Bayes**

Below is the code to run Naive Bayes algorithm. It is a different version of Naive Bayes that is suited to both numeric and categorical features(atrributes or variables).
 (https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html)

In [None]:

nb = Classifier(classname="weka.classifiers.bayes.NaiveBayes")
nb.build_classifier(train)
#let's understand the NB model by printing it
print(nb)

Naive Bayes Classifier

                  Class
Attribute            no      yes
                 (0.89)   (0.11)
age
  mean           41.0167  42.6333
  std. dev.      10.1781  13.1354
  weight sum        2643      341
  precision       1.0794   1.0794

duration
  mean          224.3443 559.0086
  std. dev.     204.4182 400.6621
  weight sum        2643      341
  precision       3.6239   3.6239

poutcome
  unknown         2221.0    224.0
  failure          292.0     39.0
  other            101.0     27.0
  success           33.0     55.0
  [total]         2647.0    345.0




In [None]:
# Time for evaluation on the test set
evl_nb = Evaluation(train)
evl_nb.test_model(nb, test)
print(evl_nb.summary())


Correctly Classified Instances        1367               88.9395 %
Incorrectly Classified Instances       170               11.0605 %
Kappa statistic                          0.3452
Mean absolute error                      0.1424
Root mean squared error                  0.2929
Relative absolute error                 69.5067 %
Root relative squared error             91.0816 %
Total Number of Instances             1537     



In [None]:
#Here are all the metrics for Naive Bayes

print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line
print(evl_nb.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


Classes at different positions are  @attribute y {no,yes}
confusion Matrix
[[1310.   47.]
 [ 123.   57.]]

Evaluation from the perspective of class at position 0
TP  0.9653647752394989
FP 0.6833333333333333
Precision  0.9141660851360781
Recall  0.9653647752394989

Evaluation from the perspective of class at position 1
TP  0.31666666666666665
FP 0.034635224760501106
Precision  0.5480769230769231
Recall  0.31666666666666665


**Appendix**

Using the following code  you can find out the best attribute by using the BestFIRst algorithm in Weka. Again it is not necessary to understand the whole code below but if you wanna learn more about BesrFirst and CfsSubsetEval, you can go here https://weka.sourceforge.io/doc.dev/weka/attributeSelection/package-summary.html. You can also replace them with options available on the above site.


In [None]:
from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
attsel = AttributeSelection()
attsel.search(search)
attsel.evaluator(evaluator)
attsel.select_attributes(data)

print("# attributes: " + str(attsel.number_attributes_selected))
print("attributes: " + str(attsel.selected_attributes))
print("result string:\n" + attsel.results_string)

# attributes: 3
attributes: [ 0 11 15 16]
result string:


=== Attribute Selection on all input data ===

Search Method:
	Best first.
	Start set: no attributes
	Search direction: forward
	Stale search after 5 node expansions
	Total number of subsets evaluated: 97
	Merit of best subset found:    0.095

Attribute Subset Evaluator (supervised, Class (nominal): 17 y):
	CFS Subset Evaluator
	Including locally predictive attributes

Selected attributes: 1,12,16 : 3
                     age
                     duration
                     poutcome



Weka's best first search method resulted into above attributes selection. Let's create a new copy of dataset with those attributes only

In [None]:
# As you see above, we only attributes 2,3 and 4 are important as judged by Weka for German Credit card data set. So we are going to load
# data again and remove all the attributes from 5-21. Atrribute at index 1 is the class atrribute, so we'll keep that too
from weka.filters import Filter

data2 = loader.load_file(data_file)
# Filtering method 1
#remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "5-21"])
#remove.inputformat(data2)
#filtered_data = remove.filter(data2)

#print(filtered_data.subset(row_range="1-10"))

In [None]:
#Filtering method 2
#Another way of filtering columns usingthe following code. Here we are keeping only features 1-4 and 7.
filtered_data=data2.subset(col_range='1, 12, 16, 17')


Now you can remove the above filtered data set as an input data set in the code examples shown above and repeat the experiments.

More examples on the use of different functionalities of Weka's Python package are here for curious readers:
http://fracpete.github.io/python-weka-wrapper3/examples.html

In [None]:
#If you are done stop the JVM (Java Virtual Machine)
jvm.stop()

It turns out that Weka's python package is easier and comprehensive than other Python packages.



```
For CIND 119 course at Ryerson
  by Syed Shariyar Murtaza,Ph.D.
```

