**Tutorial 2**

This tutorial is based on using  ML package Weka for machine learning. Weka is a famous machine learning software and a set of libraries that one can use within a programming language. Weka was created at the University of Waikato, New Zealnd (https://www.cs.waikato.ac.nz/ml/weka/). It is accompanied with a text book of data mining taught in schools around the world (https://www.cs.waikato.ac.nz/ml/weka/book.html). The advantage of using Weka's Python package is that the implementation of algorithms is complete, comprehsive and easy to use. Let's see below.


First install Weka's Python package.

In [None]:
! pip install python-weka-wrapper3

Collecting python-weka-wrapper3
  Downloading python-weka-wrapper3-0.2.8.tar.gz (14.4 MB)
[K     |████████████████████████████████| 14.4 MB 4.2 MB/s 
[?25hCollecting python-javabridge>=4.0.0
  Downloading python-javabridge-4.0.3.tar.gz (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 24.5 MB/s 
Building wheels for collected packages: python-weka-wrapper3, python-javabridge
  Building wheel for python-weka-wrapper3 (setup.py) ... [?25l[?25hdone
  Created wheel for python-weka-wrapper3: filename=python_weka_wrapper3-0.2.8-py3-none-any.whl size=12991113 sha256=18e34f6c0d98e9e2bdb3054c0e51c93a8b336140be639af571cccd07fa754c0c
  Stored in directory: /root/.cache/pip/wheels/42/03/23/d9c07aa47a84f9a0003dbb38240edac8b8c682ad4290b6a3d1
  Building wheel for python-javabridge (setup.py) ... [?25l[?25hdone
  Created wheel for python-javabridge: filename=python_javabridge-4.0.3-cp37-cp37m-linux_x86_64.whl size=1628129 sha256=4382732b59e7612030287bcf03b19e96c6ba5d4254dfd2afa074adc77b0

Weka was built on Java, and below we shall be setting Java and launching it in Python environment. Don't worry about understanding this code. 

In [None]:
import os
import sys
sys.path
sys.path.append("/usr/lib/jvm/java-11-openjdk-amd64/bin/")
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"



In [None]:

import weka.core.jvm as jvm
jvm.start()

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['/usr/local/lib/python3.7/dist-packages/javabridge/jars/rhino-1.7R4.jar', '/usr/local/lib/python3.7/dist-packages/javabridge/jars/runnablequeue.jar', '/usr/local/lib/python3.7/dist-packages/javabridge/jars/cpython.jar', '/usr/local/lib/python3.7/dist-packages/weka/lib/python-weka-wrapper.jar', '/usr/local/lib/python3.7/dist-packages/weka/lib/weka.jar']
DEBUG:weka.core.jvm:MaxHeapSize=default
DEBUG:weka.core.jvm:Package support disabled


We shall now upload a dataset file. Weka works with arff format easily, it can load CSV too. We shall upload .arff file because I have defined the correct data types of variables (cagtegorical or numerical) in it already.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving bank.arff to bank.arff


Let's load our dataset into memory. It will be loaded using the following code. Dataset file that I have uplaoded is german_credit.arff. Note this loaded data in moemeory is not a Pandas' data frame.

In [None]:
from weka.core.converters import Loader
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
#data_file = 'german_credit.arff'
#data_file="churn.arff"
data_file="bank.arff"
data = loader.load_file(data_file)

print('Data set size: ', data.num_instances)

Data set size:  4521


In [None]:
#Let's look at the attributes and their types
# We have two data types here: categorical and numeric.
for i in range(data.num_attributes):
  print ("index ",i)
  print(data.attribute(i))

index  0
@attribute age numeric
index  1
@attribute job {unemployed,services,management,blue-collar,self-employed,technician,entrepreneur,admin.,student,housemaid,retired,unknown}
index  2
@attribute marital {married,single,divorced}
index  3
@attribute education {primary,secondary,tertiary,unknown}
index  4
@attribute default {no,yes}
index  5
@attribute balance numeric
index  6
@attribute housing {no,yes}
index  7
@attribute loan {no,yes}
index  8
@attribute contact {cellular,unknown,telephone}
index  9
@attribute day numeric
index  10
@attribute month {oct,may,apr,jun,feb,aug,jan,jul,nov,sep,mar,dec}
index  11
@attribute duration numeric
index  12
@attribute campaign numeric
index  13
@attribute pdays numeric
index  14
@attribute previous numeric
index  15
@attribute poutcome {unknown,failure,other,success}
index  16
@attribute y {no,yes}


Index of class attribute in our data is 0--creditability. It can be observed above. I am setting up class attribute here.

In [None]:

# index of class atrribute is 0 (Creditability) for German credit card
# index of class attribute is 20(Churn) for Churn data set 
# index of class attribute is 16(y) for bank data set
# Again, you can see all the index numbers for attributes by running the previous cell
class_idx=16
print('Will be classifying on: ', data.attribute(class_idx))
data.class_index = class_idx


Will be classifying on:  @attribute y {no,yes}


Time to split dataset into train and test set.

In [None]:
# Splitting 66% for training and 34% for testing using a seed of 1 for random number generator
train, test = data.train_test_split(66.0, Random(1))

We are now going to train a decision tree. This decision tree is C4.5 decision tree and it's name in Weka is J48. Good thing about this decision tree is that it is the exact implementation of the C4.5 decision tree as in theory and as we studied. C4.5 decision tree algorithm can handle numeric and categorical attributes by itself. So there is no need to convert categorical features(or variables) to numeric features by using on-hot-encoding.

In [None]:
# We are generating a pruned C4.5 decision tree, with a confidence factor used for pruning of 0.25.
# You can change it to different threshold values to change the size of the tree.
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls.build_classifier(train)
# See the tree below. 
print(cls)

J48 pruned tree
------------------

duration <= 221: no (1780.0/54.0)
duration > 221
|   duration <= 645
|   |   poutcome = unknown
|   |   |   contact = cellular
|   |   |   |   month = oct: yes (9.0/3.0)
|   |   |   |   month = may: no (55.0/8.0)
|   |   |   |   month = apr
|   |   |   |   |   day <= 20
|   |   |   |   |   |   housing = no
|   |   |   |   |   |   |   day <= 16: yes (4.0/1.0)
|   |   |   |   |   |   |   day > 16: no (3.0)
|   |   |   |   |   |   housing = yes: no (38.0/1.0)
|   |   |   |   |   day > 20
|   |   |   |   |   |   marital = married: yes (6.0)
|   |   |   |   |   |   marital = single: yes (3.0)
|   |   |   |   |   |   marital = divorced: no (2.0)
|   |   |   |   month = jun
|   |   |   |   |   marital = married: yes (6.0/1.0)
|   |   |   |   |   marital = single: no (4.0/1.0)
|   |   |   |   |   marital = divorced: yes (2.0)
|   |   |   |   month = feb
|   |   |   |   |   day <= 7: no (23.0/1.0)
|   |   |   |   |   day > 7: yes (9.0/2.0)
|   |   |   |   mon

In the above tree, these values ": 1 (8.0/2.0)" means the class at the leaf is 1, total training records during evlaution on the training set after building the tree reached here are 8 but only 2 of them were incorrectly predicted.

In [None]:
import weka.plot.graph as graph  # If pygrpahviz is installed, you can plot the graph of tree too but it may not work
graph.plot_dot_graph(cls.graph)

ERROR:weka.plot.graph:Pygraphviz is not installed, cannot generate graph plot!


In [None]:
# Let's evaluate it on the test set

evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary())


Correctly Classified Instances        1360               88.4841 %
Incorrectly Classified Instances       177               11.5159 %
Kappa statistic                          0.3435
Mean absolute error                      0.1525
Root mean squared error                  0.3114
Relative absolute error                 74.4399 %
Root relative squared error             96.8266 %
Total Number of Instances             1537     



Here "Correctly Classified Instances"   means accuracy, and "Total Number of Instances" means total records in the test set. Ignore everything else as we have not studied them. 

In [None]:
# Here are all the metrics
#print ("Class Index ", class_idx)
print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line and TN will be for the class at second position
print(evl.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


Classes at different positions are  @attribute y {no,yes}
confusion Matrix
[[1300.   57.]
 [ 120.   60.]]

Evaluation from the perspective of class at position 0
TP  0.9579955784819455
FP 0.6666666666666666
Precision  0.9154929577464789
Recall  0.9579955784819455

Evaluation from the perspective of class at position 1
TP  0.3333333333333333
FP 0.04200442151805453
Precision  0.5128205128205128
Recall  0.3333333333333333


**Naive Bayes**

Below is the code to run Naive Bayes algorithm. It is a different version of Naive Bayes that is suited to both numeric and categorical features(atrributes or variables).
 (https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html)

In [None]:

nb = Classifier(classname="weka.classifiers.bayes.NaiveBayes")
nb.build_classifier(train)
#let's understand the NB model by printing it
print(nb)

Naive Bayes Classifier

                      Class
Attribute                no       yes
                     (0.89)    (0.11)
age
  mean               41.0167   42.6333
  std. dev.          10.1781   13.1354
  weight sum            2643       341
  precision           1.0794    1.0794

job
  unemployed            75.0      12.0
  services             253.0      30.0
  management           544.0      84.0
  blue-collar          587.0      39.0
  self-employed        121.0      15.0
  technician           469.0      49.0
  entrepreneur          94.0      11.0
  admin.               271.0      43.0
  student               42.0      16.0
  housemaid             63.0      13.0
  retired              110.0      35.0
  unknown               26.0       6.0
  [total]             2655.0     353.0

marital
  married             1678.0     190.0
  single               672.0     103.0
  divorced             296.0      51.0
  [total]             2646.0     344.0

education
  primary              3

In [None]:
# Time for evaluation on the test set
evl_nb = Evaluation(train)
evl_nb.test_model(nb, test)
print(evl_nb.summary())


Correctly Classified Instances        1346               87.5732 %
Incorrectly Classified Instances       191               12.4268 %
Kappa statistic                          0.4005
Mean absolute error                      0.1512
Root mean squared error                  0.3084
Relative absolute error                 73.8292 %
Root relative squared error             95.8985 %
Total Number of Instances             1537     



In [None]:
#Here are all the metrics for Naive Bayes

print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line
print(evl_nb.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


Classes at different positions are  @attribute y {no,yes}
confusion Matrix
[[1261.   96.]
 [  95.   85.]]

Evaluation from the perspective of class at position 0
TP  0.9292557111274871
FP 0.5277777777777778
Precision  0.9299410029498525
Recall  0.9292557111274871

Evaluation from the perspective of class at position 1
TP  0.4722222222222222
FP 0.0707442888725129
Precision  0.4696132596685083
Recall  0.4722222222222222


**Appendix**

Using the following code  you can find out the best attribute by using the BestFIRst algorithm in Weka. Again it is not necessary to understand the whole code below but if you wanna learn more about BesrFirst and CfsSubsetEval, you can go here https://weka.sourceforge.io/doc.dev/weka/attributeSelection/package-summary.html. You can also replace them with options available on the above site.


In [None]:
from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
attsel = AttributeSelection()
attsel.search(search)
attsel.evaluator(evaluator)
attsel.select_attributes(data)

print("# attributes: " + str(attsel.number_attributes_selected))
print("attributes: " + str(attsel.selected_attributes))
print("result string:\n" + attsel.results_string)

# attributes: 3
attributes: [ 0 11 15 16]
result string:


=== Attribute Selection on all input data ===

Search Method:
	Best first.
	Start set: no attributes
	Search direction: forward
	Stale search after 5 node expansions
	Total number of subsets evaluated: 97
	Merit of best subset found:    0.095

Attribute Subset Evaluator (supervised, Class (nominal): 17 y):
	CFS Subset Evaluator
	Including locally predictive attributes

Selected attributes: 1,12,16 : 3
                     age
                     duration
                     poutcome



Weka's best first search method resulted into above attributes selection. Let's create a new copy of dataset with those attributes only

In [None]:
# As you see above, we only attributes 2,3 and 4 are important as judged by Weka for German Credit card data set. So we are going to load 
# data again and remove all the attributes from 5-21. Atrribute at index 1 is the class atrribute, so we'll keep that too
from weka.filters import Filter

data2 = loader.load_file(data_file)
# Filtering method 1
remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "2-11,13-15"])
remove.inputformat(data2)
filtered_data = remove.filter(data2)

print(filtered_data.subset(row_range="1-10"))

@relation 'bank-weka.filters.unsupervised.attribute.Remove-R2-11,13-15-weka.filters.unsupervised.instance.RemoveRange-V-R1-10-weka.filters.MultiFilter-Fweka.filters.unsupervised.instance.RemoveRange -V -R 1-10-S1'

@attribute age numeric
@attribute duration numeric
@attribute poutcome {unknown,failure,other,success}
@attribute y {no,yes}

@data
30,79,unknown,no
33,220,failure,no
35,185,failure,no
30,199,unknown,no
59,226,unknown,no
35,141,failure,no
36,341,other,no
39,151,unknown,no
41,57,unknown,no
43,313,failure,no


In [None]:
#Filtering method 2
#Another way of filtering columns usingthe following code. Here we are keeping only features 1-4 and 7.
filtered_data=data2.subset(col_range='1,12,16,17')
print(filtered_data)

@relation 'bank-weka.filters.unsupervised.attribute.Remove-V-R1,12,16,17-weka.filters.MultiFilter-Fweka.filters.unsupervised.attribute.Remove -V -R 1,12,16,17-S1'

@attribute age numeric
@attribute duration numeric
@attribute poutcome {unknown,failure,other,success}
@attribute y {no,yes}

@data
30,79,unknown,no
33,220,failure,no
35,185,failure,no
30,199,unknown,no
59,226,unknown,no
35,141,failure,no
36,341,other,no
39,151,unknown,no
41,57,unknown,no
43,313,failure,no
39,273,unknown,no
43,113,unknown,no
36,328,unknown,no
20,261,unknown,yes
31,89,failure,no
40,189,unknown,no
56,239,unknown,no
37,114,failure,no
25,250,unknown,no
31,148,other,no
38,96,unknown,no
42,140,unknown,no
44,109,unknown,no
44,125,unknown,no
26,169,unknown,no
41,182,unknown,no
55,247,unknown,no
67,119,failure,no
56,149,unknown,no
53,74,unknown,no
68,897,unknown,yes
31,81,unknown,no
59,40,unknown,no
32,958,unknown,yes
49,354,unknown,yes
42,150,unknown,no
78,97,unknown,yes
32,132,unknown,yes
33,765,failure,yes
23,16,u

Now you can remove the above filtered data set as an input data set in the code examples shown above and repeat the experiments.

More examples on the use of different functionalities of Weka's Python package are here for curious readers:
http://fracpete.github.io/python-weka-wrapper3/examples.html

In [None]:
#If you are done stop the JVM (Java Virtual Machine)
jvm.stop()

It turns out that Weka's python package is easier and comprehensive than other Python packages.



```
For CIND 119 course at Ryerson
  by Syed Shariyar Murtaza,Ph.D.
```

