# Neural Networks Project
## Self-labeled techniques for semi-supervised learning
<div style="text-align: right">
    Mark Laane <br />
    Rome, 2017
</div>

### Introduction ###
The aim of this project is to reimplement some techniques surveyed by Isaac Triguero et. al in paper [1] and to independently reproduce the reported results.

Two self-labeled techniques are chosen: Standard Self-Training and Tri-Training. Those techniques are used on Bupa and Abalone datasets. For implementation, Python programming language was chosen along with Pandas and Sclearn libraries.

### Implemented self-labeled algorithms ###
Two algorithms are implemented: Standard Self-Training and Tri Training.
#### Standard Self-Training ####
Implementation: [Standard Self-Training](StandardSelfTraining.py)
The implementation is based on description of the algorithm in paper [2].
Training an Standard Self-Training classifier is an iterative process - The base classifier is trained with initial labeled samples. Then it is used for labelling the unlabelled samples and the classifier is retrained with the most confident predictions. The process is repeated until the classifier output stabilizes.
#### Tri-Training ###
Implementation: [Tri-Training](tri_training.py)
The implementation is based on description of the algorithm in paper [3].
In Tri-Training, Three base classifiers is trained on randomly subsampled sets of the labelled data. Then each of them will be iteratively trained on labeled data gained from two other base classifiers. The prediciton is made by using majority voting on three base classifiers.

In [1]:
from StandardSelfTraining import StandardSelfTraining
from tri_training import TriTraining

### Base classifiers ###
4 different base classifiers are used. The base classifiers are provided by Sklearn library. The classifiers are configured according to the parameters described in the paper.

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

class base_classifiers:
    KNN = KNeighborsClassifier(
        n_neighbors=3,
        metric="euclidean",
        #n_jobs=2  # Parallelize work on CPUs
    )
    NB = GaussianNB(
        priors=None
    )
    SMO = SVC(
        C=1.0,
        kernel='poly',
        degree=1,
        tol=0.001,
        # Epsilon parameter missing?
    )
    CART = DecisionTreeClassifier(
        criterion='entropy',
        # splitter='best',
        # max_depth=None,
        # min_samples_split=2,
        min_samples_leaf=2,
        # min_weight_fraction_leaf=0.0,
        # max_features=None,
        # random_state=None,
        # max_leaf_nodes=None,
        # min_impurity_split=1e-07,
        # class_weight=None,
        # presort=False,
    )

All in all, 8 classifiers are trained - 2 techniques with 4 different base classifiers:

In [3]:
# All classifiers used for testing
classifiers = [
    TriTraining("TriTraining (KNN)", base_classifiers.KNN),
    TriTraining("TriTraining (NB)", base_classifiers.NB),
    TriTraining("TriTraining (SVM)", base_classifiers.SMO),
    TriTraining("TriTraining (CART)", base_classifiers.CART),
    StandardSelfTraining("Self-Training (KNN)", base_classifiers.KNN),
    StandardSelfTraining("Self-Training (NB)", base_classifiers.NB),
    StandardSelfTraining("Self-Training (SVM)", base_classifiers.SMO),
    StandardSelfTraining("Self-Training (CART)", base_classifiers.CART)
]

### Datasets ###
In this project two standard datasets are used: Bupa and Abalone. They are obtained from KEEL-dataset repository. The datasets have 4 different labelling ratios: 10%, 20%, 30% and 40%.

In [4]:
path_to_datasets = "../Datasets/"

# All datasets used for testing
dataset_names = ["bupa", "abalone"]
labeling_rates = [10, 20, 30, 40]

For loading the datasets, pandas library is used.

In [5]:
import pandas as pd
def load_dataset(path):
    """Load one dataset"""
    return pd.read_csv(path, header=None, sep=", ", engine="python", comment="@")

def load_datasets(dataset_name, labeling_rate=10):
    """ Load 3 datasets: training, transitive and testing"""
    partial_path="{0}SSC_{1}labeled/{2}/{2}-10-1".format(path_to_datasets, labeling_rate,dataset_name)
    dataframes = {t: load_dataset(partial_path+t+".dat") for t in ["tra", "trs", "tst"]}
    return dataframes

### Training and scoring ###
The classifier is trained on training dataset that. Then its transitive classifying perfomance is measured on the same dataset. Finally testing dataset is used to measure the performance on unseen data.

In [6]:
def train_and_score(clf, dataframes,categorical=[]):
    """
    Given a classifier and a datasets
    Trains the classifier on training dataset
    and scores the classifier on transitive and testing datasets
    """
    training = dataframes["tra"]
    
    Xtra = training.iloc[:,:-1]
    ytra = training.iloc[:, -1]
    Xtra = pd.get_dummies(Xtra, columns = categorical )
    clf.fit(Xtra, ytra)
    transitive = dataframes["trs"]
    Xtrs = transitive.iloc[:,:-1]
    ytrs = transitive.iloc[:, -1].astype(str)
    Xtrs = pd.get_dummies(Xtrs, columns = categorical )
    transitive_score = clf.score( Xtrs, ytrs)
    testing = dataframes["tst"]
    Xtst = testing.iloc[:,:-1]
    ytst = testing.iloc[:, -1].astype(str)
    Xtst = pd.get_dummies(Xtst, columns = categorical )
    testing_score = clf.score(Xtst, ytst)
    return (transitive_score, testing_score)

Below is the main loop of the program, that trains all the classifiers with different labelling ratios and records the results.

In [7]:
#Columns in datasets that are categorical and need o be replaced with hot-one
categorical_columns = [[], [0]]
results = pd.DataFrame(columns=('classifier', 'dataset', 'labeling_rate', "transitive_accuracy", "testing_accuracy"))
for classifier in classifiers:
    print(classifier.name)
    for dataset_name, categorical in zip(dataset_names, categorical_columns):
        print("dataset:", dataset_name, "\t", end="")
        for labeling_rate in labeling_rates:
            print("#", end="")
            dataframes = load_datasets(dataset_name, labeling_rate)          
            transitive_score, testing_score = train_and_score(classifier, dataframes, categorical=categorical)
            results.loc[len(results.index)] = [classifier.name, dataset_name, labeling_rate, transitive_score, testing_score]
        print()
    print("--------")

TriTraining (KNN)
dataset: bupa 	####
dataset: abalone 	#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


###
--------
TriTraining (NB)
dataset: bupa 	####
dataset: abalone 	####
--------
TriTraining (SVM)
dataset: bupa 	####
dataset: abalone 	####
--------
TriTraining (CART)
dataset: bupa 	####
dataset: abalone 	#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##
--------
Self-Training (KNN)
dataset: bupa 	####
dataset: abalone 	####
--------
Self-Training (NB)
dataset: bupa 	####
dataset: abalone 	####
--------
Self-Training (SVM)
dataset: bupa 	####
dataset: abalone 	####
--------
Self-Training (CART)
dataset: bupa 	####
dataset: abalone 	####
--------


In [8]:
results.head()

Unnamed: 0,classifier,dataset,labeling_rate,transitive_accuracy,testing_accuracy
0,TriTraining (KNN),bupa,10.0,0.645161,0.571429
1,TriTraining (KNN),bupa,20.0,0.687097,0.6
2,TriTraining (KNN),bupa,30.0,0.664516,0.485714
3,TriTraining (KNN),bupa,40.0,0.690323,0.485714
4,TriTraining (KNN),abalone,10.0,0.0,0.203349


The results are organized in a table similar to the on in paper

In [9]:
pd.pivot_table(results, values=None, index=['dataset', 'classifier'], columns=['labeling_rate'])

Unnamed: 0_level_0,Unnamed: 1_level_0,transitive_accuracy,transitive_accuracy,transitive_accuracy,transitive_accuracy,testing_accuracy,testing_accuracy,testing_accuracy,testing_accuracy
Unnamed: 0_level_1,labeling_rate,10.0,20.0,30.0,40.0,10.0,20.0,30.0,40.0
dataset,classifier,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
abalone,Self-Training (CART),0.234321,0.292689,0.330845,0.266525,0.162679,0.215311,0.177033,0.181818
abalone,Self-Training (KNN),0.222044,0.243597,0.278859,0.309701,0.169856,0.177033,0.191388,0.215311
abalone,Self-Training (NB),0.135041,0.057097,0.077313,0.05677,0.126794,0.057416,0.064593,0.038278
abalone,Self-Training (SVM),0.191353,0.191569,0.194615,0.195096,0.188995,0.188995,0.184211,0.184211
abalone,TriTraining (CART),0.0,0.0,0.3751,0.435235,0.203349,0.222488,0.177033,0.203349
abalone,TriTraining (KNN),0.0,0.251601,0.300986,0.332889,0.203349,0.191388,0.229665,0.239234
abalone,TriTraining (NB),0.108887,0.107791,0.099174,0.114339,0.107656,0.124402,0.117225,0.141148
abalone,TriTraining (SVM),0.214839,0.192369,0.215942,0.210554,0.208134,0.191388,0.200957,0.19378
bupa,Self-Training (CART),0.658065,0.670968,0.625806,0.741935,0.685714,0.6,0.685714,0.657143
bupa,Self-Training (KNN),0.616129,0.667742,0.670968,0.7,0.6,0.514286,0.542857,0.514286


------
[1]: Isaac Triguero et. al "Self-labeled techniques for semi-supervised learning:taxonomy, software and empirical study" 2015

[2]: Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings
of the 33rd annual meeting of the association for computational linguistics, pp 189–196

[3]: Zhou ZH, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl
Data Eng 17:1529–1541