# Neural Networks Project
## Self-labeled techniques for semi-supervised learning
<div style="text-align: right">
    Mark Laane <br />
    Rome, 2017
</div>

### Introduction ###
The aim of this project is to reimplement some techniques surveyed by Isaac Triguero et. al in paper [1] and to independently reproduce the reported results. A report of the project is also provided: [Project Report](Neural Networks Project Report Mark Laane.pdf)

Two self-labeled techniques are chosen from the paper: Standard Self-Training and Tri-Training. Those techniques are used on Abalone and Dermatology datasets. For implementation, Python programming language was chosen along with Pandas and Sclearn libraries.

### Datasets ###
Two standard datasets are used: Abalone and Dermatology. The datasets are loaded from mldata.org

In [1]:
#Datasets are stored in a python dictionary
datasets = {}

from sklearn.datasets import fetch_mldata
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Fetch abalone dataset from mldata.org
data = fetch_mldata("abalone")
# Preprocessing pipe for abalone dataset encodes categorical feature
# and scales the features
preprocessing_pipe = make_pipeline(
    #OneHotEncoder on "Sex" feature
    OneHotEncoder(categorical_features=[0], sparse=False),
    #Scale all from 0 to 1
    MinMaxScaler())
# Apply preprocessing pipe to dataset and store the dataset in dict.
datasets["abalone"] = {
    "X": preprocessing_pipe.fit_transform(data.data),
    "y": data.target
}

In [2]:
#Dermatology dataset is loaded from mldata.org and used as-is
data = fetch_mldata("uci-20070111 dermatology")
datasets["dermatology"] = {
    "X": data.data[:,0:-1],
    "y": data.data[:,-1]
}

### Base classifiers ###
3 different base classifiers are used. The base classifiers are provided by Sklearn library. The classifiers are configured according to the parameters described in the paper.

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
# Class that only holds a collection of different 
# base classifiers for usage with SSL methods.
class base_classifiers:
    KNN = KNeighborsClassifier(
        n_neighbors=3,
        metric="euclidean",
        #n_jobs=2  # Parallelize work on CPUs
    )
    NB = GaussianNB(
        priors=None
    )
    #SVM = SVC(
    #    C=1.0,
    #    kernel='poly',
    #    degree=1,
    #    tol=0.001,
    #)
    CART = DecisionTreeClassifier(
        criterion='entropy',
        # splitter='best',
        # max_depth=None,
        # min_samples_split=2,
        min_samples_leaf=2,
        # min_weight_fraction_leaf=0.0,
        # max_features=None,
        # random_state=None,
        # max_leaf_nodes=None,
        # min_impurity_split=1e-07,
        # class_weight=None,
        # presort=False,
    )

### Implemented self-labeled algorithms ###
Two algorithms are implemented: Standard Self-Training and Tri Training.
#### Standard Self-Training ####
Implementation: [Standard Self-Training](standard_self_training.py)<br />
The implementation is based on description of the algorithm in paper [2].
Training an Standard Self-Training classifier is an iterative process - The base classifier is trained with initial labeled samples. Then it is used for labelling the unlabelled samples and the classifier is retrained with the most confident predictions. The process is repeated until the classifier output stabilizes.
#### Tri-Training ###
Implementation: [Tri-Training](tri_training.py)<br />
The implementation is based on description of the algorithm in paper [3].
In Tri-Training, Three base classifiers is trained on randomly subsampled sets of the labelled data. Then each of them will be iteratively trained on labeled data gained from two other base classifiers. The prediciton is made by using majority voting on three base classifiers.

In total 6 different classifiers are trained - 2 techniques with 3 different base classifiers:

In [4]:
from standard_self_training import StandardSelfTraining
from tri_training import TriTraining

# All classifiers used for testing
classifiers = [
    TriTraining("TriTraining (KNN)", base_classifiers.KNN),
    TriTraining("TriTraining (NB)", base_classifiers.NB),
    #TriTraining("TriTraining (SVM)", base_classifiers.SVM),
    TriTraining("TriTraining (CART)", base_classifiers.CART),
    StandardSelfTraining("Self-Training (KNN)", base_classifiers.KNN),
    StandardSelfTraining("Self-Training (NB)", base_classifiers.NB),
    #StandardSelfTraining("Self-Training (SVM)", base_classifiers.SVM),
    StandardSelfTraining("Self-Training (CART)", base_classifiers.CART)
]
labeling_rates = [0.10, 0.20, 0.30, 0.40]

### Training and scoring ###

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

def _training_scoring_iteration(clf, X, y, training_index, test_index, labeling_rate):
    """ 
    One iteration of fully training and scoring a 
    classifier on given data (one Kfold split)
    """
    #Testing set is set aside.. - 1/10th of the data
    X_test, y_test = X[test_index], y[test_index]

    #For generating a testing and transductive set
    split_data = train_test_split(
        X[training_index],
        y[training_index],
        test_size=labeling_rate,
        random_state=42
    )
    (X_unlabeled, X_labeled, y_unlabeled, y_labeled) = split_data

    #Training set - 9/10 of data
    X_train = np.concatenate((X_labeled, X_unlabeled))
    y_train = np.concatenate((
        y_labeled.astype(str),
        np.full_like(y_unlabeled.astype(str), "unlabeled")
    ))
    
    #Train the classifier
    clf.fit(X_train, y_train)
    
    #Score the classifier
    transductive_score = clf.score(X_unlabeled, y_unlabeled.astype(str))
    testing_score = clf.score(X_test, y_test.astype(str))
    
    return transductive_score, testing_score
    
def train_and_score(clf, X, y, cv, labeling_rate):
    """
    Perform KFold cross-validation of a classifier on a given data
    and labelling rate
    """
    transductive_scores = []
    testing_scores = []
    for training_index, test_index in cv.split(X,y):
        transductive_score, testing_score = _training_scoring_iteration(clf, X, y, training_index, test_index, labeling_rate)
        
        transductive_scores.append(transductive_score)
        testing_scores.append(testing_score)
        print("#", end="")
    print()
    return {
        "trans_mean": np.mean(transductive_scores),
        "test_mean": np.mean(testing_scores),
        "trans_std": np.std(transductive_scores),
        "test_std": np.std(testing_scores)
    }

In [6]:
from sklearn.model_selection import KFold
import pandas as pd

""" 
The main loop for testing 
all classifiers with 
all datasets and 
all labeling rates
"""
results = None
for classifier in classifiers:
    print(classifier.name)
    for dataset_name, dataset in datasets.items():
        print("dataset:", dataset_name, "\t")
        for labeling_rate in labeling_rates:
            print("rate:", labeling_rate, end=" ")

            test_info = { "classifier": classifier.name, "dataset":dataset_name, "labeling_rate":labeling_rate}
            cv = KFold(n_splits=10, random_state=42)
            scores = train_and_score(classifier, dataset["X"], dataset["y"], cv, labeling_rate)

            if results is None:
                results = pd.DataFrame([{**test_info, **scores}])
            else:
                results.loc[len(results.index)] = {**test_info, **scores}
    print()
    print("--------")

TriTraining (KNN)
dataset: dermatology 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########
dataset: abalone 	
rate: 0.1 

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#########
rate: 0.2 

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#####

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


###
rate: 0.3 

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


####

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#####
rate: 0.4 #

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


######

--------
TriTraining (NB)
dataset: dermatology 	
rate: 0.1 ##########
rate: 0.2 ###

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


####

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred
  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


###
rate: 0.3 ##########
rate: 0.4 ##

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred
  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#####

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#
dataset: abalone 	
rate: 0.1 

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


####
rate: 0.2 

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#####

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#
rate: 0.3 

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


####

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


###
rate: 0.4 ###

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#####

--------
TriTraining (CART)
dataset: dermatology 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########
dataset: abalone 	
rate: 0.1 #

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


#########
rate: 0.2 #######

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


###
rate: 0.3 ########

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##
rate: 0.4 ########

  check = self.filled(0).__eq__(other)
  score = y_true == y_pred


##

--------
Self-Training (KNN)
dataset: dermatology 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########
dataset: abalone 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########

--------
Self-Training (NB)
dataset: dermatology 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########
dataset: abalone 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########

--------
Self-Training (CART)
dataset: dermatology 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########
dataset: abalone 	
rate: 0.1 ##########
rate: 0.2 ##########
rate: 0.3 ##########
rate: 0.4 ##########

--------


### Results ###

In [7]:
results

Unnamed: 0,classifier,dataset,labeling_rate,test_mean,test_std,trans_mean,trans_std
0,TriTraining (KNN),dermatology,0.1,0.488889,0.060131,0.44708,0.036302
1,TriTraining (KNN),dermatology,0.2,0.650375,0.070127,0.670488,0.028531
2,TriTraining (KNN),dermatology,0.3,0.73521,0.083414,0.742178,0.021418
3,TriTraining (KNN),dermatology,0.4,0.784309,0.085764,0.800408,0.028411
4,TriTraining (KNN),abalone,0.1,0.184359,0.103033,0.185527,0.06267
5,TriTraining (KNN),abalone,0.2,0.173551,0.10149,0.143877,0.094572
6,TriTraining (KNN),abalone,0.3,0.155177,0.112807,0.203701,0.009568
7,TriTraining (KNN),abalone,0.4,0.203248,0.050367,0.14352,0.094432
8,TriTraining (NB),dermatology,0.1,0.304505,0.200854,0.303841,0.175848
9,TriTraining (NB),dermatology,0.2,0.133559,0.102494,0.131341,0.088101


In [8]:
pd.pivot_table(results, values=None, index=['dataset', 'classifier'], columns=['labeling_rate'])

Unnamed: 0_level_0,Unnamed: 1_level_0,test_mean,test_mean,test_mean,test_mean,test_std,test_std,test_std,test_std,trans_mean,trans_mean,trans_mean,trans_mean,trans_std,trans_std,trans_std,trans_std
Unnamed: 0_level_1,labeling_rate,0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4
dataset,classifier,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
abalone,Self-Training (CART),0.177636,0.188643,0.19774,0.200624,0.04812,0.052186,0.042104,0.049635,0.199894,0.198384,0.20203,0.196249,0.0089,0.010321,0.014942,0.009393
abalone,Self-Training (KNN),0.215711,0.21068,0.209482,0.202054,0.063134,0.063786,0.056126,0.058924,0.21748,0.216107,0.213241,0.205958,0.009506,0.008401,0.009116,0.010749
abalone,Self-Training (NB),0.08043,0.085939,0.070136,0.067502,0.027688,0.034251,0.027342,0.027958,0.076111,0.0826,0.076694,0.067753,0.017747,0.01156,0.016192,0.01352
abalone,TriTraining (CART),0.189849,0.20157,0.172337,0.16444,0.056379,0.057807,0.074017,0.068343,0.179912,0.177374,0.197811,0.198289,0.060278,0.059797,0.013502,0.013589
abalone,TriTraining (KNN),0.184359,0.173551,0.155177,0.203248,0.103033,0.10149,0.112807,0.050367,0.185527,0.143877,0.203701,0.14352,0.06267,0.094572,0.009568,0.094432
abalone,TriTraining (NB),0.073991,0.08309,0.100075,0.104854,0.039526,0.059929,0.047361,0.037573,0.043179,0.051639,0.05366,0.071561,0.043297,0.051677,0.05385,0.046911
dermatology,Self-Training (CART),0.734234,0.882658,0.887913,0.90991,0.084154,0.093092,0.076882,0.062707,0.776031,0.901316,0.902362,0.932149,0.084967,0.033661,0.023257,0.023461
dermatology,Self-Training (KNN),0.49467,0.678003,0.724324,0.765165,0.053112,0.08659,0.072453,0.073864,0.453488,0.673888,0.732635,0.796865,0.036719,0.027315,0.025003,0.032399
dermatology,Self-Training (NB),0.28506,0.196396,0.196396,0.196396,0.081652,0.069305,0.069305,0.069305,0.260744,0.194363,0.215726,0.22393,0.054078,0.012955,0.0409,0.035261
dermatology,TriTraining (CART),0.821922,0.87973,0.888063,0.909835,0.090365,0.084876,0.084047,0.063917,0.800315,0.908905,0.899744,0.935179,0.055557,0.026136,0.027704,0.02232


------
[1]: Isaac Triguero et. al "Self-labeled techniques for semi-supervised learning:taxonomy, software and empirical study" 2015

[2]: Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings
of the 33rd annual meeting of the association for computational linguistics, pp 189–196

[3]: Zhou ZH, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl
Data Eng 17:1529–1541