# Notebook 4: Classification

The notebook performs the classification task with the machine learning models presented in the thesis section 3.6. 
As input, the datasets created by the ensemble feature selection method of notebook #3 were included. Both the default and parameter-tuned models are evaluated.
Based on the performances of the different classifiers, an ensemble semi-supervised self-training approach is applied which includes the Random Forest, SVM, XGBoost and Deep Neural Network model. 

In section 4.1 of this notebook, the data is loaded. Subsequently, the performances of the default classifiers with its tuned counterparts are compared. Based on this, 4.3 defines the classifier functions for the subsequent application in 4.4 where the mentioned self-training method is constructed. Finally, 4.5 classifies the remaining documents and 4.6 exports the labels of the unlabeled corpus.
Optionally, the class distribution of the preliminary labeled dataset, newly labeled dataset and entire document corpus can be computed.

The results are reported in the thesis section 4.3.2.

Table of Contents:
* [4.1 Loading data](#load)
* [4.2 Evaluating the performances of the different models](#scores)
* [4.3 Functions for successive usage of best models for training and predicting ](#models)
* [4.4 Ensemble self-training](#training)
* [4.5 Labeling of remaining documents](#remaining)
* [4.6 Export labels](#export)
* [Optional: Exploring class distribution](#dist)

In [None]:
# import modules
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from collections import Counter
from scipy.sparse import vstack
import string
from copy import deepcopy
from tqdm import tqdm_notebook
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, RepeatedStratifiedKFold, cross_val_score, cross_validate, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix, make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB

import keras
from keras.models import Sequential, Model
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional, concatenate, Dropout
from keras.layers import Input, Dense, Embedding, Flatten, Conv1D, MaxPooling1D, MaxPooling2D
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras import Input

from imblearn.combine import SMOTETomek

from xgboost import XGBClassifier

#define function for saving objects (e.g. save_pickle(contracts_labeled, 'Pickles/contracts_labeled.pickle'))
def save_pickle(objectname, picklename):
    pickle_out = open(picklename,"wb") #.pickle
    pickle.dump(objectname, pickle_out)
    pickle_out.close()
    print(picklename, 'successfully pickled.') 

# 4.1 Loading data <a id="load"></a>

In [None]:
# load labeled_X
# load labeled_y
# load unlabeled_X
    
# check for correct lengths
print(labeled_X.shape, type(labeled_X))
print(len(labeled_y), type(labeled_y))
print(unlabeled_X.shape, type(unlabeled_X))

# 4.2 Evaluating performance of best models <a id="scores"></a>

In [None]:
# construct 5 shuffled and stratified folds for evaluating clssifier performance
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# function to plot the confusion matrix
def plot_cm(cm):
    df_cm = pd.DataFrame(cm, range(len(set(labeled_y))), range(len(set(labeled_y))))
    df_cm.index.name = 'True'
    df_cm.columns.name = 'Predicted'
    ax = plt.axes()
    sns.set(font_scale=1.4) # for label size
    sns.heatmap(df_cm, annot=True, annot_kws={"size": 10}, ax = ax, fmt='g') # font size
    plt.show()
    
# function to return the name of object
def namestr(obj):
    return [name for name in globals() if globals()[name] is obj]

In [None]:
# Include dummy classifier as baseline model

dummy_clf = DummyClassifier(strategy="most_frequent")
f1_w = list()
dummy_cm = np.zeros((7, 7))

for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
    train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
    train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
    clf_deep = deepcopy(dummy_clf)
    clf_new = clf_deep.fit(train_X, train_y)
    pred_new = clf_new.predict(test_X)
    f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
    d_cm = confusion_matrix(test_y, pred_new)
    dummy_cm += d_cm

print('F1 weighted:', np.mean(f1_w))
print('Single folds:', f1_w)
plot_cm(dummy_cm)

In [None]:
# Multinomial naive Bayes (MNB) model: default and tuned

mnb_default = MultinomialNB()
mnb_best = MultinomialNB(alpha=0.01, fit_prior=True)

mnb = [mnb_default, mnb_best]

for i in range(len(mnb)): # evaluate default and optimized classifier 
    print(namestr(mnb[i]))
    f1_w = list()
    naive_bayes_cm = np.zeros((7, 7))

    for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
        train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
        train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
        clf_deep = deepcopy(mnb[i])
        clf_new = clf_deep.fit(train_X, train_y)
        pred_new = clf_new.predict(test_X)
        f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
        n0_cm = confusion_matrix(test_y, pred_new)
        naive_bayes_cm += n0_cm

    print('F1 weighted:', np.mean(f1_w))
    print(f1_w)
    plot_cm(naive_bayes_cm)
    print()

In [None]:
# Support Vector Machine (SVM) model: default and tuned

svc_default = SVC(kernel = 'rbf', class_weight='balanced', decision_function_shape='ovo')
svc_best = SVC(C=4, class_weight='balanced', decision_function_shape='ovo', gamma='auto', kernel='rbf',                   
               max_iter=-1, random_state=42, shrinking=True, tol=0.001)

svc = [svc_default, svc_best]

for i in range(len(svc)): # evaluate default and optimized classifier 
    print(namestr(svc[i]))
    f1_w = list()
    svc_cm = np.zeros((7, 7))

    for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
        train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
        train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
        clf_deep = deepcopy(svc[i])
        clf_new = clf_deep.fit(train_X, train_y)
        pred_new = clf_new.predict(test_X)
        f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
        s_cm = confusion_matrix(test_y, pred_new)
        svc_cm += s_cm

    print('F1 weighted:', np.mean(f1_w))
    print(f1_w)
    plot_cm(svc_cm) 
    print()

In [None]:
# Random Forest (RF) model: default and tuned

rf_default = RandomForestClassifier(n_jobs=-1, random_state = 42, class_weight = 'balanced')
rf_best = RandomForestClassifier(bootstrap=True, class_weight='balanced',criterion='gini', 
                                 max_depth=90, max_features='auto', max_samples=0.3,                                   
                                 min_samples_leaf=2, min_samples_split=10, n_estimators=130,                                   
                                 n_jobs=-1, random_state=42)

rf = [rf_default, rf_best]

for i in range(len(rf)): # evaluate default and optimized classifier 
    print(namestr(rf[i]))
    f1_w = list()
    rfc_cm = np.zeros((7, 7))

    for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
        train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
        train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
        clf_deep = deepcopy(rf[i])
        clf_new = clf_deep.fit(train_X, train_y)
        pred_new = clf_new.predict(test_X)
        f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
        rf_cm = confusion_matrix(test_y, pred_new)
        rfc_cm += rf_cm

    print('F1 weighted:', np.mean(f1_w))
    print(f1_w)
    plot_cm(rfc_cm)
    print()

In [None]:
# XGBoost (XGB) model: default and tuned

class MyXGBClassifier(XGBClassifier):
    @property
    def coef_(self):
        return None

xgb_default = MyXGBClassifier(booster = 'gbtree', objective= 'multi:softmax', num_class = 7)
xgb_best = MyXGBClassifier(learning_rate=0.3, booster = 'gbtree', objective= 'multi:softmax', num_class = 7, 
                           max_depth = 100, min_child_weight = 1, gamma = 0.2, scale_pos_weight = 1, reg_alpha = 0.25, 
                           reg_lambda = 1.5, eval_metric = 'merror', n_estimators=300, random_state = 42, n_jobs = -1)   

xgb = [xgb_default, xgb_best]

for i in range(len(xgb)): # evaluate default and optimized classifier 
    print(namestr(xgb[i]))
    f1_w = list()
    xgb_cm = np.zeros((7, 7))

    for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
        train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
        train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
        clf_deep = deepcopy(xgb[i])
        clf_new = clf_deep.fit(train_X, train_y)
        pred_new = clf_new.predict(test_X)
        f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
        gb_cm = confusion_matrix(test_y, pred_new)
        xgb_cm += gb_cm

    print('F1 weighted:', np.mean(f1_w))
    print(f1_w)
    plot_cm(xgb_cm)
    print()

In [None]:
# Deep Neural Network (DNN) model: basic and tuned

# define function for plotting the training history of neural network    
def plot_history(history):
    fig, (ax1, ax2) = plt.subplots(2, sharex=True,figsize=(10, 6))
    fig.suptitle('Model history')
    ax1.plot(history.history['categorical_accuracy'])
    ax1.plot(history.history['val_categorical_accuracy'])
    ax1.legend(['train', 'test'], loc='upper left')
    ax1.set_ylabel('categorical accuracy')

    ax2.plot(history.history['loss'])
    ax2.plot(history.history['val_loss'])
    ax2.set_xlabel('epoch')
    ax2.set_ylabel('loss')
    plt.show()

# ----------------------------------------------------------------------------------------------------------

# Basic NN: only input and output layer

# function to retrieve the NN architecture 
def get_default_NNmodel():
    model = keras.Sequential()
    model.add(keras.layers.Dense(614, kernel_initializer=keras.initializers.he_normal(seed=1), activation='relu', input_dim=614))
    model.add(keras.layers.Dense(7, kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=4), activation='softmax'))
    model.compile(optimizer='adam', # setting adaptive learning rate for gradient descent 
                  loss='categorical_crossentropy',
                  metrics=[keras.metrics.CategoricalAccuracy(), #metrics.Accuracy(),
                           keras.metrics.AUC()]) #by default ROC curve
    print(model.summary())
    return model

# evaluate performance
print('Basic DNN \n')
f1_w = list()
NN_cm = np.zeros((7, 7))

for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
    train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
    train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
    train_yy = keras.utils.to_categorical(train_y)
    
    model = get_default_NNmodel() #retrieve neural network architecture
    history = model.fit(train_X, train_yy, epochs=500, batch_size=512, validation_split=0.1, callbacks=[EarlyStopping(monitor='val_categorical_accuracy', min_delta=0.001, patience=3, restore_best_weights = True)])
    pred_y = model.predict(test_X, batch_size=64, verbose=1)
    pred_new = np.argmax(pred_y, axis=1)
    
    f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
    N_cm = confusion_matrix(test_y, pred_new)
    NN_cm += N_cm
    plot_history(history)

print('F1 weighted:', np.mean(f1_w))
print(f1_w)   
plot_cm(NN_cm)

 
# ----------------------------------------------------------------------------------------------------------

# Tuned DNN: input and output layer with two hidden layers in-between

# function to retrieve the DNN architecture 
def get_NNmodel():
    model = keras.Sequential()
    model.add(keras.layers.Dense(614, kernel_initializer=keras.initializers.he_normal(seed=1), activation='relu', input_dim=614))
    model.add(keras.layers.Dropout(0.5))
    model.add(keras.layers.Dense(307, kernel_initializer=keras.initializers.he_normal(seed=2), activation='relu'))
    model.add(keras.layers.Dropout(0.4))
    model.add(keras.layers.Dense(153, kernel_initializer=keras.initializers.he_normal(seed=3), activation='relu'))
    model.add(keras.layers.Dropout(0.3))
    model.add(keras.layers.Dense(7, kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=4), activation='softmax'))
    model.compile(optimizer='adam', # setting adaptive learning rate for gradient descent 
                  loss='categorical_crossentropy',
                  metrics=[keras.metrics.CategoricalAccuracy(),
                           keras.metrics.AUC()])
    print(model.summary())
    return model

    
# evaluate performance
print('Tuned DNN \n')
f1_w = list()
NN_cm = np.zeros((7, 7))

for train_ix, test_ix in kfold.split(labeled_X, labeled_y):
    train_X, test_X = labeled_X[train_ix], labeled_X[test_ix]
    train_y, test_y = labeled_y[train_ix], labeled_y[test_ix]
    
    # usage of resampling methods
    smt = SMOTETomek(random_state=42)
    X_res, y_res = smt.fit_resample(train_X, train_y)
    train_yy = keras.utils.to_categorical(y_res)
    
    model = get_NNmodel() #retrieve neural network architecture
    history = model.fit(X_res, train_yy, epochs=500, batch_size=512, validation_split=0.1, callbacks=[EarlyStopping(monitor='val_categorical_accuracy', min_delta=0.001, patience=3, restore_best_weights = True)]) # min_delta=0.00001
    pred_y = model.predict(test_X, batch_size=64, verbose=1)
    pred_new = np.argmax(pred_y, axis=1)
    
    f1_w.append(f1_score(test_y, pred_new, average = 'weighted', labels=np.unique(test_y)))
    N_cm = confusion_matrix(test_y, pred_new)
    NN_cm += N_cm
    plot_history(history)

print('F1 weighted:', np.mean(f1_w))
print(f1_w)   
plot_cm(NN_cm)

# 4.3 Functions for successive usage of best models for training and predicting <a id="models"></a>

In [None]:
# Function for successive usage of best SVM model

def svc_model(data_X, data_Y, unlabeled_X):
    svc = deepcopy(svc_best)
    svc_trained = svc.fit(data_X.todense(), data_Y)
    pred_svc = svc_trained.predict(unlabeled_X.todense())
    print('SVC completed.')
    return pred_svc

In [None]:
# Function for successive usage of best RF model

def rf_model(data_X, data_Y, unlabeled_X):
    rf = deepcopy(rf_best)
    rf_trained = rf.fit(data_X, data_Y)
    pred_rf = rf_trained.predict(unlabeled_X)
    print('Random Forest completed.')
    return pred_rf

In [None]:
# Function for successive usage of best XGB model

class MyXGBClassifier(XGBClassifier):
    @property
    def coef_(self):
        return None
    
def xgb_model(data_X, data_Y, unlabeled_X):
    xgb = deepcopy(xgb_best)    
    xgb_trained = xgb.fit(data_X.todense(), data_Y)
    pred_xgb = xgb_trained.predict(unlabeled_X.todense())
    print('XGBoost completed.')
    return pred_xgb

In [None]:
# Function for successive usage of best DNN model

def NN_model(data_X, data_Y, unlabeled_X):
    #create (stratified) training and test set for constructing the neural network
    X_train, X_test, y_train, y_test = train_test_split(data_X, data_Y, test_size=0.15, shuffle = True, stratify = data_Y, random_state=42)
    yy_test = keras.utils.to_categorical(y_test)

    # usage of resampling methods
    smt = SMOTETomek(random_state=42)
    X_res, y_res = smt.fit_resample(X_train, y_train)
    yy_train = keras.utils.to_categorical(y_res)

    model = get_NNmodel() #retrieve neural network architecture
    history = model.fit(X_res, yy_train, epochs=500, batch_size=512, validation_data=(X_test, yy_test), callbacks=[EarlyStopping(monitor='val_categorical_accuracy', min_delta=0.001, patience=3, restore_best_weights = True)])
    plot_history(history) #plot training history

    pred_NN = model.predict(unlabeled_X)
    pred_FNN = pred_NN.argmax(axis=1)
    print('Neural Network completed.')
    return pred_FNN

# 4.4 Iterative ensemble self-training <a id="training"></a>

In [None]:
# create copies of datasets for tranformation
lab_X = deepcopy(labeled_X)
lab_y = deepcopy(labeled_y)
unlab_X = deepcopy(unlabeled_X)

text_labels = ['Agreement', 'Amendment', 'Attachment', 'LOI', 'NDA', 'Offer', 'SOW']
unlabeled_labels = [10] * unlab_X.shape[0] # '10' as placeholder for final labels
training_iterations = 1 # initiate counter of training iterations

c = 1001 # initiate counter of new assignments
fraction_unlabeled = unlab_X.shape[0] / (unlab_X.shape[0]+lab_X.shape[0]) # initiate fraction of unlabeled documents of entire corpus

# run while a) > 1000 new assignments & b) unlabeled corpus > 5% of entire corpus
while (c > 1000) and (fraction_unlabeled > 0.05):
    print('Iteration', training_iterations, '\n') # print current iteration
    c = 0 # reset counter of unambiguous assignments
    
    # train models on labeled corpus and predict unlabeled documents 
    pred_rf = rf_model(lab_X, lab_y, unlab_X)
    pred_svc = svc_model(lab_X, lab_y, unlab_X)
    pred_NN = NN_model(lab_X, lab_y, unlab_X)
    pred_xgb = xgb_model(lab_X, lab_y, unlab_X)
    
    # check for correct number of predictions 
    print('Number of RF predictions:', len(pred_rf))
    print('Number of SVM predictions:', len(pred_svc))
    print('Number of DNN predictions:', len(pred_NN))
    print('Number of XGB predictions:', len(pred_xgb), '\n')
    
    indices = list() # reset index of newly labeled documents
    predicted_labels = list()  # reset labels of newly labeled documents

    for i in range(len(pred_rf)): # iterate through all unlabeled documents
        guess_rf = pred_rf[i] #access random forest prediction for this document
        guess_svc = pred_svc[i] #access SVM prediction for this document
        guess_NN = pred_NN[i] #access deep neural network prediction for this document
        guess_xgb = pred_xgb[i] #access XGBoost prediction for this document
        predictions_all = [guess_rf, guess_svc, guess_NN, guess_xgb] # combine all predictions for this document
        x = len(set(predictions_all)) #evaluate number of different predictions 
        if x == 1: # if unanimous assignment
            indices.append(i) # add index of this document
            predicted_labels.append(predictions_all[0]) #add label of this document
            c += 1 #increase counter of unanimous assignments
    
    # print number and share of unanimous assignments
    print('{} unanimous assignments (out of {}) completed -> {} missing. \n'.format(c, len(pred_rf), len(pred_rf)-c))

    #evaluate the distribution of new labels
    n_pred = dict(sorted(dict(Counter(predicted_labels)).items())) #aggregate list of new labels
    for k, v in n_pred.items():
        print('{}: {} assignments ({}%)'.format(text_labels[k], v, round(v/len(predicted_labels) *100, 2))) # print absolute and relative frequency of new assignments
    print()
        
    new_labeled = unlab_X[indices] #access the newly assigned documents
    print('Shape of newly assigned documents:', new_labeled.shape) #check the shape of newly assigned documents

    #update corpora
    lab_X = vstack([lab_X, new_labeled]) # add newly assigned documents to labeled corpus
    lab_y = np.append(lab_y, np.array(predicted_labels)) # add new labels to labels collection
    p = [i for i in range(unlab_X.shape[0]) if i not in indices] # access indexes which were not newly assigned -> remain unlabeled
    unlab_X = unlab_X[p] # update unlabeled corpus
    # check corpora for correct lengths
    print('Shape of updated labeled corpus', lab_X.shape)
    print('Length of updated labels collection', lab_y.shape)
    print('Shape of updated unlabeled corpus', unlab_X.shape, '\n')
    
    # update label distribution 
    print('Distribution of labels before:', Counter(unlabeled_labels)) #print distribution of labels before updating
    unlabeled_counter = -1
    new_labels_counter = 0
    for i in range(len(unlabeled_labels)): #iterate through all labels
        if unlabeled_labels[i] == 10: # check if document is still unlabeled
            unlabeled_counter +=1 # increase unlabeled counter
        if unlabeled_counter == indices[new_labels_counter]: # check for correct index to update
            unlabeled_labels[i] = predicted_labels[new_labels_counter] # update index to the assigned label
            new_labels_counter +=1 # increase counter of new labels
        if new_labels_counter == len(indices): #exit loop if labels of all newly assigned documents were updated
            break
    print('Distribution of labels afterwards:', Counter(unlabeled_labels)) #print distribution of labels after updating
       
    training_iterations += 1 #update number of training iterations
    fraction_unlabeled = unlab_X.shape[0] / (unlab_X.shape[0]+lab_X.shape[0]) #compute share of unlabeled documents of the entire corpus
    
    print('------------------------------------------------------------------------------ \n')

# 4.5 Labeling of remaining documents <a id="remaining"></a>

In [None]:
# access the predictions of the models for the remaining documents
pred_rf_unlabeled = pred_rf[p]
pred_svc_unlabeled = pred_svc[p]
pred_NN_unlabeled = pred_NN[p]
pred_xgb_unlabeled = pred_xgb[p]

#check for correct length
print(len(pred_rf_unlabeled))
print(len(pred_svc_unlabeled))
print(len(pred_NN_unlabeled))
print(len(pred_xgb_unlabeled))

In [None]:
# assigning classes to the remaining documents by majority voting or, in case of a draw, by the prediction of the best performing classifier (guess_svc)

amount_pred = list()
final_pred = list()
counter_31 = 0
counter_22 = 0

for i in range(len(pred_rf_unlabeled)): #access each remaining unlabeled document and load its predictions from the models
    guess_rf = pred_rf_unlabeled[i]
    guess_svc = pred_svc_unlabeled[i]
    guess_NN = pred_NN_unlabeled[i]
    guess_xgb = pred_xgb_unlabeled[i]
    predictions_all = [guess_rf, guess_svc, guess_NN, guess_xgb] #combine predictions
    
    x = len(set(predictions_all)) #assess number of different predictions
    amount_pred.append(x) # add number of different predictions to list
    
    #evaluate final label based on number of predictions
    if x == 1:
        print('Mistake.') # print 'Mistake' because unanimous assignments should have taken place in the iterative self-training
        continue
    
    if x == 2:
        c = Counter(predictions_all)
        if c.most_common(1)[0][1] == 3: # if predictions were 3:1
            pred_class = c.most_common(1)[0][0]
            final_pred.append(pred_class) # assign class by majority voting
            counter_31 += 1 #increase counter
            continue
        if c.most_common(1)[0][1] == 2: #if predictions were 2:2
            final_pred.append(guess_svc) # assign class by guess_svc as best-performing model
            counter_22 += 1 #increase counter
            continue
    
    if x == 3:
        c = Counter(predictions_all)
        pred_class = c.most_common(1)[0][0] # assign class by majority voting
        final_pred.append(pred_class)
        continue
        
    if x == 4:
        final_pred.append(guess_svc) # assign class by guess_svc as best-performing model
        continue

# print the distribution of number of different predictions
print('Number of different predictions (e.g. 3 means that there is a majority class as 4 predictions):\n{} \n'.format(Counter(amount_pred)))
print('For 2 different predictions: \n{} times 3:1 and {} times 2:2 \n'.format(counter_31, counter_22))

n_pred = dict(sorted(dict(Counter(final_pred)).items())) # assess the class distribution of remaining documents
for k, v in n_pred.items():
    print('{}: {} assignments ({}%)'.format(text_labels[k], v, round(v/len(final_pred) *100, 2))) # print distribution of final assignments

In [None]:
# final update of label distribution 

print('Distribution of labels before:', Counter(unlabeled_labels)) #print distribution of labels before updating
print('Number of final predictions (labels):', len(final_pred))

new_labels_counter = 0
for i in range(len(unlabeled_labels)): #iterate through all labels
    if unlabeled_labels[i] == 10: # check if document is still unlabeled
        unlabeled_labels[i] = final_pred[new_labels_counter] # update to the assigned label
        new_labels_counter += 1 # increase counter of new labels
    if new_labels_counter == len(final_pred): #exit loop if labels of all newly assigned documents were updated
        break

print('Distribution of labels afterwards:', Counter(unlabeled_labels)) #print distribution of labels after updating: no '10' should appear anymore

# 4.6 Export final predictions <a id="export"></a>

In [None]:
print(len(unlabeled_labels)) # check for correct length
save_pickle(unlabeled_labels, 'Pickles/3_unlabeled_y.pickle')

# Optional: Exploring class distributions <a id="dist"></a>

In [None]:
# Evaluate class distribution of orginally labeled corpus

print(len(labeled_y)) #print number of documents in orginally labeled corpus
final_counter = dict(Counter(list(labeled_y))) #aggregate document types of corpus
print(final_counter) #print aggregated class distribution
print()

for key in sorted(final_counter): #iterate through document types
    # print absolute and relative frequency of this document type
    print('{}: {} assignments ({} %)'.format(text_labels[key], final_counter[key], round(final_counter[key]/len(labeled_y) *100, 2)))

In [None]:
# Evaluate class distribution of orginally unlabeled corpus

print(len(unlabeled_labels)) #print number of documents in orginally unlabeled corpus
final_counter = dict(Counter(unlabeled_labels)) #aggregate document types of corpus
print(final_counter) #print aggregated class distribution
print()

for key in sorted(final_counter): #iterate through document types
    # print absolute and relative frequency of this document type
    print('{}: {} assignments ({} %)'.format(text_labels[key], final_counter[key], round(final_counter[key]/len(unlabeled_labels) *100, 2)))

In [None]:
# Evaluate class distribution of entire corpus (originally labeled + unlabeled)
both_labels = list(labeled_y) #convert labels to list for concatenation
both_labels.extend(unlabeled_labels) # add newly assigned labels to orginal labels -> labels of entire corpus 

print(len(both_labels)) #print number of documents in entire corpus
final_counter = dict(Counter(both_labels)) #aggregate document types of corpus
print(final_counter) #print aggregated class distribution
print()

for key in sorted(final_counter): #iterate through document types
    # print absolute and relative frequency of this document type
    print('{}: {} assignments ({} %)'.format(text_labels[key], final_counter[key], round(final_counter[key]/len(both_labels) *100, 2)))