# Active learning pour la classification de tweets
*Saber Zerhoudi*



## Introduction
L’objectif est l’analyse et la détection automatique de sujets sur un corpus ou un flux de tweets. Les méthodes employées sont celles utilisées couramment en Machine Learning supervisé et non supervisé. L’approche choisie utilise une première phase de sélection basée sur le traitement supervisé des Hashtags puis une phase de traitement automatisé de l’ensemble des données textuelles. Pour finir, on test des méthodes de sélection des tweets les plus importants pour chacun des topics pour définir celle à utiliser.


## Bibliothèques Python utilisées
Les données ont été traitées en utilisant  les bibliothèques suivantes *scipy* et *numpy* et le processus d'apprentissage/validation a été construit avec *scikit-learn* et *libact*. Les diagrammes ont été créées en utilisant *matplotlib* et *seaborn*.

In [None]:
import copy
import time
import codecs
import random

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

try:
    from sklearn.model_selection import train_test_split
except ImportError:
    from sklearn.cross_validation import train_test_split

from strategies.strategies import USampling, CMBSampling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from libact.base.dataset import Dataset
from libact.models import LogisticRegression, SVM
from libact.query_strategies import UncertaintySampling, RandomSampling, QueryByCommittee


## Source de données

Les données comprenaient trois fichiers CSV:
`corpus_2015_id-time-text.csv` (30911 tweets) et `oriane_pos_id-time-text.csv` qui contient que ceux qui sont positives et `oriane_neg_id-time-text.csv` qui contient ceux négative.

Le format des fichiers CSV était le suivant :

| Id | Timestamp  | Tweet |
|------|------|------|
|   673860306414759936  | 1449495791425 | On reste calme et on suit le live-tweet de la #FakeDesLumièresn#leresistant https://t.co/l2AbZysqjS |

Les données comprenaient aussi trois fichiers TXT qui correspondent respectivement au contenu des fichiers CSV après avoir vectoriser les tweets à l'aide de W2V :
`vectors_2015.txt` (30911 tweets) et `vectors_2015_pos.txt` qui contient que ceux qui sont positives et `vectors_2015_neg.txt` qui contient ceux négative.

Le format des fichiers TXT était le suivant :

| W2V (dim 100) | Id | Timestamp  | Tweet |
|------|------|------|------|
|   [ -4.57972735e-02  ...  -5.48090134e-03]  |   673860306414759936  | 1449495791425 | On reste calme et on suit le live-tweet de la #FakeDesLumièresn#leresistant https://t.co/l2AbZysqjS |


# Traitement de données
## Fonctions utilisés en main()
        

In [5]:
# Ouvrir un fichier en format txt
def openfile_txt(filepath):
    with open(filepath, 'r', encoding='utf16') as f:
        file = f.read().split('"\n[')
    return file

# Simulation du w2v : étant donné un tweet_id on retourne le vecteur associé
def simulate_w4v(tweet_id):
    element_id = ids_list.index(tweet_id)
    vectorized = vectors_list[element_id]
    return vectorized

# Parcours le fichier des vecteurs en format txt et stock les vecteurs et les ids sur deux listes 
def get_vectors_list(filepath):
    vectors_list_x, ids_list_x = [], []
    with open(filepath, 'r', encoding='utf16') as f:
        file = f.read().split('"\n[')
        for line in file:
            parts = line.replace('\n', '').replace('    ', ' ').replace('  ', ' ').replace('  ', ' ').split(";")
            vectors_list_x.append(parts[0])
            ids_list_x.append(parts[1].replace(' ', ''))
    return vectors_list_x, ids_list_x

# À partir d'un tweet_id on vérifie s'il est positif ou négatif
def define_label(tweet_id):
    with open(pos_filepath, 'r', encoding='utf16') as f:
        next(f)
        for line in f.readlines():
            parts = line.split(";")
            tweets = parts[0].replace('"', '')
            if tweet_id in tweets:
                label = 1
                break
            else:
                label = 0
    return label

# À partir d'un tweet_id on retourne le Tweet associé
def define_tweet_by_id(line_id):
    with open(csv_filepath, 'r', encoding='utf16') as fp:
        for i, line in enumerate(fp):
            if i == line_id:
                parts = line.split(";")
                tweet = parts[2]
            elif i > line_id:
                break
    return tweet

# On randomise les listes des datas et labels X et y tout on gardant le même ordre des couples de (X, y)
def randomize(X, y):
    permutation = np.random.permutation(y.shape[0])
    X2 = X[permutation]
    y2 = y[permutation]
    return X2, y2

# On construit les targets (labels) et datas (vecteurs) à partir du fichier corpus CSV
def build_dataset(file):
    target, data = [], []
    for line in file:
        z = np.array(define_label(line[1].replace(' ', '')))
        target.append(z)
        x = np.fromstring(line[0].replace(']', '').replace('[', '').replace('  ', ' '), sep=' ')
        data.append(x)
    target = np.asarray(target)
    data = np.asarray(data)
    return target, data

# En se basant sur un Dataset déséquilibré, on crée un Dataset équilibré qui contient 1000 Tweet pos et 1000 Tweet neg au hasard
def balance_dataset():
    file_pos_ids, file_neg_ids = [], []
    file_pos = openfile_txt(pos_filepath_txt)
    for line in file_pos:
        parts = line.replace('\n', '').replace('    ', ' ').replace('  ', ' ').replace('  ', ' ').split(";")
        file_pos_ids.append(parts)
    pos_part = random.sample(file_pos_ids, 1000)

    file_neg = openfile_txt(neg_filepath_txt)
    for line in file_neg:
        parts = line.replace('\n', '').replace('    ', ' ').replace('  ', ' ').replace('  ', ' ').split(";")
        file_neg_ids.append(parts)
    neg_part = random.sample(file_neg_ids, 1000)

    balanced_txt_file = pos_part+neg_part
    random.shuffle(balanced_txt_file)
    return balanced_txt_file

# Simuler l'interaction Homme-Machine pour définir le label des Tweets : Pour chaque Tweet choisi on cherche son label (pos/neg)
def simulate_human_decision(line_id):
    with open(csv_filepath, 'r', encoding='utf16') as fp:
        for i, line in enumerate(fp):
            if i == line_id:
                parts = line.split(";")
                tweet_id = parts[0]
                label = define_label(tweet_id)
            elif i > line_id:
                break
    return label

# On split le Dataset en train et test en partant de 50 tweets labelés
def split_train_test(file):
    target = build_dataset(file)
    n_labeled = 50

    X = target[1]
    y = target[0]
    print(np.shape(X))
    print(np.shape(y))

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

    while (np.sum(y_train[:n_labeled]) < 25):
        X_rand, y_rand = randomize(X, y)
        X_train, X_test, y_train, y_test = train_test_split(X_rand, y_rand, test_size=0.2, stratify=y_rand)

    print(np.concatenate([y_train[:n_labeled], [None] * (len(y_train) - n_labeled)]))

    trn_ds = Dataset(X_train, np.concatenate([y_train[:n_labeled], [None] * (len(y_train) - n_labeled)]))
    tst_ds = Dataset(X_test, y_test)

    return trn_ds, tst_ds

## Fonction main()

En partant d'un jeu de donnée qui contient 2000 Tweets (dont 1000 sont pos et 1000 sont neg) on définit 50 Tweets labelés et on simule les 100 prochains. Le choix des Tweets est définit à chaque fois en utilisant un algorithme de classification différent ainsi que des méthodes de décisions différentes.

Les méthodes de classification testé sont : **RandomForest**, SVM(linear), LogisticRegression.
Les algorithmes d'active learning testé sont : UncertaintySampling, CMB Sampling(Combination of active learning algorithms (distance-based (DIST), diversity-based (DIV))), Random Sampling, **QueryByCommittee**, QUIRE.
Et les fonctions de décisions pour le choix des queries utilisé sont : **Vote Entropy, Kullback-Leibler Divergence** , distance-based (DIST), diversity-based (DIV), Max Margin, Least Confident

En sortie, on affiche la courbe de l'évolution de l'Accuracy par rapport au nombre de queries traitées et un fichier TXT qui contient les différentes queries traitées à chaque étape.


In [None]:
def main():
    global pos_filepath, neg_filepath_txt, pos_filepath_txt, dataset_filepath, csv_filepath, vectors_list, ids_list
    # Les chemins de fichier à changer
    dataset_filepath = "/Users/dndesign/Desktop/active_learning/vecteurs_et_infos/vectors_2015.txt"
    csv_filepath = "/Users/dndesign/Desktop/active_learning/donnees/corpus_2015_id-time-text.csv"
    pos_filepath = "/Users/dndesign/Desktop/active_learning/donnees/oriane_pos_id-time-text.csv"
    pos_filepath_txt = "/Users/dndesign/Desktop/active_learning/vecteurs_et_infos/vectors_2015_pos.txt"
    neg_filepath_txt = "/Users/dndesign/Desktop/active_learning/vecteurs_et_infos/vectors_2015_neg.txt"
    vectors_list, ids_list = get_vectors_list(dataset_filepath)

    timestr = time.strftime("%Y%m%d_%H%M%S")
    text_file = codecs.open("task_" + str(timestr) + ".txt", "w", "utf-8")

    print("Loading data...")
    text_file.write("Loading data...\n")
    # Open this file
    t0 = time.time()
    file = openfile_txt(dataset_filepath)
    num_lines = sum(1 for line in file)
    print("Treating " + str(num_lines) + " entries...")
    text_file.write("Treating : %s entries...\n" % str(num_lines))

    # Number of queries to ask human to label
    quota = 100
    E_out1, E_out2, E_out3, E_out4, E_out6, E_out7 = [], [], [], [], [], []
    balanced_file = balance_dataset()
    trn_ds, tst_ds = split_train_test(balanced_file)

    # model = SVM(kernel='linear')
    # model = LogisticRegression()
    model = RandomForestClassifier()

    ''' UncertaintySampling (Least Confident)

        UncertaintySampling : it queries the instances about which 
        it is least certain how to label

        Least Confident : it queries the instance whose posterior 
        probability of being positive is nearest 0.5
    '''
    qs = UncertaintySampling(trn_ds, method='lc', model=LogisticRegression(C=.01))
    # model.train(trn_ds)
    model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
    predicted = model.predict(tst_ds.format_sklearn()[0])
    score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
    E_out1 = np.append(E_out1, 1 - score)
    # E_out1 = np.append(E_out1, 1 - model.score(tst_ds))

    ''' UncertaintySampling (Max Margin) 

    '''
    trn_ds2 = copy.deepcopy(trn_ds)
    qs2 = USampling(trn_ds2, method='mm', model=SVM(kernel='linear'))
    # model.train(trn_ds2)
    model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
    predicted = model.predict(tst_ds.format_sklearn()[0])
    score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
    E_out2 = np.append(E_out2, 1 - score)
    # E_out2 = np.append(E_out2, 1 - model.score(tst_ds))

    ''' CMB Sampling   
        Combination of active learning algorithms (distance-based (DIST), diversity-based (DIV)) 
    '''
    trn_ds3 = copy.deepcopy(trn_ds)
    qs3 = CMBSampling(trn_ds3, model=SVM(kernel='linear'))
    # model.train(trn_ds3)
    model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
    predicted = model.predict(tst_ds.format_sklearn()[0])
    score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
    E_out3 = np.append(E_out3, 1 - score)
    # E_out3 = np.append(E_out3, 1 - model.score(tst_ds))

    ''' Random Sampling   
        Random : it chooses randomly a query
    '''
    trn_ds4 = copy.deepcopy(trn_ds)
    qs4 = RandomSampling(trn_ds4, random_state=1126)
    # model.train(trn_ds4)
    model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
    predicted = model.predict(tst_ds.format_sklearn()[0])
    score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
    E_out4 = np.append(E_out4, 1 - score)
    # E_out4 = np.append(E_out4, 1 - model.score(tst_ds))

    ''' QueryByCommittee (Vote Entropy)

        QueryByCommittee : it keeps a committee of classifiers and queries 
        the instance that the committee members disagree, it  also examines 
        unlabeled examples and selects only those that are most informative 
        for labeling

        Vote Entropy : a way of measuring disagreement 

        Disadvantage : it does not consider the committee members’ class 
        distributions. It also misses some informative unlabeled examples 
        to label 
    '''
    trn_ds6 = copy.deepcopy(trn_ds)
    qs6 = QueryByCommittee(trn_ds6, disagreement='vote',
                           models=[LogisticRegression(C=1.0),
                                   LogisticRegression(C=0.01),
                                   LogisticRegression(C=100)],
                           random_state=1126)
    # model.train(trn_ds6)
    model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
    predicted = model.predict(tst_ds.format_sklearn()[0])
    score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
    E_out6 = np.append(E_out6, 1 - score)
    # E_out6 = np.append(E_out6, 1 - model.score(tst_ds))

    ''' QueryByCommittee (Kullback-Leibler Divergence)

            QueryByCommittee : it examines unlabeled examples and selects only 
            those that are most informative for labeling

            Disadvantage :  it misses some examples on which committee members 
            disagree
    '''
    trn_ds7 = copy.deepcopy(trn_ds)
    qs7 = QueryByCommittee(trn_ds7, disagreement='kl_divergence',
                           models=[LogisticRegression(C=1.0),
                                   LogisticRegression(C=0.01),
                                   LogisticRegression(C=100)],
                           random_state=1126)
    # model.train(trn_ds7)
    model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
    predicted = model.predict(tst_ds.format_sklearn()[0])
    score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
    E_out7 = np.append(E_out7, 1 - score)
    # E_out7 = np.append(E_out7, 1 - model.score(tst_ds))

    with sns.axes_style("darkgrid"):
        fig = plt.figure()
        ax = fig.add_subplot(1, 1, 1)

    query_num = np.arange(0, 1)
    p1, = ax.plot(query_num, E_out1, 'red')
    p2, = ax.plot(query_num, E_out2, 'blue')
    p3, = ax.plot(query_num, E_out3, 'green')
    p4, = ax.plot(query_num, E_out4, 'orange')
    p6, = ax.plot(query_num, E_out6, 'black')
    p7, = ax.plot(query_num, E_out7, 'purple')
    plt.legend(
        ('Least Confident', 'Max Margin', 'Distance Diversity CMB', 'Random Sampling', 'Vote Entropy', 'KL Divergence'),
        loc=1)
    plt.ylabel('Accuracy')
    plt.xlabel('Number of Queries')
    plt.title('Active Learning - Query choice strategies')
    plt.ylim([0, 1])
    plt.show(block=False)

    for i in range(quota):
        print("\n#################################################")
        print("Query number " + str(i) + " : ")
        print("#################################################\n")
        text_file.write("\n#################################################\n")
        text_file.write("Query number %s : " % str(i))
        text_file.write("\n#################################################\n")

        ask_id = qs.make_query()
        print("\033[4mUsing Uncertainty Sampling (Least confident) :\033[0m")
        print("Tweet :" + define_tweet_by_id(ask_id), end='', flush=True)
        print("Simulating human response : " + str(simulate_human_decision(ask_id)) + " \n")
        text_file.write("Using Uncertainty Sampling (Least confident) :\n")
        text_file.write("Tweet : %s \n" % str(define_tweet_by_id(ask_id)))
        text_file.write("Simulating human response : %s \n\n" % str(simulate_human_decision(ask_id)))
        trn_ds.update(ask_id, simulate_human_decision(ask_id))
        # model.train(trn_ds)
        model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
        predicted = model.predict(tst_ds.format_sklearn()[0])
        score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
        E_out1 = np.append(E_out1, 1 - score)
        # E_out1 = np.append(E_out1, 1 - model.score(tst_ds))

        ask_id = qs2.make_query()
        print("\033[4mUsing Uncertainty Sampling (Max Margin) :\033[0m")
        print("Tweet :" + define_tweet_by_id(ask_id), end='', flush=True)
        print("Simulating human response : " + str(simulate_human_decision(ask_id)) + " \n")
        text_file.write("Using Uncertainty Sampling (Smallest Margin) :\n")
        text_file.write("Tweet : %s \n" % str(define_tweet_by_id(ask_id)))
        text_file.write("Simulating human response : %s \n\n" % str(simulate_human_decision(ask_id)))
        trn_ds2.update(ask_id, simulate_human_decision(ask_id))
        # model.train(trn_ds2)
        model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
        predicted = model.predict(tst_ds.format_sklearn()[0])
        score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
        E_out2 = np.append(E_out2, 1 - score)
        # E_out2 = np.append(E_out2, 1 - model.score(tst_ds))

        ask_id = qs3.make_query()
        print("\033[4mUsing CMB Distance-Diversity Sampling :\033[0m")
        print("Tweet :" + define_tweet_by_id(ask_id), end='', flush=True)
        print("Simulating human response : " + str(simulate_human_decision(ask_id)) + " \n")
        text_file.write("Using Uncertainty Sampling (Entropy) :\n")
        text_file.write("Tweet : %s \n" % str(define_tweet_by_id(ask_id)))
        text_file.write("Simulating human response : %s \n\n" % str(simulate_human_decision(ask_id)))
        trn_ds3.update(ask_id, simulate_human_decision(ask_id))
        # model.train(trn_ds3)
        model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
        predicted = model.predict(tst_ds.format_sklearn()[0])
        score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
        E_out3 = np.append(E_out3, 1 - score)
        # E_out3 = np.append(E_out3, 1 - model.score(tst_ds))

        ask_id = qs4.make_query()
        print("\033[4mUsing Random Sampling :\033[0m")
        print("Tweet :" + define_tweet_by_id(ask_id), end='', flush=True)
        print("Simulating human response : " + str(simulate_human_decision(ask_id)) + " \n")
        text_file.write("Using Random Sampling :\n")
        text_file.write("Tweet : %s \n" % str(define_tweet_by_id(ask_id)))
        text_file.write("Simulating human response : %s \n\n" % str(simulate_human_decision(ask_id)))
        trn_ds4.update(ask_id, simulate_human_decision(ask_id))
        # model.train(trn_ds4)
        model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
        predicted = model.predict(tst_ds.format_sklearn()[0])
        score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
        E_out4 = np.append(E_out4, 1 - score)
        # E_out4 = np.append(E_out4, 1 - model.score(tst_ds))

        ask_id = qs6.make_query()
        print("\033[4mUsing QueryByCommittee (Vote Entropy) :\033[0m")
        print("Tweet :" + define_tweet_by_id(ask_id), end='', flush=True)
        print("Simulating human response : " + str(simulate_human_decision(ask_id)) + " \n")
        text_file.write("Using QueryByCommittee (Vote Entropy) :\n")
        text_file.write("Tweet : %s \n" % str(define_tweet_by_id(ask_id)))
        text_file.write("Simulating human response : %s \n\n" % str(simulate_human_decision(ask_id)))
        trn_ds6.update(ask_id, simulate_human_decision(ask_id))
        # model.train(trn_ds6)
        model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
        predicted = model.predict(tst_ds.format_sklearn()[0])
        score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
        E_out6 = np.append(E_out6, 1 - score)
        # E_out6 = np.append(E_out6, 1 - model.score(tst_ds))

        ask_id = qs7.make_query()
        print("\033[4mUsing QueryByCommittee (KL Divergence) :\033[0m")
        print("Tweet :" + define_tweet_by_id(ask_id), end='', flush=True)
        print("Simulating human response : " + str(simulate_human_decision(ask_id)) + " \n")
        text_file.write("Using QueryByCommittee (KL Divergence) :\n")
        text_file.write("Tweet : %s \n" % str(define_tweet_by_id(ask_id)))
        text_file.write("Simulating human response : %s \n\n" % str(simulate_human_decision(ask_id)))
        trn_ds7.update(ask_id, simulate_human_decision(ask_id))
        # model.train(trn_ds7)
        model.fit(trn_ds.format_sklearn()[0], trn_ds.format_sklearn()[1])
        predicted = model.predict(tst_ds.format_sklearn()[0])
        score = accuracy_score(tst_ds.format_sklearn()[1], predicted)
        E_out7 = np.append(E_out7, 1 - score)
        # E_out7 = np.append(E_out7, 1 - model.score(tst_ds))

        ax.set_xlim((0, i + 1))
        ax.set_ylim((0, max(max(E_out1), max(E_out2), max(E_out3), max(E_out4), max(E_out6), max(E_out7)) + 0.2))
        query_num = np.arange(0, i + 2)
        p1.set_xdata(query_num)
        p1.set_ydata(E_out1)
        p2.set_xdata(query_num)
        p2.set_ydata(E_out2)
        p3.set_xdata(query_num)
        p3.set_ydata(E_out3)
        p4.set_xdata(query_num)
        p4.set_ydata(E_out4)
        p6.set_xdata(query_num)
        p6.set_ydata(E_out6)
        p7.set_xdata(query_num)
        p7.set_ydata(E_out7)

        plt.draw()

    t2 = time.time()
    time_total = t2 - t0
    print("\n\n\n#################################################\n")
    print("Execution time : %fs \n\n" % time_total)
    text_file.write("\n\n\n#################################################\n")
    text_file.write("Execution time : %fs \n" % time_total)
    text_file.close()
    input("Press any key to save the plot...")
    plt.savefig('task_' + str(timestr) + '.png')

    print("Done")


if __name__ == '__main__':
    main()

# Résultats et Remarques

Partant d'un Dataset (30911 Tweets) très déséquilibré avec 10 queries déjà labelées et en simulant 10 queries pour les labeler, on se retrouve avec le résultat suivant :
<img src="task_20180102_024943.png">

Et quand on essaie d'équilibrer le jeu de donnée en partant de 1000 Tweets positifs et 1000 négatifs pour un total de 2000 queries random, et que cette fois-ci on part de 50 queries déjà labelées et on simule 100 queries à labeler, on arrive au résultat suivant :
### Random Forest
<img src="task_20180117_143134.png">
### SVM(linear)
<img src="task_20180113_220114.png">