# BERT for Tweets Classification
Source : https://towardsdatascience.com/bert-text-classification-in-3-lines-of-code-using-keras-264db7e7a358

You need ktrain, keras and tensorflow to use this notebook. It is strongly adviced to either use Google Colab or either set tensorflow to run it on a good GPU.

As this uses a lot of RAM and processing time, the obtained result on AICrowd where made on the small training set of tweets. If you have a powerful computer, we advice you to train it on the full set to have better results.

This is our best submission on AICrowd with accuraca : 0.876, F-1 score : 0.879.

In [None]:
!pip3 install ktrain

In [None]:
import ktrain
from ktrain import text
import numpy as np

## Helpers

In [None]:
def divide_test_train(tweets):
    """Divide an array of tweet into two parts, one containing 75% and the other one, 25%"""
    index = int(0.75 * len(tweets))
    return tweets[:index], tweets[index:]

def tweets_txt(file_name):
    """Parse txt files to obtain an array of tweets and remove duplicates"""
    tweets_txt = []
    f = open(file_name, "r")
    for l in f.readlines():
        tweets_txt.append(l.strip())
    f.close()
    return np.array(list(set(tweets_txt)))

def tweets_txt_test(file_name):
    """Parse a txt file and return an array of tweets without removing duplicates"""
    tweets_txt = []
    f = open(file_name, "r")
    for l in f.readlines():
        tweets_txt.append(l.strip())
    f.close()
    return np.array(tweets_txt)

## Prepare folder for BERT

To prepare the folder, you need to create this structure in you current directory :

In [None]:
tweets_pos = tweets_txt("Datasets/twitter-datasets/train_pos.txt")
tweets_neg = tweets_txt("Datasets/twitter-datasets/train_neg.txt")

tweets_pos_train, tweets_pos_test = divide_test_train(tweets_pos)
tweets_neg_train, tweets_neg_test = divide_test_train(tweets_neg)

i = 0
for t in tweets_pos_train:
    f= open("BERT_folder/train/pos/%d.txt" %i,"w+")
    f.write(t)
    f.close()
    i+=1
print("DONE pos train")

i = 0
for t in tweets_pos_test:
    f= open("BERT_folder/test/pos/%d.txt" %i,"w+")
    f.write(t)
    f.close()
    i+=1

print("DONE pos test")

i = 0
for t in tweets_neg_train:
    f= open("BERT_folder/train/neg/%d.txt" %i,"w+")
    f.write(t)
    f.close()
    i+=1

print("DONE neg train")

i = 0
for t in tweets_neg_test:
    f= open("BERT_folder/test/neg/%d.txt" %i,"w+")
    f.write(t)
    f.close()
    i+=1

print("DONE neg test")

## Train

The maxlen of 199 comes from the CNN part, where after reduction, we saw that only 199 tokens at most where useful.
Since this algorithm is time consuming, being able to reduce maxlen is a good way to gain time. Furthermore, using Google Colab, we tried not constraining maxlen but it gave worse result.

In [None]:
(x_train_small, y_train_small), (x_test_small, y_test_small), preproc_small = text.texts_from_folder("BERT_folder", 
                                                                       maxlen=199, 
                                                                       preprocess_mode='bert',
                                                                       train_test_names=['train', 
                                                                                         'test'],
                                                                       classes=['pos', 'neg'])

In [None]:
model_small = text.text_classifier('bert', (x_train_small, y_train_small), preproc=preproc_small)
learner_small = ktrain.get_learner(model_small,train_data=(x_train_small, y_train_small), val_data=(x_test_small, y_test_small), batch_size=10)

The original paper on BERT and the tutorial we followed mention 2e-5 as a good learning rate so we used it.

Since it takes a long time, we tried only 1 epoch with this model. With a more constrained model (maxlen = 128 and batch_size = 32), we tried 3 epochs but the result was not better.

In [None]:
learner_small.fit_onecycle(2e-5, 1)

## Predict

In [None]:
predictor = ktrain.get_predictor(learner_small.model, preproc_small)

In [None]:
tweets_test = tweets_txt_test("Datasets/twitter-datasets/test_data.txt")

In [None]:
result = predictor.predict(tweets_test)

In [None]:
# make csv
with open("submission.csv", "w") as f:
    f.write("Id,Prediction\n")
    id = 1
    for i in result:
        if i == "neg":
            i = -1
        if i == "pos":
            i = 1
        l = str(id) + "," + str(i) + "\n"
        f.write(l)
        id = id + 1