# NLP case study


The following table summarizes the datasets used throughout this notebook.

| dataset ID | dataset name| is_dialectical | is_MSA (Modern Standard Arabic) | is_balanced | num_of_tweets | num_of_pos_tweets | num_of_neg_tweets |
|-- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | [arabic-sentiment-twitter-corpus](https://www.kaggle.com/mksaad/arabic-sentiment-twitter-corpus) | Yes | No/minority | Yes | 58,751 | 29,849 | 28,902  
| 2 |[SS2030](https://www.kaggle.com/snalyami3/arabic-sentiment-analysis-dataset-ss2030-dataset ) | Yes - Saudi dialect only | No/Minority | Yes | 4,252 | 2,436 | 1,816 
| 3 |[100k Arabic Reviews](https://www.kaggle.com/abedkhooli/arabic-100k-reviews ) | No/Minority | Yes | Yes | 66,666 | 33,333 | 33,333
| 4 | [ArSAS](https://homepages.inf.ed.ac.uk/wmagdy/resources.htm) | Yes - mixed dialects| No/Minority | Yes | 11,784 | 4,400 | 7,384

*(For a more detailed analysis of the datasets see [this](https://www.kaggle.com/yasmeenhany/dataset-analysis) companion notebook. )*

In [1]:
import os
import re
# from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv

%matplotlib inline

#### Importing the dataset


In [2]:
pd.set_option('display.max_colwidth', 280)
train_neg = pd.read_csv("dataset/train_Arabic_tweets_negative_20190413.tsv", 
                        sep="\t", header=None,  quoting=csv.QUOTE_NONE)
train_pos = pd.read_csv("dataset/train_Arabic_tweets_positive_20190413.tsv", 
                        sep="\t", header=None,  quoting=csv.QUOTE_NONE)
train_neg.rename(columns={0:'label', 1:'tweet'}, inplace=True)
train_pos.rename(columns={0:'label', 1:'tweet'}, inplace=True)
train_neg['label'] = 0
train_pos['label'] = 1
train_df = pd.concat([train_neg, train_pos], axis=0).reset_index(drop=True)

FileNotFoundError: [Errno 2] No such file or directory: '../input/arabic-sentiment-twitter-corpus/train_Arabic_tweets_negative_20190413.tsv'

### Visualizing the first 10 rows of the training dataset: 

In [5]:
train_df.head(10)

## Feature engineering


In [17]:
from nltk.corpus import stopwords
import emoji
#Stats about Text
def avg_word(sentence):
    words = sentence.split()
    if len(words) == 0:
        return 0
    return (sum(len(word) for word in words)/len(words))

def emoji_counter(sentence):
    return emoji.emoji_count(sentence)



In [18]:
train_df['word_count'] = train_df['tweet'].apply(lambda x: len(str(x).split(" ")))
train_df['char_count'] = train_df['tweet'].str.len() ## this also includes spaces
train_df['avg_char_per_word'] = train_df['tweet'].apply(lambda x: avg_word(x))
stop = stopwords.words('arabic')
train_df['stopwords'] = train_df['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train_df['emoji_count'] = train_df['tweet'].apply(lambda x: emoji_counter(x))
train_df = train_df.sort_values(by='word_count',ascending=[0])
train_df.head()

### As we can notice our train dataset is filled with emojis, hashtags(#), mentions(@), punctuation and links so a good pratice is to eliminate them since they have no impact on the classification 

In [19]:
from nltk.corpus import stopwords
from textblob import TextBlob
import re
from dsaraby import DSAraby
ds = DSAraby()
from tashaphyne.stemming import ArabicLightStemmer
from nltk.stem.isri import ISRIStemmer

stops = set(stopwords.words("arabic"))
stop_word_comp = {"،","آض","آمينَ","آه","آهاً","آي","أ","أب","أجل","أجمع","أخ","أخذ","أصبح","أضحى","أقبل","أقل","أكثر","ألا","أم","أما","أمامك","أمامكَ","أمسى","أمّا","أن","أنا","أنت","أنتم","أنتما","أنتن","أنتِ","أنشأ","أنّى","أو","أوشك","أولئك","أولئكم","أولاء","أولالك","أوّهْ","أي","أيا","أين","أينما","أيّ","أَنَّ","أََيُّ","أُفٍّ","إذ","إذا","إذاً","إذما","إذن","إلى","إليكم","إليكما","إليكنّ","إليكَ","إلَيْكَ","إلّا","إمّا","إن","إنّما","إي","إياك","إياكم","إياكما","إياكن","إيانا","إياه","إياها","إياهم","إياهما","إياهن","إياي","إيهٍ","إِنَّ","ا","ابتدأ","اثر","اجل","احد","اخرى","اخلولق","اذا","اربعة","ارتدّ","استحال","اطار","اعادة","اعلنت","اف","اكثر","اكد","الألاء","الألى","الا","الاخيرة","الان","الاول","الاولى","التى","التي","الثاني","الثانية","الذاتي","الذى","الذي","الذين","السابق","الف","اللائي","اللاتي","اللتان","اللتيا","اللتين","اللذان","اللذين","اللواتي","الماضي","المقبل","الوقت","الى","اليوم","اما","امام","امس","ان","انبرى","انقلب","انه","انها","او","اول","اي","ايار","ايام","ايضا","ب","بات","باسم","بان","بخٍ","برس","بسبب","بسّ","بشكل","بضع","بطآن","بعد","بعض","بك","بكم","بكما","بكن","بل","بلى","بما","بماذا","بمن","بن","بنا","به","بها","بي","بيد","بين","بَسْ","بَلْهَ","بِئْسَ","تانِ","تانِك","تبدّل","تجاه","تحوّل","تلقاء","تلك","تلكم","تلكما","تم","تينك","تَيْنِ","تِه","تِي","ثلاثة","ثم","ثمّ","ثمّة","ثُمَّ","جعل","جلل","جميع","جير","حار","حاشا","حاليا","حاي","حتى","حرى","حسب","حم","حوالى","حول","حيث","حيثما","حين","حيَّ","حَبَّذَا","حَتَّى","حَذارِ","خلا","خلال","دون","دونك","ذا","ذات","ذاك","ذانك","ذانِ","ذلك","ذلكم","ذلكما","ذلكن","ذو","ذوا","ذواتا","ذواتي","ذيت","ذينك","ذَيْنِ","ذِه","ذِي","راح","رجع","رويدك","ريث","رُبَّ","زيارة","سبحان","سرعان","سنة","سنوات","سوف","سوى","سَاءَ","سَاءَمَا","شبه","شخصا","شرع","شَتَّانَ","صار","صباح","صفر","صهٍ","صهْ","ضد","ضمن","طاق","طالما","طفق","طَق","ظلّ","عاد","عام","عاما","عامة","عدا","عدة","عدد","عدم","عسى","عشر","عشرة","علق","على","عليك","عليه","عليها","علًّ","عن","عند","عندما","عوض","عين","عَدَسْ","عَمَّا","غدا","غير","ـ","ف","فان","فلان","فو","فى","في","فيم","فيما","فيه","فيها","قال","قام","قبل","قد","قطّ","قلما","قوة","كأنّما","كأين","كأيّ","كأيّن","كاد","كان","كانت","كذا","كذلك","كرب","كل","كلا","كلاهما","كلتا","كلم","كليكما","كليهما","كلّما","كلَّا","كم","كما","كي","كيت","كيف","كيفما","كَأَنَّ","كِخ","لئن","لا","لات","لاسيما","لدن","لدى","لعمر","لقاء","لك","لكم","لكما","لكن","لكنَّما","لكي","لكيلا","للامم","لم","لما","لمّا","لن","لنا","له","لها","لو","لوكالة","لولا","لوما","لي","لَسْتَ","لَسْتُ","لَسْتُم","لَسْتُمَا","لَسْتُنَّ","لَسْتِ","لَسْنَ","لَعَلَّ","لَكِنَّ","لَيْتَ","لَيْسَ","لَيْسَا","لَيْسَتَا","لَيْسَتْ","لَيْسُوا","لَِسْنَا","ما","ماانفك","مابرح","مادام","ماذا","مازال","مافتئ","مايو","متى","مثل","مذ","مساء","مع","معاذ","مقابل","مكانكم","مكانكما","مكانكنّ","مكانَك","مليار","مليون","مما","ممن","من","منذ","منها","مه","مهما","مَنْ","مِن","نحن","نحو","نعم","نفس","نفسه","نهاية","نَخْ","نِعِمّا","نِعْمَ","ها","هاؤم","هاكَ","هاهنا","هبّ","هذا","هذه","هكذا","هل","هلمَّ","هلّا","هم","هما","هن","هنا","هناك","هنالك","هو","هي","هيا","هيت","هيّا","هَؤلاء","هَاتانِ","هَاتَيْنِ","هَاتِه","هَاتِي","هَجْ","هَذا","هَذانِ","هَذَيْنِ","هَذِه","هَذِي","هَيْهَاتَ","و","و6","وا","واحد","واضاف","واضافت","واكد","وان","واهاً","واوضح","وراءَك","وفي","وقال","وقالت","وقد","وقف","وكان","وكانت","ولا","ولم","ومن","مَن","وهو","وهي","ويكأنّ","وَيْ","وُشْكَانََ","يكون","يمكن","يوم","ّأيّان"}
ArListem = ArabicLightStemmer()


def to_arabic(text):
    return ds.transliterate(text)

def stem(text):
    zen = TextBlob(text)
    words = zen.words
    cleaned = list()
    for w in words:
        ArListem.light_stem(w)
        cleaned.append(ArListem.get_root())
    return " ".join(cleaned)

import pyarabic.araby as araby
def normalizeArabic(text):
    text = text.strip()
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    noise = re.compile(""" ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)
    text = re.sub(noise, '', text)
    text = re.sub(r'(.)\1+', r"\1\1", text) # Remove longation
    return araby.strip_tashkeel(text)
    
def remove_stop_words(text):
    zen = TextBlob(text)
    words = zen.words
    return " ".join([w for w in words if not w in stops and not w in stop_word_comp and len(w) >= 2])


In [10]:
import string
def preprocess(text):
    #links pattern :
    link_pattern = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    #emojis pattern
    emojis_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    non_arabic_letters_pattern = re.compile('[a-zA-Z]')
    #removin punctuation:
    new_text = re.sub(r'[^\w\s]', '', text)
    #removing emojis:
    new_text = emojis_pattern.sub(r'', new_text)
    #removing non arabic characters:
    new_text = non_arabic_letters_pattern.sub('', new_text)
    #removing links:
    links = re.findall(link_pattern, new_text)
    for link in links:
        new_text = new_text.replace(link[0], ', ')
    return new_text

## Preprocessing the train split


In [11]:
#removing punctuation :
train_df_clean=pd.DataFrame()
train_df_clean['tweet'] = train_df['tweet'].apply(preprocess)
train_df_clean['label'] = train_df['label']
train_df_clean.head(10)

### Now that we cleaned our data we can load the test split and do the same for that portion 


In [157]:
test_pos = pd.read_csv("../input/arabic-sentiment-twitter-corpus/test_Arabic_tweets_positive_20190413.tsv", 
                       sep="\t", header=None,  quoting=csv.QUOTE_NONE)
test_neg = pd.read_csv("../input/arabic-sentiment-twitter-corpus/test_Arabic_tweets_negative_20190413.tsv", 
                       sep="\t", header=None,  quoting=csv.QUOTE_NONE)
test_pos.rename(columns={0:'label', 1:'tweet'}, inplace=True)
test_neg.rename(columns={0:'label', 1:'tweet'}, inplace=True)
test_neg['label']=0
test_pos['label']=1
test_df = pd.concat([test_neg, test_pos], axis=0).reset_index(drop=True)
test_df_clean = pd.DataFrame()
test_df_clean['label'] = test_df['label']
test_df_clean['tweet'] = test_df['tweet'].apply(preprocess)
test_df_clean

In [158]:
train_df_clean.isna().sum()

### Note that we will be using train_test_validation split on our dataset to ensure that our model can generalise to unseen data
### train : 50%
### test : 25%
### validation : 25%
### note that these percentages are based on the total number of instances 47000 + 11751
### to achieve this we have to split our train_df into 0.66 train_split and 0.33 validation split

In [175]:
from sklearn.model_selection import train_test_split
#old train split
X = train_df.tweet.values
y = train_df.label.values
#processed train split
X_train_new = train_df_clean.tweet.values
y_train_new = train_df_clean.label.values

# The train val split is used by the DL approach but not classical ML
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.1, random_state=1111)
X_train_clean , X_val_clean , y_train_clean , y_val_clean = train_test_split(X_train_new, y_train_new, 
                                                             test_size=0.1, random_state=1111)
#old test split
X_test = test_df.tweet.values
y_test = test_df.label.values
#processed test split
X_test_clean = test_df_clean.tweet.values
y_test_clean = test_df_clean.label.values

### Define the pipeline

In [190]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
def train_model(model, data, targets):
    text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', model),
    ])
    text_clf.fit(data, targets)
    return text_clf
def get_accuracy(trained_model,X, y):
    predicted = trained_model.predict(X)
    accuracy = np.mean(predicted == y)
    return accuracy

### Metrics from old dataset:

In [185]:
from sklearn.naive_bayes import MultinomialNB
trained_clf_multinomial_nb = train_model(MultinomialNB(), X, y)
test_accuracy       = get_accuracy(trained_clf_multinomial_nb,X_test, y_test)
validation_accuracy = get_accuracy(trained_clf_multinomial_nb,X_val, y_val)
training_accuracy   = get_accuracy(trained_clf_multinomial_nb,X_train, y_train)

print(f"test accuracy with MultinomialNB: {test_accuracy:.4f}")
print(f"validation accuracy with MultinomialNB: {validation_accuracy:.4f}")
print(f"Train accuracy with MultinomialNB: {training_accuracy:.4f}")

### Metrics from the processed dataset :

In [194]:
from sklearn.naive_bayes import MultinomialNB
trained_clf_multinomial_nb = train_model(MultinomialNB(), X, y)
test_accuracy       = get_accuracy(trained_clf_multinomial_nb,X_test_clean, y_test_clean)
validation_accuracy = get_accuracy(trained_clf_multinomial_nb,X_val_clean, y_val_clean)
training_accuracy   = get_accuracy(trained_clf_multinomial_nb,X_train_clean, y_train_clean)

print(f"test accuracy with MultinomialNB: {test_accuracy:.4f}")
print(f"validation accuracy with MultinomialNB: {validation_accuracy:.4f}")
print(f"Train accuracy with MultinomialNB: {training_accuracy:.4f}")

#### Load other test datasets (datasets SS2030 reviews_100k) and : (to test how well model generalizes on arabic tweets/short text)

In [192]:
df_ss2030 = pd.read_csv("../input/arabic-sentiment-analysis-dataset-ss2030-dataset/Arabic Sentiment Analysis Dataset - SS2030.csv")
# Rename columns to match convention
df_ss2030 = df_ss2030.rename(columns = {"text":"tweet", "Sentiment": "label"})

In [136]:
df_reviews = pd.read_csv("../input/arabic-100k-reviews/ar_reviews_100k.tsv", delimiter="\t")
# Create a mapping for the labels such that we use the same convention across all datasets
label_mapping = {"Positive": 1, "Negative":0}
# Filter to only have pos and neg tweets, i.e: remove mixed tweets
df_reviews = df_reviews[df_reviews.label != "Mixed"]
df_reviews["label"] = df_reviews["label"].map(label_mapping)
# Rename columns to match convention
df_reviews = df_reviews.rename(columns = {"text":"tweet"})

<a id="1"></a>
# Classical ML approach
#### Using tf-idf features

In [137]:
# Helper functions 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
def train_model(model, data, targets):
    text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', model),
    ])
    text_clf.fit(data, targets)
    return text_clf
def get_accuracy(trained_model,X, y):
    predicted = trained_model.predict(X)
    accuracy = np.mean(predicted == y)
    return accuracy

<a id="1.2"></a>
### Evaluate classifiers on other datasets

In [None]:
def print_all_accuracies(dataset_name, dataset):
  accuracy = get_accuracy(trained_clf_decision_tree,dataset.tweet.values, dataset.label.values)
  print(f"{dataset_name} dataset accuracy with Decision Tree: {accuracy:.2f}")
  accuracy = get_accuracy(trained_clf_multinomial_nb,dataset.tweet.values, dataset.label.values)
  print(f"{dataset_name} dataset accuracy with Multinomial NB: {accuracy:.2f}")
  accuracy = get_accuracy(trained_clf_linearSVC,dataset.tweet.values, dataset.label.values)
  print(f"{dataset_name} dataset accuracy with Linear SVC: {accuracy:.2f}")
  accuracy = get_accuracy(trained_clf_random_forest,dataset.tweet.values, dataset.label.values)
  print(f"{dataset_name} dataset accuracy with Random Forest: {accuracy:.2f}")

In [None]:
print_all_accuracies("SS2030", df_ss2030)
print_all_accuracies("100k Arabic Reviews", df_reviews)
print_all_accuracies("ArSAS", df_arsas)

<a id= "1.3"> </a>
### Summary of Classic ML Results:
- Best classfiers found for the `arabic-sentiment-twitter-corpus` dataset: **RandomForestClassifier** 
- Performance across test datasets (numbers represent accuracy):

| Dataset | Decision Tree | Multinomial NB | Linear SVC | Random Forest
| :---: | :---: | :---: | :---: | :---: |
| arabic-sentiment-twitter-corpus test subset | 0.77 | 0.79 | 0.79 | **0.8** 
| SS2030 | 0.52 | **0.59** | 0.58 | 0.55
| 100k reviews | 0.54 | **0.60** | 0.58 | 0.59
| ArSAS | 0.51 | 0.65 | 0.61 | **0.66** 
    

    
- It appears that **Multinomial NB** can sometimes outperform Random Forest but the differences are insignificant. 


# Deep Learning Approach
- Given that the Random Forest Classifier model wasn't generalizing well for other datasets (possibly overfitting), I decided to try a DL approach using a pretrained model (i.e: increasing the dataset as a way of overcoming overfitting). For that I chose to use the [Arabic-BERT model](https://github.com/alisafaya/Arabic-BERT) By Ali Safaya.  
> The models were pretrained on ~8.2 Billion words:
> - Arabic version of OSCAR (unshuffled version of the corpus) - filtered from Common Crawl
> - Recent dump of Arabic Wikipedia

In [None]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
from transformers import AutoTokenizer, AutoModel

<a id="2.1"> </a>
### BERT-mini
Code adapted from https://skimai.com/fine-tuning-bert-for-sentiment-analysis/

In [None]:
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic")

<a id="2.1.1"> </a>
##### Preprocessing

In [None]:
# Define preprocessing util function
def text_preprocessing(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
  

    # Normalize unicode encoding
    text = unicodedata.normalize('NFC', text)
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    #Remove URLs
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '<URL>', text)


    return text

In [None]:
# Create a function to tokenize a set of texts
import emoji
import unicodedata
def preprocessing_for_bert(data, version="mini", text_preprocessing_fn = text_preprocessing ):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []
    tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic") if version == "mini" else AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")

    # For every sentence...
    for i,sent in enumerate(data):
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode_plus(
            text=text_preprocessing_fn(sent),  # Preprocess sentence
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                  # Max length to truncate/pad
            padding='max_length',        # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            return_attention_mask=True,     # Return attention mask
            truncation = True 
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))
    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

In [None]:
# Specify `MAX_LEN`
MAX_LEN =  280

# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X[0]])[0].squeeze().numpy())
print('Original: ', X[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)

<a id="2.1.2"> </a>
##### Create data loaders for test and validation sets

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

<a id="2.1.3"> </a>
##### Define model initialization class and functions

In [None]:
%%time
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False, version="mini"):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in = 256 if version == "mini" else 768
        H, D_out = 50, 2

        # Instantiate BERT model
        self.bert = AutoModel.from_pretrained("asafaya/bert-mini-arabic") if version == "mini" else AutoModel.from_pretrained("asafaya/bert-base-arabic")
        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(H, D_out)
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

from torch.optim import SparseAdam, Adam
def initialize_model(epochs=4, version="mini"):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False, version=version)
    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(params=list(bert_classifier.parameters()),
                      lr=5e-5,    # Default learning rate
                      eps=1e-8    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

<a id="2.1.4"> </a>
##### Define model train and evaluate functions

In [None]:
import random
import time
import torch
import torch.nn as nn
# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    """Train the BertClassifier model.
    """
    # Start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

<a id="2.1.5"> </a>
##### Initialize and train model

In [None]:
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

<a id="2.1.6"> </a>
##### Save model

In [None]:
# Saving the model for future runs

import pickle
filename = 'trained_model_mini_with_emojis.sav'
pickle.dump(bert_classifier, open(filename, 'wb'))

Load model (Uncomment to avoid retraining in future runs)

In [None]:
# # Loading the model (to avoid retraining in reruns)

# import pickle
# filename = 'trained_model_mini_with_emojis.sav'
# f = open(filename, 'rb')
# bert_classifier = pickle.load(f)

<a id="2.1.7"> </a>
##### Define prediction and test set evaluation functions

In [None]:
import torch.nn.functional as F

def bert_predict(model, test_dataloader):
    """Perform a forward pass on the trained BERT model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits = []

    # For each batch in our test set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        all_logits.append(logits)
    
    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)

    # Apply softmax to calculate probabilities
    probs = F.softmax(all_logits, dim=1).cpu().numpy()

    return probs

In [None]:
from sklearn.metrics import accuracy_score, roc_curve, auc

def evaluate_roc(probs, y_true, model_name, dataset_name, test_dataset_name):
    """
    - Print AUC and accuracy on the test set
    - Plot ROC
    @params    probs (np.array): an array of predicted probabilities with shape (len(y_true), 2)
    @params    y_true (np.array): an array of the true values with shape (len(y_true),)
    """
    preds = probs[:, 1]
    fpr, tpr, threshold = roc_curve(y_true, preds)
    roc_auc = auc(fpr, tpr)
    print(f'AUC: {roc_auc:.4f}')
       
    # Get accuracy over the test set
    y_pred = np.where(preds >= 0.5, 1, 0)
    accuracy = accuracy_score(y_true, y_pred)
    print(f'Accuracy: {accuracy*100:.2f}%')
    
    # Plot ROC AUC
    plt.title(f" ROC of {model_name}  trained on {dataset_name} dataset & evaluated on the {test_dataset_name} dataset ")
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

<a id="2.1.8"> </a>
##### Predict and evaluate validation subset

In [None]:
# Compute predicted probabilities on the validation set
probs = bert_predict(bert_classifier, val_dataloader)

# Evaluate the Bert classifier
evaluate_roc(probs, y_val, "BERT-mini", "arabic-sentiment-twitter-corpus", "arabic-sentiment-twitter-corpus validation")

<a id="2.1.9"> </a>
##### Predict and evaluate test subset

In [None]:
# Run `preprocessing_for_bert` on the test set
print('Tokenizing data...')
test_inputs, test_masks = preprocessing_for_bert(X_test)

# Create the DataLoader for our test set
test_dataset = TensorDataset(test_inputs, test_masks)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=32)

In [None]:
# Compute predicted probabilities on the test set
probs = bert_predict(bert_classifier, test_dataloader)

# Get predictions from the probabilities
threshold = 0.5
preds = np.where(probs[:, 1] > threshold, 1, 0)

# Number of tweets predicted non-negative
print("no-negative tweets ratio ", preds.sum()/len(preds))

In [None]:
# Evaluate the Bert classifier for unseen test data
evaluate_roc(probs, y_test,"BERT-mini", "arabic-sentiment-twitter-corpus","arabic-sentiment-twitter-corpus test")

<a id="2.1.10"> </a>
##### Predict and evaluate on other test datasets

In [None]:
# Evaluate the performance of a model on test datasets
def evaluate_dataset(sents, labels, model_name, dataset_name, test_dataset_name):
    test_inputs, test_masks = preprocessing_for_bert(sents)

    # Create the DataLoader for our test set
    test_dataset = TensorDataset(test_inputs, test_masks)
    test_sampler = SequentialSampler(test_dataset)
    test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=32)
    # Compute predicted probabilities on the test set
    probs = bert_predict(bert_classifier, test_dataloader)

    # Get predictions from the probabilities
    threshold = 0.5
    preds = np.where(probs[:, 1] > threshold, 1, 0)
    auc_graph = evaluate_roc(probs, labels, model_name, dataset_name, test_dataset_name )

    return auc_graph

In [None]:
# Evaluate on the SS2030 Dataset
evaluate_dataset(df_ss2030.tweet.values, df_ss2030.label.values,"BERT-mini no emojis", "arabic-sentiment-twitter-corpus", "ss2030" )

In [None]:
# Evaluate on the 100k Arabic Reviews Dataset
evaluate_dataset(df_reviews.tweet.values, df_reviews.label.values,"BERT-mini", "arabic-sentiment-twitter-corpus", "100K Reviews")

In [None]:
evaluate_dataset(df_arsas.tweet.values, df_arsas.label.values,"BERT-mini", "arabic-sentiment-twitter-corpus", "ArSAS")

<a id="2.1.11"> </a>
##### Summary of performance on test datasets

| Model | arabic-sentiment-twitter-corpus test subset | SS2030 | 100k reviews | ArSAS
| :---: | :---: | :---: | :---: | :---: |
| RandomForestClassifier | 0.798 | 0.554 | 0.587 | 0.660
| BERT-mini | 0.900 | 0.639 | 0.599 | 0.691

<center><i>numbers shown represent accuracy</i></center>

In [None]:
# Helper function to get the prediction of a single tweet's sentiment (can be used for random tweet testing)
def predict_tweet_sentiment(tweet):
    df = pd.DataFrame([tweet])
    df = df.rename(columns = {0:"tweet"})
    print(df.tweet.values)
    test_inputs, test_masks = preprocessing_for_bert(df.tweet.values)

    # Create the DataLoader for our test set
    test_dataset = TensorDataset(test_inputs, test_masks)
    test_sampler = SequentialSampler(test_dataset)
    test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=32)
    # Compute predicted probabilities on the test set
    probs = bert_predict(bert_classifier, test_dataloader)
    print(probs)
    # Get predictions from the probabilities
    threshold = 0.5
    preds = np.where(probs[:, 1] > threshold, "positive", "negative")

#     print("no-negative tweets ratio ", preds.sum()/len(preds))
    return preds


*While it seems like the DL approach with Arabic BERT improved generalization on other datasets, it seems like there's still a big gap between the performance on the arabic-sentiment-twitter-corpus dataset and the other datasets. I had a suspicion that the model's high accuracy on the first dataset (arabic-sentiment-twitter-corpus) was due to the fact that it uses emojis as cues (From the dataset analysis [notebook](https://www.kaggle.com/yasmeenhany/dataset-analysis?scriptVersionId=64595722) we can see that this dataset has emojis in almost 80% of the tweets while all other datasets' tweets/texts aren't as heavily saturated with emojis). To test this hypothesis, I decided to train the same model, but with removing emojis in the preprocessing step and seeing how it affects accuracy.*

<a id="2.2"> </a>
### BERT-mini without emojis

<a id="2.2.1"> </a>
##### Define modified preprocessing function

In [None]:
def remove_emojis(sent):
    text =  emoji.demojize(sent)
    text= re.sub(r'(:[!_\-\w]+:)', '', text)
    return text

In [None]:
# Redefine the text_processing function to include the remove emojis step
def text_preprocessing_no_emojis(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
  
    # Remove emojis
    text = remove_emojis(text)

    return text_preprocessing(text)

<a id="2.2.2"> </a>
##### Preprocess and create data loaders

In [None]:
# Specify `MAX_LEN`
MAX_LEN =  280

# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X[0]], text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
print('Original: ', X[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, text_preprocessing_fn=text_preprocessing_no_emojis)
val_inputs, val_masks = preprocessing_for_bert(X_val, text_preprocessing_fn=text_preprocessing_no_emojis)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

<a id="2.2.3"> </a>
##### Train

In [None]:
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

*As hypothesized, it seems like the presence/absence of emojis can greatly affect model performance in terms of accuracy, given how the accuracy went from 0.90 to 0.79 after only removing emojis in the preprocessing step*

<a id="2.2.4"> </a>
##### Evaluate on test datasets

In [None]:
# Evaluate on the unseen test data
evaluate_dataset(X_test, y_test,"BERT-mini no emojis", "arabic-sentiment-twitter-corpus", "arabic-sentiment-twitter-corpus test")

In [None]:
# Evaluate on the SS2030 Dataset
evaluate_dataset(df_ss2030.tweet.values, df_ss2030.label.values,"BERT-mini no emojis", "arabic-sentiment-twitter-corpus", "SS2030")

In [None]:
# Evaluate on the 100k Arabic Reviews Dataset
evaluate_dataset(df_reviews.tweet.values, df_reviews.label.values,"BERT-mini no emojis","arabic-sentiment-twitter-corpus", "100k Arabic Reviews")

In [None]:
# Evaluate on the ArSAS Dataset
evaluate_dataset(df_arsas.tweet.values, df_arsas.label.values,"BERT-mini no emojis", "arabic-sentiment-twitter-corpus", "ArSAS")

<a id="2.2.5"> </a>
##### Summary of performance on test datasets

| Model | arabic-sentiment-twitter-corpus test subset | SS2030 | 100k reviews | ArSAS
| :---: | :---: | :---: | :---: | :---: |
| RandomForestClassifier | 0.798 | 0.554 | 0.587 | 0.660
| BERT-mini | 0.900 | 0.639 | 0.599 | 0.691
| BERT-mini without emojis | 0.785 | 0.628 | 0.632 | 0.663

<center><i>numbers shown represent accuracy</i></center>

Compared to BERT-mini with emojis, BERT-mini without emojis' accuracy has dropped across all datasets. This, however, is expected since the model was learning from emojis, which is undesired behavior (we want a text sentiment classifier). Compared to the Random Forest Classifier, it seems like the BERT-mini without emojis' performance has slightly dropped on the `arabic-sentiment-twitter-corpus` test subset, but improved on the other test datasets (2, 3 and 4). Given that this version has generalized better on other datasets, let's try to see how BERT-base without emojis performs in comparison.  

<a id="2.3"> </a>
### BERT-base

<a id="2.3.1"> </a>
##### Preprocess and create data loaders

In [None]:
# Specify `MAX_LEN`
MAX_LEN =  280

# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X[0]], version="base", text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
print('Original: ', X[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, version="base", text_preprocessing_fn=text_preprocessing_no_emojis)
val_inputs, val_masks = preprocessing_for_bert(X_val, version="base", text_preprocessing_fn=text_preprocessing_no_emojis)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)



<a id="2.3.2"> </a>
##### Train

In [None]:
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2, version="base")
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

<a id="2.3.3"> </a>
##### Save trained model

In [None]:
import pickle
filename = 'trained_model_base_without_emojis.sav'
pickle.dump(bert_classifier, open(filename, 'wb'))

<a id="2.3.4"> </a>
##### Evaluate on test datasets

In [None]:
# Evaluate on the unseen test data
evaluate_dataset(X_test, y_test,"BERT-base", "arabic-sentiment-twitter-corpus", "arabic-sentiment-twitter-corpus test")

In [None]:
# Evaluate on the SS2030 Dataset
evaluate_dataset(df_ss2030.tweet.values, df_ss2030.label.values,"BERT-base", "arabic-sentiment-twitter-corpus", "SS2030")

In [None]:
# Evaluate on the 100k Arabic Reviews Dataset
evaluate_dataset(df_reviews.tweet.values, df_reviews.label.values,"BERT-base", "arabic-sentiment-twitter-corpus", "100K Arabic Reviews")

In [None]:
# Evaluate on the ArSAS Dataset
evaluate_dataset(df_arsas.tweet.values, df_arsas.label.values,"BERT-base", "arabic-sentiment-twitter-corpus", "ArSAS")

<a id="2.3.5"> </a>
##### Summary of performance of test datasets
| Model | arabic-sentiment-twitter-corpus test subset | SS2030 | 100k reviews | ArSAS
| :---: | :---: | :---: | :---: | :---: |
| RandomForestClassifier | 0.798 | 0.554 | 0.587 | 0.660
| BERT-mini | 0.900 | 0.639 | 0.599 | 0.691
| BERT-mini without emojis | 0.785 | 0.628 | 0.632 | 0.663
| BERT-base without emojis | 0.803 |  0.652 | 0.652 | 0.699

<center><i>numbers shown represent accuracy</i></center>

*It looks like the BERT-base slightly improved the overall performance on the unseen datasets, but it still appears that the model is unable to generalize well after being trained on the arabic-sentiment-twitter-corpus dataset. To overcome this, we will attempt to train the model on the other datasets, and see on how that reflects on the model's ability to generalize*

<a id="2.4"> </a>
### DL approach trained on other datasets

In [None]:
# Helper function that encapsulates all training logic
from sklearn.model_selection import train_test_split
def train_val_test_split(df):
    X = df.tweet.values
    y = df.label.values

    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2020)
    X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size=0.5, random_state=2020)
    return X_train, X_val, X_test, y_train, y_val, y_test
def preprocess_and_train(X_train, X_val, y_train,y_val):

    # Print sentence 0 and its encoded token ids
    token_ids = list(preprocessing_for_bert([X_train[0]], text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
    print('Original: ', X_train[0])
    print('Token IDs: ', token_ids)

    # Run function `preprocessing_for_bert` on the train set and the validation set
    print('Tokenizing data...')
    train_inputs, train_masks = preprocessing_for_bert(X_train, text_preprocessing_fn=text_preprocessing_no_emojis)
    val_inputs, val_masks = preprocessing_for_bert(X_val, text_preprocessing_fn=text_preprocessing_no_emojis)
    from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

    # Convert other data types to torch.Tensor
    train_labels = torch.tensor(y_train)
    val_labels = torch.tensor(y_val)

    # For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
    batch_size = 16

    # Create the DataLoader for our training set
    train_data = TensorDataset(train_inputs, train_masks, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create the DataLoader for our validation set
    val_data = TensorDataset(val_inputs, val_masks, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
    set_seed(42) 
    bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
    train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)
    return bert_classifier

In [None]:
# Concatenate all of the tweets from arabic-sentiment-twitter-corpus to treat it as 1 test dataset in this section
df_twitter_corpus = pd.concat([train_df, test_df], axis=0).reset_index(drop=True)

<a id="2.4.1"> </a>
#### Training using the SS2030 Dataset

In [None]:

from sklearn.model_selection import train_test_split
MAX_LEN = 280
X = df_ss2030.tweet.values
y = df_ss2030.label.values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2020)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size=0.5, random_state=2020)


# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X_train[0]], text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
print('Original: ', X_train[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, text_preprocessing_fn=text_preprocessing_no_emojis)
val_inputs, val_masks = preprocessing_for_bert(X_val, text_preprocessing_fn=text_preprocessing_no_emojis)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

In [None]:
# Evaluate on the unseen dataset
evaluate_dataset(X_test, y_test,"BERT-mini no emojis", "SS2030", "SS2030 test")

In [None]:
# Evaluate on the twitter corpus dataset
evaluate_dataset(df_twitter_corpus.tweet.values, df_twitter_corpus.label.values,"BERT-mini no emojis", "SS2030", "arabic-sentiment-twitter-corpus" )

In [None]:
# Evaluate on the 100k Arabic Reviews Dataset
evaluate_dataset(df_reviews.tweet.values, df_reviews.label.values,"BERT-mini no emojis", "SS2030", "100K Arabic Reviews")

In [None]:
# Evaluate on the ArSAS Dataset
evaluate_dataset(df_arsas.tweet.values, df_arsas.label.values,"BERT-mini no emojis", "SS2030", "ArSAS")

<a id="2.4.2"> </a>
#### Train using the 100k Arabic Reviews dataset

In [None]:

from sklearn.model_selection import train_test_split
MAX_LEN = 280
X = df_reviews.tweet.values
y = df_reviews.label.values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2020)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size=0.5, random_state=2020)


# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X_train[0]], text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
print('Original: ', X_train[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, text_preprocessing_fn=text_preprocessing_no_emojis)
val_inputs, val_masks = preprocessing_for_bert(X_val, text_preprocessing_fn=text_preprocessing_no_emojis)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

In [None]:
# Evaluate on unseen test dataset
evaluate_dataset(X_test, y_test,"BERT-mini no emojis", "100K Arabic Reviews", "100K Arabic Reviews test")

In [None]:
# Evaluate on the twitter corpus dataset
evaluate_dataset(df_twitter_corpus.tweet.values, df_twitter_corpus.label.values,"BERT-mini no emojis", "100K Reviews", "arabic-sentiment-twitter-corpus" )

In [None]:
# Evaluate on the SS2030 Dataset
evaluate_dataset(df_ss2030.tweet.values, df_ss2030.label.values,"BERT-mini no emojis", "100K Reviews", "SS2030")

In [None]:
# Evaluate on the ArSAS Dataset
evaluate_dataset(df_arsas.tweet.values, df_arsas.label.values,"BERT-mini no emojis", "100K Reviews", "ArSAS")

<a id="2.4.3"> </a>
#### Train using the ArSAS Dataset

In [None]:

from sklearn.model_selection import train_test_split
MAX_LEN = 280
X = df_arsas.tweet.values
y = df_arsas.label.values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2020)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size=0.5, random_state=2020)


# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X_train[0]], text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
print('Original: ', X_train[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, text_preprocessing_fn=text_preprocessing_no_emojis)
val_inputs, val_masks = preprocessing_for_bert(X_val, text_preprocessing_fn=text_preprocessing_no_emojis)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

In [None]:
# Evaluate on the twitter corpus dataset
evaluate_dataset(X_test, y_test,"BERT-mini no emojis", "ArSAS", "ArSAS test")

In [None]:
# Evaluate on the twitter corpus dataset
evaluate_dataset(df_twitter_corpus.tweet.values, df_twitter_corpus.label.values,"BERT-mini no emojis", "ArSAS", "arabic-sentiment-twitter-corpus")

In [None]:
# Evaluate on the SS2030 Dataset
evaluate_dataset(df_ss2030.tweet.values, df_ss2030.label.values,"BERT-mini no emojis", "ArSAS", "SS2030")

In [None]:
# Evaluate on the 100k reviews Dataset
evaluate_dataset(df_reviews.tweet.values, df_reviews.label.values,"BERT-mini no emojis", "ArSAS", "100K Arabic Reviews")

<a id="2.4.4"> </a>
#### Summary of performance on test datasets


| training dataset | test subset accuracy |  arabic-sentiment-twitter-corpus accuracy | SS2030 accuracy | 100k reviews accuracy | ArSAS accuracy
| :---: | :---: | :---: | :---: | :---: | :---: |
| arabic-sentiment-twitter-corpus | 0.785 | - | 0.579 | 0.570 | 0.697
| SS2030 | 0.847 | 0.492 | - | 0.502 | 0.628
| 100k reviews | 0.885 | 0.585 | 0.626 | - | 0.547
| ArSAS | 0.879 | 0.616 | 0.641 | 0.641 | - |

<center><i>numbers shown represent accuracy</i></center>

**The model trained on the ArSAS dataset appears to be generalizing better on the other test datasets (higher accuracy). Given that promising result, let's try using ArSAS to train bert-base**

<a id="2.4.5"> </a>
#### Train BERT-base Using the ArSAS Dataset

In [None]:

from sklearn.model_selection import train_test_split
MAX_LEN = 280
X = df_arsas.tweet.values
y = df_arsas.label.values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2020)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test,test_size=0.5, random_state=2020)


# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X_train[0]], version="base", text_preprocessing_fn=text_preprocessing_no_emojis)[0].squeeze().numpy())
print('Original: ', X_train[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train, version="base", text_preprocessing_fn=text_preprocessing_no_emojis)
val_inputs, val_masks = preprocessing_for_bert(X_val, version="base", text_preprocessing_fn=text_preprocessing_no_emojis)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2, version="base")
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

In [None]:
# Evaluate on the ArSAS test set
evaluate_dataset(X_test, y_test,"BERT-base no emojis", "ArSAS", "ArSAS test")

In [None]:
# Evaluate on the twitter corpus dataset
evaluate_dataset(df_twitter_corpus.tweet.values, df_twitter_corpus.label.values,"BERT-base no emojis", "ArSAS", "arabic-sentiment-twitter-corpus")

In [None]:
# Evaluate on the SS2030 Dataset
evaluate_dataset(df_ss2030.tweet.values, df_ss2030.label.values,"BERT-base no emojis", "ArSAS", "SS2030")

In [None]:
# Evaluate on the 100k reviews Dataset
evaluate_dataset(df_reviews.tweet.values, df_reviews.label.values,"BERT-base no emojis", "ArSAS", "100K Arabic Reviews")

As expected, BERT-base (no emojis) trained on ArSAS yields the best results yet in terms of generalizing on unseen test datasets

<a id="3"> </a>
# Final summary of all experiments

| model | with emojis| training dataset  |  arabic-sentiment-twitter-corpus accuracy | SS2030 accuracy | 100k reviews accuracy | ArSAS accuracy 
| :---: | :---: | :---: |  :---: | :---: | :---: | :---: |
| RandomForestClassifier | Yes | arabic-sentiment-twitter-corpus | *0.798*| 0.550 | 0.585 | 0.659 
| BERT-mini| Yes | arabic-sentiment-twitter-corpus | *0.900*| 0.639 | 0.599 | 0.691 
| BERT-mini| No | arabic-sentiment-twitter-corpus | *0.785* | 0.579 | 0.570 | **0.697** 
| BERT-base | No | arabic-sentiment-twitter-corpus | *0.803*  |  0.652| 0.652| **0.699** 
| BERT-mini| No | SS2030  | 0.492 | *0.847* | 0.502 | 0.628 
| BERT-mini| No | 100k reviews  | 0.585 | 0.626 | *0.885* | 0.547 
| BERT-mini| No | ArSAS | **0.616** | **0.641** | **0.641** | *0.879*
| BERT-base | No | ArSAS | ***0.648*** | ***0.679*** | ***0.741*** |*0.899*

<center><i>numbers shown represent accuracy</i></center>


**Notes:**
* *Italic numbers on the diagonal represent accuracies of the unseen test subsets (same dataset as training set)*
* *Bold numbers represent the highest BERT-base and BERT-mini accuracies for each external dataset/column (excluding the test subset of training dataset)*

### Summary & Conclusion

#### Summary of experiments
In this notebook my goal was to train an Arabic sentiment analysis classifier that is robust and has consistent performance regardless of the dataset used to evaluate it. Here's a summary of the experiments I've done:
- I first tried the classic ML approach and found that while it has a fast training time and good performance (measured by accuracy score) on the test subset of the dataset it's trained on, its performance significantly dropped when evaluated on other datasets.
- Then I tried a finetuning approach on a DL model that is pretrained on a very large corpus of Arabic text. The first model I tried in this category was a BERT-mini model that *did not* discard emojis in its preprocessing step. Similarly to the classical ML approach, this model performed well on the test subset of the dataset it's trained on, but failed to generalize on the other test datasets.
- I attempted a version of the same model that removes emojis in its preprocessing step. This caused the accuracy scores to drop both on the test subset and the other test datasets. This tells us that the model had been using emojis as sentiment cues. This is an undesired behavior because we want a model that infers sentiment from Arabic text, not from emojis. 
- The next step was to change the model to BERT-base which is a more complex version that has 10x more parameters than BERT-mini. This improved the performance on the ss2030 and 100k reviews datasets, but the accuracy on ArSAS didn't budge as much.
- After trying different versions of the model, changing the training dataset seemed like a logical next experiment. The dataset that showed the best performance improvement in terms of accuracy on unseen datasets, was shown to be ArSAS. 
- Given that result, I next trained a BERT-base model using ArSAS, and this version ended up outperforming the BERT-base model trained on the arabic-sentiment-twitter-corpus.

#### Conclusion:
- Out of the different model/training dataset combinations I've tried in this notebook, BERT-base trained on ArSAS proved to be the best one for the task of Arabic text Sentiment analysis.
- Even though the datasets all (except for the 100k Reviews Dataset) consist of dialectical Arabic tweets, they seem to have intrinsic differences in terms of topics and vocabulary, this is discussed in more details in the companion dataset analysis [notebook](https://www.kaggle.com/yasmeenhany/dataset-analysis). This makes it hard for a model trained on one to generalize well on others. 