# Tweet sentiment analysis
During this challenge, we tried to determine whether a tweet is a positive, negative or neutral message. For this purpose, we implemented different models. We tried a few basic models as well as a few advanced ones. After implementation, we compared their results and selected the best performing one.

### Summary :    
> __1. Data Preparation__ 

> __2. Bert Transformer__

> __3. Bag of Words__ 

> __4. Word To Vec__ 

In [None]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: 
#               https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# plotting
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")

# general NLP preprocessing and basic tools
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# train/test split
from sklearn.model_selection import train_test_split
# basic machine learning models
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# our evaluation metric for sentiment classification
from sklearn.metrics import fbeta_score

# Don't print the warnings
import warnings
warnings.filterwarnings("ignore")

## 1. Data Preparation 
### a. Loading the data

In [None]:
train_data_path='/kaggle/input/eurecom-aml-2021-challenge-3/train.csv'
test_data_path='/kaggle/input/eurecom-aml-2021-challenge-3/test.csv'

train_df = pd.read_csv(train_data_path);
test_df = pd.read_csv(test_data_path);

#train_df = train_df.loc[:1000]

### b. Data inspection

In [None]:
print("len(train_df) =",len(train_df),
    "\nlen(test_df) =",len(test_df),
    "\nlen(train_df) + len(test_df) =",len(train_df)+len(test_df))

In [None]:
train_df.head()

In [None]:
test_df.head()

### c. Data preprocessing

In [None]:
def missing_zero_values_table(df,df_name):
    '''
    Inputs:
        df: pandas table
        df_name: string of the pandas table name
    Output:
        "df_name has columns_nb columns and rows_nb Rows. There are columns_nb columns that have missing values."
    '''
    zero_val = (df == 0.00).astype(int).sum(axis=0)
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
    mz_table = mz_table.rename(
    columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
    mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
    mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
    mz_table['Data Type'] = df.dtypes
    mz_table = mz_table[mz_table.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)
    print (df_name + " has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
        "There are " + str(mz_table.shape[0]) + " columns that have missing values.")
    return mz_table
    
missing_zero_values_table(train_df,"train_df")

We start off by **converting the labels to numbers**. This is a requirement for the submission and numerical inputs are generally more compatible with machine learning libraries.

In [None]:
# pb with label "-1" for transformer !
# have to reput the convenient label for the submission !
# (to be in accordance with the test labels)
positive_label, neutral_label, negative_label = 2,1,0
target_conversion = {
    'neutral': neutral_label,
    'positive': positive_label,
    'negative': negative_label
}

train_df['target'] = train_df['sentiment'].map(target_conversion)

### Target distribution 

In [None]:
import matplotlib.patches as mpatches
fig, ax= plt.subplots(figsize =(3,3))

ax = sns.countplot(x='target', data=train_df, palette=['#DC143C',"#FFD700","#32CD32"]);
for p in ax.patches:
    ax.annotate(p.get_height(), (p.get_x()+0.16, p.get_height()/2.2))

patch1 = mpatches.Patch(color='#DC143C', label='Negative')
patch2 = mpatches.Patch(color="#FFD700", label='Neutral')
patch3 = mpatches.Patch(color="#32CD32", label='Positive')

plt.legend(handles=[patch1, patch2, patch3], 
           bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title("Distribution of the labels")
plt.show()

The distribution is relatively well-balanced.

Data cleaning has already been done with the "selected_text" columns, but we can still improve it.

In [None]:
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator

original_column = 'selected_text' ##### COLUMN TO BE PREPROCESSED ("text" or "selected_text") !!!! ####

# Start with one review:
df_positive = train_df[train_df['target']==positive_label]
df_neutral = train_df[train_df['target']==neutral_label]
df_negative = train_df[train_df['target']==negative_label]
tweet_all = " ".join(review for review in train_df[original_column])
tweet_positive = " ".join(review for review in df_positive[original_column])
tweet_neutral = " ".join(review for review in df_neutral[original_column])
tweet_negative = " ".join(review for review in df_negative[original_column])

fig, ax = plt.subplots(4, 1, figsize  = (30,30))
# Create and generate a word cloud image:
wordcloud_aLL = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweet_all)
wordcloud_positive = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweet_positive)
wordcloud_neutral = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweet_neutral)
wordcloud_negative = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweet_negative)

# Display the generated image:
ax[0].imshow(wordcloud_aLL, interpolation='bilinear')
ax[0].set_title('All Tweets', fontsize=30, pad=25)
ax[0].axis('off')
ax[1].imshow(wordcloud_positive, interpolation='bilinear')
ax[1].set_title('Tweets under positive Class',fontsize=30, pad=25)
ax[1].axis('off')
ax[2].imshow(wordcloud_neutral, interpolation='bilinear')
ax[2].set_title('Tweets under neutral Class',fontsize=30, pad=25)
ax[2].axis('off')
ax[3].imshow(wordcloud_negative, interpolation='bilinear')
ax[3].set_title('Tweets under negative Class',fontsize=30, pad=25)
ax[3].axis('off')
plt.show()

### d. Data Cleaning

#### d.1 Remove punctuations

In [None]:
import string
s = string.punctuation
print(s)
s = s.translate({ord(i): None for i in '@'})
print(s)

In [None]:
import string
punctuation_string = string.punctuation
punctuation_string = punctuation_string.translate({ord(i): None for i in '@'})

import re
def remove_punct(text,punctuation_string):
    text  = "".join([char for char in text if char not in punctuation_string])
    text = re.sub('[0-9]+', '', text)
    return text

head_nb = 10 # nb of lines to print when using pd.head(head_nb)

train_df['Tweet_punct'] = train_df[original_column].apply(lambda x: remove_punct(x,punctuation_string))
train_df.head(head_nb)

### d.2 Tokenization

In [None]:
def tokenization(text):
    text = re.split('\W+', text)
    return text

train_df['Tweet_tokenized'] = train_df['Tweet_punct'].apply(lambda x: tokenization(x.lower()))
train_df.head(head_nb)

### d.3 Word Replacement

In [None]:
replaced_words = [("hmmyou",""),("sry","sorry"),("inlove","in love"),("thats",""),("wanna",""),
                  ("soo","so"),("inlove","in love"),("amazingwell","amazing well"),
                  ("messagesorry","message sorry"),("½",""),("tomorrowneed","tomorrow need"),
                  ("tomorrowis","tomorrow is"),("amusedtime","amused time"),("weekendor","weekend or"),
                  ("competitionhope","competition hope"),("partypicnic","party picnic"),
                  ("ahmazing","amazing"),("wont","will not"),("didnt","did not"),("dont","do not"),
                  ("lookin","looking"),("u","you"),("youre","you are"),("nite","night"),("isnt","is not"),
                  ("k",""),("is",""),("doesnt","does not"),("l",""),("x",""),("c",""),("ur","your"),
                  ("e",""),("yall","you all"),("he",""),("us",""),("okim","ok i am"),("jealousi","jealous"),
                  ("srry","sorry"),("itll","it will"),("vs",""),("weeknend","weekend"),("w",""),
                  ("yr","year"),("youve","you have"),("havent","have not"),("iï",""),("gonna","going to"),
                  ("gimme","give me"),("ti",""),("ta",""),("thru","through"),("th",""),("imma","i am going to"),
                  ("wasnt","was not"),("arent","are not"), ("bff","best friend forever"),("sometimesdid","sometimes did"),
                  ("waitt","wait"),("bday","birthday"),("toobut","too but"),("showerand","shower and"),
                  ("innit","is not it"),("surgury","surgery"),("soproudofyo","so proud of you"),("p",""),
                  ("couldnt","could not"),("dohforgot","forgot"),("rih","right"),("b",""),("bmovie","movie"),
                  ("pleaseyour","please your"),("tonite","tonight"),("grea","great"),("se",""),("soonso","soon so"),
                  ("gettin","getting"),("blowin","blowing"),("coz","because"),("thanks","thank"),("st",""),("rd",""),
                  ("gtta","have got to"),("gotta","have got to"),("anythingwondering","anything wondering"),
                  ("annoyedy","annoyed"),("p",""),("beatiful","beautiful"),("multitaskin","multitasking"),
                  ("nightmornin","night morning"),("thankyou","thank you"),("iloveyoutwoooo","i love you two"),
                  ("tmwr","tomorrow"),("wordslooks","words looks"),("ima","i am going to"),("liek","like"),("mr",""),
                  ("allnighter","all nighter"),("tho","though"),("ed",""),("fyou",""),("footlong","foot long"),
                  ("placepiggy","place piggy"),("semiflaky","semi flaky"),("gona","going to"),("tmr","tomorrow"),
                  ("ppl","people"),("n",""),("dis","this"),("dun","done"),("houseee","house"),("havee","have"),
                  ("studyingwhew","studying whew"),("awwyoure","aww you are"),("softyi","softy"),
                  ("weddingyou","wedding you"),("hassnt","has not"),("lowerleft","lower left"),("anywayss","anyway"),
                  ("adoarble","adorable"),("blogyeahhhh","blog yeahhhh"),("billsim","bills i am"),("ps",""),
                  ("cheescake","cheesecake"),("morningafternoonnight","morning after noon night"),
                  ("allstudying","all studying"),("ofcoooursee","of course"),("jst","just"),("shes","she is"),
                  ("sonicswhich","sonics which"),("ouchwaited","ouch waited"),("itll","it will"),("orreply","or reply"),
                  ("somethin","something"),("fridayand","friday and"),("outta","out of"),("herenever","here never")
                 ] 

def replace_words(text,replaced_words):
    ind = -1 
    for word in text:
        ind +=1
        for k in range(len(replaced_words)):
            if word == replaced_words[k][0]:
                text[ind] = replaced_words[k][1]
            elif "http" in word:
                text[ind] = ""
            elif "@" in word:
                text[ind] = ""
            elif "www." in word:
                text[ind] = ""
            elif "Â" in word: 
                text[ind] = ""
            elif "Ã" in word: 
                text[ind] = ""
            elif "½" in word:
                text[ind] = ""
    return text

train_df['Tweet_tokenized'] = train_df['Tweet_tokenized'].apply(lambda x: replace_words(x,replaced_words))
train_df.head(head_nb)

### d.4 Remove stopwords

In [None]:
stopword = nltk.corpus.stopwords.words('english')
print("stopword:\n",stopword)
print("\n\n There are some words that we want to keep, for example 'no', 'nor','not'\n")
words_to_keep = ["not","no","nor"]
stopword = [elem for elem in stopword if not elem in words_to_keep]
stopword.extend(["im","theyre","ive","p","alot","er",""]) # Other stopwords to remove
print("stopword:\n",stopword,"\n")

def remove_stopwords(text,stopword):
    text = [word for word in text if word not in stopword]
    return text
    
train_df['Tweet_nonstop'] = train_df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x,stopword))
pd.options.display.max_rows = 4000
train_df.head(head_nb)

### d.5 Stemming/Lammetization - Tranforming any form of a word to its root word

In [None]:
stemmed_or_lemmatized = "Tweet_lemmatized" # "Tweet_lemmatized" OR "Tweet_stemmed"

### d.5.a Stemming

In [None]:
ps = nltk.PorterStemmer()

def stemming(text):
    text = [ps.stem(word) for word in text]
    return text

train_df['Tweet_stemmed'] = train_df['Tweet_nonstop'].apply(lambda x: stemming(x))
train_df.head(head_nb)

### d.5.b Lemmatization

In [None]:
wn = nltk.WordNetLemmatizer()

def lemmatizer(text):
    text = [wn.lemmatize(word) for word in text]
    return text

train_df['Tweet_lemmatized'] = train_df["Tweet_nonstop"].apply(lambda x: lemmatizer(x))
train_df.head(head_nb)

Now, we change the type of the Tweet_lemmatized column from list to string (required for the rest).

In [None]:
train_df['Tweet_lemmatized'] = train_df[stemmed_or_lemmatized].apply(lambda x: ' '.join(str(e) for e in x))
train_df.head(head_nb)

In [None]:
train_scruting = pd.DataFrame()
train_scruting['selected_text'] = train_df['selected_text']
train_scruting['Tweet_lemmatized'] = train_df['Tweet_lemmatized']
train_scruting.to_csv('train_scruting.csv', index=False)

### e . Tweet length

In [None]:
def plot_dist3(df, feature, title):
    '''
    Input:
        df: [Pandas] Dataset
        feature: [String] Column of tweets
    '''
    df['Character_Count'] = df[feature].apply(lambda x: len(str(x)))
    feature = 'Character_Count'
    # Creating a customized chart. and giving in figsize and everything.
    fig = plt.figure(constrained_layout=True, figsize=(18, 8))
    # Creating a grid of 3 cols and 3 rows.
    grid = gridspec.GridSpec(ncols=3, nrows=3, figure=fig)

    # Customizing the histogram grid.
    ax1 = fig.add_subplot(grid[:2, :2])
    # Set the title.
    ax1.set_title('Histogram')
    # plot the histogram.
    sns.distplot(df.loc[:, feature],
                 hist=True,
                 kde=True,
                 ax=ax1,
                 color='#e74c3c')
    ax1.set(ylabel='Frequency')
    ax1.xaxis.set_major_locator(MaxNLocator(nbins=20))

    # Customizing the ecdf_plot.
    ax2 = fig.add_subplot(grid[2:, :2])
    # Set the title.
    ax2.set_title('Empirical CDF')
    # Plotting the ecdf_Plot.
    sns.distplot(df.loc[:, feature],
                 ax=ax2,
                 kde_kws={'cumulative': True},
                 hist_kws={'cumulative': True},
                 color='#e74c3c')
    ax2.xaxis.set_major_locator(MaxNLocator(nbins=20))
    ax2.set(ylabel='Cumulative Probability')

    # Customizing the Box Plot.
    ax3 = fig.add_subplot(grid[:, 2])
    # Set title.
    ax3.set_title('Box Plot')
    # Plotting the box plot.
    sns.boxplot(y=feature, data=df, ax=ax3, color='#e74c3c')
    ax3.yaxis.set_major_locator(MaxNLocator(nbins=20))

    plt.suptitle(f'{title}', fontsize=24)
    
plot_dist3(train_df, "text",'Characters per all Tweets for the column "text"')
plot_dist3(train_df, "Tweet_lemmatized",'Characters per all Tweets for the column "Tweet_lemmatized"')

In [None]:
def plot_word_number_histogram(textne, textpo, textng,column_name):
    
    """A function for comparing word counts"""

    fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(18, 6), sharey=True)
    sns.distplot(textne.str.split().map(lambda x: len(x)), ax=axes[0], color='#e74c3c')
    sns.distplot(textpo.str.split().map(lambda x: len(x)), ax=axes[1], color='#e74c3c')
    sns.distplot(textng.str.split().map(lambda x: len(x)), ax=axes[2], color='#e74c3c')
    
    axes[0].set_xlabel('Word Count')
    axes[0].set_title('neutral')
    axes[1].set_xlabel('Word Count')
    axes[1].set_title('positive')
    axes[2].set_xlabel('Word Count')
    axes[2].set_title('negative')
    
    fig.suptitle('Word counts in tweets for the column "{}"'.format(column_name), fontsize=24, va='baseline')
    
    fig.tight_layout()
    
final_column = 'Tweet_lemmatized'
analysed_column = 'target'
    
plot_word_number_histogram(train_df[train_df[analysed_column] == neutral_label][original_column],
                           train_df[train_df[analysed_column] == positive_label][original_column],
                           train_df[train_df[analysed_column] == negative_label][original_column],original_column)

plot_word_number_histogram(train_df[train_df[analysed_column] == neutral_label][final_column],
                           train_df[train_df[analysed_column] == positive_label][final_column],
                           train_df[train_df[analysed_column] == negative_label][final_column],final_column)

In [None]:
def plot_word_len_histogram(textne, textpo, textng, column_name):
    
    """A function for comparing average word length"""
    
    fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(18, 6), sharey=True)
    sns.distplot(textne.str.split().apply(lambda x: [len(i) for i in x]).map(
        lambda x: np.mean(x)),
                 ax=axes[0], color='#e74c3c')
    sns.distplot(textpo.str.split().apply(lambda x: [len(i) for i in x]).map(
        lambda x: np.mean(x)),
                 ax=axes[1], color='#e74c3c')
    sns.distplot(textng.str.split().apply(lambda x: [len(i) for i in x]).map(
        lambda x: np.mean(x)),
                 ax=axes[2], color='#e74c3c')
    
    axes[0].set_xlabel('Word Count')
    axes[0].set_ylabel('mean')
    axes[0].set_title('neutral')
    axes[1].set_xlabel('Word Count')
    axes[1].set_title('positive')
    axes[2].set_xlabel('Word Count')
    axes[2].set_title('negative')
    
    fig.suptitle('Mean Word Lengths for the column "{}"'.format(column_name), fontsize=24, va='baseline')
    fig.tight_layout()
    
plot_word_len_histogram(train_df[train_df[analysed_column] == neutral_label][original_column],
                           train_df[train_df[analysed_column] == positive_label][original_column],
                           train_df[train_df[analysed_column] == negative_label][original_column],original_column)
    
plot_word_len_histogram(train_df[train_df[analysed_column] == neutral_label][final_column],
                           train_df[train_df[analysed_column] == positive_label][final_column],
                           train_df[train_df[analysed_column] == negative_label][final_column],final_column)

### f. N-gram

### f.1 Common unigrams for all tweets:

In [None]:
from collections import Counter
import plotly.express as px

def top_most_common_words(dataset,column_name,top_nb,sentiment):
    """
    Inputs:
        Dataset: [Pandas]
        Column_name: [String] The name of the column we are interesting with
        sentiment: [String] "all", "positive", "neutral" or "negative"
    """
    dataset['temp_list'] = dataset[column_name].apply(lambda x:str(x).split())
    top = Counter([item for sublist in dataset['temp_list'] for item in sublist])
    temp = pd.DataFrame(top.most_common(top_nb))
    temp.columns = ['Common_words','count']
    fig = px.bar(temp, x="count", y="Common_words",title='Common Words in {} for {} tweets'.format(column_name,sentiment), orientation='h', 
             width=700, height=700,color='Common_words',text='count')
    return fig.show(renderer="notebook")
    
top_nb=25

top_most_common_words(train_df,original_column,top_nb,"all")

top_most_common_words(train_df,final_column,top_nb,"all")

### f.2 Most common unigrams (1-gram):

In [None]:
top_most_common_words(train_df[train_df[analysed_column]==neutral_label],final_column,top_nb,"neutral")
top_most_common_words(train_df[train_df[analysed_column]==positive_label],final_column,top_nb,"positive")
top_most_common_words(train_df[train_df[analysed_column]==negative_label],final_column,top_nb,"negative")

### f.3 Most common 2-grams:

In [None]:
import collections
c=collections.Counter()
for i in train_df["Tweet_lemmatized"]:
  x = i.rstrip().split(" ")
  c.update(set(zip(x[:-1],x[1:])))
    
# c.most_common()

### f.4 Most common 3-grams:

In [None]:
c=collections.Counter()
for i in train_df["Tweet_lemmatized"]:
    x = i.rstrip().split(" ")
    c.update(set(zip(x[:-2],x[1:-1],x[2:])))
    
# c.most_common()

### g. Apply the preprocessing on the test df

In [None]:
test_df['Tweet_punct'] = test_df[original_column].apply(lambda x: remove_punct(x,punctuation_string))
test_df['Tweet_tokenized'] = test_df['Tweet_punct'].apply(lambda x: tokenization(x.lower()))
test_df['Tweet_tokenized'] = test_df['Tweet_tokenized'].apply(lambda x: replace_words(x,replaced_words))
test_df['Tweet_nonstop'] = test_df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x,stopword))
test_df['Tweet_stemmed'] = test_df['Tweet_nonstop'].apply(lambda x: stemming(x))
test_df['Tweet_lemmatized'] = test_df['Tweet_nonstop'].apply(lambda x: lemmatizer(x))
test_df['Tweet_lemmatized'] = test_df[stemmed_or_lemmatized].apply(lambda x: ' '.join(str(e) for e in x))

In [None]:
test_df.head(10)

## Initializing the training and validation datasets

We create a validation dataset from the training data with :
<h4 align="center">(90% train_df - 10% val_df)</h4>

In [None]:
# we create a validation dataset from the training data
X_train, X_val, y_train, y_val = train_test_split(train_df[final_column], train_df["target"], test_size=0.1, random_state=2021 )#stratify= train_df["target"]

### BERT TRANSFORMER 

In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

def text_preprocessing(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode_plus(
            text=text_preprocessing(sent),  # Preprocess sentence
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                  # Max length to truncate/pad
            pad_to_max_length=True,         # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            return_attention_mask=True      # Return attention mask
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

In [None]:
# Specify `MAX_LEN`
MAX_LEN = 160

print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train.values)
val_labels = torch.tensor(y_val.values)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 32

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

In [None]:
%%time
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 50, 3

        # Instantiate BERT model
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(H, D_out)
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False)

    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                      lr=1e-5,    # Default learning rate: 5e-5
                      eps=1e-8    # Default epsilon value: 1e-8
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

In [None]:
import random
import time
import pickle

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    """Train the BertClassifier model.
    """
    # Start training loop
    best_val = 0.
    print("Start training...\n")
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")
                
                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)
            train_loss, train_accuracy = evaluate(model, train_dataloader)
            
            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
        
        if val_accuracy > best_val:
            # save the model to disk
            filename = 'finalized_model.sav'
            pickle.dump(model, open(filename, 'wb'))
            best_val = val_accuracy
            print('Best validation accuracy reached. Saved model classifier.')
        print('_'*50,"\n")
            
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []
#     all_probs = []
#     all_preds = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

#         all_probs.append(logits)
        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()
#         all_preds.append(preds)

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy #, torch.cat(all_preds), torch.cat(all_probs)

In [None]:
%%time
set_seed(42)    # Set seed for reproducibility
bert_classifier, optimizer, scheduler = initialize_model(epochs=5)
train(bert_classifier, train_dataloader, val_dataloader, epochs=3, evaluation=True)

In [None]:
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score
def evaluate_with_auc(model, dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    output_accuracy = []
    output_loss = []
    all_logits = []
    all_b_labels = []
#     all_probs = []
#     all_preds = []

    # For each batch in our validation set...
    for batch in dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        output_loss.append(loss.item())

#         all_probs.append(logits)
        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()
#         all_preds.append(preds)

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        output_accuracy.append(accuracy)
        
        all_logits.append(logits)
        all_b_labels.append(b_labels)

        
    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)
    
    # Apply softmax to calculate probabilities
    probs = F.softmax(all_logits, dim=1).cpu().numpy()
    
    all_b_labels = torch.cat(all_b_labels, dim=0)
    b_labels_dummies = pd.get_dummies(all_b_labels.cpu().numpy())
    #print(b_labels_dummies)
    
    output_auc = roc_auc_score(b_labels_dummies,probs,multi_class="ovo",average='macro')
        
    # Compute the average accuracy and loss over the validation set.
    output_loss = np.mean(output_loss)
    output_accuracy = np.mean(output_accuracy)

    return output_loss, output_accuracy, output_auc, b_labels_dummies,probs #, torch.cat(all_preds), torch.cat(all_probs)

In [None]:
Y_val_dummies = pd.get_dummies(y_val.values)
Y_train_dummies = pd.get_dummies(y_train.values)

In [None]:
train_loss, train_accuracy, train_auc, b_labels_train_dum, train_probs = evaluate_with_auc(bert_classifier,train_dataloader)
val_loss, val_accuracy, val_auc , b_labels_val_dum, val_probs = evaluate_with_auc(bert_classifier,val_dataloader)

In [None]:
print("(train_loss = {:.4f}, train_accuracy = {:.4f})".format(train_loss, train_accuracy/100))
print("\n(train_loss = {:.4f},train_auc = {:.4f})".format(train_loss, train_auc))

print("\n(val_loss = {:.4f}, val_accuracy = {:.4f})".format(val_loss, val_accuracy/100))
print("\n(val_loss = {:.4f}, val_auc = {:.4f})".format(val_loss, val_auc))

#### predict on training set and validation set

In [None]:
train_prediction = train_probs.argmax(axis=1)
val_prediction = val_probs.argmax(axis=1)

In [None]:
import torch.nn.functional as F

def bert_predict(model, test_dataloader):
    """Perform a forward pass on the trained BERT model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits = []

    # For each batch in our test set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        all_logits.append(logits)
    
    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)
    
    # Apply softmax to calculate probabilities
    probs = F.softmax(all_logits, dim=1).cpu().numpy()
    
    return probs.argmax(axis=1), probs

bert_classifier = pickle.load(open('finalized_model.sav', 'rb'))

In [None]:
import pandas as pd
import numpy as np
from scipy import interp

from  sklearn.metrics import precision_recall_fscore_support, roc_curve, auc, accuracy_score
from sklearn.preprocessing import LabelBinarizer

def class_report(y_true, y_pred, y_score=None, average='macro'):
    '''
    Inputs:
        y_true: the actual labels
        y_pred: the predicted labels
    Output:
        accuracy, precision, recall, f1-score, support (actual nb), pred (predicted nb) for multiclass pb
    '''
    if y_true.shape != y_pred.shape:
        print("Error! y_true %s is not the same shape as y_pred %s" % (
              y_true.shape,
              y_pred.shape)
        )
        return

    lb = LabelBinarizer()

    if len(y_true.shape) == 1:
        lb.fit(y_true)

    #Value counts of predictions
    labels, cnt = np.unique(
        y_pred,
        return_counts=True)
    n_classes = len(labels)
    pred_cnt = pd.Series(cnt, index=labels)

    metrics_summary = precision_recall_fscore_support(
            y_true=y_true,
            y_pred=y_pred,
            labels=labels)

    avg = list(precision_recall_fscore_support(
            y_true=y_true, 
            y_pred=y_pred,
            average='weighted'))

    metrics_sum_index = ['precision', 'recall', 'f1-score', 'support']
    class_report_df = pd.DataFrame(
        list(metrics_summary),
        index=metrics_sum_index,
        columns=labels)

    support = class_report_df.loc['support']
    total = support.sum() 
    class_report_df['avg / total'] = avg[:-1] + [total]

    class_report_df = class_report_df.T
    class_report_df['pred'] = pred_cnt
    class_report_df['pred'].iloc[-1] = total

    if not (y_score is None):
        fpr = dict()
        tpr = dict()
        roc_auc = dict()
        for label_it, label in enumerate(labels):
            fpr[label], tpr[label], _ = roc_curve(
                (y_true == label).astype(int), 
                y_score[:, label_it])

            roc_auc[label] = auc(fpr[label], tpr[label])

        if average == 'micro':
            if n_classes <= 2:
                fpr["avg / total"], tpr["avg / total"], _ = roc_curve(
                    lb.transform(y_true).ravel(), 
                    y_score[:, 1].ravel())
            else:
                fpr["avg / total"], tpr["avg / total"], _ = roc_curve(
                        lb.transform(y_true).ravel(), 
                        y_score.ravel())

            roc_auc["avg / total"] = auc(
                fpr["avg / total"], 
                tpr["avg / total"])

        elif average == 'macro':
            # First aggregate all false positive rates
            all_fpr = np.unique(np.concatenate([
                fpr[i] for i in labels]
            ))

            # Then interpolate all ROC curves at this points
            mean_tpr = np.zeros_like(all_fpr)
            for i in labels:
                mean_tpr += interp(all_fpr, fpr[i], tpr[i])

            # Finally average it and compute AUC
            mean_tpr /= n_classes

            fpr["macro"] = all_fpr
            tpr["macro"] = mean_tpr

            roc_auc["avg / total"] = auc(fpr["macro"], tpr["macro"])

        class_report_df['AUC'] = pd.Series(roc_auc)
        
    class_report_df.rename(index={0:'-1 (negative)',1:'0 (neutral)',2:'1 (positive)'}, inplace=True)
    class_report_df["support"] = class_report_df["support"].astype(int)
    class_report_df["pred"] = class_report_df["pred"].astype(int)
#     values = confusion_matrix(y_true, y_pred, normalize="all").diagonal()
#     class_report_df.insert(0, 'accuracy', pd.DataFrame(np.append(values, np.mean(values))).values)
    return class_report_df

In [None]:
print("For the training:\n")
class_report(np.array(b_labels_train_dum).argmax(axis=1), train_prediction)

In [None]:
import itertools
from sklearn.metrics import confusion_matrix
    
def plot_confusion_matrix(cm, classes,title,normalize=False,cmap=plt.cm.Greens):
    plt.subplots(figsize=(10,4))
    ### Confusion Matrix (from the challenge AML made in group)
    plt.subplot(1, 2, 1)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() - 500.
    cm = np.round(cm,2)
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    return plt.show()

cm=confusion_matrix(np.array(b_labels_train_dum).argmax(axis=1), train_prediction)
plot_confusion_matrix(cm, classes=["negative","neutral","positive"],title='Confusion matrix for training')

In [None]:
print("For the validation:\n")
class_report(np.array(b_labels_val_dum).argmax(axis=1), val_prediction)

In [None]:
cm = confusion_matrix(np.array(b_labels_val_dum).argmax(axis=1), val_prediction)
plot_confusion_matrix(cm, classes=["negative","neutral","positive"],title='Confusion matrix for validation')

In [None]:
# Creating a submission
test_inputs, test_masks = preprocessing_for_bert(test_df[final_column].values)
test_data = TensorDataset(test_inputs, test_masks)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

test_predictions_nb, _ = bert_predict(bert_classifier,test_dataloader)

submission_df = pd.DataFrame()
submission_df['textID'] = test_df['textID']
submission_df['sentiment'] = test_predictions_nb-1 # to put the right target
submission_df.to_csv('TA_baseline_NB.csv', index=False)
submission_df[final_column] = test_df[final_column]
submission_df["selected_text"] = test_df["selected_text"]
submission_df.to_csv('To_analyse.csv', index=False)
#submission_df

In [None]:
raise SystemExit("Exit from script") # to stop the running here 

## 2. Bag of Words 

### LSTM

In [None]:
import re
import gensim
from nltk.tokenize.treebank import TreebankWordDetokenizer
from keras.preprocessing.text import Tokenizer

# bag of words using Keras
from keras.preprocessing.sequence import pad_sequences
from keras import regularizers
from tqdm import tqdm


max_words = 5000
max_len = 200

def Tokenize(data):
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(data)
    sequences = tokenizer.texts_to_sequences(data)
    tweets = pad_sequences(sequences, maxlen=max_len)
    return tweets


def inverse_dummy_function(val_pred_lstm):
    output = []
    for i in tqdm(range(len(val_pred_lstm))):
        #print(val_pred_model1[i])
        if (val_pred_lstm[i] == [0,1,0] ).all():
            output.append(0)
        elif (val_pred_lstm[i] == [1,0,0]).all():
             output.append(-1)
        elif (val_pred_lstm[i] == [0,0,1]).all():
             output.append(1)
        else :
            print('error')
    return output

In [None]:
#put the right target
target_conversion = {
    0: -1,
    1: 0,
    2: 1,
    
}

y_train_labels = y_train.map(target_conversion)
y_val_labels = y_val.map(target_conversion)

In [None]:
#one-hot encoding
Y_train_dummies= pd.get_dummies(y_train_labels).values
Y_val_dummies= pd.get_dummies(y_val_labels).values

#Adapting input values for LSTM
X_train_tokenized = Tokenize(X_train)
X_val_tokenized = Tokenize(X_val)

In [None]:
from keras.models import Sequential
from keras import layers
from keras import regularizers
from keras import backend as K
from keras.layers import Embedding
from keras.callbacks import ModelCheckpoint
# This returns the LSTM model in Keras.
vocabulary_size = max_words #= 5000
def get_keras_model(lstm_units,
                    neurons_dense,
                    dropout_rate,
                    embedding_size,
                    max_text_len):
    # define the layers.
    
    inputs = tf.keras.Input(shape=(max_text_len,))
    x = layers.Embedding(vocabulary_size, embedding_size)(inputs)
    x = layers.LSTM(units=lstm_units)(x)
    x = layers.Dense(neurons_dense, activation="relu")(x)
    x = layers.Dropout(dropout_rate)(x)

    outputs = layers.Dense(3, activation='softmax')(x)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
import tensorflow as tf
tf.keras.backend.clear_session()


max_text_len = 663
model_lstm_cv = get_keras_model(8,
                        47,
                        0.052,
                        58,
                        max_text_len)

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

# Specify the training configuration.
model_lstm_cv.compile(optimizer=optimizer,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#tokenizer0 = Tokenizer(num_words=vocabulary_size)
#tokenizer0.fit_on_texts(data_df['Tweet_lemmatized'].values)
#X_train_seq0 = tokenizer0.texts_to_sequences(data_df['Tweet_lemmatized'].values)
X_train_seq0_padded = pad_sequences(X_train_tokenized, maxlen=max_text_len)
X_val_seq0_padded = pad_sequences(X_val_tokenized, maxlen=max_text_len)
#y_train0 = df_train0['sentiment'].values

# fit the model using a 20% validation set.
checkpoint1 = ModelCheckpoint("best_model1.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)

In [None]:
model_lstm_cv.fit(x=X_train_seq0_padded,
          y=Y_train_dummies,
          batch_size=50, 
          epochs=3,
          validation_data=(X_val_seq0_padded, Y_val_dummies),
          callbacks=[checkpoint1])

### evaluating the model

In [None]:
#validation
val_pred_model_lstm_cv_prob = model_lstm_cv.predict(X_val_seq0_padded)
val_pred_model_lstm_cv = (val_pred_model_lstm_cv_prob == val_pred_model_lstm_cv_prob.max(axis=1)[:,None]).astype(int)
val_pred_lstm_cv = inverse_dummy_function(val_pred_model_lstm_cv)

#training
train_pred_model_lstm_cv_prob = model_lstm_cv.predict(X_train_seq0_padded)
train_pred_model_lstm_cv = (train_pred_model_lstm_cv_prob == train_pred_model_lstm_cv_prob.max(axis=1)[:,None]).astype(int)
train_pred_lstm_cv = inverse_dummy_function(train_pred_model_lstm_cv)

In [None]:
accuracy_model = (val_pred_lstm_cv == y_val_labels.values).mean()
print('The accuracy of our LSTM classifier is: {:.2f}%'.format(accuracy_model*100))

accuracy_model = (train_pred_lstm_cv == y_train_labels.values).mean()
print('The accuracy of our LSTM classifier is: {:.2f}%'.format(accuracy_model*100))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

#roc_auc_score(np.array(val_pred_lstm_cv), y_prob, multi_class="ovo",
#                                  average="macro")
auc_lstm_val = roc_auc_score(Y_val_dummies,val_pred_model_lstm_cv_prob,multi_class="ovo")
auc_lstm_train = roc_auc_score(Y_train_dummies, train_pred_model_lstm_cv_prob, multi_class="ovo")
print('The validation auc of our LSTM classifier is: {:.2f}'.format(auc_lstm_val))
print('The training auc of of our LSTM classifier is: {:.2f}'.format(auc_lstm_train))

#### submitting LSTM model file

In [None]:
X_test_tokenized = Tokenize(test_df[final_column].values)

X_test_seq0_padded = pad_sequences(X_test_tokenized, maxlen=max_text_len)
test_pred_lstm = model_lstm_cv.predict(X_test_seq0_padded)

test_pred_lstm = (test_pred_lstm == test_pred_lstm.max(axis=1)[:,None]).astype(int)
test_pred_lstm_final = inverse_dummy_function(test_pred_lstm)

submission_df = pd.DataFrame()
submission_df['textID'] = test_df['textID']
submission_df['sentiment'] = test_pred_lstm_final
submission_df.to_csv('LSTM_sumbmission.csv', index=False)

### K-NN & RandomForest & MultiLayerPerceptron

In [None]:
# preparation of data :

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import xgboost as xgb

countV = CountVectorizer() # Bag Of Words
countV.fit_transform(X_train) # Fit the dictionnary


def train_test_model(model,train_df,test_df,y,y_test,model_name) : 
    model_bow = Pipeline([('countV_bayes',countV),('model',model)])
    model_bow.fit(train_df,y)
    probas_train = model_bow.predict_proba(train_df)
    probas_test = model_bow.predict_proba(test_df)
    y_pred_train = model_bow.predict(train_df)
    y_pred_test = model_bow.predict(test_df)
    auc_train = roc_auc_score(pd.get_dummies(y).values, probas_train, multi_class="ovo")
    auc_test = roc_auc_score(pd.get_dummies(y_test).values, probas_test, multi_class="ovo")
    accuracy_train = np.mean(y_pred_train == y)
    accuracy_test = np.mean(y_pred_test == y_test)
    #f1_train = f1_score(y_pred_train,y)
    #f1_test = f1_score(y_pred_test,y_test)
    
    print("For training score Using "+ model_name +" we have {} as auc score, {} as accuracy".format(auc_train,accuracy_train))
    print("For testing score Using " + model_name +" we have {} as auc score, {} as accuracy".format(auc_test,accuracy_test))
    print("END##############################")
    
train_test_model(RandomForestClassifier(n_estimators=300,max_depth=40),X_train,X_val,y_train,y_val,"Random Forest")
train_test_model(KNeighborsClassifier(n_neighbors=4),X_train,X_val,y_train,y_val,"K - nearest Neighbors")
train_test_model(MLPClassifier(solver='lbfgs',
                                         alpha=1e-5,
                                         hidden_layer_sizes=(300,300,300),
                                         random_state=1),X_train,X_val,y_train,y_val,"MLP")

## 4. Word2Vec

### K-NN & RandomForest & MultiLayerPerceptron

In [None]:
## Word2Vec 

import gensim

tweets = X_train.apply(lambda x: tokenization(x.lower())).values#.apply(lambda x: lemmatizer(x))#.apply(lambda x: tokenization(x.lower()))

model_w2v = gensim.models.Word2Vec(
            tweets,
            vector_size=663, # desired no. of features/independent variables
            window=5, # context window size
            min_count=2, # Ignores all words with total frequency lower than 2.                                  
            sg = 1, # 1 for skip-gram model
            hs = 0,
            negative = 10, # for negative sampling
            workers= 32, # no.of cores
            seed = 34
) 

model_w2v.train(tweets, total_examples= len(tweets), epochs=20)

def word_vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0
    for word in tokens:
        try:
            vec += model_w2v.wv[word].reshape((1, size))
            count += 1.
        except KeyError:  # handling the case where the token is not in vocabulary
            continue
    if count != 0:
        vec /= count
    return vec

train_size = len(tweets)
train_arrays = np.zeros((train_size, 663))

def wordLists(data) :
    size = len(data)
    output = np.zeros((size, 663))
    for i in range(size):
        output[i,:] = word_vector(data[i], 663)
    return output 

In [None]:
train_wordw2v = wordLists(X_train.apply(lambda x: tokenization(x.lower())).values)
val_wordw2v = wordLists(X_val.apply(lambda x: tokenization(x.lower())).values)
test_wordw2v = wordLists(test_df.Tweet_stemmed.values)

In [None]:
def train_test_model_W2V(model,train_df,test_df_,y,y_test,model_name) : 
    
    model_w2v = model
    model_w2v.fit(train_df,y)
    probas_train = model_w2v.predict_proba(train_df)
    probas_test = model_w2v.predict_proba(test_df_)
    y_pred_train = model_w2v.predict(train_df)
    y_pred_test = model_w2v.predict(test_df_)
    auc_train = roc_auc_score(pd.get_dummies(y).values, probas_train, multi_class="ovo")
    auc_test = roc_auc_score(pd.get_dummies(y_test).values, probas_test, multi_class="ovo")
    accuracy_train = np.mean(y_pred_train == y)
    accuracy_test = np.mean(y_pred_test == y_test)


    print("For training score Using "+ model_name +" we have {} as auc score, {} as accuracy".format(auc_train,accuracy_train))
    print("For testing score Using " + model_name +" we have {} as auc score, {} as accuracy".format(auc_test,accuracy_test))
    print("END##############################")


In [None]:
train_test_model_W2V(RandomForestClassifier(n_estimators=300,max_depth=40),train_wordw2v,val_wordw2v,y_train,y_val,"Random Forest")
train_test_model_W2V(KNeighborsClassifier(n_neighbors=4),train_wordw2v,val_wordw2v,y_train,y_val,"K - nearest Neighbors")
mlp = train_test_model_W2V(MLPClassifier(solver='lbfgs',
                                         alpha=1e-5,
                                         hidden_layer_sizes=(300,300,300),
                                         random_state=1),train_wordw2v,val_wordw2v,y_train,y_val,"MLP")

### LSTM 

In [None]:
model_lstm_cv.fit(x=X_train_seq0_padded,
          y=Y_train_dummies,
          batch_size=50, 
          epochs=3,
          validation_data=(X_val_seq0_padded, Y_val_dummies),
          callbacks=[checkpoint1])

#validation
val_pred_model_lstm_cv_prob = model_lstm_cv.predict(X_val_seq0_padded)
val_pred_model_lstm_cv = (val_pred_model_lstm_cv_prob == val_pred_model_lstm_cv_prob.max(axis=1)[:,None]).astype(int)
val_pred_lstm_cv = inverse_dummy_function(val_pred_model_lstm_cv)

#training
train_pred_model_lstm_cv_prob = model_lstm_cv.predict(X_train_seq0_padded)
train_pred_model_lstm_cv = (train_pred_model_lstm_cv_prob == train_pred_model_lstm_cv_prob.max(axis=1)[:,None]).astype(int)
train_pred_lstm_cv = inverse_dummy_function(train_pred_model_lstm_cv)

accuracy_model = (val_pred_lstm_cv == y_val_labels.values).mean()
print('The accuracy of our LSTM classifier is: {:.2f}%'.format(accuracy_model*100))

accuracy_model = (train_pred_lstm_cv == y_train_labels.values).mean()
print('The accuracy of our LSTM classifier is: {:.2f}%'.format(accuracy_model*100))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

#roc_auc_score(np.array(val_pred_lstm_cv), y_prob, multi_class="ovo",
#                                  average="macro")
auc_lstm_val = roc_auc_score(Y_val_dummies,val_pred_model_lstm_cv_prob,multi_class="ovo")
auc_lstm_train = roc_auc_score(Y_train_dummies, train_pred_model_lstm_cv_prob, multi_class="ovo")
print('The validation auc of our LSTM classifier is: {:.2f}'.format(auc_lstm_val))
print('The training auc of of our LSTM classifier is: {:.2f}'.format(auc_lstm_train))

### Conclusion 
To conclude, through this challenge we discovered different kinds of embedding techniques such as word2vec, bag-of-words and bert. We learned what kind of basic preprocessing one can do on typescripts data. And by comparing different models based on the accuracy and the auc we chose the transformer with the bert model. For our submissions we submit the outputs of the transformer with bert. Natural language is a large domain and we saw how difficult it was to tune models and preprocess the data in order to reach good auc and accuracy. For future work we could enhance our model to detect the words based on   which, it classifies the sentiment as positive, negative or neutral.
