## Classification
The aim of this section is to apply supervised learning methods to create a classification model to determine if tweets are an actual emergency. With this model, depending on success, could be used to predict a variety of other themes, given the appropriate class labelling. The class label have already been pre-labellel under the column _target_, and shows 1 if the tweet has been classified as an emergency, and 0 if not. The original [file can be found here](/tweets-raw.csv).

It is important to preface that this dataset faces the classic class imbalance problem, given that emergency tweets (1) constitutes only 18.5% of records before cleaning. Hence, there will need to be techniques applied to account for class imbalances. For instance, F1-score is more telling than Accuracy measures. We chose to oversample instead of undersample as it would mean disposing of 7k more records of non-emergencies(0), which would mean an even smaller training set after cleaning. 

Here is the order you will expect as you read the rest of this report:
1. [Data pre-processing](#1-data-preprocessing). The tweets are seen as is. For example, besides the actual text, emojis, vulgarities, hashtags are present with varying characters. Location range from actual values such as United States of America to "hell" or "jesus". 
2. [Feature engineering](#2-Feature-engineering). The keyword column, which contain the "emergency" word in the sentence, will be added to the feature list. The sentences will be tokenised and vectorised using a Term Frequency-Inverse Document Frequency (TF-IDF) approach.
3. Dataset splitting. We will need a training set, and a test set. We have decided to employ the holdout method, which uses 2/3 of the data for model training. 
4. Model selection. We will attempt to use decision tree induction, linear regression, and Naïve (Complement) Baynesian Classification. We may also further consider ensemble methods, random forest and boosting via AdaBoost. 
5. Training phase. Models will be applied on keyword and the fragmented sentences as features. 
6. Evaluation phase. Here, we will apply metrics using the confusion matrix, the Receiver Operating Characteristics Curve and F1-score as previously mentioned. 


In [None]:
! pip install Keras
! pip install tensorflow
! pip install scikit-learn


### 1. Data Preprocessing

In this phase, we remove punctuations and emojis. Though we are aware that this might affect sentence semantics, especially if we choose to adopt encoder-only transformers, it is relatively easy to roll back. For now, emojis and punctuation will not be considered.

We also realised it was important to remove stop words, numbers, and undergo lemmasation (removing of _-ings_). Source: Web Data Mining, Bing Liu

In [None]:
import pandas as pd
import numpy as np

# Read the CSV file
df = pd.read_csv('tweetsv2.csv')

# Here, we can see that we have a imbalanced data set, with over 3 times the count of non-emergency tweets compared to emergency ones
df['target'].value_counts()

In [None]:
import pandas as pd
import re
# from tensorflow.keras.preprocessing.text import text_to_word_sequence

# Read the CSV file
df = pd.read_csv('tweetsv2.csv')

# Emoji removal source taken from: https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

df['cleantweet'] = df['text.1'].apply(lambda x: remove_emojis(remove_punctuation(str(x))))

print(df['cleantweet'][0])

display(df.head())

### 2. Feature Engineering

In [None]:
feature1 = df['cleantweet']
classlabel = df['target']
len(feature1)
classlabel.head()

### 3. Dataset splitting

In [None]:
from sklearn.model_selection import train_test_split
feature1_train, feature1_test, classlabel_train, classlabel_test = train_test_split(feature1, classlabel, test_size = 0.33, random_state = 33)
len(feature1_test)

### 4. Model selection

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report


### 5. Training phase

With the inclusion of stop words, performance improvements were seen. 

Original:

MNB: 0.846
CNB: 0.867
SVC: 0.894

After stop-words dropped:
MNB: 0.856
CNB: 0.863
SVC: 0.892

After n-gram range added:
MNB: 0.847
CNB: 0.889
SVC: 0.893

stop_words = 'english'


In [None]:
# Create a pipeline for the different classification functions
pipeline_MNB = Pipeline([('tfidf', TfidfVectorizer(stop_words = 'english', ngram_range=(1,3))), ('clf', MultinomialNB())])
pipeline_CNB = Pipeline([('tfidf', TfidfVectorizer(stop_words = 'english', ngram_range=(1,3))), ('clf', ComplementNB())])
pipeline_SVC = Pipeline([('tfidf', TfidfVectorizer(stop_words = 'english', ngram_range=(1,3))), ('clf', LinearSVC())])

pipeline_MNB.fit(feature1_train, classlabel_train)
predictMNB = pipeline_MNB.predict(feature1_test)
print(f"MNB: {accuracy_score(classlabel_test, predictMNB):.3f}")

pipeline_CNB.fit(feature1_train, classlabel_train)
predictCNB = pipeline_CNB.predict(feature1_test)
print(f"CNB: {accuracy_score(classlabel_test, predictCNB):.3f}")

pipeline_SVC.fit(feature1_train, classlabel_train)
predictSVC = pipeline_SVC.predict(feature1_test)
print(f"SVC: {accuracy_score(classlabel_test, predictSVC):.3f}")

### 5. Evaluation phase

In [None]:
print(classification_report(classlabel_test, predictSVC))

In [None]:
msg = "Severe weather expected in Lyon"
outcome = pipeline_SVC.predict([msg])
print('class label is ' + str(outcome))

msg2 = "Intense flying cow expected in Lyon"
outcome2 = pipeline_SVC.predict([msg2])
print('class label is ' + str(outcome2))