# EARTHQUAKE TWEETS CLASSIFICATION

This notebook composes the first step of my final project for JEDHA bootcamp: developing an app that visualizes emergency calls tweets on a map during the earthquake.

I ran the notebook on Google Colab since the final part requires using GPU and my local computer did not have one. Final step on deployment on huggingface hence involves steps to deploy a model from Colab notebook to a huggingface repository.

### Install and Import Necessary Packages


In [None]:
#!pip install transformers datasets

In [None]:
#!pip install transformers[torch]

In [None]:
#!pip install torch

In [None]:
#!pip install torchinfo

In [None]:
#!pip install huggingface_hub

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
import json
from wordcloud import WordCloud

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import LSTM, Embedding, Bidirectional

from tensorflow.keras.models import Model
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam

from datasets import load_dataset
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torchinfo import summary

import huggingface_hub

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

## PART 1: EDA & TEXT PREPROCESSING

The dataset I manually annotated on doccano is in json format. Let's upload it.

In [None]:
df = pd.read_json('earthquake10K.json')
df.head()

In [None]:
for i in range(len(df)):
    a = df['label'][i][0]
    df.loc[i, 'label'] = a

df.head()

### Exploring the Data

In [None]:
print("The annotated dataset contains", df.shape[0], 'tweets.')

I had initially annotated the data in three categories:
- rescue calls (belong to calls by or for people who are under the rubbles at the time)
- urgent needs (urgent food, heating, clothing demands)
- other

However, having seen the quite imbalanced distributions of the categories, I decided to merge the first two labels into a single one, to denote all sorts of emergency calls.

In [None]:
# i annotated the data with three labels yet there is very low occurence for the urgent needs:
df['label'].hist()

In [None]:
def merge_urgents(x):
    if x == 'Urgent_need' or x == 'Rescue_call':
        return 'emergency_call'
    else:
        return x
df['label'] = df['label'].apply(lambda x : merge_urgents(x))
set(df['label'])

In [None]:
for label in set(df['label']):
    #print("WordCloud for", i)
    words = ''
    for document in df[df['label'] == label]['text']:
        words += document + ' '
    wordcloud = WordCloud(width=600, height=400, background_color='#696969', colormap='Set3').generate(words)
    plt.imshow(wordcloud)
    plt.title("WordCloud for {} tweets".format(label))
    plt.show()

At a first glance, there does not seem to be much differentiation between two cateories, as the most frequent words are obviously url links, hashtags, tags obviously. Let's generate the wordclouds again after text cleaning.

### Data Cleaning 
Social media data is known to be more challening for algorithms since they are far from official and standardized uses of language with a lot of typos, abbreviations, slangs and emojis! So the first job is to try to make the text as understandable as possible for the algorithm. To this end:
- I will remove emojis and special characters
- I will also remove hashtags(#-) and person tags (@-) since I observed them to be present in evrey tweet regardless of the content. To be specific the hashtag '#enkazaltında" which translates as 'under the rubbles' were added to every tweet even it was not about a help call.
- Another issue is the frequent abbreviations that were used to refer to the places or addresses. There are more too many different ways to refer to the epicenter of the earthquake for example (Kahramanmaraş, KMaraş, Maraş...) Even though the abbreviation is not necessarily a hard challenge for complex ML algorithms like Transformers, they can be misleading for the simpler models. Therefore I will standardize them as best as I can since addresses are literally the most crucial part in our data.
- I will make sure that special turkish letters like I is lowcased properly (it should become ı, not i).
- Finally I will adjust spaces.


In [None]:
corpus = ' '.join(df['text'])
unique_char = set(re.findall(r'.', corpus)) #gives all characters, r'\w' gives alphanumerical
print(unique_char)

In [None]:
abbreviations = [
    ('apt', 'Apartmanı'),
    ('Apt', 'Apartmanı'),
    ('APT', 'Apartmanı'),
    ('apart', 'Apartmanı'),
    ('Apart', 'Apartmanı'),
    ('APART', 'Apartmanı'),
    ('sok', 'Sokak'),
    ('sk', 'Sokak'),
    ('Sok', 'Sokak'),
    ('Sk', 'Sokak'),
    ('SOK', 'Sokak'),
    ('SK', 'Sokak'),
    ('cad', 'Caddesi'),
    ('Cad', 'Caddesi'),
    ('CAD', 'Caddesi'),
    ('cd', 'Caddesi'),
    ('Cd', 'Caddesi'),
    ('CD', 'Caddesi'),
    ('bşk', 'başkanlığı'),
    ('bul', 'Bulvarı'),
    ('blv', 'Bulvarı'),
    ('Blv', 'Bulvarı'),
    ('BLV', 'Bulvarı'),
    ('bulv', 'Bulvarı'),
    ('Bulv', 'Bulvarı'),
    ('BULV', 'Bulvarı'),
    ('mey', 'meydanı'),
    ('meyd', 'meydanı'),
    ('ecz', 'Eczanesi'),
    ('Ecz', 'Eczanesi'),
    ('ECZ', 'Eczanesi'),
    ('mh', 'Mahallesi'),
    ('mah', 'Mahallesi'),
    ('Mh', 'Mahallesi'),
    ('Mah', 'Mahallesi'),
    ('MH', 'Mahallesi'),
    ('MAH', 'Mahallesi'),
    ('şb', 'şube'),
    ('maraş', 'Kahramanmaraş'),
    ('maras', 'Kahramanmaraş'),
    ('Maraş', 'Kahramanmaraş'),
    ('Maras', 'Kahramanmaraş'),
    ('MARAŞ', 'Kahramanmaraş'),
    ('MARAS', 'Kahramanmaraş'),
    ('kmaraş', 'Kahramanmaraş'),
    ('kmaras', 'Kahramanmaraş'),
    ('KMaraş', 'Kahramanmaraş'),
    ('KMaras', 'Kahramanmaraş'),
    ('KMARAŞ', 'Kahramanmaraş'),
    ('KMARAS', 'Kahramanmaraş'),
    ('antep', 'Gaziantep'),
    ('Antep', 'Gaziantep'),
    ('ANTEP', 'Gaziantep'),
    ('anteb', 'Gaziantep'),
    ('Anteb', 'Gaziantep'),
    ('ANTEB', 'Gaziantep'),
    ('Urfa', 'Şanlıuarfa'),
    ('urfa', 'Şanlıuarfa'),
    ('URFA', 'Şanlıuarfa'),

    ]

def normalize_abbreviations(text):
    for regex, replacement in abbreviations:
        text = re.sub(rf'\b{re.escape(regex)}\b', replacement, text)
        text = re.sub(r'\s\s+', ' ',text)
        text = text.replace('k.maraş', 'Kahramanmaraş')
        text = text.replace('K.maraş', 'Kahramanmaraş')
        text = text.replace('K.Maraş', 'Kahramanmaraş')
        text = text.replace('k.maras', 'kahramanmaraş')
        text = text.replace('K.maras', 'Kahramanmaraş')
        text = text.replace('K.Maras', 'kahramanmaraş')
    return text


In [None]:
def turkish_lowercase_conversion(text):
    # Replace Turkish "İ" with lowercase "i"
    text = text.replace("İ", "i")

    # Replace Turkish "ı" with uppercase "I"
    text = text.replace("I", "ı")

    # Convert the remaining text to lowercase
    text = text.lower()

    return text

text = 'ŞAZİBEY MAHALLESİ HAYDAR ALİYEV BULVARI YUNUS APARTMANI A BLOK ACİLEN EKİBE İHTİYACIMIZ VAR LÜTFEN SESİMİZİ DUYURUN YARDIM EDİN'
turkish_lowercase_conversion(text)


In [None]:
#keep punctuation but make sure there are space after each or remove punctuation
#remove all tags, remove all links, remove all hashtags

def clean(text):
    clean_text = normalize_abbreviations(text)
    #remove all links:
    clean_text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    #remove all non-alphanumerical characters except @ and # while keeping special turkish letters: sçÇğĞıİöÖşŞüÜ@#
    clean_text = re.sub(r'[^\w\sçÇğĞıİöÖşŞüÜ@#]', ' ', clean_text)
    #remove any entity that follows @ and #
    i = 0
    while i < (len(clean_text)):
        if clean_text[i] == '@' or clean_text[i]== '#':
            a=0
            while i+a < len(clean_text):
                if clean_text[i+a] == ' ':
                    break
                else:
                    a += 1
            clean_text = clean_text[:i] + clean_text[i+a+1:]
        else:
            i+=1
    #conversion to lowercase, i noticed above that there is an issue with the conversion of letter I and letter İ:
    #so we'll have to go steop-by-step:
    clean_text = clean_text.replace("İ", "i")
    clean_text = clean_text.replace("I", "ı")
    clean_text = clean_text.lower()
    clean_text = re.sub(r'\s\s+', ' ', clean_text)
    return clean_text


In [None]:
df = df[['id','text','label']]
df['text'] = df['text'].apply(lambda x: clean(x))
df.head()

#### Now let's try out the wordclouds one more time.

In [None]:
stops = stopwords.words('turkish')

In [None]:
for label in set(df['label']):
    #print("WordCloud for", i)
    words = ''
    for document in df[df['label'] == label]['text']:
        words += document + ' '
    wordcloud = WordCloud(width=600, height=400, background_color='#696969', colormap='Set3').generate(words)
    plt.imshow(wordcloud)
    plt.title("WordCloud for {} tweets".format(label))
    plt.show()

This time the disctinction between two wordclouds is much more clear.
- The second image, which shows the most frequent words in tweets with the label 'Other', contains mostly the common stop words in Turkish such as "ve" (and), "bu" (this), "bir" (one/a/an), "çok" (much) as one can expect. Other most frequent words are "Allah" (God), "geçmiş olsun" (my condolences), "lütfen" (please), which also makes sense as these are mostly the tweets to share their condolences and pray for the people.
- On the other hand, the first image which shows the frequent words in the "emergency call" tweets are "lütfen" (please)," yardım / yardım edin" (help), "enkaz altında" (under the rubbles), "hatay / antakya" (one of the most affected cities), "mahallesi" (neighborhood), "sokak" / "caddesi" (street), "apartmanı" (building) which also makes sense since people who make emergency calls often share locations. Interestingly, in these tweets, these words were even more frequently used than typical stopwords.

### Text Preprocessing
- Before training the model, I will lastly convert the target variable to integers, tokenize the texts and represent them as vectors. 
- For vectorization I will first use the CountVectorizer(), however, I will explore and compare the performance of Tfidf vectors as well while tuning the hyperparameters of the models.

In [None]:
target_map = {'Other': 0, 'emergency_call': 1}
df['target'] = df['label'].map(target_map)

In [None]:
text_train, text_test, Y_train, Y_test = train_test_split(df['text'], df['target'], random_state = 42)

In [None]:
## tokenize the texts and turn them into vectors
stops = stopwords.words('turkish')
vectorizer = CountVectorizer(stop_words = None, max_features = None)
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)


## PART 2: MACHINE LEARNING

### 2.1 Logistic Regression

In [None]:
LR = LogisticRegression(max_iter = 500)
LR.fit(X_train, Y_train)


In [None]:
print("train score:", LR.score(X_train, Y_train))
print("test score:", LR.score(X_test, Y_test))

Ptrain = LR.predict(X_train)
Ptest = LR.predict(X_test)
print("train F1:", f1_score(Y_train, Ptrain))
print("test F1:", f1_score(Y_test, Ptest))


The basic regression model acheieved quite good accuracy levels in both sets. However, since we are dealing with a highly imbalanced dataset accuracy is not the most appropriate metric to evaluate the model. F1 score of the model on test set is 0.87, which is still decent.

#### Hyperparameter Tuning
Here not only will I apply grid search to find optimum parameters for the model but also for the preprocessor (i.e. count vectorizer vs tfidf).

In [None]:
pipe = Pipeline([
    ('c_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(max_iter=1000))])



parameters = {
    'c_vect__max_features': [2000, 5000, None],
    'c_vect__stop_words': [stops, None],
    'tfidf__use_idf': [True, False],
    'clf__C': np.logspace(-3,3,3),
}

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
}


LR_best = GridSearchCV(pipe,
                           param_grid = parameters,
                           scoring = 'recall',
                           cv = 10,
                           n_jobs = -1,)
                         # refit = 'recall')

    #burda recall'a refit yapıp ttüm scoreları eklesek nolur

LR_best.fit(text_train, Y_train)

In [None]:
LR_best.best_params_

In [None]:
P_test = LR_best.predict(text_test)
P_train = LR_best.predict(text_train)
print(classification_report(Y_test, P_test))

In [None]:
def plot_cm(ax, cm, title):
    classes = [0, 1]
    df_cm = pd.DataFrame(cm, index=classes, columns=classes)
    sns.heatmap(df_cm, annot=True, fmt='g', ax = ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Target")
    ax.set_title(title)


In [None]:
cm_train = confusion_matrix(Y_train, P_train, normalize='true')
cm_test = confusion_matrix(Y_test, P_test, normalize='true')

fig, axes = plt.subplots(1, 2, figsize=(9, 4))
plot_cm(axes[0], cm_train, title='Confusion Matrix for Train Set')
plot_cm(axes[1], cm_test, title='Confusion Matrix for Test Set')

plt.tight_layout()
plt.show()

### 2.2 Naive Bayes Classifier

In [None]:
NB = MultinomialNB()
NB.fit(X_train, Y_train)

In [None]:
print("train score:", NB.score(X_train, Y_train))
print("test score:", NB.score(X_test, Y_test))

Ptrain = NB.predict(X_train)
Ptest = NB.predict(X_test)
print("train F1:", f1_score(Y_train, Ptrain))
print("test F1:", f1_score(Y_test, Ptest))


#### Hyperparameter tuning for Naive Bayes

In [None]:
pipe = Pipeline([
    ('c_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(max_iter=1000))])



parameters = {
    'c_vect__max_features': [2000, 5000, None],
    'c_vect__stop_words': [stops, None],
    'tfidf__use_idf': [True, False],
    'clf__C': np.logspace(-3,3,3),
}

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
}


NB_best = GridSearchCV(pipe,
                           param_grid = parameters,
                           scoring = 'recall',
                           cv = 10,
                           n_jobs = -1,)
                         # refit = 'recall')

    #burda recall'a refit yapıp ttüm scoreları eklesek nolur

NB_best.fit(text_train, Y_train)

In [None]:
NB_best.best_params_

In [None]:
P_test = NB_best.predict(text_test)
P_train = NB_best.predict(text_train)
print(classification_report(Y_test, P_test))


In [None]:
cm_train = confusion_matrix(Y_train, P_train, normalize='true')
cm_test = confusion_matrix(Y_test, P_test, normalize='true')

fig, axes = plt.subplots(1, 2, figsize=(9, 4))
plot_cm(axes[0], cm_train, title='Confusion Matrix for Train Set')
plot_cm(axes[1], cm_test, title='Confusion Matrix for Test Set')

plt.tight_layout()
plt.show()

### 2.3 Support Vector Machine

In [None]:
SVM = SVC(kernel = "linear", random_state = 42, probability = True)
SVM.fit(X_train, Y_train)

In [None]:

print("train score:", SVM.score(X_train, Y_train))
print("test score:", SVM.score(X_test, Y_test))

Ptrain = SVM.predict(X_train)
Ptest = SVM.predict(X_test)
print("train F1:", f1_score(Y_train, Ptrain))
print("test F1:", f1_score(Y_test, Ptest))


#### Hyperparameter tuning for SVM

In [None]:
X_sample = text_train.sample(n=1000, random_state=42)
Y_sample = Y_train.sample(n=1000, random_state=42)

In [None]:
pipe = Pipeline([
    ('c_vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SVC())])



parameters = [{
    'c_vect__max_features': [2000, 5000, None],
    'c_vect__stop_words': [stops, None],
    'tfidf__use_idf': [True, False],
    'clf__C': [0.25, 0.5, 0.75, 1],
    'clf__kernel' : ["linear"]},
    {'c_vect__max_features': [2000, 5000, None],
    'c_vect__stop_words': [stops, None],
    'tfidf__use_idf': [True, False],
    'clf__C': [0.25, 0.5, 0.75, 1],
    'clf__kernel' : ["rbf"],
    'clf__gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]



SVM_best = GridSearchCV(pipe,
                           param_grid = parameters,
                           scoring = 'recall',
                           cv = 10,
                           n_jobs = -1,)
                         # refit = 'recall')
SVM_best.fit(X_sample, Y_sample)



In [None]:
SVM_best.best_params_

In [None]:
P_test = SVM_best.predict(text_test)
P_train = SVM_best.predict(text_train)
print(classification_report(Y_test, P_test))


In [None]:
cm_train = confusion_matrix(Y_train, P_train, normalize='true')
cm_test = confusion_matrix(Y_test, P_test, normalize='true')

fig, axes = plt.subplots(1, 2, figsize=(9, 4))
plot_cm(axes[0], cm_train, title='Confusion Matrix for Train Set')
plot_cm(axes[1], cm_test, title='Confusion Matrix for Test Set')

plt.tight_layout()
plt.show()

### 2.4 Recurrent Neural Networks : LSTM

In [None]:
df['targets'] = df['label'].astype("category").cat.codes
K = df['targets'].max() + 1

In [None]:
X_tv, X_test, Y_tv, Y_test = train_test_split(df['text'], df['target'], test_size = 1000, random_state = 42)
X_train, X_val, Y_train, Y_val = train_test_split(X_tv,Y_tv, random_state = 42)

In [None]:
MAX_VOCAB_SIZE = 2000
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(X_train)
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_test = tokenizer.texts_to_sequences(X_test)
sequences_val = tokenizer.texts_to_sequences(X_val)

word2idx = tokenizer.word_index
V = len(word2idx)
print('Number of unique tokens:', V)

In [None]:
# pad sequences so that we get a N x T matrix
data_train = pad_sequences(sequences_train)
print('Shape of data train tensor:', data_train.shape)

# get sequence length
T = data_train.shape[1]

data_test = pad_sequences(sequences_test, maxlen=T)
print('Shape of data test tensor:', data_test.shape)

data_val = pad_sequences(sequences_val, maxlen =T)
print('Shape of data validation tensor:', data_val.shape)

In [None]:
D = 30
i = Input(shape=(T,))
x = Embedding(V + 1, D)(i)
#x = LSTM(8, return_sequences=True)(x)
x = LSTM(8, return_sequences=False)(x)
#x = GlobalMaxPooling1D()(x)
x = Dense(1)(x)

model = Model(i, x)

In [None]:
# Compile and fit
model.compile(
  loss=BinaryCrossentropy(from_logits=True),
  optimizer=Adam(learning_rate=0.001),
  metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)

print('Training model...')
r = model.fit(
  data_train,
  Y_train,
  epochs=10,
  validation_data=(data_val, Y_val),
  batch_size=128,
)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(9, 4))

# Plot loss per iteration on the first subplot (index 0)
ax[0].plot(r.history['loss'], label='train loss')
ax[0].plot(r.history['val_loss'], label='val loss')
ax[0].set_title('Loss per Iteration')
ax[0].legend()

# Plot accuracy per iteration on the second subplot (index 1)
ax[1].plot(r.history['accuracy'], label='train acc')
ax[1].plot(r.history['val_accuracy'], label='val acc')
ax[1].set_title('Accuracy per Iteration')
ax[1].legend()

fig.suptitle('Learning Curves for ANN with Word Embeddings', fontsize=14)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

In [None]:
model.save('LSTM_1.h5')

In [None]:
P_train = ((model.predict(data_train) > 0) * 1.0).flatten()
P_val = ((model.predict(data_val) > 0) * 1.0).flatten()
P_test = ((model.predict(data_test) > 0) * 1.0).flatten()

print("Train acc:", accuracy_score(Y_train, P_train))
print("Train acc:", accuracy_score(Y_val, P_val))
print("Test acc:", accuracy_score(Y_test, P_test))

print("Train F1:", f1_score(Y_train, P_train))
print("Train F1:", f1_score(Y_val, P_val))
print("Test F1:", f1_score(Y_test, P_test))

In [None]:
cm_train = confusion_matrix(Y_train, P_train, normalize='true')
cm_val = confusion_matrix(Y_val, P_val, normalize='true')
cm_test = confusion_matrix(Y_test, P_test, normalize='true')

fig, axes = plt.subplots(1, 3, figsize=(9, 3))
plot_cm(axes[0], cm_train, title='Confusion Matrix for Train Set')
plot_cm(axes[1], cm_val, title='Confusion Matrix for Validation Set')
plot_cm(axes[2], cm_test, title='Confusion Matrix for Test Set')

plt.tight_layout()
plt.show()

In [None]:
D = 30
i = Input(shape=(T,))
x = Embedding(V + 1, D)(i)


x = Bidirectional(LSTM(8, return_sequences=False))(x)

x = Dense(1)(x)

model_2 = Model(i, x)

# Compile and fit
model_2.compile(
  loss=BinaryCrossentropy(from_logits=True),
  optimizer=Adam(learning_rate=0.001),
  metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)


print('Training model...')
r = model_2.fit(
  data_train,
  Y_train,
  epochs=8,
  validation_data=(data_val, Y_val),
  batch_size=128,
)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(9, 4))

# Plot loss per iteration on the first subplot (index 0)
ax[0].plot(r.history['loss'], label='train loss')
ax[0].plot(r.history['val_loss'], label='val loss')
ax[0].set_title('Loss per Iteration')
ax[0].legend()

# Plot accuracy per iteration on the second subplot (index 1)
ax[1].plot(r.history['accuracy'], label='train acc')
ax[1].plot(r.history['val_accuracy'], label='val acc')
ax[1].set_title('Accuracy per Iteration')
ax[1].legend()

fig.suptitle('Learning Curves for Bidirectional LSTM', fontsize=14)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

In [None]:
model_2.save('LSTM_2.h5')

In [None]:
P_train = ((model_2.predict(data_train) > 0) * 1.0).flatten()
P_val = ((model_2.predict(data_val) > 0) * 1.0).flatten()
P_test = ((model_2.predict(data_test) > 0) * 1.0).flatten()

print("Train acc:", accuracy_score(Y_train, P_train))
print("Train acc:", accuracy_score(Y_val, P_val))
print("Test acc:", accuracy_score(Y_test, P_test))

print("Train F1:", f1_score(Y_train, P_train))
print("Train F1:", f1_score(Y_val, P_val))
print("Test F1:", f1_score(Y_test, P_test))

In [None]:
cm_train = confusion_matrix(Y_train, P_train, normalize='true')
cm_val = confusion_matrix(Y_val, P_val, normalize='true')
cm_test = confusion_matrix(Y_test, P_test, normalize='true')

fig, axes = plt.subplots(1, 3, figsize=(9, 3))
plot_cm(axes[0], cm_train, title='Confusion Matrix for Train Set')
plot_cm(axes[1], cm_val, title='Confusion Matrix for Validation Set')
plot_cm(axes[2], cm_test, title='Confusion Matrix for Test Set')

plt.tight_layout()
plt.show()

### 2.5 Transfer Learning : BERTÜRK

In [None]:
df_raw = pd.read_json('earthquake10K.json')
for i in range(len(df_raw)):
    a = df_raw['label'][i][0]
    df_raw.loc[i, 'label'] = a

df_raw.head()


In [None]:
def merge_urgents(x):
  if x == 'Urgent_need' or x == 'Rescue_call':
    return 'emergency_call'
  else:
    return x
df_raw['label'] = df_raw['label'].apply(lambda x : merge_urgents(x))
set(df_raw['label'])

In [None]:
target_map = {'Other': 0, 'emergency_call': 1}
df_raw['target'] = df_raw['label'].map(target_map)

In [None]:
df2 = df_raw[['text', 'target']]
df2.columns = ['sentence', 'label']
train_val_set, test_set = train_test_split(df2, test_size = 0.1, random_state = 123)
len(train_val_set)

In [None]:
train_val_set.to_csv('data.csv', index=None)

In [None]:
raw_dataset = load_dataset('csv', data_files='data.csv')
raw_dataset

In [None]:
split = raw_dataset['train'].train_test_split(test_size=0.3, seed=42)
split

In [None]:
checkpoint = "dbmdz/bert-base-turkish-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_fn(batch):
    return tokenizer(batch['sentence'], truncation=True)
    #include truncation but not padding
    #padding will be automatically done by trainer

tokenized_datasets = split.map(tokenize_fn, batched=True)

In [None]:
bert_model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2)

In [None]:
summary(bert_model)

In [None]:
training_args = TrainingArguments(
  output_dir='training_dir',
  evaluation_strategy='epoch',
  save_strategy='epoch',
  num_train_epochs=5, #5 epochs
  per_device_train_batch_size=16,
  per_device_eval_batch_size=64,
)

In [None]:
def compute_metrics(logits_and_labels):
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  acc = np.mean(predictions == labels)
  f1 = f1_score(labels, predictions, average='macro')
  prec = precision_score(labels, predictions)
  rec = recall_score(labels, predictions)
  return {'accuracy': acc, 'f1': f1, 'precision': prec, 'recall': rec}

In [None]:
trainer = Trainer(
    bert_model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
!ls training_dir

In [None]:
from transformers import pipeline
savedmodel = pipeline('text-classification',
                      model='training_dir/checkpoint-788',
                      device=0)

In [None]:
train_pred = savedmodel(split['train']['sentence'])
val_pred = savedmodel(split['test']['sentence'])
val_pred[:5]

In [None]:
def get_label(d):
  return int(d['label'].split('_')[1])
train_pred = [get_label(d) for d in train_pred]
val_pred = [get_label(d) for d in val_pred]

In [None]:
print("train acc:", accuracy_score(split['train']['label'], train_pred))
print("train f1:", f1_score(split['train']['label'], train_pred, average='macro'))
print("train prec:", precision_score(split['train']['label'], train_pred))
print("train rec:", recall_score(split['train']['label'], train_pred))


print("acc:", accuracy_score(split['test']['label'], val_pred))
print("f1:", f1_score(split['test']['label'], val_pred, average='macro'))
print("prec:", precision_score(split['test']['label'], val_pred))
print("rec:", recall_score(split['test']['label'], val_pred))

In [None]:
test_set.head()

In [None]:
target = test_set['label']
sentences = test_set['sentence'].tolist()

In [None]:
test_pred = savedmodel(sentences)
test_pred = [get_label(d) for d in test_pred]

In [None]:
print("acc:", accuracy_score(target, test_pred))
print("f1:", f1_score(target, test_pred, average='macro'))
print("prec:", precision_score(target, test_pred, average=None))
print("rec:", recall_score(target, test_pred, average=None))

In [None]:
cm_train = confusion_matrix(split['train']['label'], train_pred, normalize='true')
cm_val = confusion_matrix(split['test']['label'], val_pred, normalize='true')
cm_test = confusion_matrix(target, test_pred, normalize='true')

fig, axes = plt.subplots(1, 3, figsize=(9, 3))
plot_cm(axes[0], cm_train, title='Confusion Matrix for Train Set')
plot_cm(axes[1], cm_val, title='Confusion Matrix for Validation Set')
plot_cm(axes[2], cm_test, title='Confusion Matrix for Test Set')

plt.tight_layout()
plt.show()

## PART 3 : MODEL SELECTION

- Although Logistic Regression, as well as LSTM models achieved impressive performances in accuracy and F1 scores, confusion matrices show us that fine-tuned BERTURK model clearly outperformed all models especially in terms of recall score.
- Despite high accuracy and decent f1, other models (except for SVM which had the poorest performance) performed weak on the recall metric (between 0.83-0.86), which is the most important metric for this project.
- Recall score is a metric that shows how much of the relevant instances are retrieved, that is :
    True Positives / (False Negatives + True Positives) . In this context, it shows us what percentage of the emergency call tweets was actually detected by an algorithm.
- There is usually a trade-off between precision and recall, precision showing how much of the retrieved instances is actually relevant (what percentage of the tweets classified as "emergency" was in fact an emergency.

- There might be examples where precision can be a more relevant metric than recall, where we can take the risk of false negatives because false positives can pose actually a much bigger problem.
- Yet this example is a case where false negatives cannot be risked, as the misclassification of an emergency tweet as "other" can have fatal consequences. False positives is much less of a problem as we would not lose much if we wrongly think a tweet is emergency and pin it on our map, since there are still be users reading the tweets on the map.

- Therefore, I decided the good performance of transformers model on all scores, and over 5% improvement it offers on the recall score, completely justifies the computational cost and complexity arising due to GPU-intensive computation of transformers since stakes are HUMAN LIVES.


## PART 4: MODEL DEPLOYMENT
Since I ran the project on Colab but not on my local machine, this step will require one last package to import:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
BERT_finetuned = savedmodel
path = F"/content/gdrive/MyDrive/{BERT_finetuned}"
torch.save(bert_model.state_dict(), path)

In [None]:
savedmodel.save_pretrained("bert-earthquake-tweets-classification")
tokenizer.save_pretrained("bert-earthquake-tweets-classification")

In [None]:
!sudo apt-get install git-lfs

In [None]:
!huggingface-cli login
!huggingface-cli repo create berturk-earthquake-tweets-classification

In [None]:
!git clone https://huggingface.co/yhaslan/berturk-earthquake-tweets-classification

In [None]:
!cd berturk-earthquake-tweets-classification
!echo "Hello!" >> README.md
!git add . && git commit -m "Update from $USER"

In [None]:
savedmodel.save_pretrained("path/to/repo/clone/berturk-earthquake-tweets-classification")
tokenizer.save_pretrained("path/to/repo/clone/berturk-earthquake-tweets-classification")

In [None]:
huggingface_hub.upload_folder(folder_path='training_dir/checkpoint-788',
    repo_id="yhaslan/berturk-earthquake-tweets-classification",
    repo_type="model"
)