# Download the Dataset from Kaggle

In [1]:
! pip install -q kaggle

In [2]:
from google.colab import files
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [3]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [4]:
!pwd

/content


In [5]:
 ! kaggle datasets list

ref                                                                   title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
arnabchaki/data-science-salaries-2023                                 Data Science Salaries 2023 💸                         25KB  2023-04-13 09:55:16          24133        676  1.0              
mauryansshivam/netflix-ott-revenue-and-subscribers-csv-file           Netflix OTT Revenue and Subscribers (CSV File)        2KB  2023-05-13 17:40:23           1523         36  1.0              
darshanprabhu09/stock-prices-for                                      Stock prices of Amazon , Microsoft , Google, Apple   85KB  2023-05-16 15:17:16           1359         34  1.0              
fatihb/coffee-quality-data-cqi

In [6]:
! kaggle competitions download -c nlp-getting-started

Downloading nlp-getting-started.zip to /content
100% 593k/593k [00:00<00:00, 1.38MB/s]
100% 593k/593k [00:00<00:00, 1.38MB/s]


In [7]:
!unzip nlp-getting-started.zip

Archive:  nlp-getting-started.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [8]:
import pandas as pd

data = pd. read_csv('train.csv')

data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
len(data)

7613

# Handle missing values

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer

from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, f1_score, recall_score, precision_score, accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

In [10]:
imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='others')
data['keyword'] = imp.fit_transform(data['keyword'].values.reshape(-1, 1))

data = data.drop(['location', 'id'], axis= 1)

# Bert

**BERT** (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model developed by researchers at Google AI Language. It was introduced in a paper titled "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" in 2018.

BERT has significantly improved the state-of-the-art for a wide range of NLP tasks, including question answering, named entity recognition, sentiment analysis, and more.

Here are some key points about BERT:

**Bidirectionality**: Traditional language models used to predict the next word in a sentence (so-called autoregressive models) can only understand the context of a word based on the words to its left (or right, depending on the model). BERT, on the other hand, is bidirectional, meaning it understands the context of a word based on all words in the sentence.

**Pretraining and Fine-Tuning**: BERT follows a two-step process: pretraining and fine-tuning. In the pretraining step, the model learns language representations from a large text corpus (like Wikipedia). In the fine-tuning step, BERT is fine-tuned on a specific task using labeled data.

**Transformers**: BERT is based on the Transformer architecture, an attention mechanism that learns contextual relationships between words (or sub-words) in a text.

**WordPiece Tokenization**: BERT uses WordPiece tokenization, breaking words into a smaller sub-word or character level to handle the out-of-vocabulary problem in NLP.

**Handling of Long-term Dependencies**: One of the key advantages of the Transformer architecture that BERT is based on is the ability to handle long-term dependencies in text. This means BERT can better understand the context of a word by looking at words that are far away from it in the sentence.

**Preprocessing for Bert**:

When using BERT or similar transformer models, traditional text preprocessing steps like stop word removal, stemming, and lemmatization are usually not necessary and can even be counterproductive.

BERT models are pre-trained on a large corpus of text data in a way that allows them to understand the context of words and sentences, and this understanding includes things like the role of stop words and the different forms of a word. Removing stop words or reducing words to their stems could cause the model to lose valuable contextual information.

**Vectorization and Bert**
 Instead, BERT uses a method called tokenization, where each word is mapped to a unique integer ID from BERT's vocabulary. The tokenized data is then passed into the BERT model, which generates a high-dimensional vector (or embedding) for each token.

These embeddings are based on the context of the words in the sentence and are used as the input for whatever task the model is being fine-tuned for, such as text classification, named entity recognition, or question answering. The fine-tuning process involves training additional layers on top of the pre-trained BERT model to perform the specific task.

## Build the model

In [None]:
pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The Transformers library provides a high-level API for easily working with various transformer models, including BERT. It offers pre-trained models, tokenizers, and utilities for fine-tuning and inference.

In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m122.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [None]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer, TFBertForSequenceClassification


# Create an instance of the BertTokenizer class, using the 'bert-base-uncased' pre-trained model. 
#This tokenizer is responsible for tokenizing the input text into BERT-compatible tokens
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the input text
tokenized_texts = [tokenizer.tokenize(text) for text in data['text']]

# Convert tokens to input IDs: Each token is replaced with its corresponding index in the BERT vocabulary.
input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_texts]

# Pad the input sequences (input_ids) to a fixed length (max_length). It ensures that all sequences have the same length by adding padding tokens at the end 
#and truncating any sequences that exceed the maximum length.
max_length = 64
input_ids_pad = pad_sequences(input_ids, maxlen=max_length, dtype="long", truncating="post", padding="post")

# Convert labels to integers
labels = data['target'].astype(int)




In [None]:
len(tokenized_texts)

7613

In [None]:
tokenized_texts[:5]

[['our',
  'deeds',
  'are',
  'the',
  'reason',
  'of',
  'this',
  '#',
  'earthquake',
  'may',
  'allah',
  'forgive',
  'us',
  'all'],
 ['forest', 'fire', 'near', 'la', 'ron', '##ge', 'sas', '##k', '.', 'canada'],
 ['all',
  'residents',
  'asked',
  'to',
  "'",
  'shelter',
  'in',
  'place',
  "'",
  'are',
  'being',
  'notified',
  'by',
  'officers',
  '.',
  'no',
  'other',
  'evacuation',
  'or',
  'shelter',
  'in',
  'place',
  'orders',
  'are',
  'expected'],
 ['13',
  ',',
  '000',
  'people',
  'receive',
  '#',
  'wild',
  '##fires',
  'evacuation',
  'orders',
  'in',
  'california'],
 ['just',
  'got',
  'sent',
  'this',
  'photo',
  'from',
  'ruby',
  '#',
  'alaska',
  'as',
  'smoke',
  'from',
  '#',
  'wild',
  '##fires',
  'pour',
  '##s',
  'into',
  'a',
  'school']]

In [None]:
input_ids_pad

array([[ 2256, 15616,  2024, ...,     0,     0,     0],
       [ 3224,  2543,  2379, ...,     0,     0,     0],
       [ 2035,  3901,  2356, ...,     0,     0,     0],
       ...,
       [23290,  1012,  6365, ...,     0,     0,     0],
       [ 2610, 11538,  2044, ...,     0,     0,     0],
       [ 1996,  6745,  1024, ...,     0,     0,     0]])

In [None]:
# Split the dataset into training, validation, and testing sets
X_train, X_test, y_train, y_test = train_test_split(input_ids_pad, labels, stratify=labels ,test_size=0.2)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train ,test_size=0.05)

In [None]:
# Load the pre-trained BERT model for sequence classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define the optimizer and loss function for training the model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile the model with the optimizer and loss function
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Train the model on the training set
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    epochs=3,
    batch_size=32
)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test loss: {loss}, Test accuracy: {accuracy}')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Test loss: 0.44076111912727356, Test accuracy: 0.8345370888710022


In [None]:
def performance(model):
        
    train_performance = []
    test_performance = []
    y_pred = model.predict(X_test)
    y_pred_test = np.argmax(y_pred['logits'], axis=-1)
    y_pred = model.predict(X_train)
    y_pred_train = np.argmax(y_pred['logits'], axis=-1)    
    metrics_list = [accuracy_score,precision_score, recall_score, f1_score ]
    for m in metrics_list:
        train_performance.append(m(y_train,y_pred_train))
        test_performance.append(m(y_test, y_pred_test))

    
    performance_df = pd.DataFrame([train_performance,test_performance], columns=['accuracy', 'precision', 'recall', 'f1'], index= ['trian','test'])
    display(performance_df)


In [None]:
performance(model)



Unnamed: 0,accuracy,precision,recall,f1
trian,0.92325,0.966636,0.850764,0.905006
test,0.834537,0.851399,0.744648,0.794454


# LSTM

In [11]:
import numpy as np
import pandas as pd
#For preprocessing
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#For model traning
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding


In [56]:

# Make data ready to get feed to the model:

# Split the data into training, validation, and test sets
train_data, test_data = train_test_split(data, test_size=0.2)
train_data, val_data = train_test_split(train_data, test_size=0.05)

# Tokenize the text data
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(train_data['text'])

# Convert the text data to sequences
X_train = tokenizer.texts_to_sequences(train_data['text'])
X_val = tokenizer.texts_to_sequences(val_data['text'])
X_test = tokenizer.texts_to_sequences(test_data['text'])

# Pad the sequences to a fixed length
max_length = 32
X_train = pad_sequences(X_train, maxlen=max_length, padding='post', truncating='post')
X_val = pad_sequences(X_val, maxlen=max_length, padding='post', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_length, padding='post',truncating = 'post')


In [57]:
model = Sequential()
model. add(Embedding(input_dim=2000, output_dim=32, input_length = max_length))
model.add(LSTM(64))
model.add(Dense(1, activation= 'sigmoid'))

In [58]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [59]:
model.fit(X_train, train_data['target'], validation_data=(X_val, val_data['target']), epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fb359215d50>

In [60]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, test_data['target'])
print(f'Test loss: {loss}, Test accuracy: {accuracy}')

Test loss: 0.4725279211997986, Test accuracy: 0.7912015914916992


In [42]:
y_pred = model.predict(X_test)
y_pred_classes = (y_pred > 0.5).astype('int32')



In [43]:


def performance(model):
    y_train = train_data['target']
    y_test= test_data['target']    
    train_performance = []
    test_performance = []
    y_pred = model.predict(X_test)
    y_pred_test = (y_pred > 0.5).astype('int32')
    y_pred = model.predict(X_train)
    y_pred_train = (y_pred > 0.5).astype('int32')   
    metrics_list = [accuracy_score,precision_score, recall_score, f1_score ]
    for m in metrics_list:
        train_performance.append(m(y_train,y_pred_train))
        test_performance.append(m(y_test, y_pred_test))

    
    performance_df = pd.DataFrame([train_performance,test_performance], columns=['accuracy', 'precision', 'recall', 'f1'], index= ['trian','test'])
    display(performance_df)

In [61]:
performance(model)



Unnamed: 0,accuracy,precision,recall,f1
trian,0.889196,0.946237,0.784759,0.857966
test,0.791202,0.828678,0.663189,0.736755
