<a href="https://colab.research.google.com/github/tanzam085-a11y/fairseq/blob/main/BiLSTM_%E2%80%94_Sequence_Classification_(Text)_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

BiLSTM — Sequence Classification (Text)


# Task
Analyze the provided text classification task, focusing on sentiment analysis using BiLSTM and uni-directional LSTM models. The analysis should include data loading, cleaning, tokenization, model building, training, evaluation, and comparison of the two models. The dataset to be used is the Sentiment140 dataset provided as a zip file at "/content/drive/MyDrive/taskfolder/Sentiment140 dataset with 1.6 million tweets.zip".

## Load and sample data

### Subtask:
Load the dataset and sample a smaller portion for faster experimentation.


**Reasoning**:
Load the data from the CSV file into a pandas DataFrame with specified column names and then sample a portion of the DataFrame for faster experimentation.



In [13]:
import pandas as pd

column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
df = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='latin-1', names=column_names)

# Sample a smaller portion of the DataFrame
sampled_df = df.sample(n=100000, random_state=42)

# Display the first few rows of the sampled DataFrame
display(sampled_df.head())

Unnamed: 0,target,id,date,flag,user,text
541200,0,2200003196,Tue Jun 16 18:18:12 PDT 2009,NO_QUERY,LaLaLindsey0609,@chrishasboobs AHHH I HOPE YOUR OK!!!
750,0,1467998485,Mon Apr 06 23:11:14 PDT 2009,NO_QUERY,sexygrneyes,"@misstoriblack cool , i have no tweet apps fo..."
766711,0,2300048954,Tue Jun 23 13:40:11 PDT 2009,NO_QUERY,sammydearr,@TiannaChaos i know just family drama. its la...
285055,0,1993474027,Mon Jun 01 10:26:07 PDT 2009,NO_QUERY,Lamb_Leanne,School email won't open and I have geography ...
705995,0,2256550904,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,yogicerdito,upper airways problem


## Clean tweets

### Subtask:
Define and apply a function to remove URLs, mentions, tokenize, and lowercase tweets.


**Reasoning**:
Define and apply the `clean_tweet` function to remove URLs, mentions, tokenize, and lowercase the tweets in the 'text' column.



In [14]:
import re

def clean_tweet(tweet):
    """Removes URLs and mentions from a tweet and lowercases it."""
    # Remove URLs
    tweet = re.sub(r"http\S+|https\S+|www\S+", '', tweet)
    # Remove mentions
    tweet = re.sub(r"@\S+", '', tweet)
    # Convert to lowercase
    tweet = tweet.lower()
    return tweet

# Apply the clean_tweet function to the 'text' column
sampled_df['cleaned_text'] = sampled_df['text'].apply(clean_tweet)

# Display the first few rows with the new 'cleaned_text' column
display(sampled_df[['text', 'cleaned_text']].head())

Unnamed: 0,text,cleaned_text
541200,@chrishasboobs AHHH I HOPE YOUR OK!!!,ahhh i hope your ok!!!
750,"@misstoriblack cool , i have no tweet apps fo...","cool , i have no tweet apps for my razr 2"
766711,@TiannaChaos i know just family drama. its la...,i know just family drama. its lame.hey next ...
285055,School email won't open and I have geography ...,school email won't open and i have geography ...
705995,upper airways problem,upper airways problem


## Prepare data

### Subtask:
Convert sentiment labels to numerical format and split the data into training and validation sets.


**Reasoning**:
Convert the target variable to numerical format and split the data into training and validation sets.



In [15]:
from sklearn.model_selection import train_test_split

sampled_df['target'] = sampled_df['target'].map({0: 0, 4: 1})

X = sampled_df['cleaned_text']
y = sampled_df['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)

Shape of X_train: (80000,)
Shape of X_val: (20000,)
Shape of y_train: (80000,)
Shape of y_val: (20000,)


## Tokenize and pad sequences

### Subtask:
Use Keras Tokenizer to tokenize the text data and pad sequences to a fixed length.


**Reasoning**:
Import necessary libraries, define parameters, tokenize the text data, convert text to sequences, pad the sequences, and print the shapes.



In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_words = 10000
max_len = 100

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_val_sequences = tokenizer.texts_to_sequences(X_val)

X_train_padded = pad_sequences(X_train_sequences, maxlen=max_len, padding='post', truncating='post')
X_val_padded = pad_sequences(X_val_sequences, maxlen=max_len, padding='post', truncating='post')

print("Shape of X_train_padded:", X_train_padded.shape)
print("Shape of X_val_padded:", X_val_padded.shape)

Shape of X_train_padded: (80000, 100)
Shape of X_val_padded: (20000, 100)


## Build and train bilstm model

### Subtask:
Build a Bidirectional LSTM model with the specified architecture and train it on the prepared data.


**Reasoning**:
Build, compile, and train the Bidirectional LSTM model as per the instructions.



In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

embedding_dim = 128

model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

history = model.fit(X_train_padded, y_train, epochs=5, batch_size=32, validation_data=(X_val_padded, y_val))



Epoch 1/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m541s[0m 214ms/step - accuracy: 0.7208 - loss: 0.5404 - val_accuracy: 0.7825 - val_loss: 0.4536
Epoch 2/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m541s[0m 216ms/step - accuracy: 0.8194 - loss: 0.3990 - val_accuracy: 0.7882 - val_loss: 0.4511
Epoch 3/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m563s[0m 217ms/step - accuracy: 0.8534 - loss: 0.3356 - val_accuracy: 0.7866 - val_loss: 0.4705
Epoch 4/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m554s[0m 214ms/step - accuracy: 0.8760 - loss: 0.2873 - val_accuracy: 0.7796 - val_loss: 0.5294
Epoch 5/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m556s[0m 211ms/step - accuracy: 0.8993 - loss: 0.2367 - val_accuracy: 0.7763 - val_loss: 0.5979


**Reasoning**:
The task requires calculating the percentage of missing values for each column in the dataframe and then displaying the first few rows of the resulting dataframe containing the missing value information.



In [18]:
df_missing_percentage = df.isnull().sum() / len(df) * 100
df_missing_percentage = df_missing_percentage.to_frame(name='missing_percentage')
display(df_missing_percentage.head())

Unnamed: 0,missing_percentage
target,0.0
id,0.0
date,0.0
flag,0.0
user,0.0


**Reasoning**:
The task requires displaying the number of unique values in the 'user' column and then displaying the value counts for the 'flag' column.



In [19]:
print(f"Number of unique values in 'user': {df['user'].nunique()}")
display(df['flag'].value_counts())

Number of unique values in 'user': 659775


Unnamed: 0_level_0,count
flag,Unnamed: 1_level_1
NO_QUERY,1600000


**Reasoning**:
The 'flag' column has only one unique value, which makes it uninformative for analysis. Removing this column will simplify the dataset. The 'date' column is currently an object type and needs to be converted to datetime objects for potential time-series analysis or feature engineering. The 'user' column has a very high number of unique values, suggesting it might not be useful as a categorical feature directly and could be considered for removal or further analysis later. The 'id' column is likely a unique identifier and not useful for modeling, so it can be dropped. Finally, displaying the first few rows of the modified dataframe will confirm the changes.



**Reasoning**:
The previous attempt to convert the 'date' column to datetime failed due to an incorrect format string that did not account for the timezone name (PDT). The error message suggests using the 'mixed' format option, which allows pandas to infer the format for each element. This is a more robust approach when the exact format string is uncertain or varies slightly.



In [21]:
df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
display(df.head())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
  df['date'] = pd.to_datetime(df['date'], forma

Unnamed: 0,target,date,user,text
0,0,2009-04-06 22:19:45,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,2009-04-06 22:19:49,scotthamilton,is upset that he can't update his Facebook by ...
2,0,2009-04-06 22:19:53,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,2009-04-06 22:19:57,ElleCTF,my whole body feels itchy and like its on fire
4,0,2009-04-06 22:19:57,Karoli,"@nationwideclass no, it's not behaving at all...."


In [22]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

embedding_dim = 128

model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

history = model.fit(X_train_padded, y_train, epochs=5, batch_size=32, validation_data=(X_val_padded, y_val))



Epoch 1/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m554s[0m 219ms/step - accuracy: 0.7214 - loss: 0.5380 - val_accuracy: 0.7865 - val_loss: 0.4463
Epoch 2/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m556s[0m 217ms/step - accuracy: 0.8209 - loss: 0.3939 - val_accuracy: 0.7874 - val_loss: 0.4568
Epoch 3/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m546s[0m 219ms/step - accuracy: 0.8528 - loss: 0.3377 - val_accuracy: 0.7839 - val_loss: 0.4884
Epoch 4/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m553s[0m 215ms/step - accuracy: 0.8737 - loss: 0.2911 - val_accuracy: 0.7783 - val_loss: 0.5134
Epoch 5/5
[1m2500/2500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m567s[0m 217ms/step - accuracy: 0.8940 - loss: 0.2487 - val_accuracy: 0.7735 - val_loss: 0.5957
