<a href="https://colab.research.google.com/github/shraddha-an/nlp/blob/main/lstm_so_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Classification with fastText**

This project is a continuation of my NLP Case Study looking at different models to classify the quality of Stack Overflow Questions.

**Dataset**: **[Stack Overflow Questions](https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate)**

Other models in this series:

1. **[Training a Word Embedding](https://github.com/shraddha-an/nlp/blob/main/word_embedding_classification.ipynb)**
2. **[Pre-trained GloVe Embedding](https://github.com/shraddha-an/nlp/blob/main/pretrained_glove_classification.ipynb)**
3. **[fastText Classifier](https://github.com/shraddha-an/nlp/blob/main/so_fasttext.ipynb)**
4. **[BERT Model](https://github.com/shraddha-an/nlp/blob/main/so_bert.ipynb)**

## **1) Data Preparation**

In [1]:
# Importing libraries
# Data Manipulation/Handling
import pandas as pd, numpy as np

# NLP Preprocessing
from gensim.utils import simple_preprocess

In [4]:
# Importing data
dataset = pd.read_csv('train.csv')[['Body', 'Y']].rename(columns = {'Body': 'questions', 'Y': 'category'})
ds = pd.read_csv('valid.csv')[['Body', 'Y']].rename(columns = {'Body': 'questions', 'Y': 'category'})

# Simple NLP Preprocessing
X_train = dataset.iloc[:, 0].apply(lambda x: ' '.join(simple_preprocess(x)))
X_test = ds.iloc[:, 0].apply(lambda x: ' '.join(simple_preprocess(x)))

# Train/Test subsets
y_train = pd.get_dummies(dataset[['category']])
y_test = pd.get_dummies(ds[['category']])


In [5]:
print(X_train.shape, '\n', X_test.shape)

(45000,) 
 (15000,)


## **2) Tokenization**

In [6]:
# Setting the size of the vocabulary & sequence length for the embedding
seq_len = 100
vocab_size = 2100

# Tokenization
from keras.preprocessing.text import Tokenizer

tk = Tokenizer(num_words = vocab_size)
tk.fit_on_texts(X_train)

X_train_seq = tk.texts_to_sequences(X_train)
X_test_seq = tk.texts_to_sequences(X_test)

word_index = tk.word_index

# Padding
from keras.preprocessing.sequence import pad_sequences

X_train_seq = pad_sequences(X_train_seq, padding = 'post', maxlen = seq_len)
X_test_seq = pad_sequences(X_test_seq, padding = 'post', maxlen = seq_len)

# **3) Training the LSTM Model**

In [7]:
# Embedding + LSTM
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Flatten

model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim = 8, input_length = seq_len))
model.add(LSTM(units = 10, activation = 'tanh'))
model.add(Dense(units = 3, activation = 'softmax'))
model.compile(optimizer = 'adam', metrics = ['accuracy'], loss = 'categorical_crossentropy')
model.summary()

history = model.fit(X_train_seq, y_train, epochs = 10, batch_size = 512, verbose = 1)

# Saving the model
#model.save('saved_models/lstm_79.h5')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 8)            16800     
_________________________________________________________________
lstm (LSTM)                  (None, 10)                760       
_________________________________________________________________
dense (Dense)                (None, 3)                 33        
Total params: 17,593
Trainable params: 17,593
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## **4) Evaluating Performance on Test Set**

In [8]:
# Evaluation
loss, acc = model.evaluate(X_test_seq, y_test, verbose = 1)
print('\nAccuracy: {}\nLoss: {}'.format(acc, loss))



Accuracy: 0.7816666960716248
Loss: 0.5557811856269836
