<a href="https://colab.research.google.com/github/shraddha-an/nlp/blob/main/so_fasttext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Classification with fastText**
This project is a continuation of my **[NLP Case Study](https://colab.research.google.com/github/shraddha-an/nlp/blob/main/word_embedding_classification.ipynb)** looking at different embeddings to classify the quality of Stack Overflow Questions.

**Dataset**: **[Stack Overflow Questions](https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate)**


Other models in this series:


1.   [Training an Embedding](https://github.com/shraddha-an/nlp/blob/main/word_embedding_classification.ipynb)
2.   [Pre-trained GloVe Embedding](https://github.com/shraddha-an/nlp/blob/main/pretrained_glove_classification.ipynb)
3. [BERT Model](https://github.com/shraddha-an/nlp/blob/main/so_bert.ipynb)

## **Prerequisites:**

**1) fastText**: fastText is an open-source library developed by the Facebook AI Research lab to achieve accurate and fast text classification on very large datasets.

Follow **[these instructions](https://fasttext.cc/docs/en/supervised-tutorial.html)** to install fastText.




# **1) Installing fastText**


In [None]:
# Installing & Building fastText
!wget https://github.com/facebookresearch/fastText/archive/0.2.0.zip
!unzip 0.2.0.zip
%cd fastText-0.2.0
!make

# for python bindings
!pip install .

## **2) Data Preprocessing**

In [2]:
# Importing the required libraries
# Data Manipulation/Handling
import numpy as np, pandas as pd

# NLP Preprocessing
from gensim.utils import simple_preprocess

# fastText
import fastText

In [3]:
# Importing training & testing datasets & renaming columns
dataset = pd.read_csv('train.csv')[['Body', 'Y']].rename(columns = {'Body': 'questions', 'Y': 'category'})
ds = pd.read_csv('valid.csv')[['Body', 'Y']].rename(columns = {'Body': 'questions', 'Y': 'category'})


In [4]:
# Simple NLP Preprocessing to get rid of unwanted characters
dataset.iloc[:, 0] = dataset.iloc[:, 0].apply(lambda x: ' '.join(simple_preprocess(x)))
ds.iloc[:, 0] = ds.iloc[:, 0].apply(lambda x: ' '.join(simple_preprocess(x)))

In [5]:
# Prefixing each row of categories with '__label__'
dataset.iloc[:, 1] = dataset.iloc[:, 1].apply(lambda x: '__label__' + x)
ds.iloc[:, 1] = ds.iloc[:, 1].apply(lambda x: '__label__' + x)


In [6]:
# Flipping the column positions as needed by fastText: __label__cat1 Text
dataset = dataset[['category', 'questions']]
ds = ds[['category', 'questions']]

# Looking at the DataFrames
ds.head(5), dataset.tail(4)

(           category                                          questions
 0  __label__LQ_EDIT  am having different tables like select from sy...
 1  __label__LQ_EDIT  have two table m_master and tbl_appointment th...
 2       __label__HQ  trying to extract us states from wiki url and ...
 3  __label__LQ_EDIT  so new to wanna make an application that can e...
 4  __label__LQ_EDIT  basically have this array array array sub comp...,
                 category                                          questions
 44996  __label__LQ_CLOSE  am working on learning python and was wonderin...
 44997  __label__LQ_CLOSE  it looks like it costs days per month in azure...
 44998  __label__LQ_CLOSE  any questions want to implement quiz that clic...
 44999  __label__LQ_CLOSE  very new to programming and teaching myself ma...)

In [8]:
# Saving the CSV file as a txt file
dataset.to_csv('train.txt', index = False, sep = ' ', header = None)
ds.to_csv('test.txt', index = False, sep = ' ', header = None)


## **3) Training fastText**

In [25]:
# Training the fastText classifier
model = fastText.train_supervised('train.txt', wordNgrams = 2, epoch = 20)

## **4) Evaluating fastText**

In [26]:
# Evaluating fastText's performance on test set
model.test('test.txt')

(15000, 0.8256, 0.8256)