# Text classification via Ensemble of FastText and scikit-learn

Text classification is becoming a common solution to many problems in industry today including chatbots, sentiment analysis, document tagging, recommendations, etc. This article will focus on utilizing [scikit-learn](http://scikit-learn.org), a mature machine library in Python, and a relatively new library from Facebook AI Reasearch called [fastText](https://fasttext.cc/).

The data we will be using today will be publicly available [reddit](http://reddit.com/) data available [here in a Google BigQuery repository](https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_posts?pli=1). The data used here is provided in this page's [GitHub repository](https://github.com/victorkwak/FastTextEnsemble) so no need to separately download the data to follow this tutorial and using Google Cloud to download data is out of the scope of this article.

## Setup

I recommend downloading [Anaconda](https://www.anaconda.com/download/) for the base Python libraries needed for this tutorial. You will also need to download fastText available from their [GitHub repository](https://github.com/facebookresearch/fastText).

## Reading in the data

We will be using [pandas](https://pandas.pydata.org/) in order to read in and explore the data.

In [17]:
import pandas as pd

data = pd.read_csv('data/cs_subs.csv').dropna().drop_duplicates() #Unzip this data from the zip file included.

print('Number of unique subreddits in the dataset:', len(data['subreddit'].unique()))
print('Number of title posts in dataset:', data.shape[0])
data.sample(10)

Number of unique subreddits in the dataset: 136
Number of title posts in dataset: 615825


Unnamed: 0,title,score,subreddit
357036,LUKS help,2,linuxquestions
952,Dragons when upgrading from Angular 2.1.x to 2...,5,Angular2
261678,Aruba S3500-24p noise level,0,networking
437806,[Bugs] And yet another bug in the new App Stor...,7,iOSBeta
567615,torvalds/linux has infinite contributors,181,ProgrammerHumor
333320,"After deleting files, Mac became slow",0,mac
450697,How can I find a URL needle in an internet hay...,1,computerscience
102575,Are there any differences between modern web a...,5,cscareerquestions
211682,I had no idea you had to unroot before encrypt...,1,Android
150866,Dart talk by kevmoo in Seattle Feb 21 (RSVP re...,5,dartlang


### We are filtering subreddits that have less than 150 posts. 

In [18]:
counts = data['subreddit'].value_counts()
counts = counts[counts > 150]
top_values = list(counts.index)
data = data[data['subreddit'].isin(top_values)]

In [19]:
print('Number of unique subreddits after filtering:', len(data['subreddit'].unique()))
print('Number of title posts after filtering:', data.shape[0])

Number of unique subreddits after filtering: 117
Number of title posts after filtering: 614464


We have a very skewed dataset. Let's see the average reddit score (upvotes + downvotes) for each subreddit to filter out. I want to do mean and not median since median would just arbitrarily cut the data in half. Hopefully filtering by mean will take relatively larger chunks out of the more popular subreddits than the less popular ones.

In [20]:
means = {}
for subreddit in data['subreddit'].unique():
    means[subreddit] = data[data['subreddit'] == subreddit]['score'].mean()
    
filtered = []

for subreddit in data['subreddit'].unique():
    filtered.append(data.loc[(data['subreddit'] == subreddit) & (data['score'] >= means[subreddit])])

In [21]:
filtered_data = pd.concat(filtered)
filtered_data['subreddit'].value_counts()

Android                6771
linuxquestions         3893
cscareerquestions      3772
learnpython            3060
webdev                 2565
hackernews             2563
iOSBeta                2424
Windows10              2338
linux4noobs            2274
ProgrammerHumor        2182
networking             2133
linux                  2013
windows                2005
androiddev             1901
javascript             1880
learnprogramming       1813
softwaregore           1746
ios                    1737
androidthemes          1630
java                   1567
chrome                 1548
rust                   1466
Python                 1464
web_design             1361
javahelp               1326
aws                    1253
arduino                1205
iOSProgramming         1189
mac                    1150
csshelp                1100
                       ... 
operabrowser            221
mongodb                 197
macapps                 189
windowsinsiders         189
LanguageTechnology  

**Splitting data into train (60%), val (20%), and test (20%).**

In [22]:
from sklearn.model_selection import train_test_split

X = filtered_data['title']
y = filtered_data['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=17)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=31)
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)
print(y_train.shape)
print(y_test.shape)
print(y_val.shape)

(58960,)
(19654,)
(19654,)
(58960,)
(19654,)
(19654,)


In [23]:
from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()
label_encoder.fit(data['subreddit'])

y_train = label_encoder.transform(y_train)
y_val = label_encoder.transform(y_val)
y_test = label_encoder.transform(y_test)

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
X_train_vectors = vectorizer.fit_transform(X_train)
X_val_vectors = vectorizer.transform(X_val)
X_test_vectors = vectorizer.transform(X_test)

In [25]:
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


def top_n_accuracy(y_true, probs, n=5):
    top_n_list = []
    for prob in probs:
        top_n_list.append(np.argsort(-prob)[:n])
    predictions = []
    for prediction, top_n in zip(y_true, top_n_list):
        predictions.append(int(prediction in top_n))
    return np.sum(predictions) / y_true.shape[0]

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import precision_recall_fscore_support

svm = SVC(kernel='linear', decision_function_shape='ovr', C=0.2, max_iter=1000, probability=True)
svm.fit(X_train_vectors, y_train)
svm_predictions = svm.predict(X_val_vectors)

print(accuracy_score(y_val, svm_predictions))
# print(classification_report(y_val, svm_predictions))

_,_, svm_f1, _ = precision_recall_fscore_support(y_val, svm_predictions, beta=1)

In [43]:
from gensim import utils

def format_for_fastext(X, y, filename):
    prefix = '__label__'
    f = open(''.join(['data/', filename]), 'w')
    for title, label in zip(X, y):
        title = title.lower()
        tokens = utils.simple_preprocess(title)
#         tokens = [token for token in tokens if token not in ENGLISH_STOP_WORDS]
        f.write(''.join([prefix, str(label), ' ', ' '.join(tokens), '\n']))
    f.close()
    
format_for_fastext(X_train, y_train, 'reddit_fasttext_train.txt')

In [56]:
def test_fasttext(y, X, classifier, n=1):
    match = []
    for true, string in zip(y, X):
        predictions = list(classifier.predict(string, n)[0])
        for i in range(n):
            predictions[i] = int(predictions[i].split('__label__')[1])
        match.append(int(true in predictions))
    return np.array(match)

In [57]:
import fastText

classifier = fastText.train_supervised(input='data/reddit_fasttext_train.txt',
                                 lr=0.1,
                                 epoch=30,
                                 dim=64,
                                 minn=2,
                                 maxn=5
                                )

correct = test_fasttext(y_val, X_val, classifier)
correct.sum() / y_val.size

0.36704996438384047