# Text classification via Ensemble of FastText and scikit-learn

Text classification is becoming a common solution to many problems in industry today including chatbots, sentiment analysis, document tagging, recommendations, etc. This article will focus on utilizing [scikit-learn](http://scikit-learn.org), a mature machine library in Python, and a relatively new library from Facebook AI Reasearch called [fastText](https://fasttext.cc/).

The data we will be using today will be publicly available [reddit](http://reddit.com/) data available [here in a Google BigQuery repository](https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_posts?pli=1). The data used here is provided in this page's [GitHub repository](https://github.com/victorkwak/FastTextEnsemble) so no need to separately download the data to follow this tutorial and using Google Cloud to download data is out of the scope of this article.

## Setup

I recommend downloading [Anaconda](https://www.anaconda.com/download/) for the base Python libraries needed for this tutorial. You will also need to download fastText available from their [GitHub repository](https://github.com/facebookresearch/fastText).

## Reading in the data

We will be using [pandas](https://pandas.pydata.org/) in order to read in and explore the data.

In [9]:
import pandas as pd

data = pd.read_csv('data/cs_subs.csv').dropna() #Unzip this data from the zip file included.

print('Number of unique subreddits in the dataset:', len(data['subreddit'].unique()))
print('Number of title posts in dataset:', data.shape[0])
data.sample(10)

Number of unique subreddits in the dataset: 136
Number of title posts in dataset: 624281


Unnamed: 0,title,score,subreddit
179519,After turning my phone while playing Solitaire...,85,softwaregore
142100,Useful Linux Commands To Harden Your System,99,commandline
319650,The RedMonk Programming Language Rankings: Jun...,29,golang
327975,"Sure thing, Windows",11,softwaregore
619161,Identity crises over being an AI Researcher.,106,softwaregore
380642,Forget Windows Use Linux is a USB-Bootable Dis...,302,Android
239470,गुम या चोरी हो चुके फ़ोन का लोकेशन जाने | How t...,1,Android
402439,Getting celery to run on windows,2,learnpython
317152,The Concept Video of LG V30 Is Just Too Alluri...,1,Android
146894,Alcatel A30 ($60) and Moto G5 Plus ($185) are ...,96,Android


### We are filtering subreddits that have less than 150 posts. 

In [10]:
counts = data['subreddit'].value_counts()
counts = counts[counts > 150]
top_values = list(counts.index)
data = data[data['subreddit'].isin(top_values)]

In [12]:
print('Number of unique subreddits after filtering:', len(data['subreddit'].unique()))
print('Number of title posts after filtering:', data.shape[0])

Number of unique subreddits after filtering: 117
Number of title posts after filtering: 622909


We have a very skewed dataset. Let's see the average reddit score (upvotes + downvotes) for each subreddit to filter out. I want to do mean and not median since median would just arbitrarily cut the data in half. Hopefully filtering by mean will take relatively larger chunks out of the more popular subreddits than the less popular ones.

In [14]:
means = {}
for subreddit in data['subreddit'].unique():
    means[subreddit] = data[data['subreddit'] == subreddit]['score'].mean()
    
filtered = []

for subreddit in data['subreddit'].unique():
    filtered.append(data.loc[(data['subreddit'] == subreddit) & (data['score'] >= means[subreddit])])