# Corpus Language Test
The sole purpose of this notebook is to verify the following hypothesis about the clickbait corpus:
> All post titles are English.

In [1]:
import pandas as pd
from os import path
import langdetect

pd.set_option('display.max_colwidth', 1000)

## Load data

In [2]:
data_path = "../data"
dataset = "clickbait17-train-170331"

In [3]:
df = pd.read_json(path.join(data_path, dataset, "instances.jsonl"), lines=True, encoding='utf8')

## Language check
We use the `langdetect` Python package to determine the language of the post title.

In [4]:
df['lang_postText'] = df['postText'].apply(lambda x: langdetect.detect(x[0]))
non_english = df[df['lang_postText'] != 'en']

Let's first check how many entries are non-English.

In [5]:
print("{} / {}".format(len(non_english), len(df)))

33 / 2459


Just ~30 out of 2459 entries are non-English. Now we manually verify if the detection was correct.

In [6]:
non_english[['postText', 'lang_postText']]

Unnamed: 0,postText,lang_postText
35,[How do dogs donate blood?],af
239,[Mexico elects first independent governor 'El Bronco'],es
293,[I’ve hit a sleep wall and I’m seeing double],af
300,[Are you a “yuccie?”],es
321,[India frustrated against Bangladesh as Ajinkya Rahane makes 98],id
466,[MORE: WWE legend #DustyRhodes dies at 69.],af
550,[Look inside Apple's swanky new Upper East Side store in NYC Old bank vault is a VIP showroom:],af
653,[37 difficult questions from my mixed-race son],fr
686,[Harriet Harman v David Cameron at #PMQs:],sv
764,[Teenager denies murdering boy of 15],af


We observe that all post titles are misclassfied as non-English, and are therefore certain enough that the entire corpus is indeed English.