# Introduction

* In this notebook, we will analyze job titles to detect whether they are in English or not using three methods: FastText, langdetect, and CLD3. We will measure the resources consumed by each method and compare their results.

# Load Data

In [1]:
%pip install setuptools==67.6.1


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

df1 = pd.read_csv('2024-11-12 2_54pm.csv')



In [3]:
df1.shape

(10000000, 2)

# Data Analysis


In [3]:
print(f"Total number of records: {len(df)}")

missing_titles = df['JOB_TITLE'].isnull().sum()
print(f"Missing 'JOB_TITLE' entries: {missing_titles}")

df = df.dropna(subset=['JOB_TITLE'])



Total number of records: 1000
Missing 'JOB_TITLE' entries: 0


# Language Detection Methods


* Using FastText



In [4]:
!pip install fasttext

import fasttext

!wget -q -O lid.176.ftz https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

model_fasttext = fasttext.load_model('lid.176.ftz')

def detect_language_fasttext(text):
    prediction = model_fasttext.predict(text.replace('\n', ' '), k=1)
    lang = prediction[0][0].replace('__label__', '')
    return lang




* Using langdetect



In [5]:
!pip install langdetect

from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0  

def detect_language_langdetect(text):
    try:
        return detect(text)
    except:
        return 'unknown'




* Using langid

In [6]:
!pip install langid

import langid

def detect_language_langid(text):
    lang, score = langid.classify(text)
    return lang




# Measuring Resource Usage


In [7]:
import time

start_time = time.time()
df['lang_fasttext'] = df['JOB_TITLE'].apply(detect_language_fasttext)
time_fasttext = time.time() - start_time
print(f"FastText - Time taken: {time_fasttext:.2f} seconds")

start_time = time.time()
df['lang_langdetect'] = df['JOB_TITLE'].apply(detect_language_langdetect)
time_langdetect = time.time() - start_time
print(f"langdetect - Time taken: {time_langdetect:.2f} seconds")

start_time = time.time()
df['lang_langid'] = df['JOB_TITLE'].apply(detect_language_langid)
time_langid = time.time() - start_time
print(f"langid - Time taken: {time_langid:.2f} seconds")


FastText - Time taken: 0.02 seconds
langdetect - Time taken: 18.60 seconds
langid - Time taken: 5.53 seconds


# Comparing Results

In [8]:
df['is_english_fasttext'] = df['lang_fasttext'] == 'en'
df['is_english_langdetect'] = df['lang_langdetect'] == 'en'
df['is_english_langid'] = df['lang_langid'] == 'en'

df['all_agree'] = df[['is_english_fasttext', 'is_english_langdetect', 'is_english_langid']].nunique(axis=1) == 1

agreement_rate = df['all_agree'].mean() * 100
print(f"Methods agree on {agreement_rate:.2f}% of the records.")

df_discrepancies = df[~df['all_agree']]

print(f"Number of discrepancies: {len(df_discrepancies)}")

df_discrepancies[['JOB_TITLE', 'lang_fasttext', 'lang_langdetect', 'lang_langid']].head(10)


Methods agree on 35.80% of the records.
Number of discrepancies: 642


Unnamed: 0,JOB_TITLE,lang_fasttext,lang_langdetect,lang_langid
0,project manager,en,hr,en
1,manager,en,tl,en
2,owner,en,pl,en
3,intern,en,de,en
4,software engineer,en,af,en
5,assistant manager,en,en,fr
6,sales associate,en,it,fr
7,customer service representative,en,en,sv
9,administrative assistant,en,pt,en
11,consultant,en,it,en


# Initial conclusion

* Fast shows greater efficiency in terms of accuracy and processing time. So we will use it for the second step which is to detect the language in the top 50% of the most frequent titles.