Welcome to our first lab for Module 3!

Today we are going to look at text classification across 42 languages. Now that we've learned about *precision* and *recall* and *f-score*, we can use these metrics to look at how well our methods work for languages that aren't English!

In [None]:
from text_analytics import text_analytics
import os
import pandas as pd

ai = text_analytics()
print("Done!")

The task that we'll be looking at is simple but important: identifying the register or source that data comes from. So, we'll be learning how to determine whether a sample comes from Wikipedia, Twitter, or OpenSubtitles. This can be important in a pipeline; for example, we might want to have a different model for each type of data. 

Regardless, for our purposes this is going to let us compare results across all these languages. Let's load a csv file that contains the classification results.

In [None]:
file = "register.csv"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file)
print(df)
print("Done!")

This is a lot of information, too much to look at. So, let's narrow this down to the *f-score*. Since this is the harmonic mean of precision and recall, it makes sense to focus on just this one metric.

In [None]:
df = df.loc[:,["Language", "Register", "F-Score"]]
print(df)
print("Done!")

Now, this is still a bunch of information. The *Weighted_AVG* gives us an overview of how well the classifier works across all the registers. So let's just look at it.

In [None]:
df = df.loc[df["Register"] == "Weighted_AVG"]
print(df)
print("Done!")

And we can make the "Language" column the index now, as well. Because we have only one row per language.

In [None]:
df = df.set_index("Language", drop = True)
df = df.drop(columns = ["Register"])
print(df)
print("Done!")

Now, let's make a chart!

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [20,12]
plt.rcParams['figure.dpi'] = 120

ax = sns.barplot(y = df.index, x = "F-Score", data = df)
print("Done!")

So the same basic methods work for a lot of languages. But some languages, like Japanese and Chinese, are going to require special processing. Because their writing systems are so different.

You'll notice that "Language" here is a three-letter code. We're using the ISO-639(2) codes. That's because languages refer to themselves and to each other using a bunch of different terms. So, we need to use an international standard to be consistent. For reference, the codes we've just used are listed below with their informal English counterpart.

| Code       | Language     | Code     | Language     |
| :--------- | :----------: | -------: | -----------: |
|  ara       | Arabic       | lav      | Latvian      |
|  bul       | Bulgarian    | lit      | Lithuanian   | 
|  cat       | Catalan      | mkd      | Macedonian   | 
|  ces       | Czech        | nld      | Dutch        | 
|  dan       | Danish       | nor      | Norwegian    | 
|  deu       | German       | pol      | Polish       | 
|  ell       | Greek        | por      | Portuguese   | 
|  eng       | English      | ron      | Romanian     | 
|  est       | Estonian     | rus      | Russian      | 
|  fas       | Farsi        | slk      | Slovak       | 
|  fin       | Finnish      | slv      | Slovenian    | 
|  fra       | French       | spa      | Spanish      | 
|  hin       | Hindi        | sqi      | Albanian     | 
|  hun       | Hungarian    | swe      | Swedish      | 
|  ind       | Indonesian   | tam      | Tamil        | 
|  isl       | Icelandic    | tgl      | Tagalog      | 
|  ita       | Italian      | tur      | Turkish      | 
|  jpn       | Japanese     | ukr      | Ukrainian    | 
|  kat       | Georgian     | urd      | Urdu         | 
|  kaz       | Kazakh       | vie      | Vietnamese   | 
|  kor       | Korean       | zho      | Chinese      | 

Let's take a look at what these languages look like. We'll use "ara" for Arabic. But you can try other languages by replacing that with the correct code.

In [None]:
file = "Register.ara.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file, index_col = 0)
print(df)
print("Done!")

So, today we've looked a bit further at classification results. We've seen that most (but not all) have similar results on the same task. And we've had a chance to look at some non-English data.