This notebook demonstrate how to use transformers model to detect language for markdown cells in train data of this competition: https://www.kaggle.com/competitions/AI4Code/data. This dataset has multiple languages in notebooks, so understanding about language in each notebook may help you for better tokenize and model. As an example:

* we have japanese, hindi and chinese notebooks in this dataset, and these languages can't be tokenized by symbols(space, colon, semicolons,...). 
* These words can't be splited by spaces, so if you want to deal with these languages, you need to use some specific tokenize models for it. 
* Or you should remove these notebooks when training model alphabet language notebooks, because incorrected tokens in these notebooks can add more noise to our model.
* ...

In any cases, it is important for detect our markdown language. So i create this notebook as an example. Please don't run this notebook in kaggle kernel, it tooks 2 days in my computer with GPU, so it will never complete in a kaggle kernel. If you want to use directly detected language result from this notebook, you can use this dataset:
* https://www.kaggle.com/datasets/astrung/google-ai4code-train-markdown-language

**Please upvote both this notebook and dataset if it is helpful for you**

And this is a short summary about languages in train dataset:

    en         0.758186(%)
    unknown    0.159912(%)
    pt         0.015380(%)
    tr         0.010126(%)
    ru         0.009829(%)
    it         0.009750(%)
    ja         0.007455(%)
    es         0.006610(%)
    zh         0.004808(%)
    fr         0.003614(%)
    hi         0.003609(%)
    ur         0.003102(%)
    nl         0.001805(%)
    vi         0.001731(%)
    de         0.001352(%)
    sw         0.001291(%)
    pl         0.000670(%)
    th         0.000428(%)
    bg         0.000165(%)
    ar         0.000123(%)
    el         0.000054(%)

You can get language label from this model: https://huggingface.co/papluca/xlm-roberta-base-language-detection. I use this model for predict language in this notebooks. However, you can try with other models, too.

Please note, some of languages in this train data does not have corresponding labels in this langugage model. As a example, some of notebooks are Indonesian languages, but since we don't have this label in model, we get some annoying classes with very low confidence(I choose `prob <= 0.8` as threshold) -> I assign these cases to "Unknown". So "Unknown" label means detections model predicted labels with very low confidence, and it may be new language(not 20 languages of our model)

In [None]:
import os
import ast
import pandas as pd
from transformers import pipeline, set_seed
from bs4 import BeautifulSoup
from markdown import markdown

In [None]:
arr = os.listdir('../input/AI4Code/train')

In [None]:
dfs = []
for file in arr:
#     print(file)
    df = pd.read_json(os.path.join('train', file), orient='column')
    df['file'] = os.path.join('train', file)
    df = df[df['cell_type'] != 'code'].copy()
    dfs.append(df)
len(dfs)

In [None]:
generator = pipeline('text-classification', model='papluca/xlm-roberta-base-language-detection')
# for df in dfs:
#     df['source2'] = df['source'].apply(lambda x: x[:513])
#     df['detected_language_result'] = generator(df['source2'].values.tolist())
#     df['language_label'] = df['detected_language_result'].apply(lambda x: x['label'])
#     df['language_prob'] = df['detected_language_result'].apply(lambda x: x['score'])
#     df = df.drop(columns=['detected_language_result', 'source2'])
# #     print(df.head())

In [None]:
dfs[20000]

In [None]:
df_train = pd.concat(dfs)
df_train

In [None]:
def convert_markdown(text):
    html = markdown(text)
    text = ' '.join(BeautifulSoup(html).findAll(text=True)).replace('\n', ' ')
    return text[:513]
df_train['source2'] = df_train['source'].apply(lambda x: convert_markdown(x))
# generator(df_train['source2'].values[:1000].tolist())

In [None]:
df_train

In [None]:
df_train['detected_language_result'] = None
df_train['language_label'] = None
df_train['language_prob'] = None

In [None]:
def infer_lang(df):
    df = df.copy()
    df['detected_language_result'] = generator(df['source2'].values.tolist())
    df['language_label'] = df['detected_language_result'].apply(lambda x: x['label'])
    df['language_prob'] = df['detected_language_result'].apply(lambda x: x['score'])
    return df
#     df = df.drop(columns=['detected_language_result', 'source2'])

In [None]:
df_result = []
for start in range(0, len(df_train), 3000):
    df_result.append(infer_lang(df_train[start:start+3000]))
    print(start)
#     if start > 10:
#         break

In [None]:
len(df_result)

In [None]:
df_final = pd.concat(df_result)
df_final

In [None]:
# df_final = df_final.drop(columns=['source2', 'detected_language_result'])
df_final.head()

In [None]:
df_final.language_label.value_counts()

In [None]:
df_final[df_final.language_label == 'pt'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'it'].sort_values('language_prob').tail(30)

In [None]:
df_final[df_final.language_label == 'tr'].sort_values('language_prob')

# ur doens't look good. Let check carefully

In [None]:
df_final[df_final.language_label == 'ur'].sort_values('language_prob').tail(400).head(30)

In [None]:
df_final[df_final.language_label == 'ja']

In [None]:
df_final[df_final.language_label == 'th'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'hi'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'ru']

# sw doens't look good. Let check carefully 

In [None]:
df_final[df_final.language_label == 'sw'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'es'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'zh'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'nl'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'fr'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'vi'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'de'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'pl'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'bg'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'el'].sort_values('language_prob')

In [None]:
df_final[df_final.language_label == 'ar'].sort_values('language_prob')

In [None]:
df_final.language_label.value_counts()/len(df_final)

In [None]:
df_norm = df_final.copy()
df_norm.loc[df_norm['language_prob'] < 0.8, 'language_label'] = 'unknown'
df_norm.loc[df_norm['language_prob'] < 0.8, 'language_prob'] = 0

In [None]:
df_norm.language_label.value_counts()/len(df_final)

In [None]:
df_norm.to_csv('df_language.csv', index=False)