# Filtering/Cleaning data based on language using FastText

<b>FastText</b> : https://fasttext.cc/docs/en/language-identification.html

    FastText is an facebooks's open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
    
Model download (lid.176.ftz) : https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz

<b> Python Binding Installation: </b>  https://github.com/vrasneur/pyfasttext




In [1]:
from pyfasttext import FastText

In [2]:
model = FastText("lid.176.ftz")

<u><b> Supported Languages: </b></u>

af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh

<b> Language codes: </b> https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

In [3]:
def predict_sentence_language(sentence, n_nearest_language = 5):
    if sentence != "":
        return model.predict_proba_single(sentence,n_nearest_language)
    

In [4]:
print(predict_sentence_language("hi this is language identification",2))

[('en', 0.9650917056298861), ('hi', 0.0027506044766796153)]


In [5]:
def clean_document(document, language_to_be_kept = ['en'], min_freq_filtering = 0.5, sentence_delimiter ='.', as_list =0):
    
    result_document = []
    document_split = document.strip().split(sentence_delimiter)
    
    for sentence in document_split:
        
        pred_res = predict_sentence_language(sentence, 1)
        
        if(len(pred_res) > 0):
            if(([l for l in language_to_be_kept if pred_res[0][0] in l]) 
               and (pred_res[0][1] > min_freq_filtering)):
                result_document.append(sentence)
    if(as_list == 0):            
        return ".".join(result_document)
    else:
        return result_document
            

In [6]:
print(clean_document("hi this is a verification check. should give me these two sentences. cette phrase devrait être supprimée"))

hi this is a verification check. should give me these two sentences


<b> clean_document :</b>
<table style="width: 100%; align='center'" border="1">
    <tr><th>Parameter Name </th><th>Description</th><th>Default</th> </tr> 
    <tr><td>document</td><td>data as a string/blob</td><td>None</td> </tr>
    <tr><td>language_to_be_kept</td><td>python list of languages not to be removed</td><td>['en']</td> </tr>
    <tr><td>min_freq_filtering</td><td>minimum match to be selected [0.0-1.0]</td><td>0.5</td> </tr>
    <tr><td>sentence_delimiter</td><td>split the document</td><td>'.' {dot}</td> </tr>
    <tr><td>as_list</td><td>return result as list [0=as blob /1=as list]</td><td>0</td> </tr>
</table> 

===================================================================

To do:
    1. Variable cutoff of each selected languages