# [Text2Text](https://github.com/artitw/text2text): Cross-lingual natural language processing and generation toolkit

## How Cross-Lingual NLP Models Work (click to watch)
[![Cross-Lingual Models](http://img.youtube.com/vi/caZLVcJqsqo/0.jpg)](https://youtu.be/caZLVcJqsqo "Cross-Lingual Models")

In [None]:
pip install -q -U text2text

In [None]:
### Text Handler API quick start
import text2text as t2t
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" #Remove this line for the larger model
h = t2t.Handler(["Hello, World!"], src_lang="en") #Initialize with some text
h.tokenize() #[['▁Hello', ',', '▁World', '!']]

Better speed can be achieved with apex installed from https://www.github.com/artitw/apex.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=908.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3708092.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2423393.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=272.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1140.0, style=ProgressStyle(description…




[['▁Hello', ',', '▁World', '!']]

In [None]:
h.vectorize() #array([[0.18745188, 0.05658336, 0.15895301, ..., 0.46946704, 0.6332584 , 0.43805206]], dtype=float32)

In [None]:
h.tfidf() #[{'!': 0.5, ',': 0.5, '▁Hello': 0.5, '▁World': 0.5}]

[{'!': 0.5, ',': 0.5, '▁Hello': 0.5, '▁World': 0.5}]

In [None]:
h.bm25() #[{'!': 0.5, ',': 0.5, '▁Hello': 0.5, '▁World': 0.5}]

In [None]:
h.search(queries=["Hello"]).toarray() #array([[0.5]])

array([[0.5]])

In [None]:
h.translate(tgt_lang="zh") #['你好,世界!']

['你好,世界!']

In [None]:
h.summarize() #["World ' s largest world"]

100%|██████████| 213450/213450 [00:00<00:00, 784787.15B/s]


***** Recover model: cnndm_model.bin *****


100%|██████████| 1242874899/1242874899 [00:53<00:00, 23226255.43B/s]
100%|██████████| 1/1 [00:02<00:00,  2.85s/it]


['Hello , World!  Have you ever been a member of the World War II.']

In [None]:
h.question() #[('What is the name of the world you are in?', 'The world')]

***** Recover model: qg_model.bin *****


100%|██████████| 1/1 [00:01<00:00,  1.03s/it]


[('What is the name of the Hello , World!', 'world')]

In [None]:
h.variate() #['Hello the world!', 'Welcome to the world.', 'Hello to the world!',...

['Hello the world!',
 'My brother and sister!',
 'Welcome to the world!',
 'Hello, the world',
 '“No to the world”',
 'It is clearly!',
 'Welcome to the light!',
 'Hello to the world!',
 'Hello to the world!',
 'Hello, it’s a good thing!',
 'Congratulations to all!',
 'Hello to the world!',
 'Hello to the world!',
 'Good Morning, the World',
 'Welcome to the World!',
 'Hello the world!',
 'Hello the world!',
 'Good morning to the world!',
 'Hello, World!',
 'Hello the world!',
 'Fortunately the world!',
 'Hello to the world!',
 'We do it!',
 'Welcome to the world!',
 'Hello to the world!',
 'Go to the world!',
 'Member of the Board of Directors of the Board of Directors of the Board of Directors of the Board of Directors',
 'The world.',
 'Hello to the world!',
 'and it is!',
 'Hello, for the world!',
 'Hello to the world!',
 'Hello to the world!',
 'Hello to the world!',
 'Hello to the world!',
 'Hello to the world!',
 'Welcome to the world!',
 'Congratulations to the world!',
 'Hello

In [None]:
t2t.Handler(["Hello, World! [SEP] Hello, what?"]).answer() #['World']

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=757.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798293.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456356.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=594710526.0, style=ProgressStyle(descri…




['World']

In [None]:
t2t.Handler(["Hello, World! [SEP] Hello, what?"]).measure() #[2]

[2]

In [None]:
### Languages Available
t2t.Transformer.LANGUAGES

{'af': 'Afrikaans',
 'am': 'Amharic',
 'ar': 'Arabic',
 'ast': 'Asturian',
 'az': 'Azerbaijani',
 'ba': 'Bashkir',
 'be': 'Belarusian',
 'bg': 'Bulgarian',
 'bn': 'Bengali',
 'br': 'Breton',
 'bs': 'Bosnian',
 'ca': 'Catalan_Valencian',
 'ceb': 'Cebuano',
 'cs': 'Czech',
 'cy': 'Welsh',
 'da': 'Danish',
 'de': 'German',
 'el': 'Greeek',
 'en': 'English',
 'es': 'Spanish',
 'et': 'Estonian',
 'fa': 'Persian',
 'ff': 'Fulah',
 'fi': 'Finnish',
 'fr': 'French',
 'fy': 'Western_Frisian',
 'ga': 'Irish',
 'gd': 'Gaelic_Scottish_Gaelic',
 'gl': 'Galician',
 'gu': 'Gujarati',
 'ha': 'Hausa',
 'he': 'Hebrew',
 'hi': 'Hindi',
 'hr': 'Croatian',
 'ht': 'Haitian_Haitian_Creole',
 'hu': 'Hungarian',
 'hy': 'Armenian',
 'id': 'Indonesian',
 'ig': 'Igbo',
 'ilo': 'Iloko',
 'is': 'Icelandic',
 'it': 'Italian',
 'ja': 'Japanese',
 'jv': 'Javanese',
 'ka': 'Georgian',
 'kk': 'Kazakh',
 'km': 'Central_Khmer',
 'kn': 'Kannada',
 'ko': 'Korean',
 'lb': 'Luxembourgish_Letzeburgesch',
 'lg': 'Ganda',
 'ln':

In [None]:
# Sample texts
article_en = 'The Secretary-General of the United Nations says there is no military solution in Syria.'
 
notre_dame_str = "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student - run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one - page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a liberal newspaper, Common Sense was published. Likewise, in 2003, when other students believed that the paper showed a liberal bias, the conservative paper Irish Rover went into production. Neither paper is published as often as The Observer; however, all three are distributed to all students. Finally, in Spring 2008 an undergraduate journal for political science research, Beyond Politics, made its debut."
 
bacteria_str = "Bacteria are a type of biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria have a number of shapes, ranging from spheres to rods and spirals. Bacteria were among the first life forms to appear on Earth, and are present in most of its habitats."
 
bio_str = "Biology is the science that studies life. What exactly is life? This may sound like a silly question with an obvious answer, but it is not easy to define life. For example, a branch of biology called virology studies viruses, which exhibit some of the characteristics of living entities but lack others. It turns out that although viruses can attack living organisms, cause diseases, and even reproduce, they do not meet the criteria that biologists use to define life."
 

In [None]:
### Tokenization
t2t.Handler([
         "Let's go hiking tomorrow", 
         "안녕하세요.", 
         "돼지꿈을 꾸세요~~"
         ]).tokenize()

[['▁Let', "'", 's', '▁go', '▁hik', 'ing', '▁tom', 'orrow'],
 ['▁안녕', '하세요', '.'],
 ['▁', '돼', '지', '꿈', '을', '▁꾸', '세요', '~~']]

In [None]:
# Embeddings
t2t.Handler([
         "Let's go hiking tomorrow", 
         "안녕하세요.", 
         "돼지꿈을 꾸세요~~"
         ]).vectorize()

In [None]:
### TF-IDF
t2t.Handler([
         "Let's go hiking tomorrow", 
         "안녕하세요.", 
         "돼지꿈을 꾸세요~~"
         ]).tfidf()

[{"'": 0.3535533905932738,
  'ing': 0.3535533905932738,
  'orrow': 0.3535533905932738,
  's': 0.3535533905932738,
  '▁Let': 0.3535533905932738,
  '▁go': 0.3535533905932738,
  '▁hik': 0.3535533905932738,
  '▁tom': 0.3535533905932738},
 {'.': 0.5773502691896258,
  '▁안녕': 0.5773502691896258,
  '하세요': 0.5773502691896258},
 {'~~': 0.3535533905932738,
  '▁': 0.3535533905932738,
  '▁꾸': 0.3535533905932738,
  '꿈': 0.3535533905932738,
  '돼': 0.3535533905932738,
  '세요': 0.3535533905932738,
  '을': 0.3535533905932738,
  '지': 0.3535533905932738}]

In [None]:
### BM25
t2t.Handler([
         "Let's go hiking tomorrow", 
         "안녕하세요.", 
         "돼지꿈을 꾸세요~~"
         ]).bm25()

In [None]:
### Search
t2t.Handler([
         "Let's go hiking tomorrow, let's go!", 
         "안녕하세요.", 
         "돼지꿈을 꾸세요~~",
         ]).search(queries=["go", "안녕"]).toarray()

array([[0.4472136 , 0.        , 0.        ],
       [0.        , 0.57735027, 0.        ]])

In [None]:
#### Multiple queries on a single index
bm25_index = t2t.Handler([
                       article_en, 
                       notre_dame_str, 
                       bacteria_str, 
                       bio_str
                       ]).bm25(output="matrix")

search_results_bm25_1 = t2t.Handler().search(
    queries=["wonderful life", "university students"], 
    vector_class=t2t.Bm25er,
    index=bm25_index)

search_results_bm25_2 = t2t.Handler().search(
    queries=["Earth creatures are cool", "United Nations"], 
    vector_class=t2t.Bm25er,
    index=bm25_index)

In [None]:
#### Using TF-DF embeddings index
tfidf_index = t2t.Handler([
                       article_en, 
                       notre_dame_str, 
                       bacteria_str, 
                       bio_str
                       ]).tfidf(output="matrix")

search_results_tf1 = t2t.Handler().search(
    queries=["wonderful life", "university students"], 
    vector_class=t2t.Tfidfer,
    index=tfidf_index)

search_results_tf2 = t2t.Handler().search(
    queries=["Earth creatures are cool", "United Nations"], 
    vector_class=t2t.Tfidfer,
    index=tfidf_index)

In [None]:
#### Using neural embeddings index
embedding_index = t2t.Handler([
                       article_en, 
                       notre_dame_str, 
                       bacteria_str, 
                       bio_str
                       ]).vectorize()

search_results_em1 = t2t.Handler().search(
    queries=["wonderful life", "university students"],
    vector_class=t2t.Vectorizer,
    index=embedding_index)

search_results_em2 = t2t.Handler().search(
    queries=["Earth creatures are cool", "United Nations"],
    vector_class=t2t.Vectorizer,
    index=embedding_index)

In [None]:
#### Blending neural embeddings and tf-idf
np.mean( 
    np.array([
              search_results_bm25_1,
              search_results_tf1,
              search_results_em1,
              ]), axis=0)

# averaged scores matrix
matrix([[ 0.00486117, -0.01890325,  0.53769584,  0.82506883],
        [ 0.0435048 ,  1.68977281,  0.01238902,  0.01266839]])

array([[ 0.00729176, -0.02835486,  0.0024925 ,  0.08656652],
       [ 0.06525719,  0.13328168,  0.0185835 ,  0.01900256]])

In [None]:
### Levenshtein Sub-word Edit Distance
t2t.Handler([
         "Hello, World! [SEP] Hello, what?", 
         "안녕하세요. [SEP] 돼지꿈을 꾸세요~~"
        ]).measure(metric="levenshtein_distance")

[2, 8]

In [None]:
### Translation
# Default translator model
t2t.Handler([article_en, notre_dame_str, bacteria_str, bio_str], src_lang='en').translate(tgt_lang='zh')

In [None]:
# Smaller model to save time and memory for development
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M"
t2t.Handler(["I would like to go hiking tomorrow."], 
        src_lang="en"
        ).translate(tgt_lang='zh')


In [None]:
# Smaller model to save time and memory for development
# Note language code difference
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/mbart-large-50-many-to-many-mmt"
t2t.Transformer.LANGUAGES = {
  'af_ZA': 'Afrikaans',
  'ar_AR': 'Arabic',
  'az_AZ': 'Azerbaijani',
  'bn_IN': 'Bengali',
  'cs_CZ': 'Czech',
  'de_DE': 'German',
  'en_XX': 'English',
  'es_XX': 'Spanish',
  'et_EE': 'Estonian',
  'fa_IR': 'Persian',
  'fi_FI': 'Finnish',
  'fr_XX': 'French',
  'gl_ES': 'Galician',
  'gu_IN': 'Gujarati',
  'he_IL': 'Hebrew',
  'hi_IN': 'Hindi',
  'hr_HR': 'Croatian',
  'id_ID': 'Indonesian',
  'it_IT': 'Italian',
  'ja_XX': 'Japanese',
  'ka_GE': 'Georgian',
  'kk_KZ': 'Kazakh',
  'km_KH': 'Khmer',
  'ko_KR': 'Korean',
  'lt_LT': 'Lithuanian',
  'lv_LV': 'Latvian',
  'mk_MK': 'Macedonian',
  'ml_IN': 'Malayalam',
  'mn_MN': 'Mongolian',
  'mr_IN': 'Marathi',
  'my_MM': 'Burmese',
  'ne_NP': 'Nepali',
  'nl_XX': 'Dutch',
  'pl_PL': 'Polish',
  'ps_AF': 'Pashto',
  'pt_XX': 'Portuguese',
  'ro_RO': 'Romanian',
  'ru_RU': 'Russian',
  'si_LK': 'Sinhala',
  'sl_SI': 'Slovene',
  'sv_SE': 'Swedish',
  'sw_KE': 'Swahili',
  'ta_IN': 'Tamil',
  'te_IN': 'Telugu',
  'th_TH': 'Thai',
  'tl_XX': 'Tagalog',
  'tr_TR': 'Turkish',
  'uk_UA': 'Ukrainian',
  'ur_PK': 'Urdu',
  'vi_VN': 'Vietnamese',
  'xh_ZA': 'Xhosa',
  'zh_CN': 'Chinese'
}
t2t.Handler(["I would like to go hiking tomorrow."], 
        src_lang="en_XX"
        ).translate(tgt_lang='zh_CN')


In [None]:
### Question Answering. Question must follow context with ` [SEP] ` in between.
t2t.Handler([
         "Hello, this is Text2Text! [SEP] What is this?", 
         "It works very well. It's awesome! [SEP] How is it?"
         ]).answer()

In [None]:
t2t.Handler(["很喜欢陈慧琳唱歌。[SEP] 喜欢做什么?"], 
        src_lang="zh",
        ).answer()

In [None]:
### Question Generation
t2t.Handler([
            bio_str,
            bio_str,
            bio_str,
            bio_str,
            bio_str,
            "I will go to school today to take my math exam.",
            "I will go to school today to take my math exam.",
            "Tomorrow is my cousin's birthday. He will turn 24 years old.",
            notre_dame_str,
            bacteria_str,
            bacteria_str,
            bacteria_str,
            "I will go to school today to take my math exam. [SEP] school",
            "I will go to school today to take my math exam. [SEP] exam",
            "I will go to school today to take my math exam. [SEP] math",
          ], src_lang='en').question()

In [None]:
t2t.Handler(["很喜欢陈慧琳唱歌。"], src_lang='zh').question()

In [None]:
### Summarization
t2t.Handler([notre_dame_str, bacteria_str, bio_str], src_lang='en').summarize()

In [None]:
### Variation
# Useful for augmenting training data
t2t.Handler([bacteria_str], src_lang='en').variate()

In [None]:
### Training / Fine-tuning
# Finetune cross-lingual model on your data
result = t2t.Handler(["Hello, World! [TGT] 你好,世界!"], 
            src_lang="en",
            tgt_lang="zh",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

# load and use model from saved directory
t2t.Transformer.PRETRAINED_TRANSLATOR = "model_dir"
t2t.Handler("Hello, World!").translate(tgt_lang="zh")