<a href="https://colab.research.google.com/github/timmiyassine/nlp-tp1/blob/main/NLP_using_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**NLP Tasks with Transformers using HuggingFace**

*   Sentiment analysis
*   Text generation
*   Name entity recognition (NER)
*   Question answering
*   Filling masked text
*   Summarization
*   Translation
*   Feature extraction




###**Import packages**###

In [3]:
!pip install transformers sentencepiece

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 37.6 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 32.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 41.8 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 40.9 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 4.8 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |██████████████████████

###**DataSet**##

In [4]:
array_of_texts = [      
            ["I live in the city of Meknes", "English"],
            ["Hello my friend how are you?","English"],
            ["أنا اعيش في مدينة مكناس" , "Arabic"],
            ["لا أحب عملي بالمرة", "Arabic"],
            ["Je vis en ville Il n y a pas beaucoup d air", "French"],
            ["Je n'aime pas mon travail", "French"]
]

print(array_of_texts)

[['I live in the city of Meknes', 'English'], ['Hello my friend how are you?', 'English'], ['أنا اعيش في مدينة مكناس', 'Arabic'], ['لا أحب عملي بالمرة', 'Arabic'], ['Je vis en ville Il n y a pas beaucoup d air', 'French'], ["Je n'aime pas mon travail", 'French']]


In [5]:
from transformers import pipeline, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("akhooli/gpt2-small-arabic")
model = AutoModel.from_pretrained("akhooli/gpt2-small-arabic")

Downloading:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/120 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Some weights of the model checkpoint at akhooli/gpt2-small-arabic were not used when initializing GPT2Model: ['lm_head.weight']
- This IS expected if you are initializing GPT2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
my_classifier_using_pipeline = pipeline('sentiment-analysis' ,  model='akhooli/xlm-r-large-arabic-sent')

Downloading:   0%|          | 0.00/730 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at akhooli/xlm-r-large-arabic-sent were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

###**Preprocessing data**

In [8]:
import string
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ''' + string.punctuation  #all punctuations 

def preprocessing_data(text , type):
    #remove punctuations
 
  
    if type == "Arabic":

   
      #remove longation and stopwords
      text = re.sub("[إأآا]", "ا", text)
      text = re.sub("ى", "ي", text)
      text = re.sub("ؤ", "ء", text)
      text = re.sub("ئ", "ء", text)
      text = re.sub("ة", "ه", text)
      text = re.sub("گ", "ك", text)

    
    elif type == "English":

      text = text.lower()
      text = ''.join([item for item in text if item not in punctuations ])
      
 
    return text
  
 
 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


####**Text classification (Sentiments Analysis) - English, French and Arabic languages**

In [9]:
for text,language in array_of_texts:
  res = my_classifier_using_pipeline(preprocessing_data(text , language))[0]
  res['label'] = "Positive" if res['label'] == 'LABEL_2' else ("Negative" if res['label'] == 'LABEL_1'  else "Mixed")
  print("'{}' is {} {} text with occuracy = {}%".format(text,res['label'],language,round(res['score']*100, 2)))


'I live in the city of Meknes' is Positive English text with occuracy = 42.09%
'Hello my friend how are you?' is Mixed English text with occuracy = 42.05%
'أنا اعيش في مدينة مكناس' is Mixed Arabic text with occuracy = 38.15%
'لا أحب عملي بالمرة' is Negative Arabic text with occuracy = 81.6%
'Je vis en ville Il n y a pas beaucoup d air' is Mixed French text with occuracy = 51.35%
'Je n'aime pas mon travail' is Negative French text with occuracy = 95.15%


###**Text generation (English, French and Arabic languages)**

In [10]:
from transformers import pipeline, set_seed
generator_english_text = pipeline('text-generation', model='gpt2')
generator_arabic_text = pipeline('text-generation', model='akhooli/gpt2-small-arabic')
generator_french_text = pipeline('text-generation', model='dbddv01/gpt2-french-small')
set_seed(42)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/842 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/611 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/858k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/517k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/387 [00:00<?, ?B/s]

In [11]:
for text , lang in array_of_texts:
  if lang == "Arabic":
    print(generator_arabic_text(text , max_length=30, num_return_sequences=5))
  elif lang == "English":
    print(generator_english_text(text , max_length=30, num_return_sequences=5))
  else:
    print(generator_french_text(text , max_length=30, num_return_sequences=5))

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I live in the city of Meknese. I bought it off an auction site, and I think the community I live in may not appreciate it'}, {'generated_text': "I live in the city of Meknesh in Israel and I'm doing what I do because it's really important. This is the kind of kind"}, {'generated_text': 'I live in the city of Meknes," said the father, who goes by his single name.\n\nThat is all this has told him that'}, {'generated_text': 'I live in the city of Meknese."\n\nKolima and Czaja are also the owners of a popular seafood restaurant located east'}, {'generated_text': "I live in the city of Meknes, near the border with Ethiopia and Ethiopia as well as Ethiopia's largest Arab population. It is not easy,"}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello my friend how are you? You are my daughter and it was your turn to get married?"\n\nI had no desire to respond to his'}, {'generated_text': "Hello my friend how are you? (Frownes growled as his eyes caught Harry's.) 'The boy must have the ability of living and"}, {'generated_text': "Hello my friend how are you? Well, all the time that we're in my office, I'm watching films in order to make money, but"}, {'generated_text': 'Hello my friend how are you?\n\nShe looked at me and then back to me to say hello. I know nothing of myself. My clothes'}, {'generated_text': 'Hello my friend how are you? I look amazing in you," she replied and got down off of her desk with the note in her hand.\n'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'أنا اعيش في مدينة مكناس في المغرب. بدأ التعليم الرسمي بمدرسة سيدي بلعباس في الفترة ما بين سنة 1956 و1988. وفي سنة 1966 صدر'}, {'generated_text': 'أنا اعيش في مدينة مكناس، حيث كان والده مهندسا معماريا، درس الابتدائية بمدرسة أولاد جغوب التي بناها الملك الناصر ابن السلطان الناصر محمد علي بن'}, {'generated_text': 'أنا اعيش في مدينة مكناس، فقد توفي والده وهو في الحادية والعشرين، وقد عمل موظفا في البلاط الملكي. وقد تعلم القرآن عن جده الشيخ مصطفى بن'}, {'generated_text': 'أنا اعيش في مدينة مكناس. سنة 2011 اعلن امران عن وفاة شخصين من عائلة اكلا، وتم اعلان ان شخصا ثالثا من عائلة اكلا بعد'}, {'generated_text': 'أنا اعيش في مدينة مكناس، وكان ابن عمه الأمير عبد الله بن قاسم. شارك في تأسيس حركة التوحيد في عام 1919، وهي الجبهة التي شكلت حديثا'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'لا أحب عملي بالمرة. ثم قام هو بتأسيس النادي الأهلي السعودي وتم تعيين الأستاذ خالد سعيد، وكيلا للنادي في عام 1977. إلا أنه لم يكن موجودا'}, {'generated_text': 'لا أحب عملي بالمرة و من أكثر شروحه ما كتب من حيث الشعر. و في سن 12 سنة تقريبا، بدأت فترة الدراسة في بيت الدين'}, {'generated_text': 'لا أحب عملي بالمرة الأولى. فأكمل مشواره المهني في الكويت سنة 1965 ثم سافر إلى ألمانيا الغربية حيث عين مديرا لأحد الفنادق في فرانكفورت في ألمانيا الغربية'}, {'generated_text': 'لا أحب عملي بالمرة في عام 1956. كان أول عمل فني منفرد من القرن العشرين. تلقى على براءة اختراع في عام 1963 من قبل منظمة الغذاء والدواء'}, {'generated_text': 'لا أحب عملي بالمرة، فقربه وهو في السادسة عشرة من عمره، حتى بلغت الثامنة عشرة من العمر، قال في هذه المرجل إنه كان من'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Je vis en ville Il n y a pas beaucoup d air de mer avec la terre, sauf en certaines zones où aucune d'elles a été utilisée"}, {'generated_text': "Je vis en ville Il n y a pas beaucoup d air dans ces montagnes. Néanmoins il est probable qu'Alfonso est le précurseur de María"}, {'generated_text': 'Je vis en ville Il n y a pas beaucoup d air comme le soleil qui les sépare. Cette partie de son écUnivers est également appelée « brouillard'}, {'generated_text': 'Je vis en ville Il n y a pas beaucoup d air, des sols très pauvres et des conditions météorologiques difficiles. Le nombre et le traitement du'}, {'generated_text': "Je vis en ville Il n y a pas beaucoup d air dans cette ville. Lors des saisons similaires, le gouvernement américain n'est pas en mesure"}]
[{'generated_text': "Je n'aime pas mon travail et c'est une expérience qui force des gens à me faire des choses. »À la suite de ce travail"}, {'generated_text': "Je n'aime pas mon travail, je veux de ne pas faire

###**Name entity recognition (NER) English, French and Arabic languages**

In [13]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

ner_english_recognition = pipeline("ner", model="dslim/bert-base-NER", tokenizer="dslim/bert-base-NER")
ner_arabic_recognition = pipeline("ner", model="hatmimoha/arabic-ner", tokenizer="hatmimoha/arabic-ner")
ner_french_recognition = pipeline("ner", model="gilf/french-postag-model", tokenizer="gilf/french-postag-model")

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/712M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [14]:
for text , lang in array_of_texts:
  if lang == "Arabic":
    print(ner_arabic_recognition(text))
  elif lang == "English":
    print(ner_english_recognition(text))
  else:
    print(ner_french_recognition(text))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity': 'B-LOC', 'score': 0.9934355, 'index': 7, 'word': 'Me', 'start': 22, 'end': 24}, {'entity': 'I-LOC', 'score': 0.747593, 'index': 8, 'word': '##k', 'start': 24, 'end': 25}, {'entity': 'I-LOC', 'score': 0.6653653, 'index': 9, 'word': '##nes', 'start': 25, 'end': 28}]
[]
[{'entity': 'LABEL_12', 'score': 0.99600494, 'index': 1, 'word': 'أنا', 'start': 0, 'end': 3}, {'entity': 'LABEL_12', 'score': 0.99960184, 'index': 2, 'word': 'اعيش', 'start': 4, 'end': 8}, {'entity': 'LABEL_12', 'score': 0.99960136, 'index': 3, 'word': 'في', 'start': 9, 'end': 11}, {'entity': 'LABEL_12', 'score': 0.9979107, 'index': 4, 'word': 'مدينة', 'start': 12, 'end': 17}, {'entity': 'LABEL_4', 'score': 0.9988, 'index': 5, 'word': 'مكناس', 'start': 18, 'end': 23}]
[{'entity': 'LABEL_12', 'score': 0.9999264, 'index': 1, 'word': 'لا', 'start': 0, 'end': 2}, {'entity': 'LABEL_12', 'score': 0.99991745, 'index': 2, 'word': 'أحب', 'start': 3, 'end': 6}, {'entity': 'LABEL_12', 'score': 0.99994904, 'index': 3, 'wo

###**Question answering**

In [15]:
from transformers import pipeline

##### English context
question = "What is the capital of the Morocco?"
context = r"Morocco, a North African country bordering the Atlantic Ocean and Mediterranean Sea, is distinguished by its Berber, Arabian and European cultural influences. Marrakesh’s medina, a mazelike medieval quarter, offers entertainment in its Djemaa el-Fna square and souks (marketplaces) selling ceramics, jewelry and metal lanterns. The capital Rabat’s Kasbah of the Udayas is a 12th-century royal fort overlooking the water."
# Generating an answer to the question in context
qa_english_sentence = pipeline("question-answering")
answer = qa_english_sentence(question=question, context=context)
# Print the answer
print(f"English Question: {question}")
print(f"English Answer: '{answer['answer']}' with accuracy {answer['score']}")

##### French context
qa_french_sentence = pipeline('question-answering', model='fmikaelian/camembert-base-squad', tokenizer='fmikaelian/camembert-base-squad')
question_fr = "Quelle est la capitale du Maroc ?"
context_fr = "Le Maroc est un pays du nord-ouest de l'Afrique. Sa longue côte donnant sur l'océan Atlantique se termine au cap Spartel, limite occidentale du détroit de Gibraltar et où débute le littoral méditerranéen. Au sud du Maroc se trouve le territoire contesté Sahara occidental, revendiqué et contrôlé en grande partie par le Maroc. À l'est et au sud-est, le Maroc est limitrophe de l'Algérie. À l'ouest-sud-ouest et à quelque distance de la côte atlantique se situent les îles Canaries tandis qu'à 672 km à l'ouest-nord-ouest du littoral marocain, on rencontre Madère. Au nord du détroit de Gibraltar se trouve l'Espagne. La capitale administrative est Rabat. Parmi les grandes villes remarquables on trouve Casablanca, Agadir, Fès, Marrakech, Meknès, Tanger, Oujda, Nador."
french_answer = qa_french_sentence(question =  question_fr , context = context_fr)
# Print the answer
print(f"French Question: {question_fr}")
print(f"French Answer: '{french_answer['answer']}' with accuracy {french_answer['score']}")


No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

English Question: What is the capital of the Morocco?
English Answer: 'Rabat’s Kasbah of the Udayas' with accuracy 0.7165263891220093


Downloading:   0%|          | 0.00/517 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/811k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/210 [00:00<?, ?B/s]

French Question: Quelle est la capitale du Maroc ?
French Answer: ' Rabat.' with accuracy 0.9795011878013611


###**Filling masked text in English,French and Arabic languages**

In [1]:
from transformers import pipeline
english_fill_mask = pipeline('fill-mask', model='bert-large-uncased-whole-word-masking')
french_fill_mask  = pipeline("fill-mask", model="camembert-base", tokenizer="camembert-base")
arabic_fill_mask = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-ca')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at CAMeL-Lab/bert-base-camelbert-ca were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a 

In [4]:
print(english_fill_mask("Hi I'm a [MASK] data scientist."))
print(french_fill_mask("L'examen est <mask>."))
print(arabic_fill_mask("الهدف من الدراسة هو [MASK] ."))

[{'sequence': "hi i'm a computer data scientist.", 'score': 0.4872608184814453, 'token': 3274, 'token_str': 'computer'}, {'sequence': "hi i'm a freelance data scientist.", 'score': 0.04428230598568916, 'token': 15919, 'token_str': 'freelance'}, {'sequence': "hi i'm a research data scientist.", 'score': 0.033498384058475494, 'token': 2470, 'token_str': 'research'}, {'sequence': "hi i'm a data data scientist.", 'score': 0.03199975937604904, 'token': 2951, 'token_str': 'data'}, {'sequence': "hi i'm a math data scientist.", 'score': 0.026594799011945724, 'token': 8785, 'token_str': 'math'}]
[{'sequence': "L'examen est terminé.", 'score': 0.2890785038471222, 'token': 3818, 'token_str': 'terminé'}, {'sequence': "L'examen est ouvert.", 'score': 0.06130659952759743, 'token': 1422, 'token_str': 'ouvert'}, {'sequence': "L'examen est gratuit.", 'score': 0.043661560863256454, 'token': 434, 'token_str': 'gratuit'}, {'sequence': "L'examen est annulé.", 'score': 0.03470303490757942, 'token': 16737, '

###**Summarization**

In [5]:
from transformers import pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [6]:
article_summary = """Marrakech, Morocco (CNN)Renewable energy is taking off in Morocco.
In 2014 the country opened the largest wind farm in Africa, valued at $1.4 billion, in the southwest near the city of Tarfaya. Then, in early 2016, it switched on the first facility of the world's largest concentrated solar plant, Noor-1, on the fringe of the Sahara desert. When completed in 2018, it will power one million homes and make Morocco a solar superpower.
And while the country is still heavily reliant on energy imports (90 percent in 2013, according to the World Bank), it plans to generate 40 percent of its energy from renewables by 2020.
Following this road has led to Morocco hosting the UN's 2016 summit on climate change, COP22, in Marrakech. So what lies ahead?
Cutting the energy bill
Morocco ranks seventh in the world in the 2016 Climate Change Performance Index, and is the only non-European country in the top 20.
It's also one of only five countries to have achieved a "sufficient" rating for its efforts to keep warming below 2°C in the Climate Action Tracker (no country has achieved the "role model" rating as of yet, so "sufficient" is currently the best grade)."""

context_fr = "Le Maroc est un pays du nord-ouest de l'Afrique. Sa longue côte donnant sur l'océan Atlantique se termine au cap Spartel, limite occidentale du détroit de Gibraltar et où débute le littoral méditerranéen. Au sud du Maroc se trouve le territoire contesté Sahara occidental, revendiqué et contrôlé en grande partie par le Maroc. À l'est et au sud-est, le Maroc est limitrophe de l'Algérie. À l'ouest-sud-ouest et à quelque distance de la côte atlantique se situent les îles Canaries tandis qu'à 672 km à l'ouest-nord-ouest du littoral marocain, on rencontre Madère. Au nord du détroit de Gibraltar se trouve l'Espagne. La capitale administrative est Rabat. Parmi les grandes villes remarquables on trouve Casablanca, Agadir, Fès, Marrakech, Meknès, Tanger, Oujda, Nador."

print("Summary:",summarizer(article_summary, max_length=130, min_length=30, do_sample=False)[0]['summary_text'])
print("Summary:",summarizer(context_fr, max_length=130, min_length=30, do_sample=False)[0]['summary_text'])

Summary:  Morocco plans to generate 40 percent of its energy from renewables by 2020 . The country is hosting the UN's 2016 summit on climate change, COP22, in Marrakech .
Summary:  Le Maroc est un pays du nord-ouest de l'Afrique . Sa longue côte donnant sur l'océan Atlantique se termine au cap Spartel, limite occidentale du détroit de Gibraltar . Au sud du Maroc se trouve le territorie contesté Sahara occidental .


###**Translation**

In [7]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_ar = "السلطات المختصة تعيد فرض قوانين صارمة"

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "The Secretary-General of the United Nations says there is no military solution in Syria."

Downloading:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/649 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/529 [00:00<?, ?B/s]

['The authorities are reintroducing strict rules.']

###**Feature extraction**

In [8]:
from transformers import pipeline, AutoTokenizer

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am data scientist")



Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [9]:
print(features[0])
print(len(features[0])) #6

[[-0.004540705122053623, 0.060030438005924225, -0.011161313392221928, -0.12302959710359573, 0.07138173282146454, -0.13693442940711975, -0.04459403455257416, 0.08033394068479538, 0.05772212892770767, -0.04234227165579796, -0.055484019219875336, 0.07428386807441711, 0.04028827324509621, 0.029407696798443794, 0.054055847227573395, 0.05230886861681938, -0.031010109931230545, 0.005907146260142326, 0.0556052066385746, 0.011840774677693844, -0.014026502147316933, 0.009590055793523788, -0.07618232071399689, 0.07943598926067352, 0.015575905330479145, -0.013951631262898445, 0.11029195785522461, 0.0781882107257843, -0.04894213378429413, -1.4031407772563398e-05, -0.0052127111703157425, -0.014335749670863152, 0.001761345425620675, 0.05901962146162987, 0.0014778276672586799, 0.01818060874938965, 0.06327041238546371, 0.032311055809259415, -0.08511326462030411, 0.05002790689468384, -0.0033054249361157417, -0.014438709244132042, 0.02113744430243969, 0.018039241433143616, -0.01986449398100376, 0.0016447