<a href="https://colab.research.google.com/github/rama96/hugging-face/blob/master/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
! pip install transformers
from transformers import pipeline

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 7.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 59.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 54.5 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.2


## 1. Classifies text as either -ve or +ve . Just plain old sentiment analysis . Nothing special here. 

In [3]:
classifier = pipeline("sentiment-analysis")
classifier("I've had a wonderful day")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998874664306641}]

In [6]:
classifier(["I've had a wonderfully bad day","I've had a wonderfully bad day"])

[{'label': 'NEGATIVE', 'score': 0.8680425882339478},
 {'label': 'NEGATIVE', 'score': 0.8680425882339478}]

## 2. Zero shot classification on the other hand lets you choose your own labels . These are pretrained and can directly be used for classification 

In [8]:
classifier_2 = pipeline("zero-shot-classification")
classifier_2(
    "This is a course about transformers library" , candidate_labels = ['education','politics','murder']
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'sequence': 'This is a course about transformers library',
 'labels': ['education', 'politics', 'murder'],
 'scores': [0.9611843228340149, 0.026535388082265854, 0.012280362658202648]}

In [12]:
classifier_2(
    "I am going to kill this guy" , candidate_labels = ['education','politics','homicide']
)

{'sequence': 'I am going to kill this guy',
 'labels': ['homicide', 'politics', 'education'],
 'scores': [0.9635788202285767, 0.02948072925209999, 0.006940451916307211]}

In [13]:
classifier_2(
    "I am going to kill this guy" , candidate_labels = ['murder','politics','homicide']
)

{'sequence': 'I am going to kill this guy',
 'labels': ['murder', 'homicide', 'politics'],
 'scores': [0.6468082070350647, 0.3427066504955292, 0.010485121980309486]}

## 3. Text Generation - Let's you generate the text given a part of sentence 

In [15]:
generator = pipeline("text-generation")
generator('In this course , we will talk about ')

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course , we will talk about \xa0the \xa0gibson-based language by Gavrid S. Varela, which was created and brought to life by an international team of scientists. The project developed by Varela'}]

In [17]:
generator('Arsenal is a ')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Arsenal is a \xa0important league. I would expect you to look carefully at your club's stats so you can make better determinations as to how good or bad the team is. If we are looking at a\xa0better\xa0favourite,"}]

# 4. huggingface.co/models provide you with a list of models that can be used for different tasks . The models can be filtered by the use-cases (ex text-generation , sentiment class etc ) . Given below is an example on how to specify a model for a task .

In [20]:
# distilgpt2 is a lighter version of gpt2 

generator_2 = pipeline("text-generation" , model = "distilgpt2")
generator_2('In this course , we will talk about ' , 
            max_length = 30 , # arg for max number of words
            num_return_sequences = 2 , ) # arg for number of sentences to be generated from the given string 


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course , we will talk about irc.net/forum/net-forum/wiki/index.php?thread=364448'},
 {'generated_text': 'In this course , we will talk about eryxialandic.\nIn this Course we will talk about eryxialandic.'}]

## 4. fill mask - Used to fill / correctly predict the missing word in a sentence 

In [23]:
# distilgpt2 is a lighter version of gpt2 

generator_2 = pipeline("fill-mask")
generator_2('In this course , we will talk about <mask> models.' , 
            top_k = 4)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.050060659646987915,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'In this course, we will talk about mathematical models.'},
 {'score': 0.044097475707530975,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'In this course, we will talk about predictive models.'},
 {'score': 0.029750045388936996,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'In this course, we will talk about computational models.'},
 {'score': 0.016164317727088928,
  'token': 499,
  'token_str': ' future',
  'sequence': 'In this course, we will talk about future models.'}]

## 5 . NER - Used for identifying entities in given sentences . 

In [29]:
ner = pipeline("ner",
               grouped_entities = True)
ner("My name is Rama and i work as a data scientist at OYO , a vacation rental company head quarted in Amsterdam")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER',
  'score': 0.99886036,
  'word': 'Rama',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.9971631,
  'word': 'OYO',
  'start': 50,
  'end': 53},
 {'entity_group': 'LOC',
  'score': 0.9979184,
  'word': 'Amsterdam',
  'start': 98,
  'end': 107}]

In [30]:
ner("My name is Rama and i work as a data scientist at Air BnB , a vacation rental company head quarted in Amsterdam")

[{'entity_group': 'PER',
  'score': 0.9988146,
  'word': 'Rama',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.995313,
  'word': 'Air BnB',
  'start': 50,
  'end': 57},
 {'entity_group': 'LOC',
  'score': 0.99731964,
  'word': 'Amsterdam',
  'start': 102,
  'end': 111}]

## 6. Question Answering . Extracts answers to a question from a given context 

In [31]:
question_answer = pipeline("question-answering")
question_answer(
    question="Where do i work?",
    context="My name is Rama and i operate as a data scientist at Air BnB , a vacation rental company head quarted in Amsterdam",
    )

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'score': 0.9506286382675171, 'start': 53, 'end': 60, 'answer': 'Air BnB'}

In [32]:
question_answer(
    question="Where are you based out of ?",
    context="My name is Rama and i operate as a data scientist at Air BnB , a vacation rental company head quarted in Amsterdam",
    )

{'score': 0.602066159248352, 'start': 105, 'end': 114, 'answer': 'Amsterdam'}

## 7. Summarization - writes sumarized version of long articles 

In [33]:
summarizer = pipeline("summarization")
summarizer("My name is Rama and i operate as a data scientist at Air BnB , a vacation rental company head quarted in Amsterdam")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Your max_length is set to 142, but you input_length is only 29. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)


[{'summary_text': " Rama is a data scientist at Air BnB, a vacation rental company head quarted in Amsterdam . She is also the data scientist behind the company's website . Rama says she is an expert in data analysis and data science . She says the company is looking to use its data to improve the quality of its rental properties ."}]

## 8. Translation 

In [34]:
translator = pipeline("translation" , model = "Helsinki-NLP/opus-mt-fr-en")
translator("ce cours est produit par Rama")

Downloading config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/287M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

ValueError: ignored

In [35]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 8.2 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
