<a href="https://colab.research.google.com/github/srisowmya1212/NLPTasksUsingModels/blob/main/NLPTasksUsingPipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers



In [2]:
from transformers import pipeline

 **Text Summarization**


The input to this task is a corpus of text and the model will output a summary of it based on the expected length mentioned in the parameters. Here, we have kept minimum length as 5 and maximum length as 30.

In [4]:
summarizer = pipeline(
    "summarization", model="t5-base", tokenizer="t5-base", framework="tf"
)
input="A transformer model is the most common architecture of a large language model. It consists of an encoder and a decoder. A transformer model processes data by tokenizing the input, then simultaneously conducting mathematical equations to discover relationships between tokens."

summarizer(input, min_length=5, max_length=30)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


[{'summary_text': 'a transformer model processes data by tokenizing the input . it then conducts mathematical equations to discover relationships between tokens .'}]

2. Question Answering


In this task, we provide a question and a context. The model will choose the answer from the context based on the highest probability score. It also provides the starting and ending positions of the text.

 models that have been fine-tuned for the summarization task - bart-large-cnn, t5-small, t5-large, t5-3b, t5-11b

In [5]:
qa_pipeline = pipeline(model="deepset/roberta-base-squad2")

qa_pipeline(
    question="Where do I work?",
    context="I work as a Software Engineer at EPSoft. I like to develop my own applications.",
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'score': 0.8975484371185303, 'start': 33, 'end': 39, 'answer': 'EPSoft'}

3. Named Entity Recognition


Named Entity Recognition deals with identifying and classifying the words based on the names of persons, organizations, locations and so on. The input is basically a sentence and the model will determine the named entity along with its category and its corresponding location in the text.

In [8]:
ner_classifier = pipeline(
    model="dslim/bert-base-NER-uncased", aggregation_strategy="simple"
)
sentence = "My name is Srisowmya. I enjoy while eating food like Biryani, noodles.And I like to travel places which has grennary like Araku"
entity = ner_classifier(sentence)
print(entity)

Some weights of the model checkpoint at dslim/bert-base-NER-uncased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.92290306, 'word': 'srisowmya', 'start': 11, 'end': 20}, {'entity_group': 'MISC', 'score': 0.70935935, 'word': 'biryani', 'start': 53, 'end': 60}, {'entity_group': 'LOC', 'score': 0.8874147, 'word': 'araku', 'start': 122, 'end': 127}]


4. Part-of-Speech Tagging


PoS Tagging is useful to classify the text and provide its relevant parts of speech such as whether a word is a noun, pronoun, verb and so on. The model returns PoS tagged words along with their probability scores and respective locations.

In [9]:
pos_tagger = pipeline(
    model="vblagoje/bert-english-uncased-finetuned-pos",
    aggregation_strategy="simple",
)
pos_tagger("I am Software Engineer and I live in Hyderabad")

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'entity_group': 'PRON',
  'score': 0.9995009,
  'word': 'i',
  'start': 0,
  'end': 1},
 {'entity_group': 'AUX',
  'score': 0.99804807,
  'word': 'am',
  'start': 2,
  'end': 4},
 {'entity_group': 'NOUN',
  'score': 0.98996794,
  'word': 'software engineer',
  'start': 5,
  'end': 22},
 {'entity_group': 'CCONJ',
  'score': 0.9992336,
  'word': 'and',
  'start': 23,
  'end': 26},
 {'entity_group': 'PRON',
  'score': 0.9994727,
  'word': 'i',
  'start': 27,
  'end': 28},
 {'entity_group': 'VERB',
  'score': 0.99846905,
  'word': 'live',
  'start': 29,
  'end': 33},
 {'entity_group': 'ADP',
  'score': 0.99945444,
  'word': 'in',
  'start': 34,
  'end': 36},
 {'entity_group': 'PROPN',
  'score': 0.9985177,
  'word': 'hyderabad',
  'start': 37,
  'end': 46}]

5. Text Classification


We will perform sentiment analysis and classify the text based on the tone.

In [10]:
text_classifier = pipeline(
    model="distilbert-base-uncased-finetuned-sst-2-english"
)
text_classifier("This movie is horrible!")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9997865557670593}]

In [11]:
text_classifier("I like Biryani")

[{'label': 'POSITIVE', 'score': 0.999057948589325}]

6. Text Generation:

In [12]:
text_generator = pipeline(model="gpt2")
text_generator("I am working as a software engineer ", do_sample=False)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am working as a software engineer \xa0and I am working on a project that will allow me to create a new version of the game. I am also working on a new game that will allow me to create a new version of the game.'}]

7. Text Translation:


Here, we will translate the language of text from one language to another. For example, we have chosen translation from English to French.

In [13]:
en_fr_translator = pipeline("translation_en_to_fr", model='t5-small')
en_fr_translator("Hi, How are you?")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

[{'translation_text': 'Bonjour, Comment êtes-vous ?'}]