The Transformers library, developed by Hugging Face, is a versatile and powerful tool for working with transformer-based models in the field of natural language processing (NLP) and other AI-related tasks. Here are some of the key functionalities and things that Transformers can do:

1. **Pre-trained Models**: Transformers provides access to a wide range of pre-trained transformer-based models. These models are trained on massive datasets and have achieved state-of-the-art performance on various NLP tasks. You can easily load these pre-trained models and use them for tasks like text classification, sentiment analysis, named entity recognition, translation, text generation, and more.

2. **Fine-tuning**: Transformers allows you to fine-tune pre-trained models on your own datasets. This is particularly useful when you have a specific NLP task that you want to solve, and you can adapt a pre-trained model to perform well on your task with relatively little data and effort.

3. **Tokenization**: The library includes tokenizers for various transformer-based models. Tokenization is a crucial step in preparing text data for input into these models, and Transformers makes it easy to tokenize text efficiently.

4. **Inference and Generation**: You can use Transformers to generate text and perform inference with your fine-tuned models. This is useful for tasks like chatbots, text completion, and text summarization.

5. **Model Architecture**: Transformers provides detailed information about the architecture of transformer models. You can inspect the layers, weights, and parameters of these models, which is valuable for researchers and developers who want to understand and customize model behavior.

6. **Model Hub**: Transformers is integrated with Hugging Face's Model Hub, where you can find a wide variety of pre-trained models, model checkpoints, and resources shared by the NLP community. It makes it easy to discover, download, and use models for different tasks.

7. **Datasets**: Transformers can be used in conjunction with Hugging Face's Datasets library, which offers a vast collection of NLP datasets for various tasks. You can seamlessly load and preprocess these datasets for training and evaluation.

8. **Model Deployment**: While Transformers primarily focuses on model development and experimentation, it can also be used as a starting point for deploying NLP models in production systems.

9. **Multi-lingual Support**: Many of the pre-trained models available in Transformers support multiple languages, making it valuable for multilingual applications.

In summary, Transformers is a comprehensive library that simplifies working with transformer-based models in NLP and AI tasks. It abstracts many of the complexities associated with these models, making it accessible to a wide range of developers and researchers for various natural language processing tasks.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Hugging Face is a company and open-source community that specializes in natural language processing (NLP) and artificial intelligence (AI). They are known for their contributions to the field of NLP, particularly for developing and maintaining popular open-source libraries and tools for working with various NLP models, including transformers.

Here are some key aspects of Hugging Face:

1. **Transformers Library**: Hugging Face is perhaps best known for its "Transformers" library, which provides a wide range of pre-trained NLP models, including BERT, GPT-2, RoBERTa, and many others. These models can be used for various NLP tasks such as text classification, machine translation, sentiment analysis, and more. The library also includes easy-to-use interfaces for fine-tuning these models on custom datasets.

2. **Datasets**: Hugging Face maintains a repository of NLP datasets that are commonly used for training and evaluating NLP models. These datasets are available for free and can be easily accessed and integrated into your NLP projects.

3. **Model Hub**: Hugging Face hosts a Model Hub where you can find pre-trained models, model checkpoints, and other resources. It's a central place for the NLP community to share and discover models.

4. **Transformers Ecosystem**: The Transformers library is a core part of Hugging Face's ecosystem, but they also offer other libraries and tools for tasks like text generation, model deployment, and more.

5. **Community and Collaboration**: Hugging Face has a strong open-source community that actively contributes to the development of their libraries and tools. They encourage collaboration and provide resources for researchers, developers, and practitioners in the NLP and AI fields.

Hugging Face has played a significant role in democratizing access to powerful NLP models and has made it easier for developers and researchers to work with state-of-the-art NLP techniques. Their tools and resources have been widely adopted in the AI and NLP communities.

In [None]:
# https://huggingface.co/

!pip install transformers
!pip install datasets
!pip install sentencepiece



In [None]:
import datasets
import huggingface_hub
import matplotlib.pyplot as plt
import transformers

In [None]:
text = """
Hugging Face is a company and open-source community that specializes in natural language processing (NLP) and artificial intelligence (AI). They are known for their contributions to the field of NLP, particularly for developing and maintaining popular open-source libraries and tools for working with various NLP models, including transformers.

"""

# Text Classification

Text classification is a natural language processing (NLP) task that involves assigning predefined categories or labels to a given text document or piece of text. The goal of text classification is to automatically categorize text data into one or more predefined classes or categories based on its content or context. It is a fundamental task in NLP and has a wide range of applications, including:

Sentiment Analysis: Determining the sentiment or emotional tone of a piece of text, such as determining whether a movie review is positive, negative, or neutral.

Topic Classification: Categorizing news articles, blog posts, or documents into predefined topics or subjects, such as sports, politics, technology, or entertainment.

Spam Detection: Identifying whether an email or message is spam or not.

Language Detection: Determining the language in which a document or text is written.

Intent Recognition: Understanding the intent behind a user's query or message, such as identifying whether a user's query is a request for information, a complaint, or a greeting.

Text Categorization: Organizing and categorizing documents in a digital library or content repository for easy retrieval.

Authorship Attribution: Determining the likely author of a text based on writing style, which is used in forensic linguistics and literary analysis.

Medical Diagnosis: Classifying medical records or clinical notes into categories like diseases, symptoms, or treatments.

In [None]:
from transformers import pipeline
classifier=pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The pipeline() function in the Hugging Face Transformers library is used to create a pre-configured NLP (Natural Language Processing) pipeline for various common NLP tasks. It simplifies the process of using transformer-based models for these tasks. Here's how to use the pipeline() function:

from transformers import pipeline

# Create an NLP pipeline for a specific task
nlp_task = pipeline(task_name, model=model_name, tokenizer=tokenizer_name)

task_name (str): Specifies the NLP task you want to perform, such as "text-classification," "sentiment-analysis," "question-answering," "ner" (named entity recognition), and more. The available tasks depend on the pre-trained models and pipelines provided by Hugging Face.
Natural language processing (NLP) tasks, such as:
Text classification
Question answering
Named entity recognition
Sentiment analysis
Summarization
Translation

model_name (str, optional): Specifies the name or path of the pre-trained model to use for the task. If not provided, the default model for the selected task is used.

tokenizer_name (str, optional): Specifies the name or path of the tokenizer associated with the model. If not provided, the default tokenizer for the selected task and model is used.

Once you've created the NLP pipeline, you can use it to perform the specified task on input text. For example, if you've created a sentiment analysis pipeline, you can analyze the sentiment of a text by passing it to the pipeline:

result = nlp_task("This is a positive example.")
print(result)

The result will contain the output of the NLP task, which varies depending on the specific task. In the case of sentiment analysis, it might return something like:
[{'label': 'LABEL_1', 'score': 0.987}]

Example--
----------
from transformers import pipeline

# Load a pre-trained transformer model for NER
pipe = pipeline(task="ner", model="distilbert-base-uncased")

# Make predictions on the input data
predictions = pipe(["President Joe Biden met with Chinese President Xi Jinping in Beijing on Monday."])

# Print the predictions
for prediction in predictions:
    print(prediction)

Output---
----------
{'entity_group': 'PERSON', 'word': 'Joe Biden', 'start_char': 10, 'end_char': 19}
{'entity_group': 'PERSON', 'word': 'Xi Jinping', 'start_char': 34, 'end_char': 43}
{'entity_group': 'LOCATION', 'word': 'Beijing', 'start_char': 57, 'end_char': 64}
{'entity_group': 'DATE', 'word': 'Monday', 'start_char': 70, 'end_char': 76}

In [None]:
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9981080293655396}]

In [None]:
#from transformers import pipeline
#classifier = pipeline('text-classification')

# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("summarization", model="human-centered-summarization/financial-summarization-pegasus")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at human-centered-summarization/financial-summarization-pegasus and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
pipe(text)

Your max_length is set to 64, but your input_length is only 61. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)


[{'summary_text': 'Hugging Face is a company and open-source community that specializes in natural language processing.'}]

In [None]:
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9981080293655396}]

# NER Named Entity Recognition

Named Entity Recognition (NER) is a subtask of natural language processing (NLP) that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

NER is a useful task for a variety of applications, such as:

Information extraction: NER can be used to extract key information from text, such as the names of people, places, and organizations. This information can then be used to populate databases, create knowledge graphs, and power other applications.
Question answering: NER can be used to improve the performance of question answering systems by identifying the named entities in a question and then using those entities to search for the answer.
Machine translation: NER can be used to improve the quality of machine translation by identifying the named entities in a sentence and then translating them accurately.
Text summarization: NER can be used to improve the quality of text summarization by identifying the most important named entities in a text and then summarizing them.
NER is a challenging task because named entities can be expressed in a variety of ways. For example, the person name "John Doe" can also be expressed as "Mr. Doe" or "J. Doe." Additionally, named entities can be nested within other named entities. For example, the organization name "Acme Corporation" is nested within the location name "New York City."

There are a variety of different approaches to NER, but most modern NER systems are based on deep learning. Deep learning NER systems are able to achieve high accuracy by learning to identify the patterns that are associated with different types of named entities.

Here is an example of a NER system in action:

Input text:

President Joe Biden met with Chinese President Xi Jinping in Beijing on Monday.

NER output:

PERSON, Biden
PERSON, Xi Jinping
LOCATION, Beijing
DATE, Monday
NER systems are used in a variety of different products and services, such as search engines, social media platforms, and customer relationship management (CRM) systems. NER is an essential tool for many NLP applications, and it is likely to become even more important in the future as the world becomes increasingly digitized.

In [None]:
text = "Modi is prime minister of India and he speaks hindi."

In [None]:
ner = pipeline('ner', aggregation_strategy = 'simple')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
import pandas as pd
out = ner(text)
print(pd.DataFrame(out))

  entity_group     score   word  start  end
0          PER  0.994760   Modi      0    4
1          LOC  0.998302  India     26   31
2         MISC  0.802459  hindi     46   51


In [None]:
out

[{'entity_group': 'PER',
  'score': 0.9947597,
  'word': 'Modi',
  'start': 0,
  'end': 4},
 {'entity_group': 'LOC',
  'score': 0.99830186,
  'word': 'India',
  'start': 26,
  'end': 31},
 {'entity_group': 'MISC',
  'score': 0.80245876,
  'word': 'hindi',
  'start': 46,
  'end': 51}]

# Summarization

In [None]:
text = """
Elon Reeve Musk (/ˈiːlɒn/ EE-lon; born June 28, 1971) is a business magnate and investor. Musk is the founder, chairman, CEO and chief technology officer of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; founder of the Boring Company; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is the wealthiest person in the world, with an estimated net worth of US$226 billion as of September 2023, according to the Bloomberg Billionaires Index, and $249 billion according to Forbes, primarily from his ownership stakes in both Tesla and SpaceX.[4][5][6]

Musk was born in Pretoria, South Africa, and briefly attended the University of Pretoria before immigrating to Canada at age 18, acquiring citizenship through his Canadian-born mother. Two years later, he matriculated at Queen's University in Kingston, Ontario. Musk later transferred to the University of Pennsylvania, and received bachelor's degrees in economics and physics there. He moved to California in 1995 to attend Stanford University. However, Musk dropped out after two days and, with his brother Kimbal, co-founded online city guide software company Zip2. The startup was acquired by Compaq for $307 million in 1999, and with $12 million of the money he made, that same year Musk co-founded X.com, a direct bank. X.com merged with Confinity in 2000 to form PayPal.

In 2002, eBay acquired PayPal for $1.5 billion, and that same year, with $100 million of the money he made, Musk founded SpaceX, a spaceflight services company. In 2004, he became an early investor in electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.). He became its chairman and product architect, assuming the position of CEO in 2008. In 2006, Musk helped create SolarCity, a solar-energy company that was acquired by Tesla in 2016 and became Tesla Energy. In 2013, he proposed a hyperloop high-speed vactrain transportation system. In 2015, he co-founded OpenAI, a nonprofit artificial intelligence research company. The following year, Musk co-founded Neuralink—a neurotechnology company developing brain–computer interfaces—and the Boring Company, a tunnel construction company. In 2022, he acquired Twitter for $44 billion. He subsequently merged the company into newly created X Corp. and rebranded the service as X the following year. In March 2023, he founded xAI, an artificial-intelligence company.

Musk has expressed views that have made him a polarizing figure. He has been criticized for making unscientific and misleading statements, including that of spreading COVID-19 misinformation, and promoting conspiracy theories. His Twitter ownership has been similarly controversial, including laying off a large number of employees, an increase in hate speech on the platform and changes to Twitter Blue verification were criticized. In 2018, the U.S. Securities and Exchange Commission (SEC) sued him for falsely tweeting that he had secured funding for a private takeover of Tesla. To settle the case, Musk stepped down as the chairman of Tesla and paid a $20 million fine.

"""

In [None]:
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
outputs = summarizer(text, clean_up_tokenization_spaces = True, max_length = 60)
print(outputs[0]['summary_text'])

 Elon Musk is the founder, chairman, CEO and chief technology officer of SpaceX. He is the wealthiest person in the world, with an estimated net worth of US$226 billion as of September 2023. In 2002, eBay acquired PayPal for $1.5 billion, with $100


 Elon Musk is the founder, chairman, CEO and chief technology officer of SpaceX. He is the wealthiest person in the world, with an estimated net worth of US$226 billion as of September 2023. In 2002, eBay acquired PayPal for $1.5 billion, with $100

# Question Answering

In [None]:
text = "Modi is prime minister of India and he speaks hindi."

In [None]:
reader = pipeline('question-answering')
question = "Who is Modi?"
outputs =  reader(question = question, context = text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.830831,8,31,prime minister of India


In [None]:
outputs

{'score': 0.8308310508728027,
 'start': 8,
 'end': 31,
 'answer': 'prime minister of India'}