<a href="https://colab.research.google.com/github/sanketrk/Clustering-of-Network-Complaint-Data/blob/master/3_Hands_on_Perform_different_tasks_with_state_of_the_art_(SOTA)_transformer_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP applications with Transformers

## Leveraging Transformers Pipelines

__Here you will leverage various transformer pipelines to solve real-world problems__

Pipelines encapsulate the overall process of every NLP process:

 1. *Tokenization*: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
 2. *Inference*: Maps every tokens into a more meaningful representation.
 3. *Decoding*: Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following
structure:

```python
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
```


These models are already fine-tuned models for specific tasks available in huggingface hub

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

### Install dependencies

In [None]:
!pip install -q transformers

In [None]:
from transformers import pipeline
import pandas as pd

In [None]:
pd.set_option('display.max_colwidth', None)

# 1. Customer Feedback Analytics

Scenario: You are working for a retail company which stocks various products. You will try to use tranformer LLMs here to extract sentiment and emotion from customer reviews

## Example Demo:

In [None]:
sample_reviews = [
    "The PureTaste kitchen blender is far from efficient. It struggles with even soft fruits and often overheats. I had to return it within a week.",
    "The EverGreen plant fertilizer is magic in a bottle! Within weeks of using it, my plants have shown significant growth and look healthier than ever. A thumbs up from an avid gardener.",
    "Purchased the NovaTech laptop last month, and it's been a smooth experience. The processing speed is fast, and the graphics are stunning. Best decision for my work and gaming needs.",
    "The StarGlow nightlight looked promising, but it's far dimmer than advertised. Doesn't serve the purpose at all. Quite disappointed."
]

sample_review_df = pd.DataFrame(sample_reviews, columns=['review'])
sample_review_df

### Load a sentiment analysis transformer model

In [None]:
sentiment_model = pipeline('sentiment-analysis', device=0)

### Perform sentiment analysis for given reviews

In [None]:
reviews = sample_review_df['review'].values
reviews

In [None]:
reviews[0]

In [None]:
sentiment_model(reviews[0])

In [None]:
sentiment_model(reviews[0])[0]['label']

In [None]:
sentiments = []

for review in reviews:
  sentiments.append(sentiment_model(review)[0]['label'])

sentiments

In [None]:
sample_review_df['sentiment'] = sentiments
sample_review_df

### Load an emotion detector transformer model

In [None]:
emotion_model = pipeline('sentiment-analysis',
                         model='SamLowe/roberta-base-go_emotions',
                         device=0)

### Perform emotion analysis for given reviews

In [None]:
emotions = []

for review in reviews:
  emotions.append(emotion_model(review)[0]['label'])

emotions

In [None]:
sample_review_df['emotion'] = emotions
sample_review_df

### Plot some basic visuals

In [None]:
sample_review_df['sentiment'].value_counts().plot(kind='bar');

In [None]:
sample_review_df['emotion'].value_counts().plot(kind='bar');

## Your Turn:

Given more customer reviews, repeat the same analytics which we showed you above.

1. For each review get the sentiment and emotion using the same models
2. Store the emotion and sentiment in separate columns in the dataframe
3. Show the dataframe with all the columns
4. Create basic visuals

In [None]:
reviews = [
    "I had high hopes for the UltraClean vacuum cleaner, but it hardly picks up any dirt. Also, the battery life is abysmal. Not worth the price.",
    "The SunBurst solar charger was advertised as being highly efficient, but it barely charges my devices. Even under direct sunlight, it takes forever. A complete letdown.",
    "Bought the AirLite drone, and it's not user-friendly at all. The controls are confusing, and the camera quality is mediocre at best. Wish I had gone for another brand.",
    "Bought the LumiGlow skincare serum a month ago, and the results are visible! My skin feels rejuvenated and the fine lines have reduced. Definitely worth the investment.",
    "The TerraFit hiking boots are outstanding. They're not only durable but also provide excellent support during long treks. My feet have never felt so comfortable during hikes.",
    "I was skeptical about the QuickCook microwave oven initially, but it has surpassed all my expectations. The multi-function features and quick heating technology are truly innovative. A must-have in every kitchen!",
    "The Prestige juicer is a disappointment. It constantly gets jammed, and cleaning it is a nightmare. I regret this purchase.",
    "I recently purchased the AquaBliss water filter, and I'm genuinely impressed! Not only was the installation straightforward, but the water tastes remarkably fresh. Highly recommended for anyone wanting pure drinking water.",
    "The SoundWave headphones are a game-changer. They fit comfortably, and the noise-cancellation feature is top-notch. I can finally enjoy my music without any interruptions.",
    "The GlamorEyes mascara is terrible. It clumps easily and doesn't provide the volume it promises. Definitely won't be repurchasing.",
    "The StreamLine smartwatch is a fantastic addition to my tech collection. It tracks my health metrics accurately, and the battery life is commendable. It's stylish and functional.",
    "I've been using the SilkSoft hair conditioner for the past two weeks, and my hair has never felt better. It's silky, tangle-free, and smells divine. Highly recommend for those with frizzy hair.",
    "The LuxeShine nail polish chips easily and doesn't last more than a day. The colors are lovely, but the quality leaves much to be desired.",
    "Tried the PowerPulse wireless earbuds, and they keep disconnecting randomly. The sound quality is also underwhelming. Not what I expected for the price.",
    "The ZoomDrive USB stick is both compact and efficient. Transferring files has never been faster, and its durable design ensures I don't have to worry about data loss.",
    "Opted for the FreshBreeze air purifier, and it has made a noticeable difference in my home. Allergies are at bay, and the air feels cleaner and fresher.",
    "The ClearVue window cleaner left streaks on my windows. I had to redo the cleaning with another product. It's more hassle than it's worth.",
    "The MegaGrill BBQ set looked sturdy online, but in reality, it's flimsy and poorly constructed. Cooking on it has been a challenge.",
    "The UrbanWalk sneakers are not only trendy but also incredibly comfortable. I've worn them for both workouts and casual outings, and they've been perfect for both.",
    "The SwiftWrite pen set is not worth the money. The ink smudges, and the pens run out way too quickly. I'll be looking for alternatives."
]

review_df = pd.DataFrame(reviews, columns=['review'])
review_df.head()

In [None]:
reviews = review_df['review'].values

sentiments = []
emotions = []

# 2. Medical Report Q&A and Summary

Scenario: You are working for a healthcare company and you need to quickly create a report summary for a brain cancer report and also test the feasibility of a chatbot.

## Your Turn:

Use the models mentioned below and try Q&A and Summarization

### Question Answering

Use the model and the context report provided and ask the following questions and see what answers you get

1. What is cancer?
2. What is brain cancer?
3. What are traditional methods to detect brain cancer?
4. What is Magnetic Resonance Imaging?
5. What is biopsy?
6. What can AI do for brain cancer?
7. How can deep learning be used for brain cancer?

In [None]:
nlp_qa = pipeline('question-answering',
                  model='deepset/roberta-base-squad2',
                  device=0)

In [None]:
context = """
Cancer Detection for the Brain: An Overview of Modern Methods and the Role of Computer Vision

Cancer, a group of diseases characterized by the uncontrolled growth and spread of abnormal cells, has long been a significant concern in medical science. Among the various types of cancer, brain cancer is particularly alarming due to its location in the central nervous system. Detecting brain cancer early is crucial, as it significantly improves the chances of successful treatment and patient survival. This essay will delve into the methods available for detecting brain cancer, emphasizing early detection techniques and the growing role of computer vision in this arena.

1. Traditional Detection Methods:
a. Magnetic Resonance Imaging (MRI): This is the most commonly used technique for diagnosing brain tumors. MRI uses powerful magnets and radio waves to produce detailed images of the brain. These images can help identify abnormal growths, which may then be biopsied for confirmation.

b. Computed Tomography (CT) Scan: Although not as detailed as MRI, CT scans are faster and can provide valuable information about the presence and location of tumors. They use X-rays to produce cross-sectional images of the brain.

c. Biopsy: A biopsy is the removal of a small amount of tissue for examination under a microscope. This is the most definitive way to diagnose cancer. Surgeons usually perform biopsies using a needle or during surgery.

2. Early Detection Techniques:
a. Advanced Imaging: Advanced MRI techniques, such as functional MRI (fMRI) and magnetic resonance spectroscopy (MRS), can detect early changes in the brain's functionality or chemistry, even before a tumor becomes apparent.

b. Liquid Biopsy: This is a non-invasive method where doctors test the blood for tiny fragments of DNA shed by cancer cells. While still in the research phase for brain cancer, it holds promise for early detection.

c. Neurological Exams: Regular neurological exams can catch subtle changes in brain function, such as coordination, memory, or reflexes, which might be indicative of a developing tumor.

3. Computer Vision and Brain Cancer Detection:
With the advancement of artificial intelligence (AI) and machine learning, computer vision has emerged as a potent tool in the medical field, especially in cancer detection.

a. Image Segmentation: Computer vision algorithms can segment MRI or CT scan images to highlight potential tumor regions accurately. This can be particularly useful in identifying small tumors that might be overlooked by the human eye.

b. Predictive Analysis: Using vast datasets, computer vision systems can predict the likelihood of a tumor being malignant or benign. This assists doctors in deciding whether a biopsy is necessary.

c. Automated Monitoring: By comparing current images with previous scans, computer vision can detect minute changes over time, enabling the identification of tumors at a very early stage.

d. Enhanced Visualization: Computer vision can enhance the visualization of scans, making it easier for radiologists to interpret the results. Techniques like deep learning can amplify the differences between healthy and cancerous tissues.

In conclusion, while traditional methods of brain cancer detection remain foundational, the integration of modern techniques, especially computer vision, is revolutionizing the early detection and diagnosis of this life-threatening disease. The synergy of human expertise and machine precision promises a brighter future for patients at risk of brain cancer. As research progresses and technology becomes more advanced, we can anticipate even more accurate and early detection methods, potentially saving countless lives.
"""

In [None]:
nlp_qa(context=context, question='What is cancer?')

### Summarization

Use the model mentioned below and summarize the brain cancer report from above

In [None]:
summarizer = pipeline('summarization',
                      model='philschmid/bart-large-cnn-samsum',
                      device=0)

In [None]:
result =

In [None]:
import nltk
nltk.download('punkt')

In [None]:
summary = result[0]['summary_text']
print('\n'.join(nltk.sent_tokenize(summary)))

# 3. Table / Database Q & A

Scenario: You are working for a sports website who wants you to quickly get some factual answers from a sports database table. Unfortunately no-one knows or wants to learn SQL.

Here you as an AI expert will try to use a transformer LLM to do table Q&A!

## Example Demo:

### Load the Table QA Transformer model

In [None]:
table_qa = pipeline('table-question-answering',
                    model='neulab/omnitab-large-finetuned-wtq',
                    device=0)

### Look at the sample data table we have shown

In [None]:
data = {
    "year": [1896, 1900, 1904, 2004, 2008, 2012],
    "city": ["athens", "paris", "st. louis", "athens", "beijing", "london"]
}
table = pd.DataFrame.from_dict(data)
table

### Ask questions based on the table

In [None]:
table_qa(table=table,
         query="In which year did beijing host the Olympic Games?")

## Your turn:

Given the following sample data table try to ask and get answers for the following questions:


- Which country won the most medals in the table for olympics?
- Leading country with the most medals in olympics 1988?
- What is the max value for Participating Nations in the olympics?
- Which year had the maximum Participating Nations in the olympics?
- Host city and host country of 2020 olympics?
- Host country who hosted the most number of olympic games?
- How many olympic games were hosted by USA?

In [None]:
# Data for the last 10 Summer Olympics
data = {
    "Year": [2020, 2016, 2012, 2008, 2004, 2000, 1996, 1992, 1988, 1984],
    "Host City": ["Tokyo", "Rio de Janeiro", "London", "Beijing", "Athens", "Sydney", "Atlanta", "Barcelona", "Seoul", "Los Angeles"],
    "Host Country": ["Japan", "Brazil", "United Kingdom", "China", "Greece", "Australia", "USA", "Spain", "South Korea", "USA"],
    "Participating Nations": [205, 207, 204, 204, 201, 199, 197, 169, 159, 140],
    "Athletes": [11338, 11238, 10568, 10942, 10625, 10651, 10318, 9356, 8391, 6829],
    "Sports": [33, 28, 26, 28, 28, 28, 26, 25, 23, 21],
    "Leading Country": ["USA", "USA", "USA", "China", "USA", "USA", "USA", "Unified Team", "Soviet Union", "USA"],
    "Total Medals": [113, 121, 104, 100, 103, 97, 101, 112, 132, 174]
}

# Creating the dataframe
table = pd.DataFrame(data)
table

In [None]:
table_qa(table=table,
         query="Which country won the most medals in the table for olympics?")

# 4. News Categorization

Scenario: You are working for a news website who wants to stop frustrated people from sitting and manually categorizing news articles in specific category sections in its website, so they can help with creating more content and be happy!

Here you as an AI expert will try to use a zero-shot transformer LLM to do news categorization for three major news categories - sports, business and technology.

## Example Demo:

### Load the zero shot model

In [None]:
categorizer = pipeline('zero-shot-classification',
                       device=0)

### View the sample news articles

In [None]:
# Sample data with fictitious names for 5 articles
data = {
    "news_article": [
        "Riverside Roarers clinch victory in the regional football finals!",
        "Tech giant NeoTech announces its latest smartphone, the NeoTech Ultra.",
        "Stock markets see a steady rise, with BlueChip Enterprises leading the rally.",
        "InnovateSoft unveils a groundbreaking app that aids remote learning.",
        "Commodities market shaken as precious metal prices fluctuate unexpectedly."
    ],
    "actual_category": [
        "sports",
        "technology",
        "business",
        "technology",
        "business"
    ]
}

# Creating the dataframe
df = pd.DataFrame(data)
df

### Perform text categorization

In [None]:
news_categories = ['sports', 'business', 'technology']

In [None]:
articles = df['news_article'].values
articles

In [None]:
articles[0]

In [None]:
categorizer(articles[0], news_categories)

In [None]:
categorizer(articles, news_categories)['labels'][0]

In [None]:
predicted_categories = []

for article in articles:
    predicted_categories.append(categorizer(article, news_categories)['labels'][0])

predicted_categories

### Compare human labeled and AI model labeled categories

In [None]:
df['predicted_category'] = predicted_categories
df

## Your turn:

For the following news articles, perform the same exercise as shown above

In [None]:
# Sample data with fictitious names
data = {
    "news_article": [
        "Local team, the Riverside Roarers, wins the national championship!",
        "Olympics postponed due to unforeseen circumstances, world athletes in shock.",
        "New athletic shoes by SportsMaster promise better performance for athletes.",
        "Stock markets hit an all-time high; investors rejoice as GlobalTrade Inc. leads the surge.",
        "Tech giant NeoTech set to acquire startup InnovateSoft, sparking industry buzz.",
        "Latest quarterly reports show unexpected profits for FinCorp Industries.",
        "Revolutionary new AI tech from AIRevolution promises to change the way we work.",
        "Major tech conference TechFest 2023 postponed due to health concerns.",
        "Breakthrough in quantum computing achieved by Pinnacle University researchers.",
        "Financial analysts at MoneyMatters predict a bullish year for commodities, especially gold."
    ],
    "actual_category": [
        "sports",
        "sports",
        "sports",
        "business",
        "business",
        "business",
        "technology",
        "technology",
        "technology",
        "business"
    ]
}

# Creating the dataframe
df = pd.DataFrame(data)
df

In [None]:
articles = df['news_article'].values
predicted_categories = []



In [None]:
del categorizer
del table_qa
del summarizer
del nlp_qa

In [None]:
import gc
gc.collect()

In [None]:
import torch
torch.cuda.empty_cache()

## Bonus: Zero-Shot Object Detection

In [None]:
detector = pipeline(task="zero-shot-object-detection",
                    model="google/owlv2-base-patch16-ensemble",
                    device=0)

In [None]:
!curl "https://i.imgur.com/WLF73qt.png" -o image.png

In [None]:
import numpy as np
from PIL import Image

In [None]:
image = Image.open('image.png')
image

In [None]:
predictions = detector(
    image,
    candidate_labels=["cat", "plant", "computer", "computer mouse"],
)
predictions

In [None]:
df = pd.DataFrame(predictions)
df.head()

In [None]:
predictions = df.loc[df.groupby('label')['score'].idxmax()].to_dict('records')
predictions

In [None]:
from PIL import ImageDraw

draw = ImageDraw.Draw(image)

for prediction in predictions:
    box = prediction["box"]
    label = prediction["label"]
    score = prediction["score"]

    xmin, ymin, xmax, ymax = box.values()
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="red")

image