# [Hugging Face](https://huggingface.co/)
![](https://huggingface.co/front/assets/huggingface_logo.svg)

Many beginners might not be aware about this package but I believe it's G.O.A.T ( Greatest of All Times ).  
It's a handy package for beginners as well as experts due to simplicity of use.  
It has tonnes of [pretrained models](https://huggingface.co/models) and also gives us access to train our own with both Tensorflow and PyTorch.  

It can be used for almost all NLP related tasks like
- [Sequence Classification](https://huggingface.co/transformers/task_summary.html#sequence-classification) like Sentiment Analysis
- [Question Answering](https://huggingface.co/transformers/task_summary.html#extractive-question-answering)
- [Summarisation](https://huggingface.co/transformers/task_summary.html#summarization)
- and many [more](https://huggingface.co/transformers/task_summary.html)

In this Notebook I'll brief you about Sentiment Analysis and Summarisation using pretrained model to get you started with Hugging Face. 

In [None]:
import transformers
import pandas as pd
import re

In [None]:
data = pd.read_csv("../input/pfizer-vaccine-tweets/vaccination_tweets.csv")
data.head()

As we're only focusing on tweets we'll extract the Text column.  
Let's print first 5 tweets in the Dataset.

In [None]:
tweets = data['text'].values
tweets[:5]

As observed above, the tweets are not complete and can be found on trailing URLs.  
Before begining with our task Let's first preprocess the data to remove URLs and Emojis.

In [None]:
def data_preprocess(words):
    
    # removing any emojis or unknown charcters
    words = words.encode('ascii','ignore')
    words = words.decode()
    
    # spliting string into words
    words = words.split(' ')
    
    # removing URLS
    words = [word for word in words if not word.startswith('http')]
    words = ' '.join(words)
    
    # removing punctuations
    words = re.sub(r"[^0-9a-zA-Z]+", " ", words)
    
    # removing extra spaces
    words = re.sub(' +', ' ', words) 
    return words

In [None]:
tweets = [data_preprocess(tweet) for tweet in tweets]
tweets[:5]

## [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html)
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks.  
This will download the pretrained models one time and then can be reused when ever required.

In [None]:
sentiment = transformers.pipeline('sentiment-analysis')
summarizer = transformers.pipeline("summarization")

Sentiment for tweet at index 1
> **While the world has been on the wrong side of history this year hopefully the biggest vaccination effort we ve ev**

In [None]:
sentiment(tweets[1])

Sentiment for tweet at index 4
> **Explain to me again why we need a vaccine BorisJohnson MattHancock whereareallthesickpeople PfizerBioNTech**

In [None]:
sentiment(tweets[4])

We can easily use the same API for batches of data as given below.  
This might take some time.

In [None]:
sentiment(tweets[:5])

In [None]:
tweet_sentiment_data = sentiment(tweets)
tweet_sentiment_data = pd.DataFrame(tweet_sentiment_data)
tweet_sentiment_data.head()

In [None]:
tweet_sentiment_data['label'].value_counts()

Hence we observe the dataset has More **Negative** tweets than **Positive**.  
As the API is traind on large and standardised data we can trust our predictions to a great extend. However if you want to train for your own data, you can refer [here](https://huggingface.co/transformers/task_summary.html#sequence-classification)

NOTE: The score here refers to te probability of the label.

Let's try summarization of some tweets. As tweet themselves are small entities we'll join first 25 and see the results. The methods can be extended just sentiment analysis.

In [None]:
summarizer(' '.join(tweets[:25]))

In [None]:
summarizer(' '.join(tweets[-25:]))

I hope the notebook helped you in getting started with NLP and Hugging Face.