## Pipelines API

### Importing the Pipeline

### Select a Task

### Load the Pipeline

### Run the Pipeline

In [46]:
# --------------------
# Import pipeline
# --------------------
from transformers import pipeline

# --------------------
# Create the task pipeline
# --------------------
task_pipeline = pipeline(task="sentiment-analysis")

# --------------------
# Create the model pipeline
# --------------------
model_pipeline = pipeline(model="distilbert-base-uncased-finetuned-sst-2-english")

text = "this a non sentence. I am not sure what to do with it."

# --------------------
# Predict the sentiment
# --------------------
task_output = task_pipeline(text)
model_output = model_pipeline(text)

print(f"Sentiment from task_pipeline: {task_output[0]['label']}; Sentiment from model_pipeline: {model_output[0]['label']}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification f

Sentiment from task_pipeline: NEGATIVE; Sentiment from model_pipeline: NEGATIVE


### Multiple Sentences

In [47]:
text1 = ["this a non sentence. I am not sure what to do with it.", "I am happy to be here"]

# Predict the sentiment of multiple sentences
task_output = task_pipeline(text1)
model_output = model_pipeline(text1)

import pandas as pd
# convert the output to a dataframe including the text
df = pd.DataFrame(task_output)
df["text"] = text1
print(df)



      label     score                                               text
0  NEGATIVE  0.999614  this a non sentence. I am not sure what to do ...
1  POSITIVE  0.999875                              I am happy to be here


### Using a different model other than the default model

In [None]:
distilbert_pipeline = pipeline(model="distilbert-base-uncased", task="sentiment-analysis")
distilbert_output = distilbert_pipeline(text1)
df_2 = pd.DataFrame(distilbert_output)
df_2["text"] = text1


# -------------------
# And another model 
# -------------------
bert_pipeline = pipeline(model="kwang123/bert-sentiment-analysis", task="sentiment-analysis")
bert_output = bert_pipeline(text1)
df_3 = pd.DataFrame(bert_output)
df_3["text"] = text1


print(df_2)
print(df_3)


### Import a small random subset of the sentinments df (1000 reviews)

In [50]:
ds = pd.read_csv('/Users/aishwaryaravichandran/Documents/nlp/nlp_assignment_0_group_work/CS6120/Assignments/A3/review_set.csv')
ds.shape


(210000, 9)

In [51]:
subset = ds.head(1000).copy()
subset.shape

(1000, 9)

In [52]:
column_dtype = subset['review'].dtype
print(column_dtype)

object


### Apply Sentiment analysis on 10, 100 and then 1000 reviews and extrapolate the time it would take for a large (1M) number of reviews

In [53]:
task_pipeline = pipeline(task="sentiment-analysis")

model_pipeline = pipeline(model="distilbert-base-uncased-finetuned-sst-2-english")

subset['review'] = subset['review'].astype(str)

subset['review'] = subset['review'].replace('None', pd.NA)
subset['review'] = subset['review'].dropna()

result_10 = subset.head(10)

sentiment_scores = []

for review in result_10['review']:
    # Perform sentiment analysis using the pipeline
    sentiment_result = task_pipeline(review[0:501])
    sentiment_scores.append(sentiment_result[0]['label'])

result_10['sentiment_analysis'] = sentiment_scores

output_10 = result_10['sentiment_analysis']

print(output_10)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification f

0    NEGATIVE
1    NEGATIVE
2    NEGATIVE
3    NEGATIVE
4    NEGATIVE
5    NEGATIVE
6    NEGATIVE
7    NEGATIVE
8    POSITIVE
9    POSITIVE
Name: sentiment_analysis, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_10['sentiment_analysis'] = sentiment_scores


In [54]:
result_100 = subset.head(100)

sentiment_scores = []

for review in result_100['review']:
    # Perform sentiment analysis using the pipeline
    sentiment_result = task_pipeline(review[0:501])
    sentiment_scores.append(sentiment_result[0]['label'])

result_100['sentiment_analysis'] = sentiment_scores

output_100 = result_100['sentiment_analysis']

print(output_100)

0     NEGATIVE
1     NEGATIVE
2     NEGATIVE
3     NEGATIVE
4     NEGATIVE
        ...   
95    NEGATIVE
96    NEGATIVE
97    NEGATIVE
98    NEGATIVE
99    POSITIVE
Name: sentiment_analysis, Length: 100, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_100['sentiment_analysis'] = sentiment_scores


In [55]:
result_1000 = subset.head(1000)

sentiment_scores = []

for review in result_1000['review']:
    # Perform sentiment analysis using the pipeline
    sentiment_result = task_pipeline(review[0:501])
    sentiment_scores.append(sentiment_result[0]['label'])

result_1000['sentiment_analysis'] = sentiment_scores

output_1000 = result_1000['sentiment_analysis']

print(output_1000)

0      NEGATIVE
1      NEGATIVE
2      NEGATIVE
3      NEGATIVE
4      NEGATIVE
         ...   
995    NEGATIVE
996    NEGATIVE
997    NEGATIVE
998    POSITIVE
999    POSITIVE
Name: sentiment_analysis, Length: 1000, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_1000['sentiment_analysis'] = sentiment_scores


### Find another model on the portal and repeat the above step

In [56]:
model_pipeline = pipeline(model="distilbert-base-uncased-finetuned-sst-2-english")

result_10 = subset.head(10)

sentiment_scores = []

for review in result_10['review']:
    # Perform sentiment analysis using the pipeline
    sentiment_result = model_pipeline(review[0:501])
    sentiment_scores.append(sentiment_result[0]['label'])

result_10['sentiment_analysis'] = sentiment_scores

output_10 = result_10['sentiment_analysis']

print(output_10)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


0    NEGATIVE
1    NEGATIVE
2    NEGATIVE
3    NEGATIVE
4    NEGATIVE
5    NEGATIVE
6    NEGATIVE
7    NEGATIVE
8    POSITIVE
9    POSITIVE
Name: sentiment_analysis, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_10['sentiment_analysis'] = sentiment_scores


In [57]:
result_100 = subset.head(100)

sentiment_scores = []

for review in result_100['review']:
    # Perform sentiment analysis using the pipeline
    sentiment_result = task_pipeline(review[0:501])
    sentiment_scores.append(sentiment_result[0]['label'])

result_100['sentiment_analysis'] = sentiment_scores

output_100 = result_100['sentiment_analysis']

print(output_100)

0     NEGATIVE
1     NEGATIVE
2     NEGATIVE
3     NEGATIVE
4     NEGATIVE
        ...   
95    NEGATIVE
96    NEGATIVE
97    NEGATIVE
98    NEGATIVE
99    POSITIVE
Name: sentiment_analysis, Length: 100, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_100['sentiment_analysis'] = sentiment_scores


In [58]:
result_1000 = subset.head(1000)

sentiment_scores = []

for review in result_1000['review']:
    # Perform sentiment analysis using the pipeline
    sentiment_result = task_pipeline(review[0:501])
    sentiment_scores.append(sentiment_result[0]['label'])

result_1000['sentiment_analysis'] = sentiment_scores

output_1000 = result_1000['sentiment_analysis']

print(output_1000)


0      NEGATIVE
1      NEGATIVE
2      NEGATIVE
3      NEGATIVE
4      NEGATIVE
         ...   
995    NEGATIVE
996    NEGATIVE
997    NEGATIVE
998    POSITIVE
999    POSITIVE
Name: sentiment_analysis, Length: 1000, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_1000['sentiment_analysis'] = sentiment_scores
