# Dataset and Library
The dataset that will be used as a sample in this notebook is the [Sentiment Labelled Sentences](https://archive-beta.ics.uci.edu/ml/datasets/sentiment+labelled+sentences) from the open source UCI Machine Learning Repository and [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) models from the Python library,[huggingface](https://huggingface.co/transformers)

#Dataset download and preprocessing

In [1]:
%%capture
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip

In [2]:
!unzip '/content/sentiment labelled sentences.zip'

Archive:  /content/sentiment labelled sentences.zip
   creating: sentiment labelled sentences/
  inflating: sentiment labelled sentences/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/sentiment labelled sentences/
  inflating: __MACOSX/sentiment labelled sentences/._.DS_Store  
  inflating: sentiment labelled sentences/amazon_cells_labelled.txt  
  inflating: sentiment labelled sentences/imdb_labelled.txt  
  inflating: __MACOSX/sentiment labelled sentences/._imdb_labelled.txt  
  inflating: sentiment labelled sentences/readme.txt  
  inflating: __MACOSX/sentiment labelled sentences/._readme.txt  
  inflating: sentiment labelled sentences/yelp_labelled.txt  
  inflating: __MACOSX/._sentiment labelled sentences  


In [3]:
import pandas as pd

In [4]:
df1 = pd.read_csv('/content/sentiment labelled sentences/amazon_cells_labelled.txt',delimiter='\t',names=['review','labelled_sentiment'])
df2 = pd.read_csv('/content/sentiment labelled sentences/imdb_labelled.txt',delimiter='\t',names=['review','labelled_sentiment'])
df3 = pd.read_csv('/content/sentiment labelled sentences/yelp_labelled.txt',delimiter='\t',names=['review','labelled_sentiment'])

In [5]:
df = pd.concat([df1,df2,df3],axis=0,ignore_index=True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2748 entries, 0 to 2747
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   review              2748 non-null   object
 1   labelled_sentiment  2748 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 43.1+ KB


In [7]:
df

Unnamed: 0,review,labelled_sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
...,...,...
2743,I think food should have flavor and texture an...,0
2744,Appetite instantly gone.,0
2745,Overall I was not impressed and would not go b...,0
2746,"The whole experience was underwhelming, and I ...",0


#Sentiment_score calculation and labelling with tranformers

In [8]:
%%capture
!pip install transformers

In [9]:
from transformers import pipeline

In [10]:
distilbert_classifier = pipeline('sentiment-analysis',truncation = True)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [11]:
print(df['review'][1])
test = distilbert_classifier(df['review'][1])
print(test)

Good case, Excellent value.
[{'label': 'POSITIVE', 'score': 0.9998685121536255}]


In [12]:
test[0]['label']

'POSITIVE'

In [13]:
test[0]['score']

0.9998685121536255

In [14]:
df.head()

Unnamed: 0,review,labelled_sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [15]:
df['distilbert_sentiment'] = df['review'].apply(lambda x : distilbert_classifier(x))

In [16]:
df.head()

Unnamed: 0,review,labelled_sentiment,distilbert_sentiment
0,So there is no way for me to plug it in here i...,0,"[{'label': 'NEGATIVE', 'score': 0.999408602714..."
1,"Good case, Excellent value.",1,"[{'label': 'POSITIVE', 'score': 0.999868512153..."
2,Great for the jawbone.,1,"[{'label': 'POSITIVE', 'score': 0.999779641628..."
3,Tied to charger for conversations lasting more...,0,"[{'label': 'NEGATIVE', 'score': 0.999404191970..."
4,The mic is great.,1,"[{'label': 'POSITIVE', 'score': 0.999868988990..."


In [17]:
roberta_classifier = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english",truncation = True)

Downloading:   0%|          | 0.00/687 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [18]:
df['roberta_sentiment'] = df['review'].apply(lambda x : roberta_classifier(x))

In [20]:
df.head()

Unnamed: 0,review,labelled_sentiment,distilbert_sentiment,roberta_sentiment
0,So there is no way for me to plug it in here i...,0,"[{'label': 'NEGATIVE', 'score': 0.999408602714...","[{'label': 'NEGATIVE', 'score': 0.999331653118..."
1,"Good case, Excellent value.",1,"[{'label': 'POSITIVE', 'score': 0.999868512153...","[{'label': 'POSITIVE', 'score': 0.998826086521..."
2,Great for the jawbone.,1,"[{'label': 'POSITIVE', 'score': 0.999779641628...","[{'label': 'POSITIVE', 'score': 0.998665452003..."
3,Tied to charger for conversations lasting more...,0,"[{'label': 'NEGATIVE', 'score': 0.999404191970...","[{'label': 'NEGATIVE', 'score': 0.999490618705..."
4,The mic is great.,1,"[{'label': 'POSITIVE', 'score': 0.999868988990...","[{'label': 'POSITIVE', 'score': 0.998643219470..."


In [19]:
bert_classifier = pipeline("sentiment-analysis",model="barissayil/bert-sentiment-analysis-sst",truncation = True)

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at barissayil/bert-sentiment-analysis-sst were not used when initializing BertForSequenceClassification: ['cls_layer.bias', 'cls_layer.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at barissayil/bert-sentiment-analysis-sst and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [21]:
df['bert_sentiment'] = df['review'].apply(lambda x : bert_classifier(x))

In [22]:
df.head()

Unnamed: 0,review,labelled_sentiment,distilbert_sentiment,roberta_sentiment,bert_sentiment
0,So there is no way for me to plug it in here i...,0,"[{'label': 'NEGATIVE', 'score': 0.999408602714...","[{'label': 'NEGATIVE', 'score': 0.999331653118...","[{'label': 'LABEL_1', 'score': 0.5846217870712..."
1,"Good case, Excellent value.",1,"[{'label': 'POSITIVE', 'score': 0.999868512153...","[{'label': 'POSITIVE', 'score': 0.998826086521...","[{'label': 'LABEL_1', 'score': 0.6400012969970..."
2,Great for the jawbone.,1,"[{'label': 'POSITIVE', 'score': 0.999779641628...","[{'label': 'POSITIVE', 'score': 0.998665452003...","[{'label': 'LABEL_1', 'score': 0.5707916617393..."
3,Tied to charger for conversations lasting more...,0,"[{'label': 'NEGATIVE', 'score': 0.999404191970...","[{'label': 'NEGATIVE', 'score': 0.999490618705...","[{'label': 'LABEL_1', 'score': 0.6403395533561..."
4,The mic is great.,1,"[{'label': 'POSITIVE', 'score': 0.999868988990...","[{'label': 'POSITIVE', 'score': 0.998643219470...","[{'label': 'LABEL_1', 'score': 0.6146497130393..."
