# Add Labels to a Dataset for Sentiment Analysis

https://thecleverprogrammer.com/2021/11/24/add-labels-to-a-dataset-for-sentiment-analysis/

A data scientist has to spend a lot of time preparing a dataset for any data science task because the data we get has a lot of errors, and sometimes it is not labeled. Adding labels to a dataset is very important before you can use it to solve a problem. One of those problems where adding labels to a dataset is very important is sentiment analysis, where you get the data as reviews or commments from users, and you need to add labels to it to prepare it for sentiment analysis. 

## Add labels to a dataset for sentiment analysis

To add labels to unlabeled data for sentiment analysis, we can use the Vader sentiment model which is one of the best approaches for sentiment analysis. We can access it using the NLTK library in Python. Let’s import the necessary Python libraries and an unlabeled dataset that we need for the task of adding labels to a data for sentiment analysis:

In [1]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Huang\AppData\Roaming\nltk_data...


True

In [2]:
import numpy as np
import pandas as pd

In [19]:
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/reviews%20data.csv")
df.tail()

Unnamed: 0,Review
20486,"best kept secret 3rd time staying charm, not 5..."
20487,great location price view hotel great quick pl...
20488,"ok just looks nice modern outside, desk staff ..."
20489,hotel theft ruined vacation hotel opened sept ...
20490,"people talking, ca n't believe excellent ratin..."


In [20]:
sentiments = SentimentIntensityAnalyzer()

In [23]:
df['scores'] = df.agg(lambda x: sentiments.polarity_scores(x.Review), axis=1)

In [24]:
df.tail()

Unnamed: 0,Review,scores
20486,"best kept secret 3rd time staying charm, not 5...","{'neg': 0.063, 'neu': 0.665, 'pos': 0.272, 'co..."
20487,great location price view hotel great quick pl...,"{'neg': 0.0, 'neu': 0.57, 'pos': 0.43, 'compou..."
20488,"ok just looks nice modern outside, desk staff ...","{'neg': 0.131, 'neu': 0.724, 'pos': 0.145, 'co..."
20489,hotel theft ruined vacation hotel opened sept ...,"{'neg': 0.15, 'neu': 0.671, 'pos': 0.179, 'com..."
20490,"people talking, ca n't believe excellent ratin...","{'neg': 0.193, 'neu': 0.668, 'pos': 0.14, 'com..."


In [25]:
df = pd.concat([df.drop('scores', axis=1), df.agg(lambda x: pd.Series(x.scores), axis=1)], axis=1)

In [26]:
df.tail()

Unnamed: 0,Review,neg,neu,pos,compound
20486,"best kept secret 3rd time staying charm, not 5...",0.063,0.665,0.272,0.9834
20487,great location price view hotel great quick pl...,0.0,0.57,0.43,0.9753
20488,"ok just looks nice modern outside, desk staff ...",0.131,0.724,0.145,0.2629
20489,hotel theft ruined vacation hotel opened sept ...,0.15,0.671,0.179,0.9867
20490,"people talking, ca n't believe excellent ratin...",0.193,0.668,0.14,-0.6071


As you can see in the above output, we have added four new columns containing the sentiment scores of the Review column. Now the next task is to add labels by categorizing these scores. According to the industry standards, if the compound score of sentiment is more than 0.05, then it is categorized as Positive, and if the compound score is less than -0.05, then it is categorized as Negative, otherwise, it’s neutral."

In [30]:
score = df["compound"].values
sentiment = []
for i in score:
    if i >= 0.05 :
        sentiment.append('Positive')
    elif i <= -0.05 :
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
df['Sentiment'] = sentiment

In [31]:
df.head()

Unnamed: 0,Review,neg,neu,pos,compound,Sentiment
0,nice hotel expensive parking got good deal sta...,0.072,0.643,0.285,0.9747,Positive
1,ok nothing special charge diamond member hilto...,0.11,0.701,0.189,0.9787,Positive
2,nice rooms not 4* experience hotel monaco seat...,0.081,0.7,0.219,0.9889,Positive
3,"unique, great stay, wonderful time hotel monac...",0.06,0.555,0.385,0.9912,Positive
4,"great stay great stay, went seahawk game aweso...",0.135,0.643,0.221,0.9797,Positive


In [32]:
df.Sentiment.value_counts()

Positive    18831
Negative     1569
Neutral        91
Name: Sentiment, dtype: int64

So now we are ended up with six columns in this dataset which is now labeled. The Review column was the only initial column in the dataset, we added four columns containing the sentiment scores, and at last, we added a new column containing labels according to the sentiment scores. If you only want the text and label columns, you can remove all other columns and save your dataset. To save your new labeled data, you can execute the command mentioned below:

In [33]:
df.columns = ["Review", "Negative", "Neutral", "Positive", "Compound", "Sentiment"]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Review     20491 non-null  object 
 1   Negative   20491 non-null  float64
 2   Neutral    20491 non-null  float64
 3   Positive   20491 non-null  float64
 4   Compound   20491 non-null  float64
 5   Sentiment  20491 non-null  object 
dtypes: float64(4), object(2)
memory usage: 960.6+ KB


In [34]:
df.to_csv("new_data.csv", index=False)