# Add Sentiment label to the webhose dataset
In this notebook we will add sentiments to ~7k articles from the webhose set. For this we will use 3 different dictionary-based sentiment analysers. NLTK, TextBlob and the StanfordNLP library.
The Stanford Analysis is very slow and took nearly 14 hours to run over 7k articles, which is why we didn't use more.

In [1]:
# Import libraries
import pandas as pd
import sys
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob

# Set path for function import and import functions
sys.path.append('../../Functions')

from datasets import load_stratified_dataset # Loads from the webhose data in a balanced manner (equal distribution of categories)
from stanford_sentiment import Sentiment_StanfordNLP # Self-written function in which the Stanford-Sentiment-Analysis is applied to text

#### Load and process the data

In [2]:
# Load the data 
df_orig = load_stratified_dataset('../dataset_categories/dataset.csv', 'category' , 7000, random_seed=47)

# Copy the dataframe
df = df_orig.copy()

# Drop unused columns
df.drop(['organizations', 'locations', 'published', 'category', 'site', 'country', 'text_length'], axis=1, inplace=True)

#### NLTK and TextBlob
Both tokenize the input and score the full string as one entity.

In [7]:
# Sentiment Analysis with TextBlob and NLTK

# NLTK
sid = SentimentIntensityAnalyzer()
df[['neg', 'neu', 'pos', 'Vader_Score']] = df['text'].apply(sid.polarity_scores).apply(pd.Series)
df.drop(['neg', 'neu', 'pos'], axis=1, inplace=True) # We only want the overall score

# TextBlob
df[['TextBlob_Score', 'subjectivity']] = df['text'].apply(lambda x:TextBlob(x).sentiment).apply(pd.Series)
df.drop('subjectivity', axis=1, inplace=True) # We only need the polarity score

#### StanfordNLP
The Stanford sentiment analyser scores whole sentences, which means that for an article with n sentences it would return n scores.
Since we only want a single score in the end, we simply use the mean of n scores as a final result.

In [None]:
# Sentiment Analysis with StanfordNLP
Stanford_Score = []
for i in range(df.shape[0]):
    print(i, end='\r')
    try:
        Stanford_Score.append(Sentiment_StanfordNLP(df.text[i]))
    except:
        Stanford_Score.append(None)
df['Stanford_Score'] = pd.DataFrame(Stanford_Score)

In [9]:
# Export the dataset
df.to_csv('dataset_with_sentiment.csv')