# Applied Data Science : VADER Improvement




VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-built, lexicon and rule-based sentiment analysis tool designed for analyzing text data in natural language. It is specifically crafted to handle sentiments expressed in social media texts, as it incorporates features like handling of emoticons, capitalization, and context-based sentiment scoring.

Here are some key aspects of the VADER library:

Lexicon and Rule-Based Approach: VADER uses a combination of a sentiment lexicon (a predefined list of words and their associated sentiment scores) and a set of grammatical and syntactical rules to determine the sentiment of a piece of text.

* Valence Scores: The lexicon assigns polarity scores to words, indicating the positive or negative sentiment conveyed by each word. These scores range from -1 to 1, where -1 represents extreme negativity, 1 represents extreme positivity, and 0 represents neutrality.

* Emoticon Handling: VADER is designed to handle sentiments expressed through emoticons, making it suitable for analyzing text data from social media platforms where emoticons are commonly used to convey emotions.

* Capitalization and Punctuation: VADER takes into account the intensity of sentiment by considering the impact of capitalization and punctuation in the text.

* Contextual Valence Shifting: VADER can recognize and handle some degree of valence shifting, where the sentiment of a word changes based on the context in which it is used.

* Sentiment Intensity: VADER provides a compound score that represents the overall sentiment intensity of a piece of text. This score considers both the individual word scores and their arrangement in the text.

VADER is implemented in Python and is part of the NLTK (Natural Language Toolkit) library. It is widely used for quick and easy sentiment analysis tasks, especially in situations where training a machine learning model for sentiment analysis may not be feasible or necessary. However, it's important to note that while VADER is a useful tool, it may not perform as well as more sophisticated machine learning models on certain types of data or in specific domains.

## Setup environment

In [1]:
import pandas as pd
import os
import nltk
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
os.chdir("C:/Users/humic/OneDrive/Documents/Ecole/SorbonneFTD/Cours/Applied_Data_Science_Finance/project")

### Import Data

In [2]:
df = pd.read_csv("transformed_dataset_hugo.csv")

In [3]:
df

Unnamed: 0.2,Unnamed: 0.1,Date,Unnamed: 0,link,content,transform,stem,jaccard_similarity,pessimism,pessimism_2,date_2,diff
0,0,1998-06-09,274,https://www.ecb.europa.eu/press/pressconf/1998...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",,0.400000,0.022017,1998-06-09,-206
1,1,1998-07-08,273,https://www.ecb.europa.eu/press/pressconf/1998...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.436981,0.335165,0.020192,1998-07-08,-177
2,2,1998-09-01,272,https://www.ecb.europa.eu/press/pressconf/1998...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.452153,0.384106,0.019661,1998-09-01,-122
3,3,1998-10-13,271,https://www.ecb.europa.eu/press/pressconf/1998...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.447518,0.308219,0.015801,1998-10-13,-80
4,4,1998-11-03,270,https://www.ecb.europa.eu/press/pressconf/1998...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.457980,0.344371,0.016169,1998-11-03,-59
...,...,...,...,...,...,...,...,...,...,...,...,...
270,270,2023-05-04,4,https://www.ecb.europa.eu/press/pressconf/2023...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.492620,0.187500,0.009989,2023-05-04,8889
271,271,2023-05-23,3,https://www.ecb.europa.eu/press/pressconf/2023...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.519481,0.245614,0.013346,2023-05-23,8908
272,272,2023-07-27,2,https://www.ecb.europa.eu/press/pressconf/2023...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.522124,0.266667,0.012618,2023-07-27,8973
273,273,2023-09-14,1,https://www.ecb.europa.eu/press/pressconf/2023...,MONETARY POLICY STATEMENTPRESS CONFERENCEChris...,"['monetary', 'policy', 'statementpress', 'conf...","['monetari', 'polici', 'statementpress', 'conf...",0.532348,0.326531,0.016684,2023-09-14,9022


In [4]:
# Define the text_cleansing function
def text_cleansing(text):
    # Find the index of the first occurrence of "answers" in the text
    index = text.find("answers")
    if index != -1:
        # Extract the text after the first occurrence of "answers" and strip any leading or trailing spaces
        text_cleaned = text[index + len("answers"):].strip()
        text_cleaned = text_cleaned.split("We are now ready to take your questions.")[0]
        text_cleaned = text_cleaned.split("We are now at your disposal for questions.")[0]
        text_cleaned = text_cleaned.split("We are now at your disposal, should you have any questions.")[0]
        text_cleaned = text_cleaned.split("Transcript of the questions asked and the answers given by")[0]
        text_cleaned = text_cleaned.split("We stand ready to answer any questions you may have.")[0]
        text_cleaned = text_cleaned.split("CONTACT")[0]
        text_cleaned = text_cleaned.split("You may also be interested")[0]
        text_cleaned = text_cleaned.split("Related topics")[0]
        return text_cleaned
    else:
        return text_cleaned  # Return the original text if "answers" is not found


In [5]:
text_ecb = df["content"]
# Apply the text_cleansing function to the 'text_column' of the dataframe
text_ecb['cleaned_text'] = text_ecb.apply(text_cleansing)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text_ecb['cleaned_text'] = text_ecb.apply(text_cleansing)


In [6]:
def compute_sentiment_score_vader(ecb_statement, display_outputs = False):
    
    # Initialize VADER sentiment intensity analyzer
    sid = SentimentIntensityAnalyzer()

    # Split the text into sentences based on punctuation '.'
    sentences = ecb_statement.split('.')

    # Remove any empty strings from the list
    sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

    # List to store sentiment scores for each sentence as dictionaries
    sentiment_scores_list = []

    # Compute sentiment scores for each sentence
    for sentence in sentences:
        ss = sid.polarity_scores(sentence)
        
        # Create a dictionary to store sentence content and its sentiment scores
        sentence_data = {
            "content": sentence,
            "positive_score": ss['pos'],
            "negative_score": ss['neg'],
            "neutral_score": ss['neu']
        }
        
        # Append the dictionary to the list
        sentiment_scores_list.append(sentence_data)

    # Compute average sentiment scores
    total_positive = sum(score['positive_score'] for score in sentiment_scores_list)
    total_negative = sum(score['negative_score'] for score in sentiment_scores_list)
    total_neutral = sum(score['neutral_score'] for score in sentiment_scores_list)

    average_positive = round(total_positive / len(sentiment_scores_list), 2)
    average_negative = round(total_negative / len(sentiment_scores_list), 2)
    average_neutral = round(total_neutral / len(sentiment_scores_list), 2)

    # Create a dictionary containing average sentiment scores
    vader_sentiment_output = {
        "average_positive": average_positive,
        "average_negative": average_negative,
        "average_neutral": average_neutral
    }

    # Determine if the statement is positive, neutral, or negative based on average scores
    if average_positive > average_negative and average_positive > average_neutral:
        sentiment = "Positive"
    elif average_negative > average_positive and average_negative > average_neutral:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"

    # Create a DataFrame
    df_data = {
        "ecb_statement": [ecb_statement],
        "sentiment_scores_list": [sentiment_scores_list],
        "vader_sentiment_output": [vader_sentiment_output],
        "overall_sentiment": [sentiment]
    }

    df = pd.DataFrame(df_data)
    
    if display_outputs == True: 
        # Print the results
        print("##########################################")
        print("Sentiment Scores for Each Sentence:")
        print("##########################################")

        for sentence_data in sentiment_scores_list:
            print(f"Sentence: {sentence_data['content']}")
            print(f"Positive Score: {sentence_data['positive_score']}, Negative Score: {sentence_data['negative_score']}, Neutral Score: {sentence_data['neutral_score']}")
            print("-" * 50)

    print("##########################################")
    print("Sentiment Scores of the ECB Statement:")
    print("##########################################")
    print("\nAverage Sentiment Scores of the ECB Statement:")
    print(f"Average Positive Score: {average_positive}")
    print(f"Average Negative Score: {average_negative}")
    print(f"Average Neutral Score: {average_neutral}")

    print(f"\nOverall Sentiment of the Statement: {sentiment}")
    print("\n\n")

    # Return the DataFrame
    return df

In [8]:
text1

'Good afternoon, the Vice-President and I welcome you to our press conference. I would like to thank Governor Stournaras for his kind hospitality and express our special gratitude to his staff for the excellent organisation of today’s meeting of the Governing Council.The Governing Council today decided to keep the three key ECB interest rates unchanged. The incoming information has broadly confirmed our previous assessment of the medium-term inflation outlook. Inflation is still expected to stay too high for too long, and domestic price pressures remain strong. At the same time, inflation dropped markedly in September, including due to strong base effects, and most measures of underlying inflation have continued to ease. Our past interest rate increases continue to be transmitted forcefully into financing conditions. This is increasingly dampening demand and thereby helps push down inflation.We are determined to ensure that inflation returns to our two per cent medium-term target in a 

In [7]:
# Init dataframe

text1 = text_ecb['cleaned_text'][0]
df_ecb_sentiment_score = compute_sentiment_score_vader(text1, False)

for i in range (1, len(text_ecb)-1):
    df_temp = compute_sentiment_score_vader(text_ecb['cleaned_text'][i], False)
    df_ecb_sentiment_score = pd.concat([df_ecb_sentiment_score, df_temp], ignore_index=True, sort=False)
    
df_ecb_sentiment_score
df_ecb_sentiment_score.to_csv('sentiment_score_vader.csv', index=False)

##########################################
Sentiment Scores of the ECB Statement:
##########################################

Average Sentiment Scores of the ECB Statement:
Average Positive Score: 0.11
Average Negative Score: 0.07
Average Neutral Score: 0.82

Overall Sentiment of the Statement: Neutral



##########################################
Sentiment Scores of the ECB Statement:
##########################################

Average Sentiment Scores of the ECB Statement:
Average Positive Score: 0.09
Average Negative Score: 0.05
Average Neutral Score: 0.86

Overall Sentiment of the Statement: Neutral



##########################################
Sentiment Scores of the ECB Statement:
##########################################

Average Sentiment Scores of the ECB Statement:
Average Positive Score: 0.1
Average Negative Score: 0.05
Average Neutral Score: 0.84

Overall Sentiment of the Statement: Neutral



##########################################
Sentiment Scores of the ECB Statement