### Data Loading

In [23]:
df = pd.read_csv("tweets-data.csv")

### Text Cleaning

This section removes URLs, mentions, hashtags, and retweets, converts the text to lowercase, removes punctuation, and filters out common English stop words and short tokens. The cleaned tweets are stored in a new column called `Cleaned_Tweet` in the `tweets_df` DataFrame.

In [31]:
import pandas as pd
import re
stop_words = set([
    "the", "and", "is", "in", "to", "of", "for", "that", "on", "with",
    "as", "this", "it", "at", "be", "are", "was", "from", "by", "an"
])

def clean_tweet(sentence):
    sentence = re.sub(r"http\S+|@\S+|#\S+|RT", "", str(sentence))
    sentence = re.sub(r"[^\w\s]", "", sentence.lower())
    tokens = sentence.split()
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    return " ".join(tokens)

tweets_df['Cleaned_Tweet'] = tweets_df['Tweets'].apply(clean_tweet)

### Sentiment Analysis Function

This section utilizes the VADER lexicon to calculate the sentiment score of a given text. It returns a sentiment label ('positive', 'negative', or 'neutral') and the compound sentiment score based on the VADER analysis.

In [32]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

def sentiment_score(sentence):
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(sentence)
    print("Sentence sentiment analysis : ", sentiment_dict)
    print(sentiment_dict['neg']*100, "% Negative")
    print(sentiment_dict['neu']*100, "% Neutral")
    print(sentiment_dict['pos']*100, "% Positive")
    print("Sentence Overall Rated As", end = " ")

    # Decide sentiment as positive, negative and neutral
    if sentiment_dict['compound'] >= 0.05:
        print ('positive')
    elif sentiment_dict['compound'] <= -0.05:
        print ('negative')
    else:
        print ('neutral')

### Applying Sentiment Analysis and Saving Data

This section applies to get the sentiment label and score for each tweet.

In [33]:
df.to_csv("tweets_with_sentiment.csv", index=False)
tweets_df[['Tweets', 'Cleaned_Tweet', 'vader_sentiment', 'vader_score']].head()

Unnamed: 0,Tweets,Cleaned_Tweet,vader_sentiment,vader_score
0,@jacksonhinklle #wagner with 6.2 billion dolla...,billion dollar,neutral,0.0
1,Pobrecito es discapacitado\n#Reddetuiterosdemo...,pobrecito discapacitado,neutral,0.0
2,News from the EIR Daily Alert\n\n“#Putin Addre...,news eir daily alert addressed people armed ju...,positive,0.296
3,It's Messi day #Messi𓃵 #Messi36 #Russia #bigst...,its messi day,neutral,0.0
4,Il passaggio chiave di Machiavelli era questo ...,passaggio chiave machiavelli era questo princi...,neutral,0.0
