## https://tinyurl.com/ANLPColab3Part3
Go to "File" -> "Save a Copy in Drive..." This lets you create your own copy of the notebook in your Google drive, and any changes you make doesn't impact the shared notebook

### Load dataset

There are three different packages in Python to perform sentiment analysis:
1. TextBlob
2. VADER
3. SentiWordNet

We will use TextBlob package in class to perform sentiment analysis on a popular dataset from Kaggle.

In [1]:
import pandas as pd
import numpy as np


In [2]:
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ANLP 2024 Colab Notebooks/Session3 Colab Notebooks/Reddit_Data.csv')
df.head(5)

Mounted at /content/drive


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/Colab Notebooks/ANLP 2024 Colab Notebooks/Session3 Colab Notebooks/Reddit_Data.csv'

In [None]:
df.shape #dataset contains a total of 37249 rows

In [None]:
df.category.value_counts() #shows the count of each category

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#Visualising sentiment categories
labels=['Positive','Neutral','Negative']
sns.barplot(x=labels,y=df.category.value_counts())
plt.show()

## Preprocessing the data

In [None]:
df.isna().sum() #Finding empty rows

In [None]:
df = df.dropna() #Removign empty rows from the datset
df.isna().sum()

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

import string
import re
from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

lemma_ = nltk.WordNetLemmatizer() #for lemmatization
# pstem = nltk.PorterStemmer() #for stemming

def cleanData(text):

    # To convert the all uppercase to lowercase
    text = text.lower()

    # This is a reguglar expression to replace anything char that is not alphabet or numeric.
    text = re.sub(r"[^A-Za-z0-9]",' ', text)

    # The above regular expression itself will take care of punctuation, below is an alternative to remove only punctuation.
    text = ''.join([char for char in text if char not in string.punctuation])

    # Lemmatization
    text = [lemma_.lemmatize(word) for word in text.split(' ') if ((word not in stopwords.words('english')) & len(word)!=0)]

    # Let's try without stemming
    # text = [pstem.stem(word) for word in text.split(' ') if ((word not in stopwords.words('english')) & len(word)!=0)]

    return ' '.join(text)

In [None]:
#Clean the dataset by applying the fuction
df['clean_comment'] = df['clean_comment'].apply(cleanData)

In [None]:
df.head(5)

## Sentiment Analysis using TextBlob

TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more.

The two measures that are used to analyze the sentiment are:

* Polarity – talks about how positive or negative the opinion is. Polarity ranges from -1 to 1 (1 is more positive, 0 is neutral, -1 is more negative)
* Subjectivity – talks about how subjective the opinion is. Subjectivity ranges from 0 to 1(0 being very objective and 1 being very subjective)

We can use TextBlob(text).sentiment to get the Polarity and Subjectivity values.

In [None]:
from textblob import TextBlob

# function to calculate subjectivity
def getSubjectivity(sentiment):
    return TextBlob(sentiment).sentiment.subjectivity
    # function to calculate polarity
def getPolarity(sentiment):
        return TextBlob(sentiment).sentiment.polarity

# function to analyze the sentiment)
def analysis(score):
    if score < 0:
        return -1
    elif score == 0:
        return 0
    else:
        return 1

In [None]:
textBlob = pd.DataFrame(df[['clean_comment','category']])
textBlob['Subjectivity'] = textBlob['clean_comment'].apply(getSubjectivity)
textBlob['Polarity'] = textBlob['clean_comment'].apply(getPolarity)
textBlob['Analysis'] = textBlob['Polarity'].apply(analysis)
textBlob.head()

In [None]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

print("Evaluation metrics:\n",classification_report(textBlob['category'],textBlob['Analysis']))
print("\nAccuracy is:\n",accuracy_score(textBlob['category'],textBlob['Analysis']))
print("\nConfusion Matrix:\n",confusion_matrix(textBlob['category'],textBlob['Analysis']))

In [None]:
print(textBlob.category.value_counts())
print(textBlob.Analysis.value_counts())