### Python with Pandas Continued
## Melt, Stack and Unstack.
## Basic text/sentiment analysis

# Melt
(Kind of the opposite of pivot)

* Pandas melt() is used to transform a DataFrame from a wide format to a long format. Converting specified columns into rows. This is useful for data analysis and visualization when you need to reshape your data.
* Basic syntax : dataframe.melt(id_vars=" " , var_name=" ", value_name = " ")

- id_vars: These are the columns you want to keep as is.
- value_vars: These are the columns you want to "melt" into rows. If not specified, all columns not in id_vars will be melted.
- var_name and value_name: var_name will be the name of the column containing  value_vars, and value_name will be the name of the column containing the corresponding values.

In [None]:
import pandas as pd

# Sample DataFrame
data = {'FName': ['A', 'B', 'C'],
        'LName': ['X', 'Y', 'Z'],
        'Math': [1, 2, 3],
        'Chem': [88, 95, 80],
        'CS': [50, 55, 60],
        'Phy': [78, 85, 90]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Melt the DataFrame
melted_df = df.melt(id_vars=['FName','LName'], value_vars=["Math","Chem"],var_name='Subject', value_name='Score')
print("\nMelted DataFrame:")
print(melted_df)

Original DataFrame:
  FName LName  Math  Chem  CS  Phy
0     A     X     1    88  50   78
1     B     Y     2    95  55   85
2     C     Z     3    80  60   90

Melted DataFrame:
  FName LName Subject  Score
0     A     X    Math      1
1     B     Y    Math      2
2     C     Z    Math      3
3     A     X    Chem     88
4     B     Y    Chem     95
5     C     Z    Chem     80


## Learning Melt, Stack, Unstack with Movie Ratings
- Feel free to apply previous concepts here as well.

Imagine you’re analyzing movie reviews – not just the star ratings but also the nuances in reviewer opinions. Today, we’ll reshape a dataset of movie reviews and then explore how sentiment analysis can reveal the ‘mood’ behind the rating.

Dataset Description:
* Synthetic movie review dataset with:
* Identifiers: review_id, movie, reviewer
* Aspect Ratings: Columns like plot_rating, acting_rating, and visuals_rating
* Text Review: A column review_text with review comments.

In [None]:
import pandas as pd
from textblob import TextBlob  # For sentiment analysis

#Create a sample dataset
data = {
    "review_id": [1, 2, 3],
    "movie": ["Movie A", "Movie B", "Movie A"],
    "reviewer": ["Alice", "Bob", "Charlie"],
    "plot_rating": [8, 5, 7],
    "acting_rating": [7, 6, 8],
    "visuals_rating": [9, 4, 7],
    "review_text": [
        "I loved the movie. The plot was captivating and the acting was superb!",
        "The movie was mediocre, with poor visuals but decent acting.",
        "A great experience! The visuals and acting were top-notch, though the plot was predictable."
    ]
}

# data = {
#     "review_id": [1, 2, 3,4],
#     "movie": ["Movie A", "Movie B", "Movie A","Movie A"],
#     "reviewer": ["Alice", "Bob", "Charlie","Charlie"],
#     "plot_rating": [8, 5, 7,7],
#     "acting_rating": [7, 6, 8,8],
#     "visuals_rating": [9, 4, 7,7],
#     "review_text": [
#         "I loved the movie. The plot was captivating and the acting was superb!",
#         "The movie was mediocre, with poor visuals but decent acting.",
#         "A great experience! The visuals and acting were top-notch, though the plot was predictable.", "A"
#     ]
# }
df = pd.DataFrame(data)
print(df.head())


   review_id    movie reviewer  plot_rating  acting_rating  visuals_rating  \
0          1  Movie A    Alice            8              7               9   
1          2  Movie B      Bob            5              6               4   
2          3  Movie A  Charlie            7              8               7   

                                         review_text  
0  I loved the movie. The plot was captivating an...  
1  The movie was mediocre, with poor visuals but ...  
2  A great experience! The visuals and acting wer...  


## Example usage of melt
Transform “wide” columns (each aspect rating) into a “long” format where you have one column for the aspect name and one for its rating.

* Sometimes one observation per row, one variable per column is useful.
* “id_vars” remain unchanged while “value_vars” are melted.

In [None]:
# Melt the rating columns into two new columns: 'aspect' and 'rating'
df_melt = df.melt(
    id_vars=["review_id", "movie", "reviewer", "review_text"],
    value_vars=["plot_rating", "acting_rating", "visuals_rating"],
    var_name="aspect",
    value_name="rating"
)
print(df_melt)

   review_id    movie reviewer  \
0          1  Movie A    Alice   
1          2  Movie B      Bob   
2          3  Movie A  Charlie   
3          1  Movie A    Alice   
4          2  Movie B      Bob   
5          3  Movie A  Charlie   
6          1  Movie A    Alice   
7          2  Movie B      Bob   
8          3  Movie A  Charlie   

                                         review_text          aspect  rating  
0  I loved the movie. The plot was captivating an...     plot_rating       8  
1  The movie was mediocre, with poor visuals but ...     plot_rating       5  
2  A great experience! The visuals and acting wer...     plot_rating       7  
3  I loved the movie. The plot was captivating an...   acting_rating       7  
4  The movie was mediocre, with poor visuals but ...   acting_rating       6  
5  A great experience! The visuals and acting wer...   acting_rating       8  
6  I loved the movie. The plot was captivating an...  visuals_rating       9  
7  The movie was mediocre, 

## There are other methods to reshape dataframes, like using stack() and unstack().

### We will come back to these after looking at some, text analysis methods, and plotting methods with seaborn.
These will help in your homework assignment.

## Sentiment analysis using python libraries.

* We do have numerical ratings for the movie reviews. However, the text fields also have rich information that can be useful.


### We’ll explore how movie reviews and tweets express opinions, compare different sentiment tools. With some hands-on activities.

### Objectives:
* Understand the theory and limitations of sentiment analysis.
Learn and compare three different sentiment analysis approaches,

* using:
1) VADER (rule-based, tuned for social media)
2) TextBlob (simple library for basic sentiment)
3) Hugging Face Transformers (state-of-the-art deep learning models)
4) Flair ( for more exploration)
* Process larger example datasets (IMDb movie reviews & Twitter data).
* Test and compare some methods.

Note: We will only look at some basics and introduce the libraries, you are encouraged to explore the libraries in depth out of classroom.


### Datasets

* There are many openly available datasets within many python libraries. We have already seen seaborn datasets in the last example (they are more numerical in nature). For text there are datasets in libraries like "datasets" library ( https://huggingface.co/datasets) , "scikit-learn" library, etc.

* We will explore two dataset examples to work with. 1) a small imdb movie review dataset. 2) a small twitter dataset.



In [1]:
 !pip install datasets
# Uncomment the line above once to install
import pandas as pd
from datasets import load_dataset

movie_data = pd.DataFrame(load_dataset("dvilasuero/mini-imdb")["train"])
#or alternatively df = pd.read_parquet("hf://datasets/dvilasuero/mini-imdb/data/train-00000-of-00001-262690e8d377a261.parquet")
print(movie_data)

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


dataset_infos.json:   0%|          | 0.00/941 [00:00<?, ?B/s]

(…)-00000-of-00001-262690e8d377a261.parquet:   0%|          | 0.00/47.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/55 [00:00<?, ? examples/s]

                                                 text  label
0   i haven't felt so much pain just by gritting m...      0
1   I saw this movie years ago at a film festival,...      0
2   ...that are actually better in German than in ...      1
3   I am not a fan of Ang Lee. I hate crouching ti...      1
4   Luciana (Carla Borelli) is sent to a mental in...      0
5   A great movie documentary telling of the early...      1
6   This movie was a great way to get some underst...      1
7   Being that the only foreign films I usually li...      0
8   There really isn't much to say about this "fil...      0
9   Not their greatest film, but better not bad at...      1
10  On the surface "The Chamber" is about a young ...      1
11  This noisy aimless mess was an attempt to cash...      0
12  When people make movies as bad as this, do the...      0
13  Watching "The Fox and the Child" was an intoxi...      1
14  Spoilers<br /><br />SILENT SCREAM ain't much o...      0
15  I could rave about t

In [7]:


tweet_data = pd.DataFrame(load_dataset("stepp1/tweet_emotion_intensity")["train"])
print(tweet_data)

README.md:   0%|          | 0.00/501 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.csv:   0%|          | 0.00/511k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/389k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3960 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3142 [00:00<?, ? examples/s]

         id                                              tweet    class  \
0     40815  Loved @Bethenny independence msg on @WendyWill...     fear   
1     10128  @mark_slifer actually maybe we were supposed t...  sadness   
2     40476  I thought the nausea and headaches had passed ...     fear   
3     20813  Anger, resentment, and hatred are the destroye...    anger   
4     40796  new tires &amp; an alarm system on my car. fwm...     fear   
...     ...                                                ...      ...   
3955  30007  Go follow #beautiful #Snowgang ♥@Amynicolehill...     fear   
3956  40033  Maybe I'm too cynical for my own good, but I'm...    anger   
3957  20651  If Payet goes either in Jan or @ the seasons e...    anger   
3958  10795  @TeamShanny legit why i am so furious with him...    anger   
3959  10789  @UnitedFrontRev @JuanDeznuts @LucidHurricane_ ...     fear   

     sentiment_intensity class_intensity  labels  
0                    low        fear_low       4

## VADER Example

Valence Aware Dictionary and sEntiment Reasoner (VADER) is part of Natural Language Toolkit (NLTK) and is better tuned for social media.

Polarity (-1 negative, +1 positive)

https://www.nltk.org/api/nltk.sentiment.vader.html



In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# Initialize VADER
vader = SentimentIntensityAnalyzer()

# Example sentence for individual scoring
sentence = "I absolutely loved the movie, though some parts were a bit slow."
scores = vader.polarity_scores(sentence)
print("VADER Scores:", scores)

# Function to apply to a DataFrame column
def vader_sentiment(text):
    return vader.polarity_scores(text)['compound']

movie_data['vader_sentiment'] = movie_data['text'].apply(vader_sentiment)
print(movie_data.iloc[0]['text'])  # integer-location based indexing for pandas dartaframes
print(movie_data.iloc[0]['vader_sentiment'])

VADER Scores: {'neg': 0.0, 'neu': 0.682, 'pos': 0.318, 'compound': 0.6361}
i haven't felt so much pain just by gritting my teeth and sitting through a movie in a long, loooong time. i've seen worse movies - movies that were less inspired and movies that were more wholly inept - but only a handful, if that, were even nearly as painful. i think what really ruined this for me is that it had some vaguely good/okay concepts, but they were so horribly fleshed out that i just kept yelling at my vcr. the dialog was horrible, the acting was pretty damned bad, and the general premise was fairly weak (though if anything else had existed to keep this movie afloat, maybe it would have been salvageable).<br /><br />there's nothing i hate more than when characters in horror movies catch on too quickly, and these people were freaking savants in that regard, especially toward the end. oh, wait, there's one thing i hated more than that - the characters themselves! i've rarely seen a more unlikable bunch

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [3]:
print(movie_data.iloc[1]['text'])  # integer-location based indexing for pandas dartaframes
print(movie_data.iloc[1]['vader_sentiment'])

I saw this movie years ago at a film festival, and ended up looking it up here after it came up in conversation with friends last night, partly to prove to them that I was not making it up, and partly to see for myself if there was actually any record of the film's existence, or if it had sunk into some kind of merciful oblivion after doing the festival circuit.<br /><br />In my festival-going days, I sat through a lot of films that cleared virtually the entire theatre, and usually took a certain pleasure in being one of the last few survivors who made it through to the closing credits. This was the film that caused me to reconsider that practice. Of all the cinematic trainwrecks I've sat through, this was far and away the very worst.<br /><br />I don't even know if I can fully explain why. It's not just that it's essentially two hours of vomiting, disembowelment and cannibalism, interspersed with about the least erotic sex scenes ever committed to film. It's not even just that the abo

#### handson activity
Pick 3 tweets and apply vader sentiment_intensity_analyser. What are your thoughts on how well it is doing?

In [13]:
print(tweet_data.columns)
tweet1=tweet_data.iloc[1]['tweet']
scores=vader.polarity_scores(tweet1)
print("VADER Scores:", scores )
print(tweet1)

Index(['id', 'tweet', 'class', 'sentiment_intensity', 'class_intensity',
       'labels'],
      dtype='object')
VADER Scores: {'neg': 0.212, 'neu': 0.648, 'pos': 0.14, 'compound': -0.3527}
@mark_slifer actually maybe we were supposed to die and my donation saved our lives?? #optimism


## TextBlob Example
TextBlob makes sentiment analysis simple with its built-in sentiment property.
Polarity (-1 negative, +1 positive)
https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.sentiment


In [14]:
from textblob import TextBlob

# Example using TextBlob
review = "The movie was fantastic and inspiring, but the ending was disappointing."
blob = TextBlob(review)
print("TextBlob Polarity:", blob.sentiment.polarity)
print("TextBlob Subjectivity:", blob.sentiment.subjectivity)

# Apply TextBlob sentiment to Twitter data
tweet_data['textblob_sentiment'] = tweet_data['tweet'].apply(lambda x: TextBlob(x).sentiment.polarity)
print(tweet_data[['tweet', 'textblob_sentiment']])

TextBlob Polarity: 0.10000000000000002
TextBlob Subjectivity: 0.8666666666666666
                                                  tweet  textblob_sentiment
0     Loved @Bethenny independence msg on @WendyWill...            0.575000
1     @mark_slifer actually maybe we were supposed t...            0.000000
2     I thought the nausea and headaches had passed ...           -0.100000
3     Anger, resentment, and hatred are the destroye...           -0.700000
4     new tires &amp; an alarm system on my car. fwm...            0.170455
...                                                 ...                 ...
3955  Go follow #beautiful #Snowgang ♥@Amynicolehill...            0.716667
3956  Maybe I'm too cynical for my own good, but I'm...            0.020000
3957  If Payet goes either in Jan or @ the seasons e...            0.000000
3958  @TeamShanny legit why i am so furious with him...           -0.400000
3959  @UnitedFrontRev @JuanDeznuts @LucidHurricane_ ...            0.000000

[3960 

#### handson activity
Pick 3 moview review and apply textblob sentiment analysis. What are your thoughts on how well it is doing?

In [23]:
movie_data.head()
review1=movie_data.iloc[1]['text']
#review1=movie_data['text'][1] either works
blob=TextBlob(review1)
print(review1)
print('TextBlob Polarity:', blob.sentiment.polarity) #how objective it is
print("TextBlob Subjectivity:", blob.sentiment.subjectivity) #how subjective it is

I saw this movie years ago at a film festival, and ended up looking it up here after it came up in conversation with friends last night, partly to prove to them that I was not making it up, and partly to see for myself if there was actually any record of the film's existence, or if it had sunk into some kind of merciful oblivion after doing the festival circuit.<br /><br />In my festival-going days, I sat through a lot of films that cleared virtually the entire theatre, and usually took a certain pleasure in being one of the last few survivors who made it through to the closing credits. This was the film that caused me to reconsider that practice. Of all the cinematic trainwrecks I've sat through, this was far and away the very worst.<br /><br />I don't even know if I can fully explain why. It's not just that it's essentially two hours of vomiting, disembowelment and cannibalism, interspersed with about the least erotic sex scenes ever committed to film. It's not even just that the abo

## Hugging-Face Transformers Example
Use a pre-trained sentiment analysis pipeline from Hugging Face to capture nuanced sentiments.

https://pypi.org/project/transformers/


This library is not only for text,  there are pretrained models for image, audio tasks as well. Try to use it in your projects. https://huggingface.co/models

List of tasks: https://huggingface.co/tasks





In [24]:
# you can list all the models for specific tasks using the following:
from huggingface_hub import  list_models

for model in list_models(limit=10, sort="downloads", direction=-1, filter="text-classification"):
    print(model.id)

# for model in list_models(limit=10, sort="downloads", direction=-1, filter="text-generation"):
#     print(model.id)

cross-encoder/ms-marco-MiniLM-L6-v2
distilbert/distilbert-base-uncased-finetuned-sst-2-english
papluca/xlm-roberta-base-language-detection
MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
facebook/bart-large-mnli
ProsusAI/finbert
microsoft/deberta-xlarge-mnli
cardiffnlp/twitter-roberta-base-sentiment-latest
cardiffnlp/twitter-roberta-base-sentiment
cross-encoder/ms-marco-MiniLM-L4-v2


* NOTE : result is different from previous two approaches. Gives a label and confidence in the classification.

In [25]:
from transformers import pipeline

# Create sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis",model="facebook/bart-large-mnli")
#not all models are free. bert, bart, gpt2 and some others are
# Example sentence
result = sentiment_pipeline("I was not impressed by the movie; it felt outdated and dull.")
print("Transformers Result:", result)



The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Transformers Result: [{'label': 'neutral', 'score': 0.9820747971534729}]


#### handson activity
Pick one single tweet and sentiment analysis using transformers with bert model . What are your thoughts on how well it is doing?

In [26]:
sentiment_pipeline = pipeline("sentiment-analysis",model="distilbert/distilbert-base-uncased-finetuned-sset-2-english")
result=sentiment_pipeline(tweet1)
print('Transformers Result: ', result)

OSError: distilbert/distilbert-base-uncased-finetuned-sset-2-english is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

### Optional "Flair" example

* Flair is another framework for NLP. The python library brings numerous applications in text analysis.

https://flair.readthedocs.io/en/latest/

NOTE: No need to install all frameworks, only use this if you are interested to explore this framework.



In [None]:
# Uncomment and install flair if desired: !pip install flair
!pip install flair
from flair.models import TextClassifier
from flair.data import Sentence

# Load the sentiment classifier from Flair
classifier = TextClassifier.load('en-sentiment')
sentence = Sentence("The service at the theater was outstanding, yet the movie fell flat.")
classifier.predict(sentence)
print("Flair Sentiment:", sentence.labels)

## In the next class session. We will go over plotting with seaborn and matplotlib.
## Chunking dataframes and parallel processing (brief).
