# Porportional sentiment of each Reddit r/movies thread

This was a last minute add-on in order to feature-engineer sentiment labels the Reddit comments I'm using. 

This was run in Google Colab to make use of the Cloud GPU capability, as running it on my machine was far too slow.

Following instructions from [this notebook from Google](https://colab.research.google.com/notebooks/pro.ipynb).

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Oct 20 21:45:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0    46W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 89.6 gigabytes of available RAM

You are using a high-RAM runtime!


In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 4.7 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 83.5 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 72.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


In [4]:
import time
start_time = time.time()
def time_check(start=None):
    if start:
        t = time.time() - start
    else:
        t = time.time() - start_time
    print(f'Time check: {t//60:.0f} minutes and {t%60:.0f} seconds')

In [5]:
import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import pipeline
from google.colab import files

In [6]:
tf.test.gpu_device_name()

'/device:GPU:0'

In [7]:
# Read in dataset

data_url = "https://raw.githubusercontent.com/zshoorbajee/reddit-movie-comments-nlp/main/data/data_final.csv"
df = pd.read_csv(data_url)

In [8]:
df.head()

Unnamed: 0,id,tconst,title,originalTitle,comments,runtimeMinutes,startYear,post_date_utc,post_year,post_month,post_day,genres,numVotes,averageRating
0,vzcwal,tt13406136,the princess,The Princess,Joey King needs a new agent. She’s proven she ...,94.0,2022,1657851000.0,2022,7,14,"Action,Drama,Fantasy",11474,5.6
1,vzcwal,tt13406136,the princess,The Princess,"Silly, but entertaining and non stop action",94.0,2022,1657851000.0,2022,7,14,"Action,Drama,Fantasy",11474,5.6
2,vzcwal,tt13406136,the princess,The Princess,"The yassification of The Raid\n\nActually, thi...",94.0,2022,1657851000.0,2022,7,14,"Action,Drama,Fantasy",11474,5.6
3,vzcwal,tt13406136,the princess,The Princess,"Honestly, this was pretty fun. The plot is no...",94.0,2022,1657851000.0,2022,7,14,"Action,Drama,Fantasy",11474,5.6
4,vzcwal,tt13406136,the princess,The Princess,"Man, I loved this movie. Yeah, it was campy, b...",94.0,2022,1657851000.0,2022,7,14,"Action,Drama,Fantasy",11474,5.6


### "Imploding" the dataframe

Currently, each comment is its own row, giving the dataset over 70,000 rows. I need to ge the proportion of positive and negative comments from each movie. So it would be easier to treat each movie as one row.

In [9]:
# "Implode" comments so each movie is one row

comments_imploded = df.groupby('id')['comments'].agg(list)
df = df.drop_duplicates(subset='id').drop(columns=['comments'])
df = df.join(comments_imploded, on='id')
df['n_comments'] = df['comments'].apply(len)
df.head()

Unnamed: 0,id,tconst,title,originalTitle,runtimeMinutes,startYear,post_date_utc,post_year,post_month,post_day,genres,numVotes,averageRating,comments,n_comments
0,vzcwal,tt13406136,the princess,The Princess,94.0,2022,1657851000.0,2022,7,14,"Action,Drama,Fantasy",11474,5.6,[Joey King needs a new agent. She’s proven she...,21
21,vzcw0a,tt11671006,the man from toronto,The Man from Toronto,110.0,2022,1657851000.0,2022,7,14,"Action,Adventure,Comedy",43386,5.8,[ O offence to Woody but I feel like the origi...,23
44,vzcvsd,tt9288046,the sea beast,The Sea Beast,115.0,2022,1657851000.0,2022,7,14,"Adventure,Animation,Comedy",35834,7.1,[Absolutely crazy that Netflix dropped this an...,77
121,vzcvkz,tt5151570,mrs harris goes to paris,Mrs. Harris Goes to Paris,115.0,2022,1657851000.0,2022,7,14,"Comedy,Drama",4798,7.1,[This was so cute it just made me smile the wh...,22
143,vzcv66,tt9411972,where the crawdads sing,Where the Crawdads Sing,125.0,2022,1657851000.0,2022,7,14,"Drama,Mystery,Thriller",28694,7.1,[I did enjoy her house representing the 2 diff...,93


### Get sentiment label of each comment.
I'm using the Hugging Face Transformers library with a commonly used sentiment analysis model trained on 58 million tweets. The model is called [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment).

In [10]:
# Commonly used sentiment analysis model from HuggingFace Transformers  

sentiment_pipeline = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment")

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [11]:
# Function to get a proportion of sentiment labels for each comment thread.

def get_sentiments(comments_list, raw_counts=False):
  """
  Takes in a list of comments as strings.
  Returns the a dictionary with the proportion of positive, 
  negative, and neutral comments based on a Transformers model.
  """
  sentiments_count = {"negative": 0, "neutral": 0, "positive": 0}
  comment_count = 0

  for comment in comments_list:
    try:
      sentiment = sentiment_pipeline(comment)
    except:
      continue
    label = sentiment[0]['label']
    if label == "LABEL_0":
      sentiments_count["negative"] += 1
    elif label == "LABEL_1":
      sentiments_count["neutral"] += 1
    elif label == "LABEL_2":
      sentiments_count["positive"] += 1
    comment_count += 1

  if raw_counts:
    return sentiments_count
  else:
    sentiments_normalized = {
      "negative": sentiments_count["negative"] / comment_count,
      "neutral": sentiments_count["neutral"] / comment_count,
      "positive": sentiments_count["positive"] / comment_count
      }
    return sentiments_normalized

In [12]:
# Test the function:

example_sentiments = df['comments'].head().apply(get_sentiments)
example_sentiments

0      {'negative': 0.23809523809523808, 'neutral': 0...
21     {'negative': 0.4090909090909091, 'neutral': 0....
44     {'negative': 0.24, 'neutral': 0.12, 'positive'...
121    {'negative': 0.18181818181818182, 'neutral': 0...
143    {'negative': 0.3068181818181818, 'neutral': 0....
Name: comments, dtype: object

In [13]:
# Confirmed that it works.

In [14]:
df['sentiments_norm'] = df['comments'].apply(get_sentiments)

In [15]:
time_check()

Time check: 66 minutes and 57 seconds


In [16]:
df['neg_norm'] = df['sentiments_norm'].apply(lambda x: x['negative'])
df['ntrl_norm'] = df['sentiments_norm'].apply(lambda x: x['neutral'])
df['pos_norm'] = df['sentiments_norm'].apply(lambda x: x['positive'])

In [17]:
reddit_movie_sentiments = df[['id', 'title', 'neg_norm', 'ntrl_norm', 'pos_norm']]

In [18]:
reddit_movie_sentiments.to_csv("reddit_movie_sentiments.csv", header=True, index=False)

In [19]:
files.download('reddit_movie_sentiments.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>