# AAAI ICWSM 2024 | Tutorial | June 3rd | 2-6pm
## <u>Collectivist and Perspectivist Approaches to Studying Online Toxicity</u>
### Yotam Shmargad, School of Government & Public Policy, University of Arizona

#### This is the <b>first</b> of two notebooks we will discuss during the tutorial. The notebook is written in Python and walks participants through analyzing YouTube comments from a *Collectivist* approach. The notebook:
1.   Collects YouTube videos for <b>two</b> separate queries
2.   Collects a set of comments for the videos from each query
3.   Analyzes *hatefulness* in the two sets of comments
4.   Compares the average hatefulness in the two sets *(descriptive norm)*
5.   Compares the association between hatefulness and like counts in the two sets *(injunctive norm)*


### 1. Collect YouTube videos for <b>two</b> separate queries`

In [None]:
# Install library for YouTube data collection https://youtube-data-api.readthedocs.io/en/latest/youtube_api.html
!pip install youtube-data-api

In [None]:
# Import libraries for data collection, management, and analysis
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from youtube_api import YouTubeDataAPI
import statsmodels.api as sm

In [None]:
# Obtain API Key from https://console.cloud.google.com/apis/
# Authenticate with YouTube - place the API Key you obtained between the quotes
yt = YouTubeDataAPI("")

In [None]:
# Collect 50 videos using a query search - place your search term between the quotes
vids1 = yt.search("trump",max_results = 50)

In [None]:
# Collect 50 videos using a different query
vids2 = yt.search("biden",max_results = 50)

In [None]:
# Print number of videos collected for each query
print("Query 1:",len(vids1))
print("Query 2:",len(vids2))

In [None]:
# Retrieve information for the first video in the first query
vids1[0]

In [None]:
# Retrieve Video ID information for the first video in the first query
vids1[0]['video_id']

In [None]:
# Place video data into pandas tables
vids1_df = pd.DataFrame(vids1)
vids2_df = pd.DataFrame(vids2)

In [None]:
# Print table for videos from first query
vids1_df

### 2. Collect a set of comments for the videos from each query

In [None]:
# Retrieve 20 comments from the first video for first query
com = yt.get_video_comments(vids1[0]['video_id'],get_replies = False,max_results = 20)

In [None]:
# Print number of comments collected for first video of first query
len(com)

In [None]:
# Retrieve information for the first comment for first video in first query
com[0]

In [None]:
# Place comment data into a pandas table
com_df = pd.DataFrame(com)

In [None]:
# Extract three columns from pandas table
small_com_df = com_df[['video_id','comment_like_count','text']]

In [None]:
# Print table for comments from first video of first query
small_com_df

In [None]:
# Print number of rows and columns of pandas table
small_com_df.shape

In [None]:
# Append the table to itself to test 'concat' function - this should duplicate each row
pd.concat([small_com_df,small_com_df])

In [None]:
# Define empty table for storing comments from videos in first query
comments1 = pd.DataFrame(columns=['video_id','comment_like_count', 'text'])

In [None]:
# Print empty table
comments1

In [None]:
# Test concat function with the empty table
pd.concat([comments1,small_com_df],ignore_index = True)

In [None]:
# Iterate through videos from first query, pull 20 comments per video, and append them to a single table
for i in vids1:
  try:
    c = yt.get_video_comments(i['video_id'],get_replies = False,max_results = 20)
    cdf = pd.DataFrame(c)
    if cdf.shape[0] > 0:
      scdf = cdf[['video_id','comment_like_count','text']]
      comments1 = pd.concat([comments1,scdf],ignore_index=True)
    print(i['video_id'],'success!',cdf.shape[0])
  except:
    print(i['video_id'],'fail')

In [None]:
# Define empty table for storing comments from videos in second query
comments2 = pd.DataFrame(columns=['video_id','comment_like_count', 'text'])

In [None]:
# Iterate through videos from second query, pull 20 comments per video, and append them to a single table
for i in vids2:
  try:
    c = yt.get_video_comments(i['video_id'],get_replies = False,max_results = 20)
    cdf = pd.DataFrame(c)
    if cdf.shape[0] > 0:
      scdf = cdf[['video_id','comment_like_count','text']]
      comments2 = pd.concat([comments2,scdf],ignore_index=True)
    print(i['video_id'],'success!',cdf.shape[0])
  except:
    print(i['video_id'],'fail')

In [None]:
# Print number of rows and columns of two tables of comments
print("Query 1:",comments1.shape)
print("Query 2:",comments2.shape)

In [None]:
# Print text of first comment from first query
comments1.text[0]

### 3. Analyze *hatefulness* in the two sets of comments

In [None]:
# Install library for hatefulness analysis https://github.com/pysentimiento/pysentimiento
!pip install pysentimiento

In [None]:
# Import library for text analysis
from pysentimiento import create_analyzer

In [None]:
# Load hatefulness analyzer in English
hate = create_analyzer(task="hate_speech",lang="en")

In [None]:
# Test hatefulness analyzer on text of first comment
hate.predict(comments1.text[0])

In [None]:
# Save results of hatefulness analysis of first comment
h = hate.predict(comments1.text[0])

In [None]:
# Print probabilities from hatefulness analysis of first comment
h.probas

In [None]:
# Print probability of hate in first comment
h.probas['hateful']

In [None]:
# Create empty columns to store probability of hate for all comments
comments1['hateful'] = ''
comments2['hateful'] = ''

In [None]:
# Print table of comments from first query with added column
comments1

In [None]:
# Iterate through comments of first query, analyze for hatefulness, then place scores in the table
# NOTE: this may take several minutes to complete
for index,row in comments1.iterrows():
  h = hate.predict(row['text'])
  comments1.at[index, 'hateful'] = h.probas['hateful']
  if((index + 1) % 50 == 0):
    print(index + 1,end=" ")

In [None]:
# Print comments and hatefulness probabilities for first query
comments1

In [None]:
# Iterate through comments of second query, analyze for hatefulness, then place scores in the table
# NOTE: this may take several minutes to complete
for index,row in comments2.iterrows():
  h = hate.predict(row['text'])
  comments2.at[index, 'hateful'] = h.probas['hateful']
  if((index + 1) % 50 == 0):
    print(index + 1,end=" ")

In [None]:
# Print comments and hatefulness probabilities for second query
comments2

### 4. Compare the average hatefulness in the two sets *(descriptive norm)*

In [None]:
# Create columns to store indicator for first/second query
comments1['query'] = 0
comments2['query'] = 1

In [None]:
# Append the two tables together
all = pd.concat([comments1,comments2],ignore_index = True)

In [None]:
# Print table of comments
all

In [None]:
# Print variable types
all.dtypes

In [None]:
# Convert variable types to numbers
all['comment_like_count'] = all['comment_like_count'].astype(int)
all['hateful'] = all['hateful'].astype(float)

In [None]:
# Print variable types after changes
all.dtypes

In [None]:
# Create binary variable capturing if probability of hatefulness > 50%
all['hate_thresh'] = all['hateful'].apply(lambda x: 1 if x > .5 else 0)

In [None]:
# Print number of columns where probability of hatefulness > 50%
all.loc[all['hate_thresh'] == 1].size

In [None]:
# Create column of all 1s to add a constant to the models
all['constant'] = 1

In [None]:
# Print table with all of the created variables
all

In [None]:
# Model the difference in hate probability across the two queries using OLS regression
model1 = sm.OLS(all['hateful'],all[['query','constant']])

In [None]:
# Print Model 1 results
model1.fit().summary()

In [None]:
# Model the difference in hate prevalence across the two queries using OLS regression
model2 = sm.OLS(all['hate_thresh'],all[['query','constant']])

In [None]:
# Print Model 2 results
model2.fit().summary()

In [None]:
# Model the difference in hate prevalence across the two queries using Logit regression
model3 = sm.GLM(all['hate_thresh'],all[['query','constant']],family=sm.families.Binomial())

In [None]:
# Print Model 3 results
model3.fit().summary()

### 5. Compare association between hatefulness and likes in the two sets *(injunctive norm)*

In [None]:
# Create columns capturing the interaction between query and hatefulness
all['interaction'] = all['query']*all['hateful']
all['int_thresh'] = all['query']*all['hate_thresh']

In [None]:
# Print updated table with all variables
all

In [None]:
# Model associating hate probability and likes across the two queries using OLS
model4 = sm.OLS(all['comment_like_count'],all[['query','hateful','interaction','constant']])

In [None]:
# Print Model 4 results
model4.fit().summary()

In [None]:
# Model associating hate prevalence and likes across the two queries using OLS
model5 = sm.OLS(all['comment_like_count'],all[['query','hate_thresh','int_thresh','constant']])

In [None]:
# Print Model 5 results
model5.fit().summary()

In [None]:
# Model associating hate probability and likes across the two queries using Negative Binomial
model6 = sm.GLM(all['comment_like_count'],all[['query','hateful','interaction','constant']],family=sm.families.NegativeBinomial())

In [None]:
# Print Model 6 results
model6.fit().summary()

In [None]:
# Model associating hate prevalence and likes across the two queries using Negative Binomial
model7 = sm.GLM(all['comment_like_count'],all[['query','hate_thresh','int_thresh','constant']],family=sm.families.NegativeBinomial())

In [None]:
# Print Model 7 results
model7.fit().summary()