# AAAI ICWSM 2024 | Tutorial | June 3rd | 2-6pm
## <u>Collectivist and Perspectivist Approaches to Studying Online Toxicity</u>
### Yotam Shmargad, School of Government & Public Policy, University of Arizona

#### This is the <b>second</b> of two notebooks we will discuss during the tutorial. The notebook is written in Python and walks participants through analyzing YouTube comments from a *Perspectivist* approach. The notebook:
1.   Collects YouTube videos for <b>two</b> separate queries
2.   Collects comments AND replies for videos from each query and organizes data into comment-reply pairs
3.   Analyzes *hatefulness* in the two sets of comments and replies
4.   Compares the two sets in how hate in comments *(descriptive norm)* is associated with hate in replies
5.   Compares the two sets in how hate *(descriptive norm)* and like counts *(injunctive norm)* of comments <b>interact</b> in their association with hate in replies (i.e., Theory of Normative Social Behavior)

### 1. Collect YouTube videos for <b>two</b> separate queries`

In [None]:
# Install library for YouTube data collection https://youtube-data-api.readthedocs.io/en/latest/youtube_api.html
!pip install youtube-data-api

In [None]:
# Import libraries for data collection, management, and analysis
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from youtube_api import YouTubeDataAPI
import statsmodels.api as sm

In [None]:
# Obtain API Key from https://console.cloud.google.com/apis/
# Authenticate with YouTube - place the API Key you obtained between the quotes
yt = YouTubeDataAPI("")

In [None]:
# Collect 100 videos using a query search - place your search term between the quotes
vids1 = yt.search("trump",max_results = 50)

In [None]:
# Collect 100 videos using a different query
vids2 = yt.search("biden",max_results = 50)

In [None]:
# Print number of videos collected for each query
print("Query 1:",len(vids1))
print("Query 2:",len(vids2))

### 2. Collect comments AND replies for videos from each query and organize data into comment-reply pairs

In [None]:
# Retrieve 500 comments for the ith video from second query
com = yt.get_video_comments(vids2[i]['video_id'],get_replies = True,max_results = 500)

In [None]:
# Place comment data into a pandas table
com_df = pd.DataFrame(com)

In [None]:
# Print columns that have a period (.) in the comment_id, which indicates a reply
com_df[com_df['comment_id'].str.contains('.', regex=False)]

In [None]:
# Subset comments to two different dataframes: 1) popular comments and 2) replies
pop_df = com_df[com_df['reply_count'] > 0]
rep_df = com_df[com_df['comment_parent_id'].isnull()==False]

In [None]:
# Print dimensions of comment df, popular comment df, and reply df
print("All comments:",com_df.shape)
print("Popular comments:",pop_df.shape)
print("Replies:",rep_df.shape)

In [None]:
# Extract relevant columns from new dataframes
small_pop_df = pop_df[['video_id','comment_id','comment_like_count','text']]
small_rep_df = rep_df[['video_id','comment_parent_id','comment_like_count','text']]

In [None]:
# Merge the two dataframes so that data about comments and their replies are in the same row
mer_df = pd.merge(small_pop_df,small_rep_df,left_on='comment_id', right_on='comment_parent_id', how='inner', suffixes=('_com', '_rep'))

In [None]:
# Print merged dataframe
mer_df

In [None]:
# Print dimensions of merged dataframe
mer_df.shape

In [None]:
# Define empty table for storing comment-reply pairs from videos in first query
comreps1 = pd.DataFrame(columns=['video_id_com','comment_id','comment_like_count_com', 'text_com','video_id_rep','comment_parent_id','comment_like_count_rep','text_rep'])

In [None]:
# Iterate through videos from first query, pull 500 comments per video, organize into comment-reply pairs, and append them to a single table
for i in vids1:
  try:
    c = yt.get_video_comments(i['video_id'],get_replies = True,max_results = 500)
    cdf = pd.DataFrame(c)
    pdf = cdf[cdf['reply_count'] > 0]
    rdf = cdf[cdf['comment_parent_id'].isnull()==False]
    spdf = pdf[['video_id','comment_id','comment_like_count','text']]
    srdf = rdf[['video_id','comment_parent_id','comment_like_count','text']]
    mdf = pd.merge(spdf,srdf,left_on='comment_id', right_on='comment_parent_id', how='inner', suffixes=('_com', '_rep'))
    comreps1 = pd.concat([comreps1,mdf],ignore_index=True)
    print(i['video_id'],'success!',mdf.shape[0])
  except:
    print(i['video_id'],'fail')

In [None]:
# Print dimensions of comment-reply table for first query
comreps1.shape

In [None]:
# Define empty table for storing comment-reply pairs from videos in second query
comreps2 = pd.DataFrame(columns=['video_id_com','comment_id','comment_like_count_com', 'text_com','video_id_rep','comment_parent_id','comment_like_count_rep','text_rep'])

In [None]:
# Iterate through videos from second query, pull 500 comments per video, organize into comment-reply pairs, and append them to a single table
for i in vids2:
  try:
    c = yt.get_video_comments(i['video_id'],get_replies = True,max_results = 500)
    cdf = pd.DataFrame(c)
    pdf = cdf[cdf['reply_count'] > 0]
    rdf = cdf[cdf['comment_parent_id'].isnull()==False]
    spdf = pdf[['video_id','comment_id','comment_like_count','text']]
    srdf = rdf[['video_id','comment_parent_id','comment_like_count','text']]
    mdf = pd.merge(spdf,srdf,left_on='comment_id', right_on='comment_parent_id', how='inner', suffixes=('_com', '_rep'))
    comreps2 = pd.concat([comreps2,mdf],ignore_index=True)
    print(i['video_id'],'success!',mdf.shape[0])
  except:
    print(i['video_id'],'fail')

In [None]:
# Print dimensions of comment-reply table for second query
comreps2.shape

### 3. Analyze *hatefulness* in the two sets of comments and replies

In [None]:
# Install library for hatefulness analysis https://github.com/pysentimiento/pysentimiento
!pip install pysentimiento

In [None]:
# Import library for text analysis
from pysentimiento import create_analyzer

In [None]:
# Load hatefulness analyzer in English
hate = create_analyzer(task="hate_speech",lang="en")

In [None]:
# Create empty columns to store probabilities of hate for all comments and replies
comreps1['hate_com'] = ''
comreps1['hate_rep'] = ''
comreps2['hate_com'] = ''
comreps2['hate_rep'] = ''

In [None]:
# Print table of comment-reply pairs from first query with added column
comreps1

In [None]:
# Initialize object to store comment_id of previous comment in first query
lastcom1 = 'abcdefg'

In [None]:
# Iterate through comments and replies of first query, analyze for hatefulness, then place scores in the table
# NOTE: this may take several minutes to complete
for index,row in comreps1.iterrows():
  if(row['comment_id'] != lastcom1):
    h_com = hate.predict(row['text_com'])
    comreps1.at[index, 'hate_com'] = h_com.probas['hateful']
    lastcom1 = row['comment_id']
  else:
    comreps1.at[index, 'hate_com'] = h_com.probas['hateful']
  h_rep = hate.predict(row['text_rep'])
  comreps1.at[index, 'hate_rep'] = h_rep.probas['hateful']
  if((index + 1) % 50 == 0):
    print(index + 1,end=" ")

In [None]:
# Print comments, replies, and hatefulness probabilities for first query
comreps1

In [None]:
# Initialize object to store comment_id of previous comment in second query
lastcom2 = 'abcdefg'

In [None]:
# Iterate through comments and replies of second query, analyze for hatefulness, then place scores in the table
# NOTE: this may take several minutes to complete
for index,row in comreps2.iterrows():
  if(row['comment_id'] != lastcom2):
    h_com = hate.predict(row['text_com'])
    comreps2.at[index, 'hate_com'] = h_com.probas['hateful']
    lastcom2 = row['comment_id']
  else:
    comreps2.at[index, 'hate_com'] = h_com.probas['hateful']
  h_rep = hate.predict(row['text_rep'])
  comreps2.at[index, 'hate_rep'] = h_rep.probas['hateful']
  if((index + 1) % 50 == 0):
    print(index + 1,end=" ")

In [None]:
# Print comments, replies, and hatefulness probabilities for second query
comreps2

### 4. Compare the two sets in how hate in comments *(descriptive norm)* is associated with hate in replies

In [None]:
# Create columns to store indicator for first/second query
comreps1['query'] = 0
comreps2['query'] = 1

In [None]:
# Append the two tables together
all = pd.concat([comreps1,comreps2],ignore_index = True)

In [None]:
# Print table of comments
all

In [None]:
# Print variable types
all.dtypes

In [None]:
# Convert variable types to numbers
all['comment_like_count_com'] = all['comment_like_count_com'].astype(int)
all['hate_com'] = all['hate_com'].astype(float)
all['comment_like_count_rep'] = all['comment_like_count_rep'].astype(int)
all['hate_rep'] = all['hate_rep'].astype(float)

In [None]:
# Print variable types after changes
all.dtypes

In [None]:
# Create binary variables capturing if probability of hatefulness in comments and replies > 50%
all['hate_thresh_com'] = all['hate_com'].apply(lambda x: 1 if x > .5 else 0)
all['hate_thresh_rep'] = all['hate_rep'].apply(lambda x: 1 if x > .5 else 0)

In [None]:
# Print number of comment-reply pairs where comment has probability of hatefulness > 50%
all.loc[all['hate_thresh_com'] == 1].size

In [None]:
# Print number of comment-reply pairs where reply has probability of hatefulness > 50%
all.loc[all['hate_thresh_rep'] == 1].size

In [None]:
# Create column of all 1s to add a constant to the models
all['constant'] = 1

In [None]:
# Model how comment hate probability is associated with reply hate probability using OLS regression
model1 = sm.OLS(all['hate_rep'],all[['hate_com','constant']])

In [None]:
# Print Model 1 results
model1.fit().summary()

In [None]:
# Model how comment hate prevalence is associated with reply hate prevalence using OLS regression
model2 = sm.OLS(all['hate_thresh_rep'],all[['hate_thresh_com','constant']])

In [None]:
# Print Model 2 results
model2.fit().summary()

In [None]:
# Model how comment hate prevalence is associated with reply hate prevalence using Logit regression
model3 = sm.GLM(all['hate_thresh_rep'],all[['hate_thresh_com','constant']],family=sm.families.Binomial())

In [None]:
# Print Model 3 results
model3.fit().summary()

In [None]:
# Create columns capturing the interaction between query and hatefulness of comments
all['interaction'] = all['query']*all['hate_com']
all['int_thresh'] = all['query']*all['hate_thresh_com']

In [None]:
# Model how comment-reply hate probability association varies across the two queries using OLS regression
model4 = sm.OLS(all['hate_rep'],all[['hate_com','query','interaction','constant']])

In [None]:
# Print Model 4 results
model4.fit().summary()

In [None]:
# Model how comment-reply hate prevalence association varies across the two queries using Logit regression
model5 = sm.GLM(all['hate_thresh_rep'],all[['hate_thresh_com','query','int_thresh','constant']],family=sm.families.Binomial())

In [None]:
# Print Model 5 results
model5.fit().summary()

### 5. Compare the two sets in how hate *(descriptive norm)* and like counts *(injunctive norm)* of comments <b>interact</b> in their association with hate in replies (i.e., Theory of Normative Social Behavior)

In [None]:
# Create columns capturing the interaction between like counts and hatefulness of comments
all['tnsb'] = all['comment_like_count_com']*all['hate_com']
all['tnsb_thresh'] = all['comment_like_count_com']*all['hate_thresh_com']

In [None]:
# Model how comment-reply hate probability association varies with like counts using OLS regression
model6 = sm.OLS(all['hate_rep'],all[['hate_com','comment_like_count_com','tnsb','constant']])

In [None]:
# Print Model 6 results
model6.fit().summary()

In [None]:
# Model how comment-reply hate prevalence association varies with like counts using Logit regression
model7 = sm.GLM(all['hate_thresh_rep'],all[['hate_thresh_com','comment_like_count_com','tnsb_thresh','constant']],family=sm.families.Binomial())

In [None]:
# Print Model 7 results
model7.fit().summary()

In [None]:
# Create column capturing the interaction between query and like counts of comments
all['q_like'] = all['comment_like_count_com']*all['query']

In [None]:
# Create columns capturing the TRIPLE interaction between query, like counts, and hatefulness of comments
all['tnsb_int'] = all['comment_like_count_com']*all['hate_com']*all['query']
all['tnsb_int_thresh'] = all['comment_like_count_com']*all['hate_thresh_com']*all['query']

In [None]:
# Model how comment-reply hate probability association varies with like counts AND query using OLS regression
model8 = sm.OLS(all['hate_rep'],all[['hate_com','comment_like_count_com','query','tnsb','interaction','q_like','tnsb_int','constant']])

In [None]:
# Print Model 8 results
model8.fit().summary()

In [None]:
# Model how comment-reply hate prevalence association varies with like counts AND query using Logit regression
model9 = sm.GLM(all['hate_thresh_rep'],all[['hate_thresh_com','comment_like_count_com','query','tnsb_thresh','int_thresh','q_like','tnsb_int_thresh','constant']],family=sm.families.Binomial())

In [None]:
# Print Model 9 results
model9.fit().summary()

#### <u>Additional considerations</u>
1.   Comment like counts are skewed --> could transform with logarithm
2.   Arbitrary threshold of .5 for hate --> higher numbers (e.g., .6, .9) are sometimes used<br> --> likely need more data
3.   Add fixed/random effects for video, comment, and/or authors<br>--> likely need more data