<a href="https://colab.research.google.com/github/ykim71/google_toxicity/blob/main/google_toxicity_update.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load File via GDrive



In [1]:
"""
Run this code and it will bring your Google account access permission. 
This gives Colab direct access for any files in your Google Drive.
OR, upload file from local
"""
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [2]:
"""
This code changes your working directory that Colab is on. I created and set 'toxicity' folder in my Google Drive. 
I can load and save data on the 'toxicity' folder in my Google Drive
"""
%cd drive/'MyDrive'/toxicity/

/content/drive/MyDrive/toxicity


## load your file to Colab

In [3]:
"""
load your file on Colab using following code. Replace 'sample_code_review.csv' with your file name. 
I set my data as 'sample_text' so you can replce it other name. 
"""

import pandas as pd

sample_text = pd.read_csv('sample_df.csv') # OR, /content/sample_code_review.csv


In [11]:
"""
take random 3 samples to see if data has loaded successfully; 'text' is the column that you want to analyze.
"""
#len(sample_text)
sample_text = sample_text.sample(100)
sample_text.sample(3)

Unnamed: 0,Unnamed: 0.1,text
162,257675,"Cannot definitively say. Dr. Stephen Smith, Di..."
961,132025,"Erin, we also learned tonight that the Califor..."
456,358089,Let me get to your list so we don`t waste a se...


# Perpective API toxicity 



> Language Attributes: https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages



> API Request: https://developers.perspectiveapi.com/s/docs-get-started (note UT Google Account may not work; recommend using personal Google account for request)



In [5]:
"""
load packages/libraries
"""
from googleapiclient import discovery
from googleapiclient.errors import HttpError


In [6]:
"""
Enter your API here;
"""
API_KEY='your-api'


In [7]:
"""
Run this code if you want to analyze text data 4 measures of Toxicity, Likely to reject, Insult, and Identity Attact. 
See below comments for other variables and descriptions in detail.

"""
# variable descriptions: https://github.com/conversationai/perspectiveapi
# you can replace toxicity attributes here:
analyze_request = {
   'comment': { 'text': 'xx'}, # setting formats (id, text)
   'requestedAttributes': {'TOXICITY@6': {}, # see the actual variable name from the Perspective API page
                           'LIKELY_TO_REJECT@2': {}, 
                           'INSULT': {}, 
                           'IDENTITY_ATTACK': {} 
                           },
   'doNotStore': True, # for other settings, https://developers.perspectiveapi.com/s/about-the-api-methods
   'languages' : 'en'
}



In [8]:
# for a single text
import json

def incivility_measures(text):
  
  service = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,)
  
  analyze_request['comment']['text'] = text
  
  response = service.comments().analyze(body=analyze_request).execute()
  i = json.loads(json.dumps(response, indent=2))
  
  toxicity = i['attributeScores']['TOXICITY@6']['summaryScore']['value']
  reject = i['attributeScores']['LIKELY_TO_REJECT@2']['summaryScore']['value']
  insult = i['attributeScores']['INSULT']['summaryScore']['value']
  identity = i['attributeScores']['IDENTITY_ATTACK']['summaryScore']['value']
  
  print("text:" + text + "\ntoxicity:" + str(toxicity) + "\nreject:" + str(reject) + "\ninsult:" + str(insult) + "\nidentity:" + str(identity))


In [None]:

text = 'To all those thinking about voting for Trump, remember...  IN YOUR HEART... YOU KNOW HE\'S SHIT'
incivility_measures(text)

text = sample_text.text[1]
incivility_measures(text)


text:To all those thinking about voting for Trump, remember...  IN YOUR HEART... YOU KNOW HE'S SHIT
toxicity:0.94727516
reject:0.99898785
insult:0.71120167
identity:0.09659086
text:And with red tape slashed, scientists, medical researchers, our doctors are now rapidly developing not only treatments, but hopefully a vaccine. This unprecedented cooperation between the government, private industry, this will not forever change the way this country and the world will deal with future pandemics and crises, just like the travel ban.
toxicity:0.072587214
reject:0.06286466
insult:0.009431887
identity:0.0027008436


In [9]:
# run this code chunk
import csv
import codecs
import json
import time
import pandas as pd

def incivility_for_chunks(sample_text_file, text):
  
  start = time.time()

  service = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,)
  
  comments_toxicity_list = []
  comments_reject_list = []
  comments_insult_list = []
  comments_identity_list = []
  
  for i in text: 
    analyze_request['comment']['text'] = i
    
    try:
      response = service.comments().analyze(body=analyze_request).execute()
      i = json.loads(json.dumps(response, indent=2))
    
      comments_toxicity = i['attributeScores']['TOXICITY@6']['summaryScore']['value']
      comments_reject = i['attributeScores']['LIKELY_TO_REJECT@2']['summaryScore']['value']
      comments_insult = i['attributeScores']['INSULT']['summaryScore']['value']
      comments_identity = i['attributeScores']['IDENTITY_ATTACK']['summaryScore']['value']
        
    except HttpError:
      comments_toxicity = "error"
      comments_reject = "error"
      comments_insult = "error"
      comments_identity = "error"
      time.sleep(10.0) # added 10 second pause when error occurs
            
    comments_toxicity_list.append(comments_toxicity)
    comments_reject_list.append(comments_reject)
    comments_insult_list.append(comments_insult)
    comments_identity_list.append(comments_identity)
        
  temp = pd.DataFrame({'toxicity': comments_toxicity_list,
                       'reject': comments_reject_list, 
                       'insult': comments_insult_list, 
                       'attack': comments_identity_list})
  
  end=time.time()
  print("complete time: ", round(end -start, 2))

  return temp

In [12]:

"""
HERE is the code you need to chage. My data file name is 'sample_text' and the column name is 'text'. 
You can change your file and the text column name here.  
For other example, if your data name is 'df' and the text column name is 'comment', 
the first line of following code is supposed to be:

text = df.comment.values.tolist(): 

"""

text = sample_text.text.values.tolist()

"""
I switched the processs with a definition function for multiple running

add arguments into the definition below; (1) 'sample_text' is the datafile you have, 
'text' is the text data that you want to analyze here (we already assign text here)
For other example, if your data name is 'df' and the text column name is 'comment',

text = df.comment.values.tolist()

incivility_for_chunks(df, text)

"""
measures = incivility_for_chunks(sample_text, text)


complete time:  127.96


In [13]:
# merge output with the text 
sample_text_merge = pd.concat([sample_text.reset_index(drop=True), measures], axis=1)


In [14]:
"""
see if measures have been computed successfully;
"""
sample_text_merge

Unnamed: 0,Unnamed: 0.1,text,toxicity,reject,insult,attack
0,42640,A health screening component can be a temperat...,0.06701,0.093243,0.007019,0.001979
1,105046,"Right now, the data does not support the routi...",0.091227,0.127259,0.010135,0.002359
2,207151,So I think what Ive been encouraged by is that...,error,error,error,error
3,382259,People go -- people see social distancing. The...,0.059191,0.051179,0.025027,0.005402
4,230131,And we kind of looked the other way in terms o...,error,error,error,error
...,...,...,...,...,...,...
95,265538,"The nice thing about these measures, Laura, is...",error,error,error,error
96,332809,"Now, a new report estimates that the real numb...",error,error,error,error
97,371313,There was delay after delay after delay as hea...,0.096617,0.098703,0.018323,0.002322
98,83204,The project leader told a British newspaper th...,0.128401,0.387891,0.009166,0.005032


In [22]:
"""
check if there's error; errors can occur due to many reasons such as the API limit (in this case you may have to re-run those that aren't processed) 
or the Perpective API can't analyze the text because of different languages or so.

"""

sample_text_merge[sample_text_merge['toxicity']=="error"]


Unnamed: 0,Unnamed: 0.1,text,toxicity,reject,insult,attack
2,207151,So I think what Ive been encouraged by is that...,error,error,error,error
4,230131,And we kind of looked the other way in terms o...,error,error,error,error
5,427080,And there will be no letting up on continuing ...,error,error,error,error
6,205755,"And this is something, Wolf, we have actually ...",error,error,error,error
7,372105,Go to China. The entire phenomenon seems rever...,error,error,error,error
8,327524,Im very conservative it comes to taxpayer mone...,error,error,error,error
13,216855,"Great, great. So, your colleague Senator Cotto...",error,error,error,error
17,358089,Let me get to your list so we don`t waste a se...,error,error,error,error
18,373075,"No doubt, there will be many similarities, but...",error,error,error,error
94,247670,"And, Geraldo, thats insane when you say -- irr...",error,error,error,error


In [24]:
"""
I'm going to re-run those errors and merge them to those completed;
(1) save complete cases separately
(2) select error cases and remove error columns (toxicity, etc)
"""
complete_cases = sample_text_merge[sample_text_merge['toxicity']!="error"]
error_cases = sample_text_merge[sample_text_merge['toxicity']=="error"]
error_cases.drop(['toxicity','reject', 'insult','attack'], axis=1, inplace=True)


In [25]:
error_cases

Unnamed: 0,Unnamed: 0.1,text
2,207151,So I think what Ive been encouraged by is that...
4,230131,And we kind of looked the other way in terms o...
5,427080,And there will be no letting up on continuing ...
6,205755,"And this is something, Wolf, we have actually ..."
7,372105,Go to China. The entire phenomenon seems rever...
8,327524,Im very conservative it comes to taxpayer mone...
13,216855,"Great, great. So, your colleague Senator Cotto..."
17,358089,Let me get to your list so we don`t waste a se...
18,373075,"No doubt, there will be many similarities, but..."
94,247670,"And, Geraldo, thats insane when you say -- irr..."


In [26]:

text = error_cases.text.values.tolist()

measures = incivility_for_chunks(error_cases, text)

# merge output with the text 
error_cases = pd.concat([error_cases.reset_index(drop=True), measures], axis=1)


complete time:  1.2


In [27]:
error_cases

Unnamed: 0,Unnamed: 0.1,text,toxicity,reject,insult,attack
0,207151,So I think what Ive been encouraged by is that...,0.147925,0.224447,0.025027,0.009434
1,230131,And we kind of looked the other way in terms o...,0.237338,0.238034,0.061102,0.102216
2,427080,And there will be no letting up on continuing ...,0.030776,0.093212,0.007209,0.002682
3,205755,"And this is something, Wolf, we have actually ...",0.244812,0.236526,0.016937,0.007992
4,372105,Go to China. The entire phenomenon seems rever...,0.28136,0.127106,0.047935,0.058402
5,327524,Im very conservative it comes to taxpayer mone...,0.105799,0.464183,0.010211,0.001887
6,216855,"Great, great. So, your colleague Senator Cotto...",0.119004,0.177849,0.018646,0.004606
7,358089,Let me get to your list so we don`t waste a se...,0.156157,0.204883,0.028061,0.003459
8,373075,"No doubt, there will be many similarities, but...",0.160419,0.180249,0.027347,0.004847
9,247670,"And, Geraldo, thats insane when you say -- irr...",0.208115,0.643321,0.086853,0.003219


In [28]:
"""
check the output again; see if the error is due to the API limit or different language issues;
if there's error again, repeat the code above;

"""
error_cases[error_cases['toxicity']=='error']

Unnamed: 0,Unnamed: 0.1,text,toxicity,reject,insult,attack


In [29]:
"""
merge all files

"""

sample_text_final = pd.concat([complete_cases, error_cases])
sample_text_final.sort_index(inplace=True)


In [30]:
sample_text_final

Unnamed: 0,Unnamed: 0.1,text,toxicity,reject,insult,attack
0,42640,A health screening component can be a temperat...,0.06701,0.093243,0.007019,0.001979
0,207151,So I think what Ive been encouraged by is that...,0.147925,0.224447,0.025027,0.009434
1,105046,"Right now, the data does not support the routi...",0.091227,0.127259,0.010135,0.002359
1,230131,And we kind of looked the other way in terms o...,0.237338,0.238034,0.061102,0.102216
2,427080,And there will be no letting up on continuing ...,0.030776,0.093212,0.007209,0.002682
...,...,...,...,...,...,...
92,77991,And so now were dealing with an American probl...,0.154927,0.052736,0.010135,0.009286
93,87358,Its science. The president is asking us to sta...,0.334079,0.385375,0.168611,0.004736
97,371313,There was delay after delay after delay as hea...,0.096617,0.098703,0.018323,0.002322
98,83204,The project leader told a British newspaper th...,0.128401,0.387891,0.009166,0.005032


In [None]:
"""
Exporting the results -- Save the data to your Google Drive OR the Colab environment. 
You can also find your data at the folder icon at the left side and download it.
"""

sample_text_final.to_csv('toxicity_done.csv')