# Data Cleaning
Script for cleaning up data in quotes excel sheet for VADER input

Each row should contain:
* Text id (from 'full' tab)
* Text name ('full')
* Quotes
 * All quotes in 'full' tab
 * All quotes from the same file in the same row
 * Quotes separated by newline (merge quotes)
* Non-quotes ('non_quoted_text', copied as is)
* Speaker ('full')
 * Merged and newline separated, as with Quotes
* Verb ('full')
 * Merged and newline separated, as with Quotes

Output CSV header names: text_id, text_name, quotes, non_quotes, speakers, verbs

In [1]:
# run this code if connecting to a Google drive
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

## Important!! Remember to replace file paths with the correct ones when running this code locally

In [44]:
# extract relevant columns from both excel sheets

# replace this with relevant file path
fp = '/content/drive/My Drive/evaluation_quotes/quotes.xlsx'

full_df = pd.read_excel(fp, sheet_name = 'full', usecols = ["text_id", "text_name", "quote", "speaker", "verb"])

non_quotes_df = pd.read_excel(fp, sheet_name = 'non_quoted_text')

In [46]:
# check number of articles in the non_quoted_text sheet
len(non_quotes_df)

100

In [45]:
# check number of articles in the full sheet
unique_ids = full_df['text_id'].nunique()

unique_names = full_df['text_name'].nunique()

print("number of unique ids: ", unique_ids)
print("number of unique names: ", unique_names)

number of unique ids:  96
number of unique names:  96


In [47]:
# merge quotes, speakes, verbs separated by newline
merged_df = full_df.groupby(['text_id', 'text_name']).agg({
    'quote': lambda x: '\n'.join(x),
    'speaker': lambda x: '\n'.join(map(str, x)),
    'verb': lambda x: '\n'.join(map(str, x))
}).reset_index()

merged_df.head()

Unnamed: 0,text_id,text_name,quote,speaker,verb
0,01aceb45444212877ad3c6b8a340ac85,2021_02_05_ShaniaOBrien,In the official statement made this morning\nt...,the NUS\nthe NUS\nJackie Chen from the SA Labo...,said\nsaid\nreporting\ncontinues\nnan\nreached
1,054c82651b895adb42592c3b55b04fde,2021_10_10_MaxShanahan,"that ""when casuals do claim the actual hours t...",Staff\nA spokesperson for the USyd Casuals Net...,told\ntelling\ntold\ncriticised\nnan\nhighligh...
2,0740ab6bebf7c4c8575f950bfce8d8a8,2021_05_02_ClaireOllivain,it's pretty obvious that there's no threat her...,The security\nThe security\nEAG member Holly H...,said\nsaid\nsaid\ntold\ntold\nsaid\nsaid\ntold...
3,07d7f15966bdf625f5358fbb179b5033,2021_11_27_MarlowHurst_ShaniaOBrien_SamuelGarrett,that they were now available in the bathrooms ...,Mills\n2021 Sydney University Dramatic Society...,reported\nsaid\nindicated\nnoted\naccording to...
4,0a7f70b8d6612b7964adca2db0ae0242,2022_05_11_RileyVaughan,The University of Sydney Union has been using ...,USU President Prudence Wilkins-Wheat\nnan,told\nnan


In [48]:
# merge the updated full dataframe with the non_quotes dataframe
# with the new headers text_id, text_name, quote, speaker, verb, non_quoted_text
output_df = pd.merge(merged_df, non_quotes_df, on=['text_id', 'text_name'])

output_df.head()

Unnamed: 0,text_id,text_name,quote,speaker,verb,non_quoted_text
0,01aceb45444212877ad3c6b8a340ac85,2021_02_05_ShaniaOBrien,In the official statement made this morning\nt...,the NUS\nthe NUS\nJackie Chen from the SA Labo...,said\nsaid\nreporting\ncontinues\nnan\nreached,"US condemns ""horrific"" assault on internationa..."
1,054c82651b895adb42592c3b55b04fde,2021_10_10_MaxShanahan,"that ""when casuals do claim the actual hours t...",Staff\nA spokesperson for the USyd Casuals Net...,told\ntelling\ntold\ncriticised\nnan\nhighligh...,"fter USyd's denial, Fair Work Ombudsman issues..."
2,0740ab6bebf7c4c8575f950bfce8d8a8,2021_05_02_ClaireOllivain,it's pretty obvious that there's no threat her...,The security\nThe security\nEAG member Holly H...,said\nsaid\nsaid\ntold\ntold\nsaid\nsaid\ntold...,ensions escalate at UTSSA; President calls sec...
3,07d7f15966bdf625f5358fbb179b5033,2021_11_27_MarlowHurst_ShaniaOBrien_SamuelGarrett,that they were now available in the bathrooms ...,Mills\n2021 Sydney University Dramatic Society...,reported\nsaid\nindicated\nnoted\naccording to...,SU Board Meeting: Honourary Secretary resigns\...
4,0a7f70b8d6612b7964adca2db0ae0242,2022_05_11_RileyVaughan,The University of Sydney Union has been using ...,USU President Prudence Wilkins-Wheat\nnan,told\nnan,REAKING: USU election loophole allows voter fr...


## Remember to replace the output file path with your own specifications

In [49]:
# write dataframe to excel sheet
# replace this file path with the correct/relevant file path
output = '/content/drive/My Drive/evaluation_quotes/quotes_input.xlsx'

output_df.to_excel(output, index=False)

# Playing around with VADER
Installed vaderSentiment analysis tool and extracted quotes from the first news article to check input compatibility

In [50]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [51]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [56]:
# testing vader on quotes
quotes = output_df['quote'][0]

quotes = quotes.split('\n')

In [57]:
analyzer = SentimentIntensityAnalyzer()
for quote in quotes:
    vs = analyzer.polarity_scores(quote)
    print("{:-<65} {}".format(quote, str(vs)))

In the official statement made this morning---------------------- {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
that they "…[stand] in solidarity with this student, and International Student workers across the country who are faced with such abuse daily {'neg': 0.159, 'neu': 0.758, 'pos': 0.083, 'compound': -0.4588}
that many students in their district were paid as low as $5 an hour {'neg': 0.139, 'neu': 0.861, 'pos': 0.0, 'compound': -0.2732}
"Universities and Governments fail to adequately arm student workers with the resources they desperately need to navigate workplace issues {'neg': 0.264, 'neu': 0.736, 'pos': 0.0, 'compound': -0.7269}
"International students in Australia face a barrage of malprotection in their workplaces. From most not being paid the award rate, to sexual harassment and workplace bullying, and unfair termination if and when they do speak out about their treatment." {'neg': 0.214, 'neu': 0.714, 'pos': 0.071, 'compound': -0.7906}
"taking students on pub cr