# Part Two - Analysing related article sentiment and bias
This workbook considers the second part of the project - taking a group of related articles and assessing their sentiment and scope for bias.
In an ideal world, where we each want to have a fair and balanced view on every topic, we would hope to read articles covering a range of perspectives on the same story.
Having developed the means to find related articles in part one of this project, this part applies several sentiment analysis techniques to look at the distribution of their sentiment. It ultimately arrives at a score to convey how balanced the coverage of a story is.

## Preparation
### Imports
First some of the required packages must be imported.

In [2]:
import argparse
import pandas as pd
import numpy as np
import operator
import nltk as nl
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import ParameterGrid
import statistics
import random

from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

ModuleNotFoundError: No module named 'google.cloud'

### Parameter configuration
The parameters used to control the NLP-related calculations, and to specify the domain for any grid search are captured in the runParams dict. This includes specification of the location of the key input files.
The runParams dict is converted into an sklearn ParameterGrid, even if there is no grid search requirement (in which case it's processed as a single scenario grid search).

In [None]:
runParams={'sentiment_library':   ['google','vader','stanford'],
           'input_file':          ['./data/articles.csv'],
           'article_id_list':     [[120639,80103,25225,21502,57362,120636]],
           'sentiment_sentences': [5],
           'article_stats':       [False]}

# Use parameter grid even if there is only set of parameters
parameterGrid=ParameterGrid(runParams)

### File loader for news corpus

This is the same function as used in Part One.

In [None]:
def getInputDataAndDisplayStats(filename,processDate,printSummary=False):

	df=pd.read_csv(filename)

	df=df.drop_duplicates('content')
	df=df[~df['content'].isnull()]

	# There are a large number of junk articles, many of which either don't make sense or
	# just contain a headline - as such they are useless for this analysis and may distort
	# results if left in place
	df=df[df['content'].str.len()>=200]

	# Find and remove summary NYT "briefing" articles to avoid confusing the clustering
	targetString="(Want to get this briefing by email?"
	df['NYT summary']=df['content'].map(lambda d: d[:len(targetString)]==targetString)
	df=df[df['NYT summary']==False]

	# The following removes a warning that appears in many of the Atlantic articles.
	# Since it is commonly at the beginning, it brings a lot of noise to the search for similar articles
	# And subsequently to the assessment of sentiment
	targetString="For us to continue writing great stories, we need to display ads.             Please select the extension that is blocking ads.     Please follow the steps below"
	df['content']=df['content'].str.replace(targetString,'')

	# This is also for some Atlantic articles for the same reasons as above
	targetString="This article is part of a feature we also send out via email as The Atlantic Daily, a newsletter with stories, ideas, and images from The Atlantic, written specially for subscribers. To sign up, please enter your email address in the field provided here."
	df=df[df['content'].str.contains(targetString)==False]

	# This is also for some Atlantic articles for the same reasons as above
	targetString="This article is part of a feature we also send out via email as Politics  Policy Daily, a daily roundup of events and ideas in American politics written specially for newsletter subscribers. To sign up, please enter your email address in the field provided here."
	df=df[df['content'].str.contains(targetString)==False]

	# More Atlantic-specific removals (for daily summaries with multiple stories contained)
	df=df[df['content'].str.contains("To sign up, please enter your email address in the field")==False]

	# Remove daily CNN summary
	targetString="CNN Student News"
	df=df[df['content'].str.contains(targetString)==False]

	if printSummary:
		print("\nArticle counts by publisher:")
		print(df['publication'].value_counts())

		print("\nArticle counts by date:")
		print(df['date'].value_counts())
		
	# Restrict to articles on the provided input date.
	# This date is considered mandatory for topic clustering but is not required for sentiment
	# since sentiment only processes a specified list of articles.
	# For topic clustering it is essential to have the date as it is
	# enormously significant in article matching.
	if processDate!=None:
		df=df[df['date']==processDate]
	df.reset_index(inplace=True, drop=True)

	# Remove non-ASCII characters
	df['content no nonascii']=df['content'].map(lambda x: removeNonASCIICharacters(x))

	print("\nFinal dataset:\n\nDate:",processDate,"\n")
	print(df['publication'].value_counts())

	return df

##########################################################################################

def removeNonASCIICharacters(textString): 
    return "".join(i for i in textString if ord(i)<128)

### Load the articles from the corpus
In addition the function will return the number of articles per publication (for the requested run date). Here we see there is a relatively good mix of political viewpoints covered. More discussion of this is provided in the project report.

In [None]:
articleDataFrame=getInputDataAndDisplayStats(runParams['input_file'][0],None,False)

## Define the sentiment analysis classes
Three NLP libraries are supported for this part of the project: Vader, Google, and Stanford Core NLP. Classes will be defined to enable each of them.
### Parent Sentiment Analysis Class
The main sentiment analyser class provides a consistent wrapper and interface around classes specific to the various NLP libraries. Its constructor creates and embeds an instance of the appropriate class. It provides additional standard interfaces to trigger the analysis on an article, and to return the overall score for that article.
Additionally it manages the scaling of results across classes in order to facilitate consistency.

In [None]:
class SentimentAnalyser():

	scaleMin=-1.
	scaleMax=1.

    # Initializer / Instance attributes
	def __init__(self, library):
		if library=='google':
			self.analyser=GoogleSentimentAnalyser()
		elif library=='stanford':
			self.analyser=StanfordSentimentAnalyser()
		elif library=='vader':
			self.analyser=NLTKVaderSentimentAnalyser()
		else:
			print("ERROR - NO RECOGNISED LIBRARY")

	def getOverallArticleScore(self,articleResults):

		# Google returns a document score, but it is an int, which is useful when comparing documents
		# Hence computing the average of the sentences here instead
		# Google's document score is here: articleResults.document_sentiment.score
		numSentences=0.
		totalSentScore=0.
		for sentence in articleResults:
			numSentences+=1
			totalSentScore+=self.analyser.getSentenceScoreFromResults(sentence)

		value=(totalSentScore/numSentences-self.analyser.scaleMin)/(self.analyser.scaleMax-self.analyser.scaleMin)
		return self.scaleMin+value*(self.scaleMax-self.scaleMin)

	def generateResults(self,textToAnalyse):
		return self.analyser.generateResults(textToAnalyse)

### Google NLP library class
The following class pertains to the use of the Google Cloud Library. The class provides interfaces to the relevant Google methods and packages results for return to the parent. Note that this needs to have appropriate Google Cloud Platform credentials available - see the project report for details.

In [None]:
class GoogleSentimentAnalyser():

	scaleMin=-1.
	scaleMax=1.

	def __init__(self):
		self.client=language.LanguageServiceClient()
		return

	def generateResults(self,textToAnalyse):
		document=types.Document(
								content=textToAnalyse,
								type=enums.Document.Type.PLAIN_TEXT
								)
		return self.client.analyze_sentiment(document=document).sentences

	def getSentenceScoreFromResults(self,sentenceResults):
		return sentenceResults.sentiment.score

### Vader library class
This class does the same thing for using the VADER sentiment analyser packaged with NLTK.

In [3]:
class NLTKVaderSentimentAnalyser():
# Per NLTK Vader user guide: https://pypi.org/project/vaderSentiment/
# Typical threshold values (used in the literature cited on this page) are: 
#. **positive sentiment**: ``compound`` score >= 0.05 
#. **neutral sentiment**: (``compound`` score > -0.05) and (``compound`` score < 0.05) 
#. **negative sentiment**: ``compound`` score <= -0.05 

	scaleMin=-1.
	scaleMax=1.

	def __init__(self):
		self.nltkVaderAnalyser=SentimentIntensityAnalyzer()
		return

	def generateResults(self,textToAnalyse):
		ss=[]
		for sentence in nl.sent_tokenize(textToAnalyse):
			ss.append(self.nltkVaderAnalyser.polarity_scores(sentence))
		return ss

	def getSentenceScoreFromResults(self,sentenceResults):
		return sentenceResults['compound']

### Stanford Core NLP library class
And similarly for using Stanford Core NLP. Note that this needs to have the Stanford Core NLP server running locally in order for it to work. See the instructions in the project report.

In [4]:
class StanfordSentimentAnalyser():

	scaleMin=0.
	scaleMax=4.

	def __init__(self):
		from pycorenlp import StanfordCoreNLP
		self.nlp=StanfordCoreNLP('http://localhost:9000')
		return

	def generateResults(self,textToAnalyse):
		return self.nlp.annotate(textToAnalyse,
								properties={
											 'annotators': 'sentiment',
											 'outputFormat': 'json',
											 'timeout': 100000,  # NB The original example had 1000 and that caused time-out errors
											})["sentences"]

	def getSentenceScoreFromResults(self,sentenceResults):
		return int(sentenceResults["sentimentValue"])

## Balance/Score computation methods
These are to determine a score for how well a set of documents manages to cover a variety of perspectives...

In [5]:
def computePopulationBalanceScore(articleScoreDict,sentimentClass):
	# Extract values from dict, then normalise to be within -1 to +1
	# Then compute population standard deviation as the balance score
	population=[-1.+(x-sentimentClass.scaleMin)/(sentimentClass.scaleMax-sentimentClass.scaleMin)*(1.-(-1.)) for x in articleScoreDict.values()]
	return statistics.pstdev(population)

def computePopulationBalanceScoreHistoMean(articleScoreDict,sentimentClass):
	# Extract values from dict, then normalise to be within -1 to +1
	# Then compute population standard deviation as part of the balance score
	numBuckets=len(articleScoreDict)
	articleValues=pd.Series(articleScoreDict)
	
	# Based on 10,000 random article samples, Google's sentiment score for these articles lies within +/- 0.86
	# So, scale all scores by dividing by that value to rescale to +/- 1.00 before computing balance score
	# Ideally this should factored in at the individual NLP library class level 
	articleValues=articleValues/0.86

	populatedBuckets=0
	for i in range(numBuckets):
		bucketFrom=sentimentClass.scaleMin+i*(sentimentClass.scaleMax-sentimentClass.scaleMin)/numBuckets
		bucketTo=bucketFrom+(sentimentClass.scaleMax-sentimentClass.scaleMin)/numBuckets
		# The following is to ensure the top of the highest bucket is counted somewhere
		# and doesn't fall out due to treatment of inequalities in ranges
		if bucketTo==sentimentClass.scaleMax:
			bucketTo+=0.001
		numSamples=((bucketFrom<=articleValues) & (articleValues<bucketTo)).sum()
		if numSamples>0:
			populatedBuckets+=1

	# Score computed as proportion of buckets which are populated (more buckets implies a more balanced view)
	# This has a value between 0 and 1.
	# This is in turn multiplied by the distance between the mean and 1.
	# So, if mean is in center (i.e. at 0) then things are balanced, so score is not decreased
	# Otherwise, score is decreased proportionately
	return (populatedBuckets/numBuckets * (1.-abs(articleValues.mean())))

## Story map loading
The story map file is an optional way to provide the algorithm with:
- A list of stories
- Each story has a name (for reference/convenience)
- Each story contains a list of articles that pertain to that story

The objective is ultimately to process each story, and within each story to measure the sentiment of each article, then (still within the story) to compute the score for the balance of the coverage (of that story).

This section of code is taken from the other workbook.

In [6]:
def setupStoryMapAndReportList(args=None,reportArticleList=None,storyMapFileName=None):
	# Story Map is used in fitting if grid search is applied (As ground truth)
	# It is also used in graph if no threshold provided (to determine colours, not to determine location)
	# Report Article List is used at the end to create a report with, for each
	# article in the list, the set of articles within tolerance, and the key words for each
	if args==None:
		articleList=reportArticleList
		fileName=storyMapFileName
	else:
		articleList=args['article_id_list']
		fileName=args['story_map_validation']

	reportArticleList=articleList
	if fileName!=None:
		storyMap=readStoryMapFromFile(fileName)
		if reportArticleList==None:
			reportArticleList=[]
			for story, articleList in storyMap.items():
				reportArticleList.append(articleList[0])
	else:
		storyMap=None
	return storyMap,reportArticleList

def readStoryMapFromFile(filename):
	return readDictFromCsvFile(filename,'StoryMap')

##########################################################################################

def readGridParameterRangeFromFile(filename):
	return readDictFromCsvFile(filename,'GridParameters')

##########################################################################################

def readDictFromCsvFile(filename,schema):
	gridParamDict={}
	with open(filename, 'r') as f:
		for row in f:
			row=row[:-1] # Exclude the carriage return
			row=row.split(",")
			key=row[0]
			vals=row[1:]
			
			if schema=='GridParameters':
				if key in ['story_threshold','tfidf_maxdf']:
					finalVals=list(float(n) for n in vals)
				elif key in ['ngram_max','tfidf_mindf','max_length']:
					finalVals=list(int(n) for n in vals)
				elif key in ['lemma_conversion','tfidf_binary']:
					finalVals=list(str2bool(n) for n in vals)
				elif key in ['parts_of_speech']:
					listlist=[]
					for v in vals:
						listlist.append(v.split("+"))
					finalVals=listlist
				elif key in ['tfidf_norm','nlp_library']:
					finalVals=vals
				else:
					print(key)
					print("KEY ERROR")
					return
			elif schema=='StoryMap':
				finalVals=list(int(n) for n in vals)
			else:
				print(schema)
				print("SCHEMA ERROR")
				return
			
			gridParamDict[key]=finalVals
	return gridParamDict

### Load the story map from file

In [7]:
storyMap,reportArticleList=setupStoryMapAndReportList(storyMapFileName='storyMapForValidation.csv')

Inspecting the story map we see that it forms a dict containing a key corresponding to the name of the story and a value containing a list of the article IDs germane to that story.

In [8]:
for story, articleList in storyMap.items():
    print(story,":",articleList)

Trump meeting : [151832, 110126, 172078, 48306, 57365, 190512, 26536, 71335, 21499, 23872, 142033, 110133, 23888, 71336, 57366, 71339]
Brazil impeachment : [120639, 80103, 25225, 21502, 57362, 120636, 110141]
Kaepernick : [40617, 40543, 39520, 80109, 80101, 47403]
Clinton Guccifer : [214888, 85803, 47979]
Farage : [37252, 37468, 46175]
Anthony Weiner : [49480, 110144, 142300, 214934]
SpaceX : [38658, 134545, 172095, 214894]
Safe space : [21448, 78169, 78171]
Lauer debate : [43447, 47078, 138709]
Venezuela : [172079, 57375, 190522]
Iran deal : [158005, 48823, 57373, 120634]
Penn State : [80094, 157527, 214892]
David Brown : [172085, 80096, 141886]


### Augment story map with user requested specific article list
A requested article list is explicitly part of the input parameters and may vary through each iteration of the main loop. This data may not be consistent with any provided story map file. So the following function will be required in order to reconcile any differences.

In [None]:
def collapseRequestedArticleListIntoStoryList(requestedArticleList,storyMap):
	# Check that the explicitly requested articles are all contained in the storyList
	# If they aren't, add a new story to contain them

	# If the storyMap was empty, it will be None,
	# so initialise as a dictionary ready for adding new values
	if storyMap==None:
		newStoryMap={}
	else:
		newStoryMap=storyMap.copy()

	found=False
	for story,articleListFromMap in newStoryMap.items():
		if len(articleListFromMap)==len(requestedArticleList):
			y=sum([x in articleListFromMap for x in requestedArticleList])
			if y==len(articleListFromMap):
				found=True
                
	# If there is no complete story exactly matching then add a new story to the list
	# With the first article ID as the key (arbitrarily)
	if not found:
		newStoryMap[requestedArticleList[0]]=requestedArticleList
	return newStoryMap

## MAIN PROCESS
Note that the earlier steps need to have been run before this step in order to populate the following:
- articleDataFrame
- parameterGrid

Also, recall that for Google Cloud Platform to work the necessary credentials need to have been created and referenced in the environment.

And for Stanford Core NLP to work, the Stanford Java server needs to be running locally.

- java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 1000000

Vader will operate without any additional constraints, providing the Python environment has all the correct packages, per my project requirements file.

In [None]:
for i,currentParams in enumerate(parameterGrid):
	if len(parameterGrid)>1:
		print("Combination:",i+1,"of",len(parameterGrid))
		print(currentParams)

	# The base storyMap may be appended with an iteration specific
	# article list, depending on the request details in the parameter grid
	iterationStoryMap=collapseRequestedArticleListIntoStoryList(currentParams['article_id_list'],
																storyMap)

	sentimentAnalyser=SentimentAnalyser(currentParams['sentiment_library'])

	for story,articleList in iterationStoryMap.items():
		articleSentScores={}
		print("ANALYSING STORY:",story,"using",currentParams['sentiment_library'])
		print("Number of articles in story:",len(articleList))
		for article in articleList:

			articleContent=articleDataFrame[articleDataFrame['id']==article].iloc[0]['content']

			# if requested, only use the first few sentences for analysis
			if currentParams['sentiment_sentences']!=None:
				articleSentences=nl.sent_tokenize(articleContent)
				textToAnalyse=' '.join(articleSentences[:currentParams['sentiment_sentences']])	
			else:
				textToAnalyse=articleContent

			results=sentimentAnalyser.generateResults(textToAnalyse)

			articleSentScores[article]=sentimentAnalyser.getOverallArticleScore(results)

		# Sort and display results
		sortedArticleSentScores=sorted(articleSentScores.items(), key=operator.itemgetter(1))
		print("\nArticle sentiments, most positive first:")
		for article in reversed(sortedArticleSentScores):
			print(article[0],":", round(article[1],3),articleDataFrame[articleDataFrame['id']==article[0]].iloc[0]['publication'])

		# This only works because each article's score is constrained to be in -1 to +1
		# So maximum possible population standard deviation is 1 and minimum is 0
		# Should arguably build this into the classes somewhere, but so far I don't have any
		# classes that pertain to the population, rather than to an individual article
		# I could do the calculation here based on the class min and max... (and building a function to consume the list) 
		print("\nBALANCE SCORE:",round(computePopulationBalanceScoreHistoMean(articleSentScores,SentimentAnalyser),3)*100.,"\n")

## Article inspection
To inspect the raw content of one of the articles, substitute the desired ID in the following command and run it.

In [None]:
articleDataFrame[articleDataFrame['id']==80101]['content'].values[0]