# Part One - Clustering news articles according to inferred shared stories
There are two parts to this project. The first part deals with takes a body of news articles and clustering them into stories. That part is covered in this workbook.
The second part takes the results of the clustering (i.e. a list of articles pertaining to a single story) and ranks them in order to determine if the set of articles correspond to a balanced representation of the story, or a biased representation. This second part is covered in a second notebook.

To clarify some terms:
- ARTICLE - a single article printed by one news publication
- STORY - the underlying event that an article is in reference to

In many instances the universe of publications will feature multiple articles on any one story. The questions were are ultimately seeking to address here are:
- FURTHER READING - given that a user has read one article, which other articles should the user read in order to not have a biased perspective on the underlying event?
- FAIRNESS - is the complete set of published articles on that story biased or fair?

This workbook contains code required to:
- load a large corpus of new articles
- process them using a series of NLP methods
- group the articles into inferred stories
- analyse and graph the results
- perform a grid search if required in order to optimise the hyper parameters

## Preparation
### Imports
The first step is to import some of the required packages.

In [1]:
import argparse
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ParameterGrid
import csv

### Parameter configuration
The parameters used to control the NLP-related calculations, and to specify the domain for any grid search are captured in the runParams dict. This includes specification of the location of the key input files.
The runParams dict is converted into an sklearn ParameterGrid, even if there is no grid search requirement (in which case it's processed as a single scenario grid search).

NB All parameters need to be lists (or lists of lists) - requirement of ParameterGrid.

In [2]:
runParams={'tfidf_maxdf':      [0.5],
           'input_file':       ['./data/articles.csv'],
           'story_threshold':  [0.26],
           'process_date':     ['2016-09-01'],
           'parts_of_speech':  [['PROPER', 'VERB']],
           'lemma_conversion': [False],
           'ngram_max':        [3],
           'tfidf_binary':     [False],
           'tfidf_norm':       ['l2'],
           'nlp_library':      ['nltk'],
           'max_length':       [50],
           'stop_words_file':  ['./data/stopWords.txt'],
           'tfidf_mindf':      [2],
           'display_graph':    [True],
           'article_stats':    [False]}

# Use parameter grid even if there is only set of parameters
parameterGrid=ParameterGrid(runParams)

### NLP Libraries
Two NLP libraries are used in this worksheet - ntkl and spaCy. Depending on which is/are requested in the run parameters, the following section of code loads the relevant packages. In addition it initialises a dictionary to translate a common set of Parts of Speech into the corresponding set of tokens specific to each library.

In [24]:
# Load and initialise required NLP libraries
pos_nlp_mapping={}
nl=None
wordnet_lemmatizer=None
nlp=None
if 'spaCy' in runParams['nlp_library']:
	import spacy
	nlp=spacy.load('en')
	pos_nlp_mapping['spaCy']={'VERB':['VERB'],'PROPER':['PROPN'],'COMMON':['NOUN']}

if 'nltk' in runParams['nlp_library']:
	import nltk as nl
	if True in runParams['lemma_conversion']:
		from nltk.stem import WordNetLemmatizer
		wordnet_lemmatizer=WordNetLemmatizer()
	else:
		wordnet_lemmatizer=None
	pos_nlp_mapping['nltk']={'VERB':['VB','VBD','VBG','VBN','VBP','VBZ'],'PROPER':['NNP','NNPS'],'COMMON':['NN','NNS']}

### File loader for news article corpus
The following function loads the text file specified in the run parameters and converts it into a Pandas data frame.
It proceeds to perform some clean up on the data, effectively removing articles that are in some sense or other corrupt and will not be able to processed by the algorithm.
This set includes summary articles which effectively contain a single sentence on a large number of stories. It also removes some standardised text common to certain articles, since that text contains no information about the story itself and hence will create noise in the algorithm.

#### Note on article dates
The loader function below needs to be called with the selected articles to be constrained to a single date. This is because a key feature of "news" reporting is that it is current. As a result two articles are hugely more likely to pertain to the same story if they are published on the same day. Including the full set of dates has the effect of creating a lot of noise for the vectorizing and clustering, thus resulting in both slow performance and inaccurate results.
There are additional techniques for pairing articles across different dates. These are discussed in the project report.

In [25]:
def getInputDataAndDisplayStats(filename,processDate,printSummary=False):

	df=pd.read_csv(filename)
	df=df.drop_duplicates('content')
	df=df[~df['content'].isnull()]

	# There are a large number of junk articles, many of which either don't make sense or
	# just contain a headline - as such they are useless for this analysis and may distort
	# results if left in place
	df=df[df['content'].str.len()>=200]

	# Find and remove summary NYT "briefing" articles to avoid confusing the clustering
	targetString="(Want to get this briefing by email?"
	df['NYT summary']=df['content'].map(lambda d: d[:len(targetString)]==targetString)
	df=df[df['NYT summary']==False]

	# The following removes a warning that appears in many of the Atlantic articles.
	# Since it is commonly at the beginning, it brings a lot of noise to the search for similar articles
	# And subsequently to the assessment of sentiment
	targetString="For us to continue writing great stories, we need to display ads.             Please select the extension that is blocking ads.     Please follow the steps below"
	df['content']=df['content'].str.replace(targetString,'')

	# This is also for some Atlantic articles for the same reasons as above
	targetString="This article is part of a feature we also send out via email as The Atlantic Daily, a newsletter with stories, ideas, and images from The Atlantic, written specially for subscribers. To sign up, please enter your email address in the field provided here."
	df=df[df['content'].str.contains(targetString)==False]

	# This is also for some Atlantic articles for the same reasons as above
	targetString="This article is part of a feature we also send out via email as Politics  Policy Daily, a daily roundup of events and ideas in American politics written specially for newsletter subscribers. To sign up, please enter your email address in the field provided here."
	df=df[df['content'].str.contains(targetString)==False]

	# More Atlantic-specific removals (for daily summaries with multiple stories contained)
	df=df[df['content'].str.contains("To sign up, please enter your email address in the field")==False]

	# Remove daily CNN summary
	targetString="CNN Student News"
	df=df[df['content'].str.contains(targetString)==False]

	if printSummary:
		print("\nArticle counts by publisher:")
		print(df['publication'].value_counts())

		print("\nArticle counts by date:")
		print(df['date'].value_counts())
		
	# Restrict to articles on the provided input date.
	# This date is considered mandatory for topic clustering but is not required for sentiment
	# since sentiment only processes a specified list of articles.
	# For topic clustering it is essential to have the date as it is
	# enormously significant in article matching.
	if processDate!=None:
		df=df[df['date']==processDate]
	df.reset_index(inplace=True, drop=True)

	# Remove non-ASCII characters
	df['content no nonascii']=df['content'].map(lambda x: removeNonASCIICharacters(x))

	print("\nFinal dataset:\n\nDate:",processDate,"\n")
	print(df['publication'].value_counts())

	return df

##########################################################################################

def removeNonASCIICharacters(textString): 
    return "".join(i for i in textString if ord(i)<128)

### Load the articles from the corpus
In addition the function will return the number of articles per publication (for the requested run date). Here we see there is a relatively good mix of political viewpoints covered. More discussion of this is provided in the project report.

In [4]:
# Load corpus of articles from file
# 0 index is required because the parameters are forced to be lists by ParameterGrid
articleDataFrame=getInputDataAndDisplayStats(runParams['input_file'][0],
											 runParams['process_date'][0],
											 runParams['article_stats'][0])


Final dataset:

Date: 2016-09-01 

Breitbart           50
Buzzfeed News       35
NY Times            28
NY Post             27
NPR                 26
Atlantic            24
Washington Post     22
CNN                 20
Reuters             15
Business Insider    15
Guardian            14
Fox News            14
National Review     12
Name: publication, dtype: int64


### Inspect loaded articles
Now that the articles are loaded, the only attributes that will be used are the ID and the 'content no non-ascii' column.

In [5]:
display(articleDataFrame)

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,NYT summary,content no nonascii
0,3722,21413,"One Star Over, a Planet That Might Be Another ...",NY Times,Kenneth Chang,2016-09-01,2016.0,9.0,,Another Earth could be circling the star right...,False,Another Earth could be circling the star right...
1,3748,21448,University of Chicago Strikes Back Against Cam...,NY Times,"Richard Pérez-Peña, Mitch Smith and Stephanie ...",2016-09-01,2016.0,9.0,,The anodyne welcome letter to incoming freshme...,False,The anodyne welcome letter to incoming freshme...
2,3754,21454,Quake Exposes Italy's Challenge to Retrofit It...,NY Times,Gaia Pianigiani and Elisabetta Povoledo,2016-09-01,2016.0,9.0,,"CASETTA, Italy — Romano Camassi, a seismolo...",False,"CASETTA, Italy Romano Camassi, a seismolog..."
3,3755,21455,"A Cheaper Airbag, and Takata's Road to a Deadl...",NY Times,Hiroko Tabuchi,2016-09-01,2016.0,9.0,,"In the late 1990s, General Motors got an unexp...",False,"In the late 1990s, General Motors got an unexp..."
4,3772,21474,Gene Wilder Dies at 83 Star of ‘Willy Wonka' a...,NY Times,Daniel Lewis,2016-09-01,2016.0,9.0,,"Gene Wilder, who established himself as one of...",False,"Gene Wilder, who established himself as one of..."
...,...,...,...,...,...,...,...,...,...,...,...,...
297,143552,214894,Watch SpaceX's rocket explode in a massive fir...,Washington Post,Christian Davenport,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160902000145/htt...,As SpaceX prepared to test fire the p...,False,As SpaceX prepared to test fire the p...
298,143553,214895,This photographer's photos show the tender mom...,Washington Post,Kenneth Dickerman,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160902000145/htt...,Old family photographs on Marisa Vesco...,False,Old family photographs on Marisa Vesco...
299,143584,214934,How Anthony Weinerâ€™s risque messages shaped ...,Washington Post,Sarah Jeong,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160904003253/htt...,In the aftermath of Anthony Weiner s late...,False,In the aftermath of Anthony Weiner s late...
300,143592,214943,Stop touting the crazy hours you work. It help...,Washington Post,Jena McGregor,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160904003253/htt...,"As Labor Day approaches, and a single day...",False,"As Labor Day approaches, and a single day..."


#### Inspect full article corpus
Although the remaining set of data are not used, for the purpose of exploring the dataset, the following breakdown can be obtained. Note that at the bottom of the list are some misformatted dates. These are not pertinent to the date being used for the example here, so it is not necessary to address that at this point.

In [6]:
getInputDataAndDisplayStats(runParams['input_file'][0],
                            runParams['process_date'][0],
                            printSummary=True)


Article counts by publisher:
Breitbart           104
NY Post              61
CNN                  57
Reuters              56
NPR                  54
NY Times             50
Washington Post      50
Buzzfeed News        48
Atlantic             48
Business Insider     41
Guardian             35
National Review      32
Fox News             28
Name: publication, dtype: int64

Article counts by date:
2016-12-02    362
2016-09-01    302
Name: date, dtype: int64

Final dataset:

Date: 2016-09-01 

Breitbart           50
Buzzfeed News       35
NY Times            28
NY Post             27
NPR                 26
Atlantic            24
Washington Post     22
CNN                 20
Reuters             15
Business Insider    15
Guardian            14
Fox News            14
National Review     12
Name: publication, dtype: int64


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,NYT summary,content no nonascii
0,3722,21413,"One Star Over, a Planet That Might Be Another ...",NY Times,Kenneth Chang,2016-09-01,2016.0,9.0,,Another Earth could be circling the star right...,False,Another Earth could be circling the star right...
1,3748,21448,University of Chicago Strikes Back Against Cam...,NY Times,"Richard Pérez-Peña, Mitch Smith and Stephanie ...",2016-09-01,2016.0,9.0,,The anodyne welcome letter to incoming freshme...,False,The anodyne welcome letter to incoming freshme...
2,3754,21454,Quake Exposes Italy's Challenge to Retrofit It...,NY Times,Gaia Pianigiani and Elisabetta Povoledo,2016-09-01,2016.0,9.0,,"CASETTA, Italy — Romano Camassi, a seismolo...",False,"CASETTA, Italy Romano Camassi, a seismolog..."
3,3755,21455,"A Cheaper Airbag, and Takata's Road to a Deadl...",NY Times,Hiroko Tabuchi,2016-09-01,2016.0,9.0,,"In the late 1990s, General Motors got an unexp...",False,"In the late 1990s, General Motors got an unexp..."
4,3772,21474,Gene Wilder Dies at 83 Star of ‘Willy Wonka' a...,NY Times,Daniel Lewis,2016-09-01,2016.0,9.0,,"Gene Wilder, who established himself as one of...",False,"Gene Wilder, who established himself as one of..."
...,...,...,...,...,...,...,...,...,...,...,...,...
297,143552,214894,Watch SpaceX's rocket explode in a massive fir...,Washington Post,Christian Davenport,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160902000145/htt...,As SpaceX prepared to test fire the p...,False,As SpaceX prepared to test fire the p...
298,143553,214895,This photographer's photos show the tender mom...,Washington Post,Kenneth Dickerman,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160902000145/htt...,Old family photographs on Marisa Vesco...,False,Old family photographs on Marisa Vesco...
299,143584,214934,How Anthony Weinerâ€™s risque messages shaped ...,Washington Post,Sarah Jeong,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160904003253/htt...,In the aftermath of Anthony Weiner s late...,False,In the aftermath of Anthony Weiner s late...
300,143592,214943,Stop touting the crazy hours you work. It help...,Washington Post,Jena McGregor,2016-09-01,2016.0,9.0,https://web.archive.org/web/20160904003253/htt...,"As Labor Day approaches, and a single day...",False,"As Labor Day approaches, and a single day..."


## NLP Processing
### Stop words
In processing natural language, it is necessary to suppress words that convey little value. Typically these are common words such as "a", "man", "Friday", etc. They are referred to as Stop Words. A list of these files is in an included file and is loaded below. This file is independent of whether nltk or spaCy is used for some of the other NLP features. (it will only be applied towards the end of the NLP processing)

In [9]:
def loadStopWords(stopWordsFileName):
	stop_words=[]
	f=open(stopWordsFileName, 'r')
	for l in f.readlines():
		stop_words.append(l.replace('\n', ''))
	return stop_words


# Load stop words now - these will be deleted from final text by processor before vectorizing
# 0 index is required because the parameters are forced to be lists by ParameterGrid
stop_words=loadStopWords(runParams['stop_words_file'][0])

### NLTK
The following function provides the NLTK pre-processing. Specifically, it:
- restricts the text to the requested parts-of-speech (according to the run parameters). This includes Proper Nouns, Common Nouns, Verbs, etc. Words of different parts-of-speech are more/less important in determining the story relayed by an article - for example, adjectives are not important. The goal is to reduce the number of words in a way that eliminates those that cause noise rather than adding value - and thus makes the algorithm operate more effectively.
- applies lemmatisation (optionally), thus substituting a word for its root - the intention here being to reduce the final universe of words in the corpus, and thus make finding related articles easier.
- truncates the length of the article to the degree requested. (testing has shown that restricting to the first few paragraph increases the ease with which the algorithm can relate articles)

In [10]:
def stringNLTKProcess(nl,stringToConvert,partsOfSpeech,stop_words,maxWords=None,lemmatizer=None):
	sentences=nl.sent_tokenize(stringToConvert)
	str=[]
	for sentence in sentences:
		wordString=[]
		for word,pos in nl.pos_tag(nl.word_tokenize(sentence)):
			# The following condition avoids any POS which corresponds to punctuation (and takes all others)
			if partsOfSpeech==None:
				if pos[0]>='A' and pos[0]<='Z':
					wordString.append(word)
			elif pos in partsOfSpeech:
				wordString.append(word)
		for wrd in wordString:
			wrdlower=wrd.lower()
			if wrdlower not in stop_words and wrdlower!="'s":
				if maxWords==None or len(str)<maxWords:
					if lemmatizer==None:
						str.append(wrdlower)
					else:
						str.append(lemmatizer.lemmatize(wrd.lower(), pos='v'))
			if maxWords!=None and len(str)==maxWords:
				return ' '.join(str)
	return ' '.join(str)

##########################################################################################

def removeSpacesAndPunctuation(textString): 
    return "".join(i for i in textString if (ord(i)>=48 and ord(i)<=57) or (ord(i)>=97 and ord(i)<=122))

### spaCy
The second NLP library supported here is spaCy. It is being used to provide the same features as NLTK - although there are pros and cons to the specific implementations of each library.

In [11]:
def stringSpaCyProcess(nlp,stringToConvert,partsOfSpeech,maxWords,stop_words,lemmatize):
	doc=nlp(stringToConvert)
	if partsOfSpeech==None:
		spacyTokens=[w for w in doc]
	else:
		spacyTokens=[w for w in doc if w.pos_ in partsOfSpeech]

	str=[]
	for spt in spacyTokens:
		if lemmatize:
			wrd=spt.lemma_
		else:
			wrd=spt.text
		wrdlower=removeSpacesAndPunctuation(wrd.lower())
		# The middle term below is correctly wrd.lower() not wrdlower since the function call
		# above strips out the --, and I don't want to compare with 'pron' in case that
		# finds false matches
		if wrdlower not in stop_words and wrd.lower()!='-pron-' and not wrdlower=='':
			if maxWords==None or len(str)<maxWords:
				str.append(wrdlower)
		if maxWords!=None and len(str)==maxWords:
				return ' '.join(str)		
	return ' '.join(str)

## Prepare data for analysing results
In order to score how well the algorithm is assigning articles to stories, it is useful (but optional) to provide a file containing a "story map". This file effectively specifies which articles belong to which stories. It is incomplete, but it is sufficiently extensive to demonstrate the effectiveness of the results.
In addition, a list of article IDs can be provided to drive the post vectorization validation. For the purposes of this workbook, this second list is being derived from the story map.
### Setup story map and testing list

In [22]:
def setupStoryMapAndReportList(args=None,reportArticleList=None,storyMapFileName=None):
	# Story Map is used in fitting if grid search is applied (As ground truth)
	# It is also used in graph if no threshold provided (to determine colours, not to determine location)
	# Report Article List is used at the end to create a report with, for each
	# article in the list, the set of articles within tolerance, and the key words for each
	if args==None:
		articleList=reportArticleList
		fileName=storyMapFileName
	else:
		articleList=args['article_id_list']
		fileName=args['story_map_validation']

	reportArticleList=articleList
	if fileName!=None:
		storyMap=readStoryMapFromFile(fileName)
		if reportArticleList==None:
			reportArticleList=[]
			for story, articleList in storyMap.items():
				reportArticleList.append(articleList[0])
	else:
		storyMap=None
	return storyMap,reportArticleList

def readStoryMapFromFile(filename):
	return readDictFromCsvFile(filename,'StoryMap')

##########################################################################################

def readGridParameterRangeFromFile(filename):
	return readDictFromCsvFile(filename,'GridParameters')

##########################################################################################

def readDictFromCsvFile(filename,schema):
	gridParamDict={}
	with open(filename, 'r') as f:
		for row in f:
			row=row[:-1] # Exclude the carriage return
			row=row.split(",")
			key=row[0]
			vals=row[1:]
			
			if schema=='GridParameters':
				if key in ['story_threshold','tfidf_maxdf']:
					finalVals=list(float(n) for n in vals)
				elif key in ['ngram_max','tfidf_mindf','max_length']:
					finalVals=list(int(n) for n in vals)
				elif key in ['lemma_conversion','tfidf_binary']:
					finalVals=list(str2bool(n) for n in vals)
				elif key in ['parts_of_speech']:
					listlist=[]
					for v in vals:
						listlist.append(v.split("+"))
					finalVals=listlist
				elif key in ['tfidf_norm','nlp_library']:
					finalVals=vals
				else:
					print(key)
					print("KEY ERROR")
					return
			elif schema=='StoryMap':
				finalVals=list(int(n) for n in vals)
			else:
				print(schema)
				print("SCHEMA ERROR")
				return
			
			gridParamDict[key]=finalVals
	return gridParamDict

### Load the story map from file

In [13]:
storyMap,reportArticleList=setupStoryMapAndReportList(storyMapFileName='storyMapForValidation.csv')

Inspecting the story map we see that it forms a dict containing a key corresponding to the name of the story and a value containing a list of the article IDs germane to that story.

In [14]:
for story, articleList in storyMap.items():
    print(story,":",articleList)

Trump meeting : [151832, 110126, 172078, 48306, 57365, 190512, 26536, 71335, 21499, 23872, 142033, 110133, 23888, 71336, 57366, 71339]
Brazil impeachment : [120639, 80103, 25225, 21502, 57362, 120636, 110141]
Kaepernick : [40617, 40543, 39520, 80109, 80101, 47403]
Clinton Guccifer : [214888, 85803, 47979]
Farage : [37252, 37468, 46175]
Anthony Weiner : [49480, 110144, 142300, 214934]
SpaceX : [38658, 134545, 172095, 214894]
Safe space : [21448, 78169, 78171]
Lauer debate : [43447, 47078, 138709]
Venezuela : [172079, 57375, 190522]
Iran deal : [158005, 48823, 57373, 120634]
Penn State : [80094, 157527, 214892]
David Brown : [172085, 80096, 141886]


In [15]:
print(reportArticleList)

[151832, 120639, 40617, 214888, 37252, 49480, 38658, 21448, 43447, 172079, 158005, 80094, 172085]


## Algorithm preparation
The algorithm can be considered as having the following steps:
- Preprocessing for NLP features
- Conversion to TF-IDF vectors
- A Relatedness Score for a set of vectors
- A main processing loop for tying everything together

### Preprocess and vectorizing
This section provides the necessary code for the natural language processing requirements.

Two NLP libraries are supported. The user can choose (in the run parameters) between:
- NLTK
- SpaCy

The various natural language functions will be applied (to the extent requested in the run parameters).
- Lemmatization
- Remove stop words
- Restrict to specific parts-of-speech
- Constrain overall length
- n-grams

Once that's done, the body of the processed articles will be analysed and converted into tf-idf values.


In [26]:
def preprocessAndVectorize(articleDataFrame,args,pos_nlp_mapping,nlp,nl,wordnet_lemmatizer,stop_words):
	# Map the input parts of speech list to the coding required for the specific NLP library
	if args['parts_of_speech'][0]!='ALL':
		partsOfSpeech=[]
		for pos in args['parts_of_speech']:
			partsOfSpeech.append(pos_nlp_mapping[args['nlp_library']][pos])
		partsOfSpeech=[item for sublist in partsOfSpeech for item in sublist]
	else:
		partsOfSpeech=None

	# Processing of text depends on NLP library choice
	if args['nlp_library']=='spaCy':
		articleDataFrame['input to vectorizer']=articleDataFrame['content no nonascii'].map(lambda x: stringSpaCyProcess(nlp,
																									   x,
																									   partsOfSpeech=partsOfSpeech,
																									   maxWords=args['max_length'],
																									   stop_words=stop_words,
																									   lemmatize=args['lemma_conversion']))
	elif args['nlp_library']=='nltk':
		articleDataFrame['input to vectorizer']=articleDataFrame['content no nonascii'].map(lambda x: stringNLTKProcess(nl,
																									  x,
																									  partsOfSpeech=partsOfSpeech,
																									  stop_words=stop_words,
																									  maxWords=args['max_length'],
																									  lemmatizer=wordnet_lemmatizer))
	else:
		print("PROBLEM... NO VALID NLP LIBRARY... MUST BE nltk OR spaCy")

	# To get default values a couple of parameters need to be not passed if not specified on the command line
	# Passing as None behaves differently to passing no parameter (which would invoke the default value)
	optArgsForVectorizer={}
	if args['tfidf_maxdf'] != None:
		optArgsForVectorizer['max_df']=args['tfidf_maxdf']
	if args['tfidf_mindf'] != None:
		optArgsForVectorizer['min_df']=args['tfidf_mindf']

	# Create and run the vectorizer
	vectorizer=TfidfVectorizer(analyzer='word',
   	    	                   ngram_range=(1,args['ngram_max']),
       	    	               lowercase=True,
           	    	    	   binary=args['tfidf_binary'],
               		    	   norm=args['tfidf_norm'],
							   **optArgsForVectorizer)
	tfidfVectors=vectorizer.fit_transform(articleDataFrame['input to vectorizer'])
	terms=vectorizer.get_feature_names()
	return tfidfVectors, terms

### Scoring
Scores must be computed for each pair of articles, for the following reasons:
- To determine the proposed clustering of articles in stories (and to evaluate this clustering against a ground truth story map)
- To evaluate the grid parameters in order to choose the preferable combination

In [27]:
def scoreCurrentParamGuess(tfidfVectors,storyMap,articleDataFrame,threshold,printErrors=False):
	# Work with distances relative to first item in each cluster - even though this is clearly arbitrary since that
	# point could be an outlier in the cluster and hence might cause problems.
	# But I have to start somewhere - and can refine it later if needed.

	nonZeroCoords=initialiseAllNonZeroCoords(tfidfVectors)
	score=0
	outGood=0
	outBad=0
	inGood=0
	inBad=0
	for story, storyArticles in storyMap.items():
		leadArticleIndex=articleDataFrame[articleDataFrame['id']==storyArticles[0]].index[0]
		# Compute score of all articles in corpus relative to first article in story (.product)
		# Then count through list relative to threshold (add one for a good result, subtract one for a bad result)
		scores=productRelatednessScores(tfidfVectors,nonZeroCoords,leadArticleIndex)
		rankedIndices=np.argsort(scores)
		foundRelatedArticles=[]
		# THE SORTING HERE IS NOT STRICTLY REQUIRED, BUT I COULD USE IT SO THAT ONCE THE THRESHOLD IS PASSED
		# IN THE LOOP, THEN I INFER THE REMAINING RESULTS
		for article in reversed(rankedIndices):
			thisArticleIndex=articleDataFrame['id'][article]
			if thisArticleIndex in storyArticles:
				if scores[article]>=threshold:
					score+=1
					inGood+=1
				else:
					score-=1
					inBad+=1
					if printErrors:
						print("ERROR:",thisArticleIndex,"should be in",story)
			else: # article not supposed to be in range
				if scores[article]<=threshold:
					score+=1
					outGood+=1
				else:
					score-=1
					outBad+=1
					if printErrors:
						print("ERROR:",thisArticleIndex,"should NOT be in",story)
	scoreDict={'score':score,'inGood':inGood,'inBad':inBad,'outGood':outGood,'outBad':outBad}
	return scoreDict

##########################################################################################

def initialiseAllNonZeroCoords(tfidfVectors):
# This function just exists since it seems to be expensive and I'd rather not call it multiple times
# Hence it is intended to be called outside of loops in order to simplify the row specific processing
	values=[]
	nzc=zip(*tfidfVectors.nonzero())

	# In Python 3 the zip can only be iterated through one time before it is automatically released
	# So need to copy the results otherwise the main loop below will no longer work
	pointList=[]
	for i,j in nzc:
		pointList.append([i,j])		

	for row in range(tfidfVectors.shape[0]):
		rowList=[]
		for i,j in pointList:
			if row==i:
				rowList.append(j)
		values.append(rowList)

	return values

### Relatedness Scoring measure
The Relatedness Score is computed between a pair of articles by taking the dot product of the values across each dimension of the pair's TF-IDF vectors.
This has the following behaviour characteristics:
- If a term is important in both articles, that term will have a high impact on the article relatedness
- If a term is not important in either or both articles, that term will have a high impact on the article relatedness

The non-linearity coming from the product ensures a more contextual and intuitive scoring than the conventional Euclidean measure. (thus resulting in more usable pairings)

In [28]:
def productRelatednessScores(tfidfVectors,nonZeroCoords,refRow):
	scores=[0]*tfidfVectors.shape[0]
	for toRow in range(tfidfVectors.shape[0]):
		scores[toRow]=sum([(tfidfVectors[toRow,w]*tfidfVectors[refRow,w]) for w in nonZeroCoords[refRow] if w in nonZeroCoords[toRow]])
	return scores

In [30]:
import nltk; nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\goldm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

## Run the algorithm
Now that all the pieces are in place, a loop is run to tie everything together - calling the vectorizer and scoring the results.
If we are running in GridSearch mode, the loop will repeat and keep track of the best results achieved.

In [32]:
# Loop across all parameter combinations in grid to determine best set
# If not doing grid search, will just pass through the loop once
bestParamScoreDict={'score':-1000000}
bestParams=parameterGrid[0]
for i,currentParams in enumerate(parameterGrid):
	if len(parameterGrid)>1:
		print("Combination:",i+1,"of",len(parameterGrid))
		print(currentParams)

	# Determine tf-idf vectors
	# terms is just used later on if analysis of final results is requested
	tfidfVectors,terms=preprocessAndVectorize(articleDataFrame,
											  currentParams,
											  pos_nlp_mapping,
											  nlp,
											  nl,
											  wordnet_lemmatizer,
											  stop_words)

	# Compute scores if threshold provided (meaning as part of grid search)
	if 'story_threshold' in currentParams and currentParams['story_threshold']!=None:
		scoreDict=scoreCurrentParamGuess(tfidfVectors,storyMap,articleDataFrame,currentParams['story_threshold'])
		print(scoreDict)

		# Update best so far
		if scoreDict['score']>=bestParamScoreDict['score']:
			if len(parameterGrid)>1:
				print(i+1,"is the best so far!")
			bestParams=currentParams
			bestParamScoreDict=scoreDict
	# End grid/parameter loop

{'score': 3914, 'inGood': 60, 'inBad': 2, 'outGood': 3860, 'outBad': 4}


### Tidy up by restoring to best run before proceeding with analysis

In [33]:
# Set threshold to input value from best (and possibly only) run for use in results analysis
# Unless not specified at all
if 'story_threshold' in bestParams and bestParams['story_threshold']!=None:
	threshold=bestParams['story_threshold']
else:
	threshold=None


# If there was a real parameter grid, then output/refresh results
if len(parameterGrid)>1:
	print("BEST PARAMETERS:")
	print(bestParams)
	print(bestParamScoreDict)
	scoreCurrentParamGuess(tfidfVectors,storyMap,articleDataFrame,threshold,printErrors=True)
	# Recreate vector for best results in loop
	# terms is just used later on if analysis of final results is requested
	tfidfVectors,terms=preprocessAndVectorize(articleDataFrame,
											  bestParams,
											  pos_nlp_mapping,
											  nlp,
											  nl,
											  wordnet_lemmatizer,
											  stop_words)

## Analysis of results
### Produce graphs
In order to produce a graph of the results, the TF-IDF vectors are reduced to two dimensions.

Clustering is computed using the full n dimensions, with the threshold determining which articles end up grouped into shared stories.

The graphs is ultimately rendered using Bokeh.

In [35]:
# Reduce vector space to two dimensions
# Then produce Bokeh graph
def graphVectorSpace(tfidfVectors,extraColumns,dateForTitle,storyMap,threshold):
	# Better results seem to be obtained by breaking the dimensionality reduction into two steps

	# First reduce to fifty dimensions with SVD
	from sklearn.decomposition import TruncatedSVD
	svd=TruncatedSVD(n_components=50, random_state=0)
	svdResults=svd.fit_transform(tfidfVectors)

	# Next continue to two dimensions with TSNE
	from sklearn.manifold import TSNE
	tsneModel=TSNE(n_components=2, verbose=0, random_state=0, n_iter=500)
	tsneResults=tsneModel.fit_transform(svdResults)
	tfidf2dDataFrame=pd.DataFrame(tsneResults)
	tfidf2dDataFrame.columns=['x','y']

	tfidf2dDataFrame['publication']=extraColumns['publication']	
	tfidf2dDataFrame['id']=extraColumns['id']	
	tfidf2dDataFrame['content']=extraColumns['content no nonascii'].map(lambda x: x[:200])

	# All articles will be marked as NA to indicate that they have not been assigned to a story
	# Then those which have been assigned one will be updated to refer to that
	tfidf2dDataFrame['category']='NA'

	# If the threshold is not provided, then just graph the vector space as is
	# With colours indicating desired story grouping
	# This still has value because it shows how well stories cluster together
	if threshold==None:
		graphTitle=("TF-IDF article clustering - story assignment from map - "+dateForTitle[0])
		for story, storyArticles in storyMap.items():
			for article in storyArticles:
				if len(tfidf2dDataFrame[tfidf2dDataFrame['id']==article].index)==1:
					i=tfidf2dDataFrame[tfidf2dDataFrame['id']==article].index[0]
					tfidf2dDataFrame['category'][i]=story
	else:
		graphTitle=("TF-IDF article clustering - story assignment computed - "+dateForTitle[0])
		nonZeroCoords=initialiseAllNonZeroCoords(tfidfVectors)
		for story, storyArticles in storyMap.items():
			leadArticleIndex=extraColumns[extraColumns['id']==storyArticles[0]].index[0]
			# Compute score of all articles in corpus relative to first article in story (.product)
			scores=productRelatednessScores(tfidfVectors,nonZeroCoords,leadArticleIndex)
			rankedIndices=np.argsort(scores)
			for article in rankedIndices:
				if scores[article]>=threshold:
					tfidf2dDataFrame['category', article]=story

	import bokeh.plotting as bp
	from bokeh.models import HoverTool
	from bokeh.plotting import show,output_notebook
	from bokeh.palettes import d3
	import bokeh.models as bmo

	output_notebook()
	plot_tfidf=bp.figure(plot_width=800, plot_height=800, title=graphTitle,
						 tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
						 x_axis_type=None, y_axis_type=None, min_border=1)

	numCats=len(tfidf2dDataFrame['category'].unique())
	palette=d3['Category20'][numCats]
	color_map=bmo.CategoricalColorMapper(factors=tfidf2dDataFrame['category'].map(str).unique(), palette=palette)

	plot_tfidf.scatter(x='x', y='y', color={'field': 'category', 'transform': color_map}, 
						legend='category',source=tfidf2dDataFrame)
	hover=plot_tfidf.select(dict(type=HoverTool))
	plot_tfidf.legend.click_policy="hide"
	hover.tooltips={"id": "@id", "publication": "@publication", "content":"@content", "category":"@category"}

	show(plot_tfidf)

### Interpreting the graph
Each dot on the scattergraph corresponds to an article.
Most of the stories covered that day (and available in the dataset) are actually not represented in more than one publication. So many articles are effectively their own unique story - these are indicated by NA.
There is a particularly large cluster of these around the center of the graph. This appears to be an artefact of the SVD - the articles don't contain strong distinctive terms, and hence have small values on both axes. (this is likely party a result of the min DF being set to 2 in the example)

To view the details of any story, hover over the corresponding dot.

Considerably more detail and analysis is provided in the project report document.

In [39]:
graphVectorSpace(tfidfVectors,
				 articleDataFrame[['id','publication','content no nonascii']],
				 runParams['process_date'],
				 storyMap,
				 threshold)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tfidf2dDataFrame['category'][article]=story


ValueError: unexpected tool name 'previewsave', possible tools are pan, xpan, ypan, xwheel_pan, ywheel_pan, wheel_zoom, xwheel_zoom, ywheel_zoom, zoom_in, xzoom_in, yzoom_in, zoom_out, xzoom_out, yzoom_out, click, tap, crosshair, box_select, xbox_select, ybox_select, poly_select, lasso_select, box_zoom, xbox_zoom, ybox_zoom, save, undo, redo, reset, help, box_edit, line_edit, point_draw, poly_draw, poly_edit or hover

For the avoidance of doubt - recall that the story names in the legend (Safe space, Trump meeting, etc) are taken directly from the input story map titles. These names are not inferred from the data! (just the groupings are inferred)
### Investigate inferred clustering vs given story

In [None]:
def produceRequestedReportDetails(tfidfVectors,articleDataFrame,reportArticleList,threshold,storyMap,terms):

	# tfidfVectors is a sparse matrix, for efficiency it's useful to determine once only which
	# coordinates have non-zero values
	nonZeroCoords=initialiseAllNonZeroCoords(tfidfVectors)

	topNwords=25

	# Create list of articles to process
	# If a list is provided in command line arguments, use that
	storyMapGood=0.0
	encounteredStoriesList=[]
	for index,row in articleDataFrame.iterrows():
		if row['id'] in reportArticleList:
			ref_index=index
			print("-----")
			print("-----")
			print("LEAD ARTICLE IN STORY:",row['id'])
			print("-----")

			if threshold==None:
				articleIndexList=[index]
			else:
				# Score and rank all articles relative to this one
				# Count number of items that are greater than or equal to threshold
				# Then truncate the list beyond those items
				scores=productRelatednessScores(tfidfVectors,nonZeroCoords,ref_index)
				rankedIndices=np.argsort(scores)
				numItemsInRange=sum(x>=threshold for x in scores)
				articleIndexList=rankedIndices[-numItemsInRange:]

			# If there is a story map, find out which story this article is meant to belong to
			targetStory=None
			if storyMap!=None:
				for story,articleList in storyMap.items():
					if row['id'] in articleList:
						targetStory=story
						targetArticleList=articleList
						encounteredStoriesList.append(targetStory)
					
			# For just those articles that are within threshold of the lead article
			# Print out the key terms and their tf-idf scores
			# Then count the number of articles that are correctly assigned to the story
			# (if there is a ground truth storyMap provided)
			for article in reversed(articleIndexList):
				if targetStory!=None:
					# If this is officially part of the same story, update the counts
					if articleDataFrame['id'][article] in targetArticleList:
						storyMapGood+=1.0

				print("MEMBER ARTICLE:",articleDataFrame['id'][article])
				if threshold!=None:
					print("Score :",scores[article])
				print(articleDataFrame['publication'][article])
				print(articleDataFrame['content'][article][:500])
				print("PASSED TO VECTORIZER AS:")
				print(articleDataFrame['input to vectorizer'][article])
				print()
				printTopNwordsForArticle(tfidfVectors,terms,articleNum=article,n=topNwords)
				print("-----")
			print("-----")

	# If there is a storyMap, print out the percentage results for the inferred allocation
	# Note that it should be just relative to the number of stories actually encountered
	# So if the user requests a specific set of articles and those articles don't cover
	# the full set of stories, then they shouldn't be counted as errors.
	if storyMap!=None:
		storyMapSize=sum([len(storyMap[story]) for story in encounteredStoriesList])
		print("\n\nPERCENTAGE OF STORIES ALLOCATED IN LINE WITH MAP:",100.*storyMapGood/storyMapSize)

	return

def printTopNwordsForArticle(tfidfVectors,terms,articleNum,n):
	vect=tfidfVectors[articleNum].toarray()[0]
	topn1=np.argsort(vect)
	for t in reversed(topn1[-n:]):
		if vect[t]>0.001:
			print(terms[t],":",round(vect[t],5))

### Clustering report for articles
The output of the following cell is broken down as follows:
- One section per story in the story map
- The ID of the lead story (which is taken from the reportArticleList provided
- For each article which has a pair score with this lead article greater than the threshold:
 - The ID of that related article
 - The score for the pair
 - The name of the publisher of the article
 - The first few lines of the actual article content
 - The derived corresponding text that is provided to the TF-IDF vectorizer (i.e. processed for lemmatization, parts-of-speech, stop words, etc)
 - The 25 most significant terms (in TF-IDF) in that article, along with that term's correspondong TF-IDF value for the article
At the very end of the report is a line which indicates the percentage of articles which were allocated to the correct stories (according to the story map).

In [None]:
# Continue with outputting from best results if requested
produceRequestedReportDetails(tfidfVectors,articleDataFrame,reportArticleList,threshold,storyMap,terms)