## Overview
- Using Python 3, NLTK, spaCy

## Assessing spaCy's Entity Extraction Capactiy

- You should run the entity extractor on a sample of 100 sentences and report on how accurately it has been performed. 
- You do not need to describe your assessement of each sentence individually, but you should give overall statistics and use illustative examples from your sample.
- Your analysis should be broken down by the type of named entity. 
- When an error occurs you should describe the nature of the error. You should distinguish the following cases
 - where the wrong entity type is assigned to a span
 - where the wrong span is identified
 - where an entity is missed altogether
- A confusion matrix should be used here to summarise what you have found.

---

In [None]:
import sys
sys.path.append(r'C:\Users\gamer\Documents\resources') #Ammend to own path files in resources.zip
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict,Counter
from itertools import zip_longest
from IPython.display import display
from random import seed
import random
import math
from pylab import rcParams
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
import csv
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pylab as pylab
%matplotlib inline
params = {'legend.fontsize': 'large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'large',
         'axes.titlesize':'large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
get_ipython().magic('matplotlib inline')


In [None]:
import spacy
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from nltk.corpus import gutenberg
nlp = spacy.load('en')
#moby = gutenberg.raw('melville-moby_dick.txt')
emma = gutenberg.raw('austen-emma.txt')
#alice = gutenberg.raw('carroll-alice.txt')
#persuasion = gutenberg.raw('austen-persuasion.txt')
#sense = gutenberg.raw('austen-sense.txt')
#parsed_moby = nlp(moby)
parsed_emma = nlp(emma)
#parsed_alice = nlp(alice)
#parsed_persuasion = nlp(persuasion)
#parsed_sense = nlp(sense)

In [None]:
text = parsed_emma
nounphrases = [[nounphrase.text, nounphrase.root.head.text] for nounphrase in parsed_emma.noun_chunks]
print("There were {} noun phrases found.".format(len(nounphrases)))
display(pd.DataFrame(nounphrases))
#To remove \n, edit to code using re.sub("\s+"," ",<string>), to substitute all substrings consisting of one or more whitespace characters (captured with the regular expression "\s+") with a single space character, " ".

In [None]:
seed(164746) 
sample_size = 20
my_sample = random.sample(list(parsed_emma.sents),sample_size) # select a random sample of sentences
for sent in my_sample:
    sent = re.sub("\s+"," ",sent.text) # clean up the whitespace
    print(sent,"\n")

## Gender Classifier


- You should present the code of your gender classifier and explain how it works.
- You should use the names.csv data (found within resources.zip).
- Your code should deal with cases where a character is referred to by more than just their first name (e.g. "John Jones").
- Your code should deal with cases where a character is referred to using a title (e.g. Mrs Smith").
- By running your gender classifier on a sample of data and reviewing the results, provide an indication of how accurate your gender classifier is. What proportion of names are being correctly analysed?
- Deal with situations where just a surname is used (e.g. "Smith") after the gender of that character has been revealed (e.g. "Mrs Smith") before.
---

- First, define a function guess_gender(name,gender_map) that returns gender_map[name] when we (think we) know the gender of name, but when name does not appear in gender_map (i.e. maps to the 'unknown') it strips off all but the first token of name and tries that instead.

- Second, write a function extend_gender_map(gender_map) that returns a gender map with additional mappings added for as many male and female titles as you can think of.

- Third, adapt the line in the above cell

     names_with_gender = [(name,gender_map[name.lower()]) for name,count in named_entity_counts(text,entity_type).most_common(number_of_entities)]

    to use your guess_gender function rather than directly applying gender_map.

In [None]:
def create_gender_map(dict_reader):
    names_info = defaultdict(lambda: {"gender":"", "freq": 0.0})
    for row in input_file:
        name = row["name"].lower()
        if names_info[name]["freq"] < float(row["freq"]): # is this gender more frequent?
            names_info[name]["gender"] = row["gender"] 
            names_info[name]["freq"] = float(row["freq"])
    gender_map = defaultdict(lambda: "unknown")
    for name in names_info:
        gender_map[name] = names_info[name]["gender"]
    return gender_map

input_file = csv.DictReader(open("names.csv"))
gender_map = create_gender_map(input_file)

In [None]:
def named_entity_counts(document,named_entity_label):
    occurrences = [ent.string.strip() for ent in document.ents
                   if ent.label_ == named_entity_label and ent.string.strip()]
    return Counter(occurrences)

text = parsed_emma
entity_type = 'PERSON'
number_of_entities = 10
names_with_gender = [(name,gender_map[name.lower()]) for name,count in named_entity_counts(text,entity_type).most_common(number_of_entities)]
display(pd.DataFrame(names_with_gender,columns=["Name","Gender"]))

## Building feature sets that characterise the way a character is portrayed


- You should explore a number of alternative ways of characterising the way a person in portrayed by a novelist. This should include code that (more detail on what to do for the first 2 is below):
 - Finding the entities within a novel.
 - Finding the main characters of a novel.
 - Finding the least occuring character within a novel.
- You should describe the code that you have written to create feature sets for characters.
- You should describe the code that showed how you were able to extract features in situations where one of the pronouns "he", "she", "his" and "her" is used in a novel.

In [None]:
import sys
sys.path.append(r'C:\Users\gamer\Documents\resources')
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict,Counter
from itertools import zip_longest
from IPython.display import display
from random import seed
import random
import math
from pylab import rcParams
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
import csv
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pylab as pylab
%matplotlib inline
params = {'legend.fontsize': 'large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'large',
         'axes.titlesize':'large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
get_ipython().magic('matplotlib inline')
import spacy
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from nltk.corpus import gutenberg
nlp = spacy.load('en')
from GutenbergCorpus import GutenbergCorpusReader as gcr
reader = gcr.GutenbergCorpusReader()       

In [None]:
for author in authors:
    print("{0}: {1}".format(author,len(authors[author])))

In [None]:
works = reader.get_authors_works('Roosevelt, Theodore')  # replace <AUTHOR NAME> by a string that is the name of an author 
for work in works:
    print(work["title"])
#Name should be in the form of comma and space separated string, with each part of the name capitalised(e.g. authors['Walker, Anne'])

In [None]:
#In this cell run spaCy on a random novel by your chosen author

In [None]:
parsed_Middlemarch = nlp(works[3]["text"])

In the blank cell below, define a function `get_entities_in(parsed_novel,entity_type)` that takes two inputs:
- `parsed_novel` is the result of running spaCy on the raw text of some novel
- `entity_type` is one of the spaCy entity types, e.g. "PERSON"

The output should be a list of the text for each entity appearing in `parsed_novel` that is of type `entity_type`

spaCy can sometimes return entities with an empty text representation, and you don't want to include these in the output.

It is helpful to normalise the text as follows:
- convert the text for each entity to lower case using `lower()`
- remove any surrounding white space, using `strip()`

Run your function on your parsed novel and look at the first 10 characters.


Your next idea is to define a function `get_main_characters(parsed_novel,num_charachters)` that takes two inputs:
- `parsed_novel` is the result of running spaCy on the raw text of some novel
- `num_charachters` is a positive whole number, specifying how many of the main characters should be returned

The output will be a list of the `num_characters` most frequently occurring `"PERSON"` entities in `parsed_novel`.

In the blank cell below, implement `get_main_characters`.
- This function should make use of the `get_entities` function you have just defined
- You can use `Counter` to produce a counter from a list of elements - try `Counter(["a","b","a","c","b"])`
- Once you have a `Counter` you can use `Counter`'s `most_common` method to find the most common characters

Use the idea from get_main_characters in order to find the amount of characters that are only mentioned once in a book aka the least common characters.

Extracting Feature Sets for Characters:

We now turn to the issue of extracting feature sets for characters or sets of characters (The base has been done, whats written below is the parts needed doing).

 - write your code so that it is possible to specify any set of relations of interest, e.g. both nsubj and dobj
 - Refine your solution futher by removing the most commonly occurring verbs. Adapt a copy of the code that you have created when solving the previous exercise so that contexts involving the most common verbs are not displayed.
       
       - Hint: use a Counter to determine the count of each verb in a set of novels, and then use most_common(n) to find the most common n verbs.


 - Refine it further. Your goal should be to indentify other aspects of the context where a character is mentioned that you think will help to provide a richer characterisation of the way that a character is being portrayed by the author.

In [None]:
def get_interesting_contexts(novels,rels,num_characters):
    
    def of_interest(ent,rels,main_characters):
        return (ent.text.strip().lower() in main_characters 
                and ent.label_ == 'PERSON' 
                and ent.root.head.pos_ == 'VERB'
                and ent.root.dep_ in rels)  

    contexts = defaultdict(Counter)    
    for parsed_novel in novels:
        main_characters = get_main_characters(parsed_novel,num_characters)
        for ent in parsed_novel.ents:
            if of_interest(ent,rels,main_characters):
                contexts[ent.text.strip().lower()][ent.root.head.lemma_] += 1
    return contexts

novels = {parsed_novel} #use a set here to allow for the possibility of having multiple texts
number_of_characters_per_text = 8
target_rels = {'nsubj'} #use set to allow for the possibility of several target dependency relations
target_contexts = get_interesting_contexts(novels,target_rels,number_of_characters_per_text)
display(pd.DataFrame.from_dict(target_contexts).applymap(lambda x: '' if math.isnan(x) else x))


- You should describe the code that showed how you were able to extract features in situations where one of the pronouns "he", "she", "his" and "her" is used in a novel in the cell below

## Investigating differences in the way genders are portrayed

- You should make it clear how you have aggregated feature sets across the male and female characters appearing in at least two collection of novels. These collections could be novels by  different authors, different sets of authors, or  sets of novels that were written at different periods in history. 
- You should discuss the result of measuring the cosine similarity of the aggregated male an female feature sets. The reason to consider different sets of novels is to look at differences in gender-based cosine similarity in the different collections.
- In the last section you should have considered a number of alternative ways of deriving feature sets for characters. In this section, you should present the results of using these alternative approaches.
- You should explain what you have done to assess the cosine similarity of pairs of features sets that are aggregated over randomly selected characters (i.e. characters that aren't split up on the basis of gender). This should provide an indication as to whether the differences you find when making a gender-based comparison are meaningful.
- You should explain how you went about assessing what the impact would be of an imbalance in the number of male and female characters. Is there an gender imbalance in your gender-based comparison.

## Help
#### Aggregating feature sets

Once you are satisifed with the feature sets that you are able to build for a character, you are ready to undertake your analysis of the way characters are being portrayed based on gender.

- Select a set of novels
- Parse each of the novels with spaCy (this might take a while)
- Determine the settings of any parameters that are needed by the code you have written to produce the character feature sets, e.g.
 - the number characters to consider in each novel
 - the number of most common verbs to disregard
- Run your code that builds feature sets for characters over all of the novels under consideration
- Build two aggregated feature sets, one for all female characters and one for all male characters

In the next cell, we look at how to measure the difference between these two aggregated feature sets and how to assess whether the different you find is significant.

#### Measuring the similarity of two feature sets

The code cell below shows how to compare the similarity of two feature sets. This is now explained.

- We are given two feature sets: `A` and `B`.
- Initially, each feature set is represented as a `Counter` which is a dictionary where the keys are the features and each feature (key) is mapped to a positive number which corresponds to the strength (weight) of that feature. 
 - feature set `A` has features `'a', 'b' and 'c'` with weights `1, 2 and 3`, respectively.
 - feature set `B` has features `'b', 'c', 'd' and 'e'` with weights `3, 4, 5 and 6`, respectively.
- Note that they share some, but not all of their features.
- Our goal is to represent both feature sets as lists in such a way that each position in a lists is consistently used for a particular feature
- For example, we could use a list with 5 positions, where the weight of feature `'a'` is held in the first position, the weight of feature `'b'` is held in the second position, and so on. 
 - with this scheme the feature list for `A` would be the list: `[1,2,3,0,0]`, and the feature list for `B` would be `[0,3,4,5,6]`.
- The function `counters_to_feature_lists` takes two feature sets each of which is a `Counter` and returns two lists, one for each of the inputs, where both lists use the same feature representation.
- In the first line of the function, the counters are added together. This is done because the keys of resulting counter (which is named `combined`) can be used to produce consistent mappings of the counters to lists - see lines 2 and 3.
- Once consistent list representations are produced for the two feature sets, we can use the `cosine_similarity` function from `sklearn` as as a measure of how similar the lists are, and therefore, how similar the feature sets are.
- `cosine_similarity` returns a real number between 0 and 1, with 1 indicating that the inputs are identical, and 0 indicating that the two inputs are completely different.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

A = Counter({'a':1, 'b':2, 'c':3})
B = Counter({'b':3, 'c':4, 'd':5, 'e':6})

def counters_to_feature_lists(counter1,counter2):
    combined = counter1 + counter2 
    list1 = [counter1[key] for key in combined]
    list2 = [counter2[key] for key in combined]
    return list1,list2

L1,L2 = counters_to_feature_lists(A,B)
print(L1)
print(L2)
cosine_similarity([L1], [L2])[0,0]

# Adapt for use in section 4