_Carry overs from previous notebook_
    
## Issues

1. Annotated data having issues:
  - (Fixed by sorting chars) different annotators possibly having different order in pairs
  - (Fixed by dropping NR chars / affinity) NR still present with some comments
2. Book sources having issues:
  - different editions of same book might be available with different level of changes (starting with additions / deletions ending with non-rendering font using Æ and others)
  - licensing need to be removed as it introduces extra noise
3. book-nlp output having issues:
  - as is, it causes issues when loaded via pandas as there're ocasionally a label for token that is too long or some other issues
  - might result in same character being represented as multiple
4. current algorithm having issues:
  - no way to limit characters to only 'important once', so we can do some sort of validation but either external part should be responsible for selecting the main characters (and it's not yet fact those would be present in annotated relations) or current system updated to account for that
  - sentiment of sentences mentioning two characters isn't the best way, some papers mention other ways of doing this

## Next steps

1. There are still long way to go with this amount of data and this baseline: tuning sentiment, removing stopwords, fixing all of the issues.
2. I have another dataset that need some transforming to be compatible with current one
3. Trying other baselines.

In [95]:
import os
import re

import pandas as pd
import numpy as np
import collections as col
from sklearn import metrics

import books_utils as bu
import baseline

In [86]:
import importlib
importlib.reload(baseline)

<module 'baseline' from '/Users/sudodoki/Projects/AI_ML/projector-nlp/final-project-public/experiment-2/baseline.py'>

### Reading in annotations / books

In [39]:
annotations = pd.read_csv('../data/character_relation_annotations.txt.gz', sep='\t')
# dropping values that have gibberish affinity - might transform this later based on category
annotations = annotations[(annotations['affinity'] != 'NR') & (annotations['character_1'] != 'NR') & (annotations['character_2'] != 'NR')].copy()
annotations['book_name'] = (annotations['title'] + ' ' + annotations['author']).str.replace("\s", "_")
print(annotations.shape)
# making sure no NR in character_1/character_2/affinity
annotations.describe()

(2137, 11)


Unnamed: 0,annotator,change,title,author,character_1,character_2,affinity,coarse_category,fine_category,detail,book_name
count,2137,2137,2137,2137,2137,2137,2137,2137,2137,2137,2137
unique,14,3,109,49,1005,825,3,4,30,528,109
top,annotator_1,no,Timon of Athens,William Shakespeare,Joseph K.,Timon,positive,social,friend,NR,Hamlet_William_Shakespeare
freq,760,1712,20,613,15,17,1120,886,342,1591,20


In [42]:
# working around multiple annotators giving different marks - transforming into
# real value based on mapping and averaging it
def avg(numbers):
    return float(sum(numbers)) / max(len(numbers), 1)
affinity_mapping = {
    'positive': 1,
    'neutral': 0.5,
    'negative': 0
}
annotations['num_affinity'] = annotations['affinity'].map(lambda aff: affinity_mapping[aff])
all_df = pd.DataFrame(columns=['book_name', 'char_1', 'char_2', 'affinity'])
by_book_annotations = col.defaultdict(col.defaultdict)
def add_books_annotations(row):
    book_name = row['book_name']
    char_1, char_2 = sorted([row['character_1'], row['character_2']])
    affinity = row['num_affinity']
    by_book_annotations[book_name][char_1 + ':' + char_2] = (by_book_annotations[book_name][char_1 + ':' + char_2] if (char_1 + ':' + char_2) in by_book_annotations[book_name] else []) + [affinity]

annotations.apply(add_books_annotations, axis=1)

for book in by_book_annotations:
    for pair in by_book_annotations[book]:
        [char_1, char_2] = pair.split(':')
        all_df = all_df.append([{
            'book_name': book, 'char_1': char_1, 'char_2': char_2, 'affinity': avg(by_book_annotations[book][pair])
        }])
# voila, afinnity as single column dataframe with real values and all_x having book_name, char_1 and char_2 names
all_y = all_df['affinity'].copy()
all_X = all_df.drop('affinity', axis=1)

In [44]:
# making sure all raw txt files are present for books
titles = annotations['title'].unique()
authors = [annotations[annotations['title'] == title]['author'][0:1].ravel()[0] for title in titles]
existing_files = []
names = []
for pair in zip(titles, authors):
    title, author = pair
    name = re.sub("\s", "_", '{} {}'.format(title, author))
    names.append(name)
    file = '../data/books/{}.txt'.format(name)
    existing_files.append(os.path.isfile(file))
len(titles), len(existing_files), all(existing_files)

(109, 109, True)

In [49]:
books = [bu.Book(name, book_NLP_folder="../data/bookNLP_output", source_folder="../data/books") for name in names]

In [51]:
for book in books:
    if len(book.tokens) < 100:
        print(book.name, len(book.tokens), book.tokens.shape)
# seems that most tokens are non-empty, yet issue with parsing the bookNLP token output into dataframes
# underneath we are skipping those lines

In [62]:
book = books[0]
book_name = book.name
subset = all_X[all_X['book_name'] == book_name]
present_chars = pd.concat([subset['char_1'], subset['char_2']]).unique()
for char in book.characters.meaningful:
    name = bu.book_name_to_annotated_name(book_name, char, present_chars, False)
    if name:
        print(name, "|", bu.longest_name(char))
# Okay, new challenge - we need to 'merge' characters

Don Quixote | Don Quixotes
Altisidora | ALTISIDORA
Don Quixote | Senor Don Quixote
Sampson Carrasco | SAMSON CARRASCO
Don Quixote | lord Don Quixote
Rocinante | Rocinante
The Duke and Duchess | Duke of Sesa
Cervantes | Miguel de Cervantes
The Duke and Duchess | Duke Ricardo
Cide Hamete Benengeli | Cide Hamete Benengeli
Sancho Panza | Sancho Panzas
Dapple | Dapple
Sancho Panza | Senor Don Sancho Panza
Dulcinea del Toboso | lady Dona Dulcinea del Toboso


### Reproducing baseline results

In [133]:
import importlib
importlib.reload(bu)
importlib.reload(baseline)
importlib.reload(bu)
importlib.reload(baseline)

<module 'baseline' from '/Users/sudodoki/Projects/AI_ML/projector-nlp/final-project-public/experiment-2/baseline.py'>

In [123]:
base_predictor = baseline.create_for(books, all_X)

Looping over books:   4%|▎         | 4/109 [01:06<29:42, 16.98s/it]



Looping over books:  30%|███       | 33/109 [04:43<10:18,  8.13s/it]



Looping over books:  49%|████▊     | 53/109 [07:15<03:45,  4.03s/it]



Looping over books:  51%|█████▏    | 56/109 [07:24<03:11,  3.61s/it]



Looping over books:  94%|█████████▍| 103/109 [5:24:28<2:17:09, 1371.66s/it]



Looping over books: 100%|██████████| 109/109 [5:24:54<00:00, 166.29s/it]   


In [136]:
def score_to_label(val):
    if val <= 0.33:
        return 'negative'
    if val <= 0.66:
        return 'neutral'
    return 'positive'

In [137]:
y_predicted = baseline.predict(base_predictor, all_X)
y_predicted.clip(0, 1, inplace=True)

In [138]:
metrics.mean_squared_error(y_predicted, all_y)

0.18553792295085625

In [139]:
print(metrics.classification_report(all_y.map(score_to_label), y_predicted.map(score_to_label)))

             precision    recall  f1-score   support

   negative       0.35      0.08      0.13       397
    neutral       0.22      0.86      0.35       312
   positive       0.60      0.12      0.20       759

avg / total       0.45      0.27      0.21      1468



In [104]:
# one more baseline, just to check - major class
y_all_positive = np.ones_like(y_predicted)
print(metrics.mean_squared_error(y_all_positive, all_y))
print(metrics.classification_report(all_y.map(score_to_label), [score_to_label(val) for val in y_all_positive]))
# well, at least our baseline does something more useful

0.3092465656221617
             precision    recall  f1-score   support

   negative       0.00      0.00      0.00       397
    neutral       0.00      0.00      0.00       312
   positive       0.52      1.00      0.68       759

avg / total       0.27      0.52      0.35      1468



  'precision', 'predicted', average, warn_for)


In [105]:
# Books issues with chars
# WARNING: Mrs. John Dashwood might have multiple aliases: ['Mrs. Dashwood', 'John Dashwood']
# WARNING: Hector son of Priam might have multiple aliases: ['Hector', 'Priam']
# WARNING: Mrs. Joe Gargery might have multiple aliases: ['Joe Gargery', 'Mrs. Joe']
# WARNING: Mrs. Ned Hale might have multiple aliases: ['Mrs. Ned Hale', 'Ned Hale']
# WARNING: Elizabeth-Jane Newson might have multiple aliases: ['Elizabeth-Jane Newson', 'Newson']

In [124]:
bad_chars = set(["Mrs. John Dashwood",
"Hector son of Priam",
"Mrs. Joe Gargery",
"Mrs. Joe Gargery",
"Mrs. Ned Hale",
"Elizabeth-Jane Newson"])
for (i,book) in enumerate(books):
    for char in book.characters.meaningful:
        if bu.longest_name(char) in bad_chars:
            print(i, book.name, char['id'], bu.longest_name(char))

3 Sense_and_Sensibility_Jane_Austen 3 Mrs. John Dashwood
32 The_Iliad_Homer 78 Hector son of Priam
55 Great_Expectations_Charles_Dickens 92 Mrs. Joe Gargery
70 Ethan_Frome_Edith_Wharton 6 Mrs. Ned Hale
102 The_Mayor_of_Casterbridge_Thomas_Hardy 42 Elizabeth-Jane Newson


Even though I identified these chars, I'm not sure what to do with all of this:

1. Mrs. John Dashwood as identificator for wife of Mr. John Dashwood
2. son of Priam mentioning Priam in no way refers to Priam himself
3. Mrs. Joe Gargery as identificator for wife of Mr. Joe Gargery
4. Not even sure about this one, but I think once again this is identificator for wife of Mr. Ned Hale\
5. Newson is used as reference to father of Elizabeth-Jane Newson

so I'm considering adding following rules to bu.book_name_to_annotated_name()

```
Mrs. + Name != Name
X son of Y != Y
Firstname Lastname != Lastname
```

although first one could have another way of working out if we would have gender annotation for characters in dataset (given bookNLP infers gender), yet there are issues with bookNLP getting confused on its own with Mrs. John Dashwood (considering John Dashwood being same character and not Mr. John Dashwood who is character on his own)

🤔 might be an issue for the future