# NLP Assignment 2 (40% of grade): Vector Space Semantics for Similarity between EastEnders Characters

In this assignment, you will be creating a vector representation of a document containing lines spoken by a character in the EastEnders script data (i.e. from the file `training.csv`), then improving that representation such that each character vector is maximally distinguished from the other character documents. This distinction is measured by how well a simple information retrieval classification method can select documents from validation and test data as belonging to the correct class of document (i.e. deciding which character spoke the lines by measuring the similarity of those character document vectors to those built in training).

As the lines are not evenly distributed in terms of frequency, this coursework stipulates you can only use a maximum of the first **300 lines** of each character in the training data `training.csv` to create the training documents and a maximum of the first **50 lines** in the validation and test data (from `val.csv` and `test.csv`). This makes it more challenging, as the number of lines spoken by a character can't be used directly or otherwise as a feature.

A simple vector representation for each character document is done for you to start with in this code, as is the pipeline of similarity-based information retrieval evaluation. You need to improve the character vector representations by pre-processing, feature extraction and transformation techniques, as per Questions 1-6 below, which you need to complete as instructed.

**Refer to the material in units 8-9 for conceptual background.**

In [38]:
! pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [39]:
! pip install spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [40]:
! pip install contractions


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [41]:
! python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [42]:
import string
import re
import spacy
from contractions import fix
import numpy as np
from numpy.linalg import norm
import pandas as pd
from collections import Counter, OrderedDict

import seaborn as sns
import matplotlib.pyplot as plt

import math

import nltk

from sklearn.feature_extraction import DictVectorizer
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction import DictVectorizer

import itertools

%matplotlib inline
pd.options.display.max_colwidth=500


[nltk_data] Downloading package punkt_tab to /Users/salva/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/salva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/salva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [43]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)
val_path ='val.csv'
val_data = pd.read_csv(val_path,  delimiter="\t", skip_blank_lines = True)
test_path ='test.csv'
test_data = pd.read_csv(test_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
train_data

Unnamed: 0,Episode,Scene,Scene_info,Character_name,Line,Gender
0,1350,1,DESERTED CAR PARK EXT NIGHT,SHIRLEY,"Look at ya, not a mark on ya. And you think you're an unlucky man.",FEMALE
1,1350,1,DESERTED CAR PARK EXT NIGHT,OTHER,Shirl...,MALE
2,1350,2,R&R INT NIGHT,JACK,Oi. Where have you been? Huh? What were the texts about?,MALE
3,1350,2,R&R INT NIGHT,RONNIE,Nothing. Nothing. I'll be with you in two minutes yeah?,FEMALE
4,1350,2,R&R INT NIGHT,JACK,"Well I've got mates here I wanted to have a chat with them, instead I've been serving behind the bar.",MALE
...,...,...,...,...,...,...
14008,1399,55,SQUARE EXT DAY LIGHT,OTHER,"Dad? Okay ... alright, just one drink alright. But that's all. It doesn't mean anything. It's just a drink.",MALE
14009,1399,55,SQUARE EXT DAY LIGHT,MAX,Thanks Bradley. Thanks mate... It means the world to me...,MALE
14010,1399,55,SQUARE EXT DAY LIGHT,OTHER,You alright...,MALE
14011,1399,55,SQUARE EXT DAY LIGHT,MAX,"Yeah, yeah, yeah. I'm fine.",MALE


In [44]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them

    Improve this for Q3
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    for line, name, gender in zip(df.Line, df.Character_name, df.Gender):
        #remove the empty lines
        if (isinstance(line, float) and math.isnan(line)) or line == "":
            continue
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        character_docs[name] += str(line)   + " _EOL_ "  # adding an end-of-line token
        character_line_count[name]+=1
    print("lines per character", character_line_count)
    return character_docs

In [45]:
# print out the number of words each character has in the training set
# only use the first 300 lines of each character
train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=300)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

lines per character {'SHIRLEY': 300, 'OTHER': 300, 'JACK': 300, 'RONNIE': 300, 'TANYA': 300, 'SEAN': 300, 'ROXY': 300, 'MAX': 300, 'IAN': 300, 'JANE': 300, 'STACEY': 300, 'PHIL': 300, 'HEATHER': 300, 'MINTY': 300, 'CHRISTIAN': 300, 'CLARE': 300}
Num. Characters:  16 

SHIRLEY Number of Words:  3100
OTHER Number of Words:  2673
JACK Number of Words:  3707
RONNIE Number of Words:  3005
TANYA Number of Words:  3291
SEAN Number of Words:  2868
ROXY Number of Words:  3119
MAX Number of Words:  3884
IAN Number of Words:  3467
JANE Number of Words:  3128
STACEY Number of Words:  3235
PHIL Number of Words:  3129
HEATHER Number of Words:  3262
MINTY Number of Words:  3310
CHRISTIAN Number of Words:  3278
CLARE Number of Words:  3623
total words 52079


In [46]:
train_character_docs["SHIRLEY"]

'Look at ya, not a mark on ya. And you think you\'re an unlucky man. _EOL_ I\'m gonna get help. Oh where\'s my phone? Oh Kevin. Kevin you smashed it, didn\'t ya? Kevin, Kevin, where\'s your phone? _EOL_ No you\'re not, ssh, shut up. _EOL_ Fire brigade and ambulance. There\'s been an accident. On an industrial estate in Walford. ...Um, the Marsh Lane industrial estate. Please come quick. My husband- he\'s not my husband- my friend. He\'s trapped in the car. Please come quick... Shirley Carter. 82 82B George Street, Walford, E20. Please hurry, please come quick. _EOL_ Kevin. Kevin! _EOL_ Kevin I\'m gonna go to the main road - _EOL_ To make sure they know where to go. _EOL_ Kevin I\'ll be five minutes. _EOL_ You\'ll be fine. You\'re talking. I\'ll be five minutes. _EOL_ It\'s alright, it\'s alright, it\'s alright, it\'s alright. _EOL_ Go away. _EOL_ I don\'t know what to say. It\'s a nightmare. _EOL_ He asked me to go with him. _EOL_ Yeah. _EOL_ Between us? Of course not. _EOL_ We just we

defining grid search options

In [70]:



################################################################################
# 1. Define your parameter grids
################################################################################

PREPROCESS_OPTIONS = [
    # Each tuple will represent some flags you'd toggle in pre_process()
    {"lower": True,"remove_punct": True, "expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":False,"lemmatize":False},
    {"remove_punct": True,"lower": False,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":False,"pos_filtering":False,"lemmatize":False},
    {"remove_punct": True,"lower": True,"expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": False,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": True,"expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":False,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": True,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": False,"expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": True,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":False,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": False,"expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": True,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":False},
    {"remove_punct": True,"lower": False,"expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":False,"pos_filtering":False,"lemmatize":True},
    {"remove_punct": True,"lower": True,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":False,"lemmatize":True},
    {"remove_punct": True,"lower": True,"expand_contractions": True, "normalize_space":False,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":False,"pos_filtering":True,"lemmatize":True},
    {"remove_punct": True,"lower": False,"expand_contractions": False, "normalize_space":True,"remove_repeat":True,"remove_word_numbers":True,"remove_filter":True,"stop_words":True,"pos_filtering":True,"lemmatize":True},

    
]

FEATURE_METHODS = [
    # Each is a list of strings for to_feature_vector_dictionary()
    ["bigram", "sentiment", "emotion", "linguistic"],
    ["bigram", "trigram", "pos", "linguistic"],
    ["bigram", "pos","emotion","linguistic"], 
    ["bigram", "sentiment", "emotion", "linguistic","pos"],
    ["bigram","sentiment","emotion"],
    ["bigram","emotion","linguistic"],
    ["bigram","emotion","pos"]
]

MATRIX_TYPES = [
    ["tfidf"],                 # only tf-idf
    ["tfidf", "selection"],    # tf-idf + feature selection
    ["selction"],                        # no tf-idf, no feature selection (example)
]

In [71]:

#preprocessing function with all options that were tested in the first question
def pre_process(text,params=None):
    # 4. Remove specific punctuation (.,!?) by replacing them with a space
    if params.get("remove_punct", True):
        text = re.sub(r'\b[.,!?-]+|[.,!?-]+\b', '', text)
    # 1. Remove repeated characters
    if params.get("remove_repeat", True):
        text = re.sub(r'(.)\1+', r'\1\1', text)
    if params.get("lower", True):
        text = text.lower()
        text = text.replace('_eol_','_EOL_')

    # 2. Expand contractions
    if params.get("expand_contractions", True):
        text = fix(text)
    if params.get("remove_repeat",True):
        # 4- remove repeated characters
        text = re.sub(r'(.)\1+', r'\1\1', text)
    # 3. Normalize spacing
    if params.get("normalize_space", True):
        text = re.sub(r'\s+', ' ', text)
    
    # 3. Tokenize text, preserving _EOL_ as a separate token
    tokens = word_tokenize(text)

    # 5. Remove words containing numbers
    if params.get("remove_word_numbers",True):
        tokens = [token for token in tokens if not any(char.isdigit() for char in token)]
   
    
    # 4. Remove filler words
    if params.get("remove_filter",True):
        filler_words = {'um', 'uh', 'hmm', 'ah', 'er', 'haha', 'hehe', 'huh'}
        tokens = [word for word in tokens if word.lower() not in filler_words or word == '_EOL_']
    if params.get("stop_words",True):
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words]
    
    
    
    # 5. POS-based filtering and Named Entity Recognition
    if params.get("pos_filtering",True):
        # Reconstruct text with preserved _EOL_ for further processing
        text = ' '.join(tokens)
        nlp = spacy.load("en_core_web_sm")
        sentences = text.split('_EOL_')  # Split into sentences using `_EOL_` as a boundary
        processed_tokens = []
    
        for sentence in sentences:
            doc = nlp(sentence.strip())  # Process each sentence
            for token in doc:
                if token.ent_type_:  # Replace named entities with their type
                    processed_tokens.append(token.ent_type_)
                else: #token.pos_ in {'NOUN', 'VERB', 'ADJ'} or token.text == '_EOL_':  # Keep important POS or _EOL_
                    processed_tokens.append(token.text)
            # Add `_EOL_` back to indicate sentence boundary
            processed_tokens.append('_EOL_')

        tokens = processed_tokens
        text = ' '.join(tokens)
        #tokens = text.split()
    
    # 5. Contextual Lemmatization
    if params.get("lemmatize",True):
        # Reconstruct text with preserved _EOL_ for further processing
        text = ' '.join(tokens)
        nlp = spacy.load("en_core_web_sm")
        lemmatized_text = []
        for sentence in text:
            doc = nlp(sentence)
            lemmatized_text.append(' '.join([token.lemma_ if token.pos_ in {'VERB', 'NOUN'} else token.text for token in doc]))
    
    
        # Remove the trailing _EOL_ if it's added redundantly
        if text and text[-1] == '_EOL_':
            text = text[:-1]
        tokens = text.split()
    
    return tokens

'''
def pre_process(text):
    """Pre-process all the concatenated lines of a character, 
    using tokenization, spelling normalization and other techniques.
    
    Initially just a tokenization on white space. Improve this for Q1.
    
    ::character_text:: a string with all of one character's lines
     """
    # 1- converting text to lowercase
    text = text.lower()
    text = text.replace('_eol_','_EOL_')
    # 2- remove punctuation (.,!?-) at the beginning and end of wordsby replacing them with a space
    
    text = re.sub(r'\b[.,!?-]+|[.,!?-]+\b', '', text)
    
    # 3- expand contractions
    text = fix(text)

    # 4- remove repeated characters
    text = re.sub(r'(.)\1+', r'\1\1', text)

    # 5- using nltk's word tokenizer instead of splitting by whitespaces
    tokens = word_tokenize(text)   # just a simple tokenization, to be replaced
    # 6- remove tokens with digits
    tokens = [token for token in tokens if not any(char.isdigit() for char in token)]
   
    # 7- remove filler words
    filler_words = {'um', 'uh', 'hmm', 'ah', 'er', 'haha', 'hehe', 'huh'}
    tokens = [word for word in tokens if word.lower() not in filler_words or word == '_EOL_']
    
    # 8- removing stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    return tokens
'''


'\ndef pre_process(text):\n    """Pre-process all the concatenated lines of a character, \n    using tokenization, spelling normalization and other techniques.\n    \n    Initially just a tokenization on white space. Improve this for Q1.\n    \n    ::character_text:: a string with all of one character\'s lines\n     """\n    # 1- converting text to lowercase\n    text = text.lower()\n    text = text.replace(\'_eol_\',\'_EOL_\')\n    # 2- remove punctuation (.,!?-) at the beginning and end of wordsby replacing them with a space\n    \n    text = re.sub(r\'\x08[.,!?-]+|[.,!?-]+\x08\', \'\', text)\n    \n    # 3- expand contractions\n    text = fix(text)\n\n    # 4- remove repeated characters\n    text = re.sub(r\'(.)\x01+\', r\'\x01\x01\', text)\n\n    # 5- using nltk\'s word tokenizer instead of splitting by whitespaces\n    tokens = word_tokenize(text)   # just a simple tokenization, to be replaced\n    # 6- remove tokens with digits\n    tokens = [token for token in tokens if not any(

Original text for SHIRLEY

In [72]:
train_character_docs["SHIRLEY"]

'Look at ya, not a mark on ya. And you think you\'re an unlucky man. _EOL_ I\'m gonna get help. Oh where\'s my phone? Oh Kevin. Kevin you smashed it, didn\'t ya? Kevin, Kevin, where\'s your phone? _EOL_ No you\'re not, ssh, shut up. _EOL_ Fire brigade and ambulance. There\'s been an accident. On an industrial estate in Walford. ...Um, the Marsh Lane industrial estate. Please come quick. My husband- he\'s not my husband- my friend. He\'s trapped in the car. Please come quick... Shirley Carter. 82 82B George Street, Walford, E20. Please hurry, please come quick. _EOL_ Kevin. Kevin! _EOL_ Kevin I\'m gonna go to the main road - _EOL_ To make sure they know where to go. _EOL_ Kevin I\'ll be five minutes. _EOL_ You\'ll be fine. You\'re talking. I\'ll be five minutes. _EOL_ It\'s alright, it\'s alright, it\'s alright, it\'s alright. _EOL_ Go away. _EOL_ I don\'t know what to say. It\'s a nightmare. _EOL_ He asked me to go with him. _EOL_ Yeah. _EOL_ Between us? Of course not. _EOL_ We just we

Preprocessed text for shirley

In [73]:
nlp = spacy.load("en_core_web_sm")
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
from transformers import pipeline

def add_sentiment_scores(character_doc, counts):
    text = ' '.join(character_doc)
    sentiment = sia.polarity_scores(text)
    counts['sentiment_pos'] = sentiment['pos']
    counts['sentiment_neg'] = sentiment['neg']
    counts['sentiment_neu'] = sentiment['neu']
    return counts


# Load emotion classifier
emotion_classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=None)

def add_emotion_scores(character_doc, counts, chunk_size=500):
    """
    Adds emotion scores to the feature dictionary by analyzing chunks of text.
    Splits the text into smaller chunks if it exceeds the model's input size.
    
    ::character_doc:: List of tokens representing the character's dialogue.
    ::counts:: Dictionary of feature counts.
    ::chunk_size:: Maximum number of characters in each chunk.
    """
    text = ' '.join(character_doc)
    # Split the text into smaller chunks
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    
    # Aggregate emotion scores
    emotion_scores = Counter()
    for chunk in chunks:
        # Classify emotions for each chunk
        emotions = emotion_classifier(chunk[:chunk_size])  # Truncate chunk to fit model limits
        emotions=emotions[0]

        for emotion in emotions:  # Iterate through predictions
            emotion_scores[emotion['label']] += emotion['score']  # Aggregate scores
    
    # Normalize by the number of chunks to get average scores
    for emotion, score in emotion_scores.items():
        counts[f"emotion_{emotion}"] = score / len(chunks)
    
    return counts




def to_feature_vector_dictionary(character_doc,method):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
    counts = Counter(character_doc)  # for now a simple count

    #1- add bigrams
    if "bigram" in method:
        bigrams = zip(character_doc, character_doc[1:])
        bigram_counts = Counter(['_'.join(bigram) for bigram in bigrams])
        counts.update(bigram_counts)
    
    #2- add trigrams
    if "trigram" in method:
        trigrams = zip(character_doc, character_doc[1:],character_doc[2:])
        trigram_counts = Counter(['_'.join(trigram) for trigram in trigrams])
        counts.update(trigram_counts)
    #3- add pos counts
    if "pos" in method:
        doc = nlp(' '.join(character_doc))
        pos_counts = Counter([token.pos_ for token in doc])
        counts.update({f"POS_{pos}": count for pos, count in pos_counts.items()})
    if "sentiment" in method:
        counts = add_sentiment_scores(character_doc,counts)
    if "emotion" in method:
        counts = add_emotion_scores(character_doc, counts)
    if "linguistic" in method:
        counts['avg_sentence_length'] = len(character_doc) / max(1, character_doc.count('_EOL_'))
        counts['type_token_ratio'] = len(set(character_doc)) / len(character_doc)

    # add the extra features, for now just adding one count for each extra feature
    for feature in method:
        counts[feature] += 1
    return dict(counts)  

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/salva/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Device set to use mps:0


Items of the training chorpus for CHRISTIAN

feature vectors with the best combination of added features: bigram,sentiment,emotion,linguistic

In [74]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
tfidf_transformer = TfidfTransformer()
feature_selector = SelectKBest(chi2, k=10000) 

def create_document_matrix_from_corpus(corpus, matrix_type, extra_features, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q2.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    #converting to feature dictionaries
    feature_dict = [to_feature_vector_dictionary(doc,extra_features) for name,doc in corpus]
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit(feature_dict)
    doc_feature_matrix = corpusVectorizer.transform(feature_dict)

    #apply tf-idf transformation
    if "tfidf" in matrix_type:
        if fitting:
            tfidf_transformer.fit(doc_feature_matrix)
        doc_feature_matrix = tfidf_transformer.transform(doc_feature_matrix)

    #feature selection with k-best
    if "selection" in matrix_type:
        if fitting:
            feature_selector.fit(doc_feature_matrix, [name for name, doc in corpus])
        doc_feature_matrix = feature_selector.transform(doc_feature_matrix)

    
    return doc_feature_matrix


In [75]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [76]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        # print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [77]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

In [None]:
def run_experiment_once(preprocess_params,matrix_types,extra_features):
    train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=300)
    #print('Num. Characters: ',len(train_character_docs.keys()),"\n")
    total_words = 0
    for name in train_character_docs.keys():
        print(name, 'Number of Words: ',len(train_character_docs[name].split()))
        total_words += len(train_character_docs[name].split())
    #print("total words", total_words)
    # create list of pairs of (character name, pre-processed character) 
    training_corpus = [(name, pre_process(doc,preprocess_params)) for name, doc in sorted(train_character_docs.items())]
    train_labels = [name for name, doc in training_corpus]
    training_feature_matrix = create_document_matrix_from_corpus(training_corpus,matrix_types,extra_features,fitting=True)

    # get the validation data- only 50 lines used for each character
    val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=50)
    #print('Num. Characters: ',len(val_character_docs.keys()),"\n")
    total_words = 0
    for name in val_character_docs.keys():
        #print(name, 'Num of Words: ',len(val_character_docs[name].split()))
        total_words += len(val_character_docs[name].split())
    #print("total words", total_words)

    # create list of pairs of (character name, pre-processed character) 
    val_corpus = [(name, pre_process(doc,preprocess_params)) for name, doc in sorted(val_character_docs.items())]
    val_labels = [name for name, doc in val_corpus]
    val_feature_matrix = create_document_matrix_from_corpus(val_corpus,matrix_types,extra_features,fitting=False)
    mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)
    return {
        "mean_rank": mean_rank,
        "mean_cosine_similarity": mean_cosine_simliarity,
        "accuracy": acc,
    }
def run_grid_search(
    preprocess_grid,
    feature_methods_grid,
    matrix_types_grid,
    checkpoint_path="grid_search_partial3.csv"
):
    """
    Loops over all combinations of the given parameter grids.
    Calls run_experiment_once(...) for each combo.
    Logs and returns results sorted by some metric.
    """
    results = []

    # Build all possible combinations
    for preprocess_params in preprocess_grid:
        for feats in feature_methods_grid:
            for mat_types in matrix_types_grid:
                    
                    # Actually run training & validation
                    metrics = run_experiment_once(preprocess_params,feats,mat_types)
                    # Store the results with the params
                    row = {
                        "preprocess": preprocess_params,
                        "feature_methods": feats,
                        "matrix_types": mat_types,
                        "mean_rank": metrics["mean_rank"],
                        "mean_cosine_similarity": metrics["mean_cosine_similarity"],
                        "accuracy": metrics["accuracy"]
                    }
                    results.append(row)
                    print("Tried:", row)
                    # ---------------------------------------------------------
                    df_partial = pd.DataFrame(results)
                    df_partial.to_csv(checkpoint_path, index=False)

    # Convert to DataFrame for easy analysis
    df_results = pd.DataFrame(results)

    # Example: sort by mean_rank ascending (lower is better) or accuracy descending
    df_results.sort_values("mean_rank", inplace=True)  
    return df_results


Running the grid search on different combinations

In [79]:
run_grid_search(PREPROCESS_OPTIONS,FEATURE_METHODS,MATRIX_TYPES)

lines per character {'SHIRLEY': 300, 'OTHER': 300, 'JACK': 300, 'RONNIE': 300, 'TANYA': 300, 'SEAN': 300, 'ROXY': 300, 'MAX': 300, 'IAN': 300, 'JANE': 300, 'STACEY': 300, 'PHIL': 300, 'HEATHER': 300, 'MINTY': 300, 'CHRISTIAN': 300, 'CLARE': 300}
SHIRLEY Number of Words:  3100
OTHER Number of Words:  2673
JACK Number of Words:  3707
RONNIE Number of Words:  3005
TANYA Number of Words:  3291
SEAN Number of Words:  2868
ROXY Number of Words:  3119
MAX Number of Words:  3884
IAN Number of Words:  3467
JANE Number of Words:  3128
STACEY Number of Words:  3235
PHIL Number of Words:  3129
HEATHER Number of Words:  3262
MINTY Number of Words:  3310
CHRISTIAN Number of Words:  3278
CLARE Number of Words:  3623
lines per character {'OTHER': 50, 'HEATHER': 43, 'TANYA': 50, 'JACK': 50, 'RONNIE': 40, 'JANE': 50, 'STACEY': 50, 'SEAN': 50, 'PHIL': 32, 'SHIRLEY': 50, 'ROXY': 30, 'IAN': 50, 'MINTY': 50, 'CHRISTIAN': 30, 'CLARE': 28, 'MAX': 50}
mean rank 1.6875
mean cosine similarity 0.9463616322890502


KeyboardInterrupt: 

# Q1. Improve pre-processing (20 marks)
Using the pre-processing techniques you have learned in the module, improve the `pre_process` function above, which currently just tokenizes text based on white space.

When developing, use the first 300 and 50 lines from the training and validation files, as per above. To check the improvements by using the different techniques, use the `compute_IR_evaluation_scores` function as above. The **mean rank** is the main metric you need to focus on improving throughout this assignment, where the target/best possible performance is **1** (i.e. all test/validation data character documents are closest to their corresponding training data character documents) and the worst is **16**. Initially, the code in this template achieves a mean rank of **4.3**  and accuracy of **0.25** on the test set and a mean rank of **3.6** and accuracy of **0.31** on the validation set - you should be looking to improve those, particularly getting the mean rank as close to 1 as possible.


# Q2. Improve linguistic feature extraction (30 marks)
Use the feature extraction techniques you have learned to improve the `to_feature_vector_dictionary` and `create_document_matrix_from_corpus` functions above. Examples of extra features could include extracting n-grams of different lengths and including POS tags. You could also use sentiment analysis or another text classifier's result when applied to the features for each character document. You could even use a gender classifier trained on the same data using the GENDER column **(but DO NOT USE the GENDER column directly in the features for the final vector)**.

Matrix transformation techniques like TF-IDF (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) are commonly used to improve the `create_document_matrix_from_corpus` function, which currently only uses a dictionary vectorizer (`DictVectorizer`) which straight-forwardly maps from the feature dictionaries produced for each character document to a sparse matrix.

Other options include using feature selection/reduction with techniques like minimum/maximum document frequency and/or feature selection like k-best selection using different statistical tests https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html. 

Again, develop your system using the training and validation sets and note the effect/improvement in mean rank with the techniques you use.

# Q3. Add dialogue context and scene features (15 marks)
Adjust `create_character_document_from_dataframe` and the other functions appropriately so the data incorporates the context of the line spoken by the characters in terms of the lines spoken by other characters in the same scene (before and after the target character's lines). HINT: you should use the *Episode* and *Scene* columns to check which characters are in the same scene to decide whether to include their lines or not. Only the lines from the same *Scene* can be added as the context as lines from different *Scene* are irrelevant.  **(but DO NOT USE the GENDER and CHARACTER columns directly)**.

# Q4. Parameter Search (15 marks)
It is a good practice to conduct a systematic parameter search instead of a random search as this will give you more 
reliable results. Given the scope of this assignment, it is possible to conduct a **grid search** on options you decided to try within the individual questions. The grid search should be done within the individual questions (i.e. Q1-Q3), and the later question should adopt the best settings from the previous questions. There is no need to do a grid search over all configurations from all questions as this will easily make the search unrealistic. E.g. Suppose we need 32, 90, and 4 runs to finish the grid search within questions, a cross-question grid search would need 32x90x4 = 11520 runs!                                                                      

# Q5. Analyse the similarity results (10 marks)
From your system so far run on the training/validation sets, identify the heldout character vectors ranked closest to each character's training vector which are not the character themselves, and those furthest away, as displayed using the `plot_heat_map_similarity` function. In your report, try to ascribe reasons why this is the case, particularly for those where there isn't a successful highest match between the target character in the training set and that character's vector in the heldout set yet. Observations you could make include how their language use is similar, resulting in similar word or ngram features.

# Q6. Run on final test data  (10 marks)
Test your best system using the code below to train on the training data (using the first 300 lines per character maximum) and do the final testing on the test file (using the first 50 lines per character maximum).

Make any necessary adjustments such that it runs in the same way as the training/testing regime you developed above- e.g. making sure any transformer objects are initialized before `create_document_matrix_from_corpus` is called. Make sure your best system is left in the notebook and it is clear what the mean rank and accuracy of document selection are on the test data.