# Q1. Improve pre-processing (20 marks)
Using the pre-processing techniques you have learned in the module, improve the `pre_process` function above, which currently just tokenizes text based on white space.

When developing, use the 90% train and 10% validation data split from the training file, using the first 360 lines from the training split and first 40 lines from the validation split, as per above. To check the improvements by using the different techniques, use the `compute_IR_evaluation_scores` function as above. The **mean rank** is the main metric you need to focus on improving throughout this assignment, where the target/best possible performance is **1** (i.e. all test/validation data character documents are closest to their corresponding training data character documents) and the worst is **16**. Initially the code in this template achieves a mean rank of **5.12**  and accuracy of **0.3125** on the test set and a mean rank of **4.5** and accuracy of - you should be looking to improve those, particularly getting the mean rank as close to 1 as possible.


In [None]:
# Baseline mean rank (tokenize on whitespace): 4.5, accuracy: 0.25

# Normalization
# tokens = [token.lower() for token in tokens]
# Mean rank: 4.875, accuracy: 0.375

# Stop words
# tokens = [token for token in tokens if token not in stopwords.words('english')]
# Mean rank: 3.4375, accuracy: 0.5

# Lemmatization
# tokens = [lemmatizer.lemmatize(token) for token in tokens]
# Mean rank: 4.5, accuracy: 0.3125

# Porter Stemmer (without lowercasing)
# tokens = [porter.stem(token, for_lowercase=False) for token in tokens]
# Mean rank: 4.5, accuracy: 0.3125
# Porter Stemmer (with lowercasing)
# tokens = [porter.stem(token, for_lowercase=True) for token in tokens]
# Mean rank: 5.0, accuracy: 0.375



# Several methods
# split on white space
# tokens = re.split('\W+', character_text)
# # remove empty tokens
# tokens = [t for t in tokens if t]
# # convert to lower case
# tokens = [t.lower() for t in tokens]
# # remove stop words
# stop_words = set(stopwords.words('english'))
# tokens = [t for t in tokens if t not in stop_words]
# # remove numbers
# tokens = [t for t in tokens if not t.isdigit()]
# # remove single character tokens
# tokens = [t for t in tokens if len(t)>1]
# # remove tokens with non-alphabetic characters
# tokens = [t for t in tokens if t.isalpha()]
# # stem tokens
# porter = PorterStemmer()
# tokens = [porter.stem(t) for t in tokens]
# Mean rank: 2.0, accuracy: 0.6875

# Num to word
# # tokenize on white space
# tokens = character_text.split()
# # convert numbers to words (e.g. 1 -> one)
# tokens = [num2words(int(t)) if t.isdigit() else t for t in tokens]

# Best so far
# returns mean rank of 1.8125, accuracy: 0.6875
def pre_process(character_text):
    """Pre-process all the concatenated lines of a character, 
    using tokenization, spelling normalization and other techniques.
    
    Initially just a tokenization on white space. Improve this for Q1.
    
    ::character_text:: a string with all of one character's lines
    """
    # split on white space 
    tokens = re.split('\W+', character_text)
    
    # convert to lower case
    tokens = [t.lower() for t in tokens]
    
    # convert numbers to words (e.g. 1 -> one)
    tokens = [num2words(int(t)) if t.isdigit() else t for t in tokens]

    # remove tokens with non-alphabetic characters
    tokens = [t for t in tokens if t.isalpha()]

    # lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # stem tokens
    porter = PorterStemmer()
    tokens = [porter.stem(t) for t in tokens]
    
    # remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    
    return tokens

# Q2. Improve linguistic feature extraction (20 marks)
Use the feature extraction techniques you have learned to improve the `to_feature_vector_dictionary` function above. Examples of extra features could include extracting n-grams of different lengths and including POS-tags. You could also use sentiment analysis or another text classifier's result when applied to the features for each character document. You could even use a gender classifier trained on the same data using the GENDER column **(but DO NOT USE the GENDER column directly in the features for the final vector)**.

You could use feature selection/reduction with techniques like minimum/maximum document frequency and/or feature selection like k-best selection using different statistical tests https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html. Again, develop on 90% training and 10% validation split and note the effect/improvement in mean rank with the techniques you use.

In [None]:
# Baseline = 1.8125

# # replace the counts with the log of the counts 
#     for key in counts.keys():
#         counts[key] = np.log(counts[key])
#     return counts
# Mean rank: 1.625

# # tf-idf model
#     counts = Counter(character_doc)
#     counts = dict(counts)
#     tf = {k: v/len(character_doc) for k, v in counts.items()}
#     idf = {k: math.log(len(character_doc)/v) for k, v in counts.items()}
#     tf_idf = {k: v*idf[k] for k, v in tf.items()}
#     counts = tf_idf
#     return counts
# Mean rank: 1.5

# # tf-idf model with bigrams
# # extract n-grams
#     n_grams = []
#     for i in range(1, 3):
#         n_grams += list(ngrams(character_doc, i))
#     counts = Counter(n_grams)  # for now a simple count
#     counts = dict(counts)
#     # tf-idf
#     tf = {k: v/len(character_doc) for k, v in counts.items()}
#     idf = {k: math.log(len(character_doc)/v) for k, v in counts.items()}
#     tf_idf = {k: v*idf[k] for k, v in tf.items()}
#     counts = tf_idf
#     return counts
# Mean rank: 1.3125


# # extract n-grams
#     n_grams = []
#     for i in range(1, 4):
#         n_grams += list(ngrams(character_doc, i))
#     counts = Counter(n_grams)  # for now a simple count
#     counts = dict(counts)
#     # remove low frequency n-grams
#     for key in list(counts.keys()):
#         if counts[key] < 2:
#             del counts[key]
#     return counts
# Mean rank: 2.75, accuracy: 0.5

# # extract n-grams
#     n_grams = []
#     for i in range(1, 4):
#         n_grams += list(ngrams(character_doc, i))
#     counts = Counter(n_grams)  # for now a simple count
#     counts = dict(counts)
#     return counts
# Mean rank: 2.0, accuracy: 0.6875

#  # pos tags
#     pos_tags = nltk.pos_tag(character_doc)
#     pos_tags = Counter(pos_tags)
#     pos_tags = dict(pos_tags)
    
#     return pos_tags
# Mean rank: 2.0, accuracy: 0.625

# Adding pos tags and sentiment scores
# # pos tags
#     pos_tags = nltk.pos_tag(character_doc)
#     pos_tags = Counter(pos_tags)
#     pos_tags = dict(pos_tags)
#     # add sentiment scores to each pos_tag
#     sentiment_scores = []
#     for word in character_doc:
#         sentiment_scores.append(TextBlob(word).sentiment.polarity)
#     sentiment_scores = Counter(sentiment_scores)
#     sentiment_scores = dict(sentiment_scores)
#     # add sentiment_score keys to pos_tags
#     for key in sentiment_scores.keys():
#         pos_tags[key] = sentiment_scores[key]
#     return pos_tags
# Mean rank: 1.8125

# Extracting bigrams and unigrams
# # extract bigrams and unigrams
#     unigrams = Counter(character_doc)
#     unigrams = dict(unigrams)
#     # extract bigrams
#     bigrams = list(ngrams(character_doc, 2))
#     bigrams = [b[0] + '_' + b[1] for b in bigrams]
#     bigrams = Counter(bigrams)
#     bigrams = dict(bigrams)
#     # make unigrams into a Counter
#     counts = Counter(unigrams)
#     counts = dict(counts)
#     # remove underscores from bigrams
#     bigrams = {k.replace('_', ' '): v for k, v in bigrams.items()}
#     # combine unigrams and bigrams
#     counts.update(bigrams)
#     return counts
# Mean Rank: 1.875, accuracy: 0.75

# Combine pos tags and bigrams
# # extract pos tags
#     pos_tags = nltk.pos_tag(character_doc)
#     pos_tags = Counter(pos_tags)
#     pos_tags = dict(pos_tags)
#     # extract bigrams and add pos tags to them
#     bigrams = list(ngrams(character_doc, 2))
#     bigrams = [b[0] + '_' + b[1] for b in bigrams]
#     bigrams = Counter(bigrams)
#     bigrams = dict(bigrams)
#     bigrams = nltk.pos_tag(bigrams)
#     bigrams = Counter(bigrams)
#     bigrams = dict(bigrams)
#     # Update pos_tags dictionary with bigrams
#     pos_tags.update(bigrams)
#     return pos_tags
# Mean rank: 2.0625, accuracy: 0.5625


# Q3. Analyse the similarity results (10 marks)
From your system so far run on the 90%/10% training/validation split, identify the heldout character vectors ranked closest to each character's training vector which are not the character themselves, and those furthest away, as displayed using the `plot_heat_map_similarity` function. In your report, try to ascribe reasons why this is the case, particularly for those where there isn't a successful highest match between the target character in the training set and that character's vector in the heldout set yet. Observations you could make include how their language use is similar, resulting in similar word or ngram features.

Christian is most similar to Jane and least similar to Shirley
Clare most similar to Max and least similar to Ronnie
Heather most similar to Phil and least similar to Shirley
Ian most similar to Max and least similar to Ronnie
Jack most similar to Max and least similar to Shirley
Jane most similar to Christian and least similar to Ronnie
Max most similar to Sean and least similar to Ronnie
Minty most similar to Max and least similar to Ronnie
Other most similar to Minty and least similar to Ronnie
Phil most similar to Max and least similar to Ronnie
Ronnie most similar to Max and least similar to Shirley
Roxy most similar to Max and least similar to Shirley
Sean most similar to Max and least similar to Ronnie
Shirley most similar to Max and least similar to Ronnie
Stacey most similar to Max and least similar to Ronnie
Tanya most similar to Max and least similar to Ronnie

# Q4. Add dialogue context and scene features (20 marks)
Adjust `create_character_document_from_dataframe` and the other functions appropriately so the data incorporates the context of the line spoken by the characters in terms of the lines spoken by other characters in the same scene (before and after the target character's lines). HINT: you should use the *Episode* and *Scene* columns to check which characters are in the same scene to decide whether to include their lines or not. You can also use **scene_info** column to extract information about the scene **(but DO USE the GENDER and CHARACTER columns directly)**.

In [None]:
##### Don't use this #####
# # Create one document per character
# def create_character_document_from_dataframe(df, max_line_count):
#     """Returns a dict with the name of the character as key,
#     their lines joined together as a single string, with end of line _EOL_
#     markers between them.
    
#     ::max_line_count:: the maximum number of lines to be added per character
#     """
#     character_docs = {}
#     character_line_count = {}
#     for line, name, scene in zip(range(len(df.Line)), df.Character_name, df.episode_scene):
#         if not name in character_docs.keys():
#             character_docs[name] = ""
#             character_line_count[name] = 0
#         if character_line_count[name]==max_line_count:
#             continue
#         character_docs[name] += f'{str(df.Line.iloc[line])}'  + " _EOL_ "  # adding an end-of-line token
#         character_line_count[name]+=1

#         character_docs[name] += str(df.Scene_info.iloc[line]) + " "

#         character_docs[name] += str(df.episode_scene.iloc[line]) + " _SOL_ "  # adding an end-of-line token
#     print("lines per character", character_line_count)
#     return character_docs

In [None]:
# # Mine
# # Create one document per character
# def create_character_document_from_dataframe(df, max_line_count):
#     """Returns a dict with the name of the character as key,
#     their lines joined together as a single string, with end of line _EOL_
#     markers between them.
    
#     ::max_line_count:: the maximum number of lines to be added per character
#     """
#     character_docs = {}
#     character_line_count = {}
#     for line, name, episode_scenes in zip(df.Line, df.Character_name, df.episode_scene):
#         if not name in character_docs.keys():
#             character_docs[name] = ""
#             character_line_count[name] = 0
#         if character_line_count[name]==max_line_count:
#             continue
#         character_docs[name] += str(line)  + " _EOL_ "  # adding an end-of-line token
#         character_docs[name] += str(episode_scenes) + " _SOL_ "  # adding an end-of-line token
#         character_line_count[name]+=1
#     print("lines per character", character_line_count)
#     return character_docs

In [None]:
# # Best so far
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    for line, name, episode, scene in zip(range(len(df.Line)), df.Character_name, df.Episode, df.Scene):
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        # for line in the same scene
        if line > 0 and df.Episode.iloc[line-1] == episode and df.Scene.iloc[line-1] == scene:
            character_docs[name] += str(df.Line.iloc[line-1]) + "_EOL_ "
        character_docs[name] += str(df.Line.iloc[line])  + " _EOL_ "
        if line < len(df.Line)-1 and df.Episode.iloc[line+1] == episode and df.Scene.iloc[line+1] == scene:
            character_docs[name] += str(df.Line.iloc[line+1])  + " _EOL_ "
        character_line_count[name]+=1
    print("lines per character", character_line_count)
    return character_docs

# Q5. Improve the vectorization method (20 marks)
Use a matrix transformation technique like TF-IDF (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to improve the `create_document_matrix_from_corpus` function, which currently only uses a dictionary vectorizor (`DictVectorizer`) which straight-forwardly maps from the feature dictionaries produced for each character document to a sparse matrix.

As the `create_document_matrix_from_corpus` is designed to be used both in training/fitting (with `fitting` set to `True`) and in transformation alone on test/validation data (with `fitting` set to `False`), make sure you initialize any transformers you want to try in the same place as `corpusVectorizer = DictVectorizer()` before you call 
`create_document_matrix_from_corpus`. Again, develop on 90% training 10% validation split and note the effect/improvement in mean rank with each technique you try.

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
# tfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTransformer = TfidfTransformer()

def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary(doc, []) for name, doc in corpus])
        tfidfTransformer.fit(corpusVectorizer.transform([to_feature_vector_dictionary(doc, []) for name, doc in corpus]))
        
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc, []) for name, doc in corpus])
    doc_feature_matrix = tfidfTransformer.transform(doc_feature_matrix)

    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)

# mean rank: 1.25, accuracy: 0.875

# Q6. Run on final test data  (10 marks)
Test your best system using the code below to train on all of the training data (using the first 400 lines per character maximum) and do the final testing on the test file (using the first 40 lines per character maximum).

Make any neccessary adjustments such that it runs in the same way as the training/testing regime you developed above- e.g. making sure any transformer objects are initialized before `create_document_matrix_from_corpus` is called. Make sure your best system is left in the notebook and it is clear what the mean rank, accuracy of document selection are on the test data.