# Name: Swati Thapa

# Student ID: 210151488

# Q1. Improve pre-processing (10 marks)
Using the pre-processing techniques you have learned in the module, improve the `pre_process` function above, which currently just tokenizes text based on white space.

When developing, use the 90% train and 10% validation data split from the training file, using the first 360 lines from the training split and first 40 lines from the validation split, as per above. To check the improvements by using the different techniques, use the `compute_IR_evaluation_scores` function as above. The **mean rank** is the main metric you need to focus on improving throughout this assignment, where the target/best possible performance is **1** (i.e. all test/validation data character documents are closest to their corresponding training data character documents) and the worst is **16**. Initially the code in this template achieves a mean rank of **5.12**  and accuracy of **0.3125** on the test set- you should be looking to improve those, particularly getting the mean rank as close to 1 as possible.


!pip install num2words #Require to convert number to words

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

In [None]:
import string
import re
import numpy as np
from numpy.linalg import norm
import pandas as pd
from collections import Counter, OrderedDict

import seaborn as sns
import matplotlib.pyplot as plt

import nltk

from sklearn.feature_extraction import DictVectorizer

%matplotlib inline
pd.options.display.max_colwidth=500
import num2words

In [None]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data

In [None]:
all_train_data.isnull().sum() #Observed 12 null values

In [None]:
all_train_data.dropna(inplace=True) #Dropping rows conatining null

In [None]:
# Split into training and test data for heldout validation with random samples of 9:1 train/heldout split
from random import shuffle, seed

seed(0) # set a seed for reproducibility so same split is used each time

epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!

shuffle(episode_scenes)

print(len(episode_scenes))
episode_split = int(0.9*len(episode_scenes))
training_ep_scenes = episode_scenes[:episode_split]
test_ep_scenes = episode_scenes[episode_split:]
print(len(training_ep_scenes), len(test_ep_scenes))

def train_or_heldout_eps(val):
    if val in training_ep_scenes:
        return "training"
    return "heldout"

all_train_data['train_heldout'] = all_train_data['episode_scene'].apply(train_or_heldout_eps)

In [None]:
print('Raw Data: ',np.shape(all_train_data))
train_data = all_train_data[all_train_data['train_heldout']=='training']
val_data = all_train_data[all_train_data['train_heldout']=='heldout']
print('Train set: ',np.shape(train_data))
print('Validation set: ',np.shape(val_data))

### Pre-processing used:
a.	Lower casing all the words

b.	Using different pattern to split properly with the help of regular expression:

   •	Didn’t use nltk default tokenization because it separate I’ll as I and ‘ll as two different words.

   •	Splitting use patter helps tokenizing properly word like I’ll and handle words like ..you.

c.	Using stop words and punctuation to remove very common words and extra punctuations.

d.	Converting numeric to word eg: 25 as twenty five.

e.	Converting a word to its base form using lemmatization


In [None]:
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import string
import re
def pre_process(character_text):
    """Pre-process all the concatenated lines of a character, 
    using tokenization, spelling normalization and other techniques.
    
    Initially just a tokenization on white space. Improve this for Q1.
    
    ::character_text:: a string with all of one character's lines
    """
    
    lemmatizer = WordNetLemmatizer() #initializing lemmatizer
    character_text=character_text.lower() #converting to all lower case
    pattern = r"\s+|(?<=\s)'|'(?=\s)|'(?!.*\.\.) |(?<=\w)([,..!/?])" #different pattern to be used for splitting
    result = [s for s in re.split(pattern, character_text) if s] 
    stop_words = stopwords.words('english') #usage of stopwords
    punctuation=list(string.punctuation) #usage of punctuation to delete unwanted punctuation
    punctuation.append("..") #required punctuation has been appended
    stop=stop_words+punctuation #now all words excepted stopwords and unwanted punctuation will be ignored
    filtered_sentence = [w for w in result if not w in stop] 
    for i in filtered_sentence:
        if i.isnumeric()==True:
          indx=filtered_sentence.index(i)
          k=num2words.num2words(i) #trying to convert numeric to word
          filtered_sentence[indx]=k
    lem_sent=[] 
    for i in filtered_sentence: #looping through each pre-processed word present in filtered_sentence
      lem_sent.append(lemmatizer.lemmatize(i)) #Lemmitizing the word

    return lem_sent

In [None]:
#Example on how regex is working in our preprocess

pattern = r"\s+|(?<=\s)'|'(?=\s)|'(?!.*\.\.) |(?<=\w)([,..!/?])"
words = """hello my name is 'joe..' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)

#### It can be observed words like "i'm" is restored. According to my observation I have observed if I use normal word.tokenize it split the word but even word like i'll becomes as i and 'll.
For instance

In [None]:
word_data = "I'll be at home. Where do you want to go? Oh no....."

nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

In [None]:
tokens = pre_process(word_data) #demo of pre-process function
print(tokens)

In [None]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    for line, name, gender in zip(df.Line, df.Character_name, df.Gender):
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        character_docs[name] += str(line)   + " _EOL_ "  # adding an end-of-line token
        #character_docs[name] += str(line)   + " "  # adding an end-of-line token
        character_line_count[name]+=1
    print("lines per character", character_line_count)
    return character_docs

In [None]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=360)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

In [None]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process(doc)) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [None]:
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split()))
    total_words += len(val_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process(doc)) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]

In [None]:
def to_feature_vector_dictionary(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
    counts = Counter(character_doc)  # for now a simple count
    counts = dict(counts)
    # add the extra features, for now just adding one count for each extra feature
    for feature in extra_features:
        counts[feature] += 1
    return counts  

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
def compute_cosine_similarity(v1, v2):
    """Takes a pair of vectors v1 and v2 (1-d arrays e.g. [0, 0.5, 0.5])
    returns the cosine similarity between the vectors
    """
    
    # compute cosine similarity manually
    manual_cosine_similarity = np.dot(v1, v2)  /(norm(v1) * norm(v2))
    
    return manual_cosine_similarity

In [None]:
def compute_IR_evaluation_scores(train_feature_matrix, test_feature_matrix, train_labels, test_labels):
    """
    Computes an information retrieval based on training data feature matrix and test data feature matrix
    returns 4-tuple:
    ::mean_rank:: mean of the ranking of the target document in terms of similarity to the query/test document
    1 is the best possible score.
    ::mean_cosine_similarity:: mean cosine similarity score for the target document vs. the test document of the same class
    ::accuracy:: proportion of test documents correctly classified
    ::df:: a data frame with all the similarity measures of the test documents vs. train documents
    
    params:
    ::train_feature_matrix:: a numpy matrix N x M shape where N = number of characters M = number of features
    ::test_feature_matrix::  a numpy matrix N x M shape where N = number of characters M = number of features
    ::train_labels:: a list of character names for the training data in order consistent with train_feature_matrix
    ::test_labels:: a list of character names for the test data in order consistent with test_feature_matrix
    """
    rankings = []
    all_cosine_similarities = []
    pairwise_cosine_similarity = []
    pairs = []
    correct = 0
    for i, target in enumerate(test_labels):
        # compare the left out character against the mean
        idx = i 
        fm_1 = test_feature_matrix.toarray()[idx]
        all_sims = {}
        print("target:", target)
        for j, other in enumerate(train_labels):
            fm_2 = train_feature_matrix.toarray()[j]
            manual_cosine_similarity = compute_cosine_similarity(fm_1, fm_2)
            pairs.append((target, other))
            pairwise_cosine_similarity.append(manual_cosine_similarity)
            if other == target:
                all_cosine_similarities.append(manual_cosine_similarity)
            all_sims[other] = manual_cosine_similarity

            # print(target, other, manual_cosine_similarity)
        sorted_similarities = sorted(all_sims.items(),key=lambda x:x[1],reverse=True)
        # print(sorted_similarities)
        ranking = {key[0]: rank for rank, key in enumerate(sorted_similarities, 1)}
        # print("Ranking for target", ranking[target])
        if ranking[target] == 1:
            correct += 1
        rankings.append(ranking[target])
        # print("*****")
    mean_rank = np.mean(rankings)
    mean_cosine_similarity = np.mean(all_cosine_similarities)
    accuracy = correct/len(test_labels)
    print("mean rank", np.mean(rankings))
    print("mean cosine similarity", mean_cosine_similarity)
    print(correct, "correct out of", len(test_labels), "/ accuracy:", accuracy )
    
    # get a dafaframe showing all the similarity scores of training vs test docs
    df = pd.DataFrame({'doc1': [x[0] for x in pairs], 'doc2': [x[1] for x in pairs],
                       'similarity': pairwise_cosine_similarity})

    # display characters which are most similar and least similar
    df.loc[[df.similarity.values.argmax(), df.similarity.values.argmin()]]
    return (mean_rank, mean_cosine_similarity, accuracy, df)

In [None]:
def plot_heat_map_similarity(df):
    """Takes a dataframe with header 'doc1, doc2, similarity'
    Plots a heatmap based on the similarity scores.
    """
    test_labels =  sorted(list(set(df.sort_values(['doc1'])['doc1'])))
    # add padding 1.0 values to either side
    cm = [[1.0,] * (len(test_labels)+2)]
    for target in test_labels:
        new_row = [1.0]
        for x in df.sort_values(['doc1', 'doc2'])[df['doc1']==target]['similarity']:
            new_row.append(x)
        new_row.append(1.0)
        cm.append(new_row)
    cm.append([1.0,] * (len(test_labels)+2))
    #print(cm)
    labels = [""] + test_labels + [""]
    fig = plt.figure(figsize=(20,20))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Similarity matrix between documents as vectors')
    fig.colorbar(cax)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels( labels, rotation=45)
    ax.set_yticklabels( labels)

    for i in range(len(cm)):
        for j in range(len(cm)):

            text = ax.text(j, i, round(cm[i][j],3),
                           ha="center", va="center", color="w")

    plt.xlabel('Training Vector Doc')
    plt.ylabel('Test Vector Doc')
    #fig.tight_layout()
    plt.show()

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
plot_heat_map_similarity(df)

## Result:
Achieved 2.375 mean rank

# Q2. Improve linguistic feature extraction (15 marks)
Use the feature extraction techniques you have learned to improve the `to_feature_vector_dictionary` function above. Examples of extra features could include extracting n-grams of different lengths and including POS-tags. You could also use sentiment analysis and gender classification (using the same data) as additional features.

You could use some feature selection/reduction with techniques like minimum document frequency and/or feature selection like k-best selection using different criteria https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html. Again, develop on 90% training and 10% validation split and note the effect/improvement in mean rank with the techniques you use.

## 2.1 Using count, one previous words and POS tag

In [None]:
features={} #Dictionary to store mean rank of each feature combination

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    features_list=[] #Main list
    features_dict={} #main feature dict
    counts = Counter(character_doc)  #simple count
    counts=dict(counts)
   

    features_dict={}
    for key, value in counts.items():
      lis_values=[] #storing PRE and POS
      lis_values.append("Count_"+str(value))
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #Checking first word condtion in a sentence 
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))
            
            

       # For POS tag
            pos_list=[] 
            pos_list.append(pos_tag([character_doc[i]])) 
            #pos_list format will be [[("hello","NN"),("go","JJ")]]
            pos_dict={}
            for m in pos_list:
              pos_dict[m[0][0]]=m[0][1]
            #pos_dict format will be {"hello": "NN", "go": "JJ"}
            for key_1,value_1 in  pos_dict.items():
              lis_values.append("POS_" + str(value_1))
            features_dict[key]=lis_values


    for i in list(features_dict.keys()): #Deleting _eol_ as it is redundant and have used just for indicating new line
      if i=='_eol_':
        del features_dict[i]

    print(features_dict)
    return (features_dict)

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary(doc,[name]) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc,[name]) for name, doc in corpus])
    
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
print('\n')
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features["Count + ONE previous word+ POS tag"]=mean_rank

## 2.2 Using count, one previous word, one next word and POS tag

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_2(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # simple count
    counts=dict(counts)
   


    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))
            

            if ((i>=len(character_doc)-1)|(character_doc[i]=='_eol_')): #for post word
              next_word=" "
            else:
              next_word=character_doc[i+1]
              if next_word=='_eol_':
                next_word=" "
            lis_values.append("POST_" +str(next_word))
            

           # For POS tag
            pos_list=[] 
            pos_list.append(pos_tag([character_doc[i]])) 
            #pos_list format will be [[("hello","NN"),("go","JJ")]]
            pos_dict={}
            for m in pos_list:
              pos_dict[m[0][0]]=m[0][1]
            #pos_dict format will be {"hello": "NN", "go": "JJ"}
            for key_1,value_1 in  pos_dict.items():
              lis_values.append("POS_" + str(value_1))
            features_dict[key]=lis_values


    for i in list(features_dict.keys()): #Deleting _eol_ as it is redundant and have used just for indicating new line
      if i=='_eol_':
        del features_dict[i]

    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
   
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary_2(doc,[name]) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_2(doc,[name]) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

#print(training_corpus)
training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
print('\n')
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features["Count + ONE previous word+ ONE next word+ POS tag"]=mean_rank

## 2.3 One previous and one next word

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_3(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for now a simple count
    counts=dict(counts)
   


    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))
            

            if ((i>=len(character_doc)-1)|(character_doc[i]=='_eol_')): #for post word
              next_word=" "
            else:
              next_word=character_doc[i+1]
              if next_word=='_eol_':
                next_word=" "
            lis_values.append("POST_" +str(next_word))
            features_dict[key]=lis_values

      for i in list(features_dict.keys()): #Deleting _eol_ as it is redundant and have used just for indicating new line
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
   
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary_3(doc,[name]) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_3(doc,[name]) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

#print(training_corpus)
training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
print('\n')
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features["Count +ONE previous word + ONE next word"]=mean_rank

## 2.4 Bigrams:- Two previous words and POS tag

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_4(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
  
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for simple count
    counts=dict(counts)


    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word_1=" "
              prev_word_2=" "
            elif (character_doc[i-2]==character_doc[-1])|(character_doc[i-2]=='_eol_'):
              prev_word_1=character_doc[i-1]
              prev_word_2=" "
            else:
              prev_word_1=character_doc[i-1]
              prev_word_2=character_doc[i-2]
            lis_values.append("PRE1_"+str(prev_word_1))
            lis_values.append("PRE2_"+str(prev_word_2))

            pos_list=[]
            pos_list.append(pos_tag([character_doc[i]]))

            pos_dict={} #For POS tag
            for m in pos_list:
              pos_dict[m[0][0]]=m[0][1]
            for key_1,value_1 in  pos_dict.items():
              lis_values.append("POS_" + str(value_1))
            features_dict[key]=lis_values

      for i in list(features_dict.keys()): #Deleting _eol_ as it is redundant and have used just for indicating new line
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    #print(len(corpus))
    # for i in range(0,len(corpus)):
    #   print(corpus[i][0])
    #print(corpus[2][0])
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary_4(doc,[name]) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_4(doc,[name]) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

#print(training_corpus)
training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
print('\n')
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features["Count + Two previous word(Bi-grams)"]=mean_rank

## 2.5 Testing on best feature combination:
i.e "Count + ONE previous word+ ONE next word+ POS tag"

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_5(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    #print(character_doc)
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for simple count
    counts=dict(counts)
   

    pos=dict(pos_tag(character_doc))
    features_dict={}

    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      if key in pos: #Adding just one POS tag
        lis_values.append("POS_"+str(pos[key]))
      
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))

            if ((i>=len(character_doc)-1)|(character_doc[i]=='_eol_')): #for post word
              next_word=" "
            else:
              next_word=character_doc[i+1]
              if next_word=='_eol_':
                next_word=" "
            lis_values.append("POST_" +str(next_word))
            features_dict[key]=lis_values

      for i in list(features_dict.keys()):
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    #print(len(corpus))
    # for i in range(0,len(corpus)):
    #   print(corpus[i][0])
    #print(corpus[2][0])
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary_5(doc,[name]) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_5(doc,[name]) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

#print(training_corpus)
training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
print('\n')
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features["Count + ONE previous word+ ONE next word+ POS tag (only one)"]=mean_rank

## Conclusion

In [None]:
features

In [None]:
from pylab import *
fig = figure()
names = list(features.keys())
values = list(features.values())
plt.bar(0,values[0],tick_label=names[0])
plt.bar(1,values[1],tick_label=names[1])
plt.bar(2,values[2],tick_label=names[2])
plt.bar(3,values[3],tick_label=names[3])
plt.bar(4,values[4],tick_label=names[4])
plt.xticks(range(0,5),names)
ax = fig.add_subplot(111)
ax.set_xticklabels(features.keys(),rotation=90) ;
plt.show()

## Result: 

From above features “Count + ONE previous word+ ONE next word+ POS tag (only one)” gave the best mean rank of 1.812

# Q3. Add dialogue context data and features (15 marks)
Adjust `create_character_document_from_dataframe` and the other functions appropriately so the data incorporates the context of the line spoken by the characters in terms of the lines spoken by other characters in the same scene (immediately before and after). You can also use **scene information** from the other columns **(but NOT the gender and character names directly)**.

In [None]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data

In [None]:
# Split into training and test data for heldout validation with random samples of 9:1 train/heldout split
from random import shuffle, seed

seed(0) # set a seed for reproducibility so same split is used each time

epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!

shuffle(episode_scenes)

print(len(episode_scenes))
episode_split = int(0.9*len(episode_scenes))
training_ep_scenes = episode_scenes[:episode_split]
test_ep_scenes = episode_scenes[episode_split:]
print(len(training_ep_scenes), len(test_ep_scenes))

def train_or_heldout_eps(val):
    if val in training_ep_scenes:
        return "training"
    return "heldout"

all_train_data['train_heldout'] = all_train_data['episode_scene'].apply(train_or_heldout_eps)

In [None]:
print('Raw Data: ',np.shape(all_train_data))
train_data = all_train_data[all_train_data['train_heldout']=='training']
val_data = all_train_data[all_train_data['train_heldout']=='heldout']
print('Train set: ',np.shape(train_data))
print('Validation set: ',np.shape(val_data))

In [None]:
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import string
import re
def pre_process(character_text):
    """Pre-process all the concatenated lines of a character, 
    using tokenization, spelling normalization and other techniques.
    
    Initially just a tokenization on white space. Improve this for Q1.
    
    ::character_text:: a string with all of one character's lines
    """
    
    lemmatizer = WordNetLemmatizer() #initializing lemmatizer
    character_text=character_text.lower() #converting to all lower case
    pattern = r"\s+|(?<=\s)'|'(?=\s)|'(?!.*\.\.) |(?<=\w)([,..!/?])" #different pattern to be used for splitting
    result = [s for s in re.split(pattern, character_text) if s] 
    stop_words = stopwords.words('english') #usage of stopwords
    punctuation=list(string.punctuation) #usage of punctuation to delete unwanted punctuation
    punctuation.append("..") #required punctuation has been appended
    stop=stop_words+punctuation #now all words excepted stopwords and unwanted punctuation will be ignored
    filtered_sentence = [w for w in result if not w in stop] 
    for i in filtered_sentence:
        if i.isnumeric()==True:
          indx=filtered_sentence.index(i)
          k=num2words.num2words(i) #trying to convert numeric to word
          filtered_sentence[indx]=k
    lem_sent=[] 
    for i in filtered_sentence: #looping through each pre-processed word present in filtered_sentence
      lem_sent.append(lemmatizer.lemmatize(i)) #Lemmitizing the word

    return lem_sent

In [None]:
train_data.isnull().sum()

In [None]:
val_data.isnull().sum()

In [None]:
train_data.dropna(inplace=True) #Observed null value in train data therefore dropping row which has null

In [None]:
train_data.isnull().sum()

In [None]:
train_data

Some pre- processing such as stop word and punctuation removal have been done in “Line” column as we are focusing on the number of counts as feature in this question.
Reason to take number of counts:

a.	Since this is drama one character calling other character name in the line is common. It becomes easier to know at least the current character won’t be repeating his/her name again and again in whole drama.

b.	Easy to count scene_info and find a pattern which character has more chance of playing the scene


In [None]:
stop_words = stopwords.words('english')
punctuation=list(string.punctuation)
punctuation.append("..")
stop=stop_words+punctuation

In [None]:
train_data

In [None]:
train_data['Line'] = train_data['Line'].apply(word_tokenize)
train_data['Line'] = train_data['Line'].apply(lambda words: [word for word in words if word not in stop])

In [None]:
val_data['Line'] = val_data['Line'].apply(word_tokenize) #preprocessing validation set too
val_data['Line'] = val_data['Line'].apply(lambda words: [word for word in words if word not in stop])

In [None]:
#After preprocessing "Line" column joining the word as senetence again in train data
no_stop=[]
for i in train_data['Line']:
  k=' '.join(i)
  no_stop.append(k)
train_data['Line']=no_stop

In [None]:
#After preprocessing "Line" column joining the word as senetence again in val data
no_stop=[]
for i in val_data['Line']:
  k=' '.join(i)
  no_stop.append(k)
val_data['Line']=no_stop

In [None]:
train_data

In [None]:
val_data

In [None]:
train_data = train_data.reset_index() #resetting the index as some part from all_train_data were taken as validation set
train_data.drop(['index'],axis=1,inplace=True)
train_data

In [None]:
val_data = val_data.reset_index() #resetting the index as some part from all_train_data were taken as train set
val_data.drop(['index'],axis=1,inplace=True)
val_data

## In this problem as mentioned in question I have made a column “pre_current_post” in data frame which includes scene info, previous line and post line for a current character line provided they are in same scene and are different character. 

Conditions covered:

a.	If current line is the first line of a scene or from same character in the scene then it returns “NONE” or each word is concatenated with PRE_ after tokenizing

b.	If current line is the last line of a scene or from same character in the scene then it returns “NONE” or each word is concatenated with POST_ after tokenizing 

c.	Current line is tokenized and don’t have any prefix.

d.	Scene info have been concatenated with “_” to make it look like one single word.

e.	_EOL_ is added at the end to know current line pre-processing is done and new line start after that


In [None]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line EOL
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    #incorporates the context of the line spoken by the characters in terms of the lines spoken by other characters in the same scene (immediately before and after)
    result = []
    
    Scene_info_list=[]
    for i in df['Scene_info']: # few Scene_info "SLATERS', KITCHEN/FRONT ROOM INT DAY LIGHT" are in this format
      k=i.replace(",","") 
      k=k.replace(' ','_')# therefore making it into "SLATERS'_KITCHEN/FRONT_ROOM_INT_DAY_LIGHT" format
      Scene_info_list.append(k)
    df['Scene_info']=Scene_info_list
    episodeCol = df.Episode
    sceneCol = df.Scene
    lineCol = df.Line
    charname=df.Character_name
    episcene= df.episode_scene
    scene_info=df.Scene_info


    currEpisode = episodeCol[0]
    currScene = sceneCol[0]
      
    for i in range(0, len(episodeCol)): 
        temp = []

        ep = episodeCol[i]
        scene = sceneCol[i]
        line = lineCol[i]
        name=charname[i]
        episcene_con=episcene[i]
        scene_info_con=scene_info[i]

        lineList = line.split()

        if(i != 0): #For previous character line
          prevEp = episodeCol[i-1]
          prevScene = sceneCol[i-1]
          prevLine = lineCol[i-1]
          prevChar = charname[i-1]
        else:
          prevEp = -1
          prevScene = -1
          prevLine = 'NONE'  
          prevChar = -1      


        prevLineList = prevLine.split()

        if(i != len(episodeCol)-1): #for post character line
          nextEp = episodeCol[i+1]
          nextScene = sceneCol[i+1]
          nextLine = lineCol[i+1]  
          nextChar = charname[i+1]
        else:
          nextEp = -1
          nextScene = -1
          nextLine = 'NONE'  
          nextChar = -1


        nextLineList = nextLine.split()


        if(ep == prevEp and scene == prevScene and name!=prevChar): 
            append_str = 'PRE_'
            pre_res = [append_str + sub for sub in prevLineList]
            temp.append(scene_info_con)
            temp.extend(pre_res)
        else:
            temp.append(scene_info_con)
            temp.append('NONE')
        

        temp.extend(lineList)

        if(ep == nextEp and scene == nextScene and name!=nextChar):
            append_str = 'POST_'
            pre_res = [append_str + sub for sub in nextLineList]
            temp.extend(pre_res)

        else:
            temp.append('NONE')
        
        result.append(temp)
        

    df['pre_current_post']=result #make new column in datframe with pre line current line and post line
    uni_dict={} #initialise unique dict
    line_list=[] #initialise list
    char_name=df['Character_name'].unique() #getting unique char name
    for i in char_name: #for each char unique dataframe
      char_name_dataframe=df.loc[df['Character_name']==i]
      for pre_post in char_name_dataframe.pre_current_post:
        pre_post.append('_EOL_') #EOL to know current line ending
        line_list=line_list + pre_post
        uni_dict[i]=line_list
      line_list=[]

    print(uni_dict)
    return uni_dict

In [None]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=360)
for k,v in train_character_docs.items():
  train_character_docs[k]=",".join(v)
#train_character_docs
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split(',')))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

Format example: 

a.	DESERTED_CAR_PARK_EXT_NIGHT, NONE, Look, ya, mark, ya, And, think, 're, unlucky, man, POST_Shirl, POST_..., _EOL_         When current line is first line in scene

b.	R&R_INT_NIGHT,PRE_Okay,Are,alright,You,'ve,bit,since,got,POST_Are,POST_alright,_EOL_   General format when it satisfies our condition


In [None]:
train_data

In [None]:
train_character_docs

In [None]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process(doc)) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [None]:
val_data

In [None]:
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
for k,v in val_character_docs.items():
  val_character_docs[k]=",".join(v)
val_character_docs
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split(',')))
    total_words += len(val_character_docs[name].split(','))
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process(doc)) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]

In [None]:
def to_feature_vector_dictionary(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
    counts = Counter(character_doc)  # for now a simple count
    counts = dict(counts)
    # add the extra features, for now just adding one count for each extra feature
    for feature in extra_features:
        counts[feature] += 1
    
    for i in list(counts.keys()): #deleting "none" and "_eol_" key as these don't have much meaning in the dictionary 
      if i=='none' or i=='_eol_':
        del counts[i]
    #print(counts)
    return counts  

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

## Result:
Achieved mean rank of 1.5

# Q4. Improve the vectorization method (10 marks)
Use a matrix transformation technique like TF-IDF (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to improve the `create_document_matrix_from_corpus` function, which currently only uses a dictionary vectorizor (`DictVectorizer`) which straight-forwardly maps from the feature dictionaries produced for each character document to a sparse matrix.

As the `create_document_matrix_from_corpus` is designed to be used both in training/fitting (with `fitting` set to `True`) and in transformation alone on test/validation data (with `fitting` set to `False`), make sure you initialize any transformers you want to try in the same place as `corpusVectorizer = DictVectorizer()` before you call 
`create_document_matrix_from_corpus`. Again, develop on 90% training 10% validation split and note the effect/improvement in mean rank with each technique you try.

## 4.1 Applying TF-IDF on q3 feature. i.e scene_info + previous line + current line + post line

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features_tfidf={}
features_tfidf["scene_info + previous line + current line + post line"]=mean_rank

### Note:
I'm running same steps as I have run in q1 and q2 because in q3 I had added new column and features were different too

In [None]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data

In [None]:
all_train_data.dropna(inplace=True) #dropping rows which has null values

In [None]:
# Split into training and test data for heldout validation with random samples of 9:1 train/heldout split
from random import shuffle, seed

seed(0) # set a seed for reproducibility so same split is used each time

epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column
episode_scenes = sorted(list(set([x for x in epsiode_scene_column.values]))) # set function is random, need to sort!

shuffle(episode_scenes)

print(len(episode_scenes))
episode_split = int(0.9*len(episode_scenes))
training_ep_scenes = episode_scenes[:episode_split]
test_ep_scenes = episode_scenes[episode_split:]
print(len(training_ep_scenes), len(test_ep_scenes))

def train_or_heldout_eps(val):
    if val in training_ep_scenes:
        return "training"
    return "heldout"

all_train_data['train_heldout'] = all_train_data['episode_scene'].apply(train_or_heldout_eps)

In [None]:
print('Raw Data: ',np.shape(all_train_data))
train_data = all_train_data[all_train_data['train_heldout']=='training']
val_data = all_train_data[all_train_data['train_heldout']=='heldout']
print('Train set: ',np.shape(train_data))
print('Validation set: ',np.shape(val_data))

In [None]:
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import string
import re
def pre_process(character_text):
    """Pre-process all the concatenated lines of a character, 
    using tokenization, spelling normalization and other techniques.
    
    Initially just a tokenization on white space. Improve this for Q1.
    
    ::character_text:: a string with all of one character's lines
    """
    
    lemmatizer = WordNetLemmatizer() #initializing lemmatizer
    character_text=character_text.lower() #converting to all lower case
    pattern = r"\s+|(?<=\s)'|'(?=\s)|'(?!.*\.\.) |(?<=\w)([,..!/?])" #different pattern to be used for splitting
    result = [s for s in re.split(pattern, character_text) if s] 
    stop_words = stopwords.words('english') #usage of stopwords
    punctuation=list(string.punctuation) #usage of punctuation to delete unwanted punctuation
    punctuation.append("..") #required punctuation has been appended
    stop=stop_words+punctuation #now all words excepted stopwords and unwanted punctuation will be ignored
    filtered_sentence = [w for w in result if not w in stop] 
    for i in filtered_sentence:
        if i.isnumeric()==True:
          indx=filtered_sentence.index(i)
          k=num2words.num2words(i) #trying to convert numeric to word
          filtered_sentence[indx]=k
    lem_sent=[] 
    for i in filtered_sentence: #looping through each pre-processed word present in filtered_sentence
      lem_sent.append(lemmatizer.lemmatize(i)) #Lemmitizing the word

    return lem_sent

In [None]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line _EOL_
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    character_docs = {}
    character_line_count = {}
    for line, name, gender in zip(df.Line, df.Character_name, df.Gender):
        if not name in character_docs.keys():
            character_docs[name] = ""
            character_line_count[name] = 0
        if character_line_count[name]==max_line_count:
            continue
        character_docs[name] += str(line)   + " _EOL_ "  # adding an end-of-line token
        #character_docs[name] += str(line)   + " "  # adding an end-of-line token
        character_line_count[name]+=1
    print("lines per character", character_line_count)
    return character_docs

In [None]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe(train_data, max_line_count=360)
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split()))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

In [None]:
train_character_docs

In [None]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process(doc)) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [None]:
val_character_docs = create_character_document_from_dataframe(val_data, max_line_count=40)
print('Num. Characters: ',len(val_character_docs.keys()),"\n")
total_words = 0
for name in val_character_docs.keys():
    print(name, 'Num of Words: ',len(val_character_docs[name].split()))
    total_words += len(val_character_docs[name].split())
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
val_corpus = [(name, pre_process(doc)) for name, doc in sorted(val_character_docs.items())]
val_labels = [name for name, doc in val_corpus]

## 4.2 TF-IDF using count, POS tag and one previous words

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  #simple count
    counts=dict(counts)
   

    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for previous word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))
            
       # For POS tag
            pos_list=[] 
            pos_list.append(pos_tag([character_doc[i]])) 
            #pos_list format will be [[("hello","NN"),("go","JJ")]]
            pos_dict={}
            for m in pos_list:
              pos_dict[m[0][0]]=m[0][1]
            #pos_dict format will be {"hello": "NN", "go": "JJ"}
            for key_1,value_1 in  pos_dict.items():
              lis_values.append("POS_" + str(value_1))
            features_dict[key]=lis_values



    for i in list(features_dict.keys()): #Deleting _eol_ key as it doesn't have much impact as it is used just for indicating next line
      if i=='_eol_':
        del features_dict[i]

    print(features_dict)
    return (features_dict)

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features_tfidf["Count + ONE previous word+ POS tag"]=mean_rank

In [None]:
features_tfidf

## 4.3 TF-IDF using count POS, one previous word and one next word

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_2(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
   
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for now a simple count
    counts=dict(counts)
   


    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))

            if ((i>=len(character_doc)-1)|(character_doc[i]=='_eol_')): #for post word
              next_word=" "
            else:
              next_word=character_doc[i+1]
              if next_word=='_eol_':
                next_word=" "
            lis_values.append("POST_" +str(next_word))
            

            pos_list=[]
            pos_list.append(pos_tag([character_doc[i]]))

            pos_dict={}
            for m in pos_list:
              pos_dict[m[0][0]]=m[0][1]
            for key_1,value_1 in  pos_dict.items():
              lis_values.append("POS_" + str(value_1))
            features_dict[key]=lis_values


      for i in list(features_dict.keys()):
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary_2(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_2(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features_tfidf["Count + ONE previous word+ ONE next word+ POS tag"]=mean_rank

## 4.4 TF-IDF using count, One previous and one next word

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_3(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for now a simple count
    counts=dict(counts)
   


    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))
            features_dict[key]=lis_values

            if ((i>=len(character_doc)-1)|(character_doc[i]=='_eol_')): #for post word
              next_word=" "
            else:
              next_word=character_doc[i+1]
              if next_word=='_eol_':
                next_word=" "
            lis_values.append("POST_" +str(next_word))
            features_dict[key]=lis_values

      for i in list(features_dict.keys()):
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary_3(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_3(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features_tfidf["Count +ONE previous word + ONE next word"]=mean_rank

## 4.5 Bigrams:- TF-IDF using count and Two previous words

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_4(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for simple count
    counts=dict(counts)


    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word_1=" "
              prev_word_2=" "
            elif (character_doc[i-2]==character_doc[-1])|(character_doc[i-2]=='_eol_'):
              prev_word_1=character_doc[i-1]
              prev_word_2=" "
            else:
              prev_word_1=character_doc[i-1]
              prev_word_2=character_doc[i-2]
            lis_values.append("PRE1_"+str(prev_word_1))
            lis_values.append("PRE2_"+str(prev_word_2))


            pos_list=[] #Pos tag
            pos_list.append(pos_tag([character_doc[i]]))

            pos_dict={}
            for m in pos_list:
              pos_dict[m[0][0]]=m[0][1]
            for key_1,value_1 in  pos_dict.items():
              lis_values.append("POS_" + str(value_1))
            features_dict[key]=lis_values

      for i in list(features_dict.keys()):
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary_4(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_4(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features_tfidf["Count + Two previous word(Bi-grams)"]=mean_rank

## 4.6 TF-IDF Testing on best feature combination:
i.e "Count + ONE previous word+ ONE next word+ POS tag(only one)"

In [None]:
from nltk import pos_tag
def to_feature_vector_dictionary_5(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
      
    features_list=[]
    features_dict={}
    counts = Counter(character_doc)  # for simple count
    counts=dict(counts)
   

    pos=dict(pos_tag(character_doc))
    features_dict={}
    for key, value in counts.items():
      lis_values=[]
      lis_values.append("Count_"+str(value))
      if key in pos:
        lis_values.append("POS_"+str(pos[key]))
      
      for i in range(0,len(character_doc)):
        if(key==character_doc[i]):
            if (character_doc[i-1]==character_doc[-1])|(character_doc[i-1]=='_eol_'): #for prev word
              prev_word=" "
            else:
              prev_word=character_doc[i-1]
            lis_values.append("PRE_"+str(prev_word))
            features_dict[key]=lis_values

            if ((i>=len(character_doc)-1)|(character_doc[i]=='_eol_')): #for post word
              next_word=" "
            else:
              next_word=character_doc[i+1]
              if next_word=='_eol_':
                next_word=" "
            lis_values.append("POST_" +str(next_word))
            features_dict[key]=lis_values

      for i in list(features_dict.keys()):
        if i=='_eol_':
          del features_dict[i]
    print(features_dict)
    return (features_dict)


In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary_5(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary_5(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
val_feature_matrix = create_document_matrix_from_corpus(val_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, val_feature_matrix, train_labels, val_labels)

In [None]:
features_tfidf["Count + ONE previous word+ ONE next word+ POS tag (only one)"]=mean_rank

## Conclusion

In [None]:
features_tfidf

In [None]:
from pylab import *
fig = figure()
names = list(features_tfidf.keys())
values = list(features_tfidf.values())
plt.bar(0,values[0],tick_label=names[0])
plt.bar(1,values[1],tick_label=names[1])
plt.bar(2,values[2],tick_label=names[2])
plt.bar(3,values[3],tick_label=names[3])
plt.bar(4,values[4],tick_label=names[4])
plt.bar(5,values[5],tick_label=names[5])
plt.xticks(range(0,6),names)
ax = fig.add_subplot(111)
ax.set_xticklabels(features_tfidf.keys(),rotation=90) ;
plt.show()

## Result:

As it can observed from above 

Applying TF-IDF on q3 feature. i.e scene_info + previous line + current line + post line gave the best mean rank of 1.06

# Q5. Select and test the best vector representation method (10 marks)
Finish the optimization of your vector representations by selecting the best combination of the techniques you tried in Q1-3 and test using the code below to train on all of the training data (using the first 400 lines per character maximum) and do the final testing on the test file (using the first 40 lines per character maximum).

Make any neccessary adjustments such that it runs in the same way as the training/testing regime you developed above- e.g. making sure any transformer objects are initialized before `create_document_matrix_from_corpus` is called. Make sure your best system is left in the notebook and it is clear what the mean rank, accuracy of document selection are on the test data.

## 5.1 Feature used: " Count + ONE previous word+ ONE next word+ POS tag(only one)"
Since we received best mean rank(minimum) for feature " Count + ONE previous word+ ONE next word+ POS tag(only one)". Therefore I'll be using this feature for test data

In [None]:
# Load in training data and display in pandas dataframe
train_path='training.csv'
all_train_data = pd.read_csv(train_path,  delimiter="\t", skip_blank_lines = True)
test_path ='test.csv'
test_data = pd.read_csv(test_path,  delimiter="\t", skip_blank_lines = True)

# Inspect
all_train_data

In [None]:
epsiode_scene_column = all_train_data.Episode.astype(str) + "-" + all_train_data.Scene.astype(str)
all_train_data['episode_scene'] = epsiode_scene_column


epsiode_scene_column = test_data.Episode.astype(str) + "-" + test_data.Scene.astype(str)
test_data['episode_scene'] = epsiode_scene_column


In [None]:
all_train_data['Scene_info'].value_counts().head(12).plot(kind='bar')

In [None]:
all_train_data

In [None]:
all_train_data['Character_name'].value_counts().plot(kind='bar') 

In [None]:
names=all_train_data['Character_name'].unique()
names_dict={}
for i in names:
  k=all_train_data.loc[all_train_data['Character_name']==i]
  names_dict[i]=np.array(k['Gender'].head(1))
print(names_dict)

In [None]:
names_dict_gen={} #making dic with character name and it's gender

for key,value in names_dict.items():
  names_dict_gen[key]=value[0]

In [None]:
names_dict_gen

In [None]:
del names_dict_gen["OTHER"] #Since OTHER gender is mixed so we deleted it 
names_dict_gen

In [None]:
gender_count_dict={}
gender_list=['FEMALE', 'MALE']
for i in gender_list:
  gender_count_dict[i]=list(names_dict_gen.values()).count(i)

In [None]:
gender_count_dict

In [None]:
fig = figure()
names = list(gender_count_dict.keys())
values = list(gender_count_dict.values())
plt.bar(0,values[0],tick_label=names[0])
plt.bar(1,values[1],tick_label=names[1])
plt.xticks(range(0,2),names)
ax = fig.add_subplot(111)
ax.set_xticklabels(gender_count_dict.keys(),rotation=90) ;
plt.show()

In [None]:
all_train_data['Scene_info'].value_counts().head(3)

## 5.2 Data Analysis observed for all_train_data:

a. Top three scene are "CAFE_INT_DAY_LIGHT", "VIC_DOWNSTAIRS_INT_DAY_LIGHT" and "BRIDGE_STREET_EXT_DAY_LIGHT".

b. There are totally 8 Female main character and 7 Male main character. Excluding OTHER as they are of both gender

In [None]:
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import string
import re
def pre_process(character_text):
    """Pre-process all the concatenated lines of a character, 
    using tokenization, spelling normalization and other techniques.
    
    Initially just a tokenization on white space. Improve this for Q1.
    
    ::character_text:: a string with all of one character's lines
    """
    
    lemmatizer = WordNetLemmatizer() #initializing lemmatizer
    character_text=character_text.lower() #converting to all lower case
    pattern = r"\s+|(?<=\s)'|'(?=\s)|'(?!.*\.\.) |(?<=\w)([,..!/?])" #different pattern to be used for splitting
    result = [s for s in re.split(pattern, character_text) if s] 
    stop_words = stopwords.words('english') #usage of stopwords
    punctuation=list(string.punctuation) #usage of punctuation to delete unwanted punctuation
    punctuation.append("..") #required punctuation has been appended
    stop=stop_words+punctuation #now all words excepted stopwords and unwanted punctuation will be ignored
    filtered_sentence = [w for w in result if not w in stop] 
    for i in filtered_sentence:
        if i.isnumeric()==True:
          indx=filtered_sentence.index(i)
          k=num2words.num2words(i) #trying to convert numeric to word
          filtered_sentence[indx]=k
    lem_sent=[] 
    for i in filtered_sentence: #looping through each pre-processed word present in filtered_sentence
      lem_sent.append(lemmatizer.lemmatize(i)) #Lemmitizing the word

    return lem_sent

In [None]:
all_train_data.isnull().sum() #12 null values in all_train_data

In [None]:
test_data.isnull().sum() #1 null value in test_data

In [None]:
#Dropping null values
all_train_data.dropna(inplace=True)
test_data.dropna(inplace=True)

In [None]:
all_train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

In [None]:
all_train_data

In [None]:
stop_words = stopwords.words('english') #stopword and punctuation to be removed from "Line" column
punctuation=list(string.punctuation)
punctuation.append("..")
stop=stop_words+punctuation

In [None]:
all_train_data['Line'] = all_train_data['Line'].apply(word_tokenize)
all_train_data['Line'] = all_train_data['Line'].apply(lambda words: [word for word in words if word not in stop])

In [None]:
test_data['Line'] = test_data['Line'].apply(word_tokenize)
test_data['Line'] = test_data['Line'].apply(lambda words: [word for word in words if word not in stop])

In [None]:
#Joining words after tokenizing in train_data
no_stop=[]
for i in all_train_data['Line']:
  k=' '.join(i)
  no_stop.append(k)
all_train_data['Line']=no_stop

In [None]:
#Joining words after tokenizing in test_data
no_stop=[]
for i in test_data['Line']:
  k=' '.join(i)
  no_stop.append(k)
test_data['Line']=no_stop

In [None]:
all_train_data

In [None]:
test_data

In [None]:
all_train_data = all_train_data.reset_index() #reseting index in all_train_data
all_train_data.drop(['index'],axis=1,inplace=True)
all_train_data

In [None]:
test_data = test_data.reset_index() #resetting index in test_data
test_data.drop(['index'],axis=1,inplace=True)
test_data

In [None]:
# Create one document per character
def create_character_document_from_dataframe(df, max_line_count):
    """Returns a dict with the name of the character as key,
    their lines joined together as a single string, with end of line EOL
    markers between them.
    
    ::max_line_count:: the maximum number of lines to be added per character
    """
    #incorporates the context of the line spoken by the characters in terms of the lines spoken by other characters in the same scene (immediately before and after)
    result = []
    
    Scene_info_list=[]
    for i in df['Scene_info']: # few Scene_info "SLATERS', KITCHEN/FRONT ROOM INT DAY LIGHT" are in this format
      k=i.replace(",","") 
      k=k.replace(' ','_')# therefore making it into "SLATERS'_KITCHEN/FRONT_ROOM_INT_DAY_LIGHT" format
      Scene_info_list.append(k)
    df['Scene_info']=Scene_info_list
    episodeCol = df.Episode
    sceneCol = df.Scene
    lineCol = df.Line
    charname=df.Character_name
    episcene= df.episode_scene
    scene_info=df.Scene_info


    currEpisode = episodeCol[0]
    currScene = sceneCol[0]
      
    for i in range(0, len(episodeCol)): 
        temp = []

        ep = episodeCol[i]
        scene = sceneCol[i]
        line = lineCol[i]
        name=charname[i]
        episcene_con=episcene[i]
        scene_info_con=scene_info[i]

        lineList = line.split()

        if(i != 0): #For previous character line
          prevEp = episodeCol[i-1]
          prevScene = sceneCol[i-1]
          prevLine = lineCol[i-1]
          prevChar = charname[i-1]
        else:
          prevEp = -1
          prevScene = -1
          prevLine = 'NONE'  
          prevChar = -1      


        prevLineList = prevLine.split()

        if(i != len(episodeCol)-1): #for post character line
          nextEp = episodeCol[i+1]
          nextScene = sceneCol[i+1]
          nextLine = lineCol[i+1]  
          nextChar = charname[i+1]
        else:
          nextEp = -1
          nextScene = -1
          nextLine = 'NONE'  
          nextChar = -1


        nextLineList = nextLine.split()


        if(ep == prevEp and scene == prevScene and name!=prevChar): 
            append_str = 'PRE_'
            pre_res = [append_str + sub for sub in prevLineList]
            temp.append(scene_info_con)
            temp.extend(pre_res)
        else:
            temp.append(scene_info_con)
            temp.append('NONE')
        

        temp.extend(lineList)

        if(ep == nextEp and scene == nextScene and name!=nextChar):
            append_str = 'POST_'
            pre_res = [append_str + sub for sub in nextLineList]
            temp.extend(pre_res)

        else:
            temp.append('NONE')
        
        result.append(temp)
        

    df['pre_current_post']=result #make new column in datframe with pre line current line and post line
    uni_dict={} #initialise unique dict
    line_list=[] #initialise list
    char_name=df['Character_name'].unique() #getting unique char name
    for i in char_name: #for each char unique dataframe
      char_name_dataframe=df.loc[df['Character_name']==i]
      for pre_post in char_name_dataframe.pre_current_post:
        pre_post.append('_EOL_') #EOL to know current line ending
        line_list=line_list + pre_post
        uni_dict[i]=line_list
      line_list=[]

    print(uni_dict)
    return uni_dict

In [None]:
# print out the number of words each character has in the training set
# only use the first 360 lines of each character
train_character_docs = create_character_document_from_dataframe(all_train_data, max_line_count=400)
for k,v in train_character_docs.items():
  train_character_docs[k]=",".join(v)
#train_character_docs
print('Num. Characters: ',len(train_character_docs.keys()),"\n")
total_words = 0
for name in train_character_docs.keys():
    print(name, 'Number of Words: ',len(train_character_docs[name].split(',')))
    total_words += len(train_character_docs[name].split())
print("total words", total_words)

In [None]:
all_train_data

In [None]:
train_character_docs

In [None]:
# create list of pairs of (character name, pre-processed character) 
training_corpus = [(name, pre_process(doc)) for name, doc in sorted(train_character_docs.items())]
train_labels = [name for name, doc in training_corpus]

In [None]:
test_data

In [None]:
test_character_docs = create_character_document_from_dataframe(test_data, max_line_count=40)
for k,v in test_character_docs.items():
  test_character_docs[k]=",".join(v)
test_character_docs
print('Num. Characters: ',len(test_character_docs.keys()),"\n")
total_words = 0
for name in test_character_docs.keys():
    print(name, 'Num of Words: ',len(test_character_docs[name].split(',')))
    total_words += len(test_character_docs[name].split(','))
print("total words", total_words)

# create list of pairs of (character name, pre-processed character) 
test_corpus = [(name, pre_process(doc)) for name, doc in sorted(test_character_docs.items())]
test_labels = [name for name, doc in test_corpus]

In [None]:
def to_feature_vector_dictionary(character_doc, extra_features=[]):
    """Converts a list of pre-processed tokens and extra features
    to a Dictionary as a function of the tokens.
    
    Initially just a simple count. Improve this for Q2.
    
    ::character_doc:: a list of pre-processed tokens
    ::extra_features:: any extra features for the character to be added to feature vector dict
    """
    counts = Counter(character_doc)  # for now a simple count
    counts = dict(counts)
    # add the extra features, for now just adding one count for each extra feature
    for feature in extra_features:
        counts[feature] += 1
    
    for i in list(counts.keys()): #Deleting none and eol key as it is used just for no previous or post line and end of sentence 
      if i=='none' or i=='_eol_':
        del counts[i]
    print(counts)
    return counts  

## 5.3 Without TF-IDF

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
        corpusVectorizer.fit([to_feature_vector_dictionary(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
test_feature_matrix = create_document_matrix_from_corpus(test_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, test_feature_matrix, train_labels, test_labels)

In [None]:
all_train_test={}
all_train_test["Scene_info + previous line + current line + post line (WITHOUT TF-IDF)"]=mean_rank

## 5.4 With TF-IDF

In [None]:
corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


# corpusVectorizer = DictVectorizer()   # corpusVectorizor which will just produce sparse vectors from feature dicts
# Any matrix transformers (e.g. tf-idf transformers) should be initialized here


corpusVectorizer = Pipeline([('count', DictVectorizer()),
                 ('tfid', TfidfTransformer(sublinear_tf= True))])


def create_document_matrix_from_corpus(corpus, fitting=False):
    """Method which fits different vectorizers
    on data and returns a matrix.
    
    Currently just does simple conversion to matrix by vectorizing the dictionary. Improve this for Q3.
    
    ::corpus:: a list of (class_label, document) pairs.
    ::fitting:: a boolean indicating whether to fit/train the vectorizers (should be true on training data)
    """
    
    # uses the global variable of the corpus Vectorizer to improve things
    if fitting:
      corpusVectorizer.fit([to_feature_vector_dictionary(doc) for name, doc in corpus])
    doc_feature_matrix = corpusVectorizer.transform([to_feature_vector_dictionary(doc) for name, doc in corpus])
    
    #training_feature_matrix[0].toarray()
    return doc_feature_matrix

training_feature_matrix = create_document_matrix_from_corpus(training_corpus, fitting=True)
test_feature_matrix = create_document_matrix_from_corpus(test_corpus, fitting=False)

In [None]:
mean_rank, mean_cosine_simliarity, acc, df = compute_IR_evaluation_scores(training_feature_matrix, test_feature_matrix, train_labels, test_labels)

In [None]:
all_train_test["Scene_info + previous line + current line + post line (WITH TF-IDF)"]=mean_rank

In [None]:
all_train_test

## Result:
Achieved 1 mean rank for feature 'Scene_info + previous line + current line + post line (WITH TF-IDF)'