# Text Similarity for Document Retrieval

#### Import Libraries

In [1]:
#plotly.offline doesn't push your charts to the clouds
import plotly
import plotly.offline as pyo
#allows us to create the Data and Figure objects
from plotly.graph_objs import *
#plotly.plotly pushes your charts to the cloud  

# import cufflings to easily plot pandas data frames
import cufflinks as cf

# work with cufflinks offline and set its theme
cf.go_offline()
cf.set_config_file(theme='white')

import pandas as pd
import numpy as np

import spacy
nlp = spacy.load('en_core_web_lg')

import warnings
warnings.filterwarnings('ignore')

In [2]:
# display all text in a cell without truncation
# I wanted to read the text in full .. comment these lines out and restart the notebook
# if you want to see the usual truncated df cells
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)

#### a Fix for the missing Stopwords from spaCy's large language models

In [3]:
# fix for the missing stop words in spaCy's large language models en_core_web_lg and en_core_web_md 
# source: https://github.com/explosion/spaCy/issues/1574#issuecomment-346184948
for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

### Reading and Exploring Our Data

In [4]:
df = pd.read_excel('ac_curriculum_2018.xlsx', index_col=0)

In [5]:
df.head()

Unnamed: 0,dcterms_subject.prefLabel,text
5,English,"In the Foundation year, students communicate with peers, teachers, known adults and students from other classes. Students engage with a variety of texts for enjoyment. They listen to, read and view spoken, written and multimodal texts in which the primary purpose is to entertain, as well as some texts designed to inform. These include traditional oral texts, picture books, various types of stories, rhyming verse, poetry, non-fiction, film, multimodal texts and dramatic performances. They participate in shared reading, viewing and storytelling using a range of literary texts, and recognise the entertaining nature of literature.Students create a range of imaginative, informative and persuasive texts including pictorial representations, short statements, performances, recounts and poetry."
6,English,"In Year 1, students communicate with peers, teachers, known adults and students from other classes.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret spoken, written and multimodal texts designed to entertain and inform. These encompass traditional oral texts including Aboriginal stories, picture books, various types of stories, rhyming verse, poetry, non-fiction, film, dramatic performances and texts used by students as models for constructing their own texts. Students create a variety of imaginative, informative and persuasive texts including recounts, procedures, performances, literary retellings and poetry."
7,English,"In Year 2, students communicate with peers, teachers, students from other classes and community members.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret spoken, written and multimodal texts in which the primary purpose is to entertain, as well as texts designed to inform and persuade. These encompass traditional oral texts, picture books, various types of print and digital stories, simple chapter books, rhyming verse, poetry, non-fiction, film, multimodal texts, dramatic performances and texts used by students as models for constructing their own work. Literary texts that support and extend Year 2 students as independent readers involve sequences of events that span several pages and present unusual happenings within a framework of familiar experiences. Informative texts present new content about topics of interest and topics being studied in other areas of the curriculum. These texts include language features such as varied sentence structures, some unfamiliar vocabulary, a significant number of high-frequency sight words and words that need to be decoded phonically, and a range of punctuation conventions, as well as illustrations and diagrams that support and extend the printed text. Students create a range of imaginative, informative and persuasive texts including imaginative retellings, reports, performances, poetry and expositions."
8,English,"In Years 3 and 4, students experience learning in familiar contexts and a range of contexts that relate to study in other areas of the curriculum. They interact with peers and teachers from other classes and schools in a range of face-to-face and online/virtual environments.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret spoken, written and multimodal texts in which the primary purpose is aesthetic, as well as texts designed to inform and persuade. These encompass traditional oral texts including Aboriginal stories, picture books, various types of print and digital texts, simple chapter books, rhyming verse, poetry, non-fiction, film, multimodal texts, dramatic performances and texts used by students as models for constructing their own workLiterary texts that support and extend students in Years 3 and 4 as independent readers describe complex sequences of events that extend over several pages and involve unusual happenings within a framework of familiar experiences. Informative texts include content of increasing complexity and technicality about topics of interest and topics being studied in other areas of the curriculum. These texts use complex language features, including varied sentence structures, some unfamiliar vocabulary, a significant number of high-frequency sight words and words that need to be decoded phonically, and a variety of punctuation conventions, as well as illustrations and diagrams that support and extend the printed text.Students create a range of imaginative, informative and persuasive types of texts including narratives, procedures, performances, reports, reviews, poetry and expositions."
9,English,"In Years 3 and 4, students experience learning in familiar contexts and a range of contexts that relate to study in other areas of the curriculum. They interact with peers and teachers from other classes and schools in a range of face-to-face and online/virtual environments.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret spoken, written and multimodal texts in which the primary purpose is aesthetic, as well as texts designed to inform and persuade. These encompass traditional oral texts including Aboriginal stories, picture books, various types of print and digital texts, simple chapter books, rhyming verse, poetry, non-fiction, film, multimodal texts, dramatic performances and texts used by students as models for constructing their own work. Literary texts that support and extend students in Years 3 and 4 as independent readers describe complex sequences of events that extend over several pages and involve unusual happenings within a framework of familiar experiences. Informative texts include content of increasing complexity and technicality about topics of interest and topics being studied in other areas of the curriculum. These texts use complex language features, including varied sentence structures, some unfamiliar vocabulary, a significant number of high-frequency sight words and words that need to be decoded phonically, and a variety of punctuation conventions, as well as illustrations and diagrams that support and extend the printed text. Students create a range of imaginative, informative and persuasive types of texts including narratives, procedures, performances, reports, reviews, poetry and expositions."


In [59]:
#df.tail()

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 5 to 86
Data columns (total 2 columns):
dcterms_subject.prefLabel    43 non-null object
text                         42 non-null object
dtypes: object(2)
memory usage: 1.0+ KB


In [7]:
df['text'] = df['text'].astype(str)

In [8]:
type(df['text'].iloc[0])

str

Note how the type changes when we process the text using spaCy:

### Parsing the Docs using spaCy's `nlp.pipe`

In [9]:
df['text_parsed'] = list(nlp.pipe(df['text']))

In [10]:
type(df['text_parsed'].iloc[0])

spacy.tokens.doc.Doc

In [11]:
#df.dropna()

### Extracting and Plotting Term Frequency

In [12]:
# first tokenize the text
df['tokenized'] = df['text_parsed'].apply(lambda doc: [str(t) for t in doc if not t.is_punct and not t.is_stop])

In [13]:
from nltk.util import ngrams
def get_word_freq(df, tokenized_var, ngram_range=1):
    
    from sklearn.feature_extraction.text import CountVectorizer
    vec = CountVectorizer(tokenizer = lambda x: x, lowercase=False)

    def get_ngrams(tokenized_text, n ):
        
        n_grams = ngrams(tokenized_text, n)
        
        return [' '.join(grams) for grams in n_grams]
    
    if ngram_range > 1:
        df['ngrams'] = df[tokenized_var].apply(lambda x: get_ngrams(x, ngram_range))
        tokenized_var = 'ngrams'
    
    vec.fit(df[tokenized_var])
    doc_matrix = vec.transform(df[tokenized_var])
    tf = np.sum(doc_matrix, axis=0)
    tf = np.squeeze(np.asarray(tf))
                
    tf_df = pd.DataFrame(tf, columns=['Frequency'], index=vec.get_feature_names()).sort_values('Frequency', ascending=False)
    
    if 'ngrams' in df.columns:
        df.drop('ngrams', axis=1, inplace=True)
    
    return tf_df

#### Most Frequent Unigrams

In [14]:
get_word_freq(df, 'tokenized', 1).head(20).iplot(kind='bar', title='Most Frequent Unigrams')

#### Most Frequent Bigrams

In [15]:
get_word_freq(df, 'tokenized', ngram_range=2).head(20).iplot(kind='bar', color='purple', title='Most Frequent Bigrams')

#### Most Frequent Trigrams

In [16]:
get_word_freq(df, 'tokenized', ngram_range=3).head(20).iplot(kind='bar', color='red', 
                                                             layout=dict(title='Most Frequent Trigrams',
                                                                         margin=dict(b=130)))

### Text Length Distribution

In [17]:
df['length'] = df['tokenized'].apply(len)

In [18]:
df['length'].iplot(kind='box', boxpoints='outliers', color='#004cff', title='Text Length (# words) Distribution')

### Text Preprocessing 

#### Trying two different cleaning methods

We'll try 2 different cleaning methods:
- Remove the stop words and punctuation only
- Remove all but the nouns, since they have the most information

In [19]:
df['text_nouns'] = df['text_parsed'].apply(lambda doc: nlp(" ".join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']])))
df['text_nostop'] = df['text_parsed'].apply(lambda doc: nlp(" ".join([str(t) for t in doc if not t.is_stop and not t.is_punct])))

Let's now print out some sentences to compare the results:

#### No Cleaning

In [20]:
idx = [0, 1, 2]
for i in idx:
    print("[{0}]".format(i), df['text_parsed'].iloc[i])
    print()

[0] In the Foundation year, students communicate with peers, teachers, known adults and students from other classes. Students engage with a variety of texts for enjoyment. They listen to, read and view spoken, written and multimodal texts in which the primary purpose is to entertain, as well as some texts designed to inform. These include traditional oral texts, picture books, various types of stories, rhyming verse, poetry, non-fiction, film, multimodal texts and dramatic performances. They participate in shared reading, viewing and storytelling using a range of literary texts, and recognise the entertaining nature of literature.Students create a range of imaginative, informative and persuasive texts including pictorial representations, short statements, performances, recounts and poetry.

[1] In Year 1, students communicate with peers, teachers, known adults and students from other classes.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret

#### Removing Stopwords and Punctuation Only

In [21]:
for i in idx:
    print("[{0}]".format(i), df['text_nostop'].iloc[i])
    print()

[0] Foundation year students communicate peers teachers known adults students classes Students engage variety texts enjoyment listen read view spoken written multimodal texts primary purpose entertain texts designed inform include traditional oral texts picture books types stories rhyming verse poetry non fiction film multimodal texts dramatic performances participate shared reading viewing storytelling range literary texts recognise entertaining nature literature Students create range imaginative informative persuasive texts including pictorial representations short statements performances recounts poetry

[1] Year 1 students communicate peers teachers known adults students classes Students engage variety texts enjoyment listen read view interpret spoken written multimodal texts designed entertain inform encompass traditional oral texts including Aboriginal stories picture books types stories rhyming verse poetry non fiction film dramatic performances texts students models constructin

#### Removing All but Nouns

In [22]:
for i in idx:
    print("[{0}]".format(i), df['text_nouns'].iloc[i])
    print()

[0] Foundation year students peers teachers adults students classes Students variety texts enjoyment texts purpose texts texts picture books types stories verse poetry film texts performances reading storytelling range texts nature literature Students range texts representations statements performances recounts poetry

[1] Year students peers teachers adults students classes Students variety texts enjoyment texts texts stories picture books types stories verse poetry film performances texts students models texts Students variety texts recounts procedures performances retellings poetry

[2] Year students peers teachers students classes community members Students variety texts enjoyment texts purpose texts texts picture books types print stories chapter books verse poetry film texts performances texts students models work texts Year students readers sequences events pages happenings framework experiences texts content topics interest topics areas curriculum texts language features senten

#### No Cleaning

In [23]:
print(df['text_parsed'].iloc[0].similarity(df['text_parsed'].iloc[1]))
print(df['text_parsed'].iloc[0].similarity(df['text_parsed'].iloc[2]))
print(df['text_parsed'].iloc[0].similarity(df['text_parsed'].iloc[20]))

0.9949758768866656
0.9923086001798976
0.914938915419828


#### No Stopwords and Punctuations:

In [24]:
print(df['text_nostop'].iloc[0].similarity(df['text_nostop'].iloc[1]))
print(df['text_nostop'].iloc[0].similarity(df['text_nostop'].iloc[2]))
print(df['text_nostop'].iloc[0].similarity(df['text_nostop'].iloc[20]))

0.9928864039456686
0.9815621648266192
0.8104190415543462


#### Only Nouns

In [25]:
print(df['text_nouns'].iloc[0].similarity(df['text_nouns'].iloc[1]))
print(df['text_nouns'].iloc[0].similarity(df['text_nouns'].iloc[2]))
print(df['text_nouns'].iloc[0].similarity(df['text_nouns'].iloc[20]))

0.9857674492770195
0.9756167513633999
0.7518748208115867



As explained in <a href='https://spacy.io/usage/vectors-similarity' target='_blank'>spaCy's docs</a>, the word embedding of a full document is simply the average over all different words in the document. If we now have a lot of words that semantically lie in the same region (as for example stop words like "he", "was", "this", ...), and the additional vocabulary "cancels out", then we might end up with a similarity as seen in our case.

Removing stopwords and punctuations improved the results a bit, but the results of the last option (keeping only the nouns) make the most sense to me. Hence I'll stick to it

### Computing Similarity Matrix

Now all we need to do is compute the pair-wise similarity matrix for texts. To do this we need to:

- Extract the embedding for each document (text block). This is done using spacy's pre-trained word vectors, and we can access it simply by calling the attribute `.vector` on a spaCy `Doc` or `Token` object. The word embedding of a full document (text block) is simply the average its constituent word vectors.

- Once we have a vector for each text block, use the `cosine_similarity` function from scikit-learn to compute pair-wise similarity between vectors

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

# get the embedding matrix
embedding_matrix = np.array([doc.vector for doc in df['text_nouns']])

# compute pair-wise similarity
similarity_matrix = cosine_similarity(embedding_matrix)


# have a look
similarity_matrix

array([[1.0000002 , 0.9857673 , 0.97561705, ..., 0.8292966 , 0.79353344,
        0.75033784],
       [0.9857673 , 1.0000004 , 0.9636292 , ..., 0.8064437 , 0.77263045,
        0.73675513],
       [0.97561705, 0.9636292 , 0.99999976, ..., 0.8571835 , 0.8232752 ,
        0.7957817 ],
       ...,
       [0.8292966 , 0.8064437 , 0.8571835 , ..., 0.99999964, 0.98041916,
        0.88410026],
       [0.79353344, 0.77263045, 0.8232752 , ..., 0.98041916, 0.9999996 ,
        0.85864794],
       [0.75033784, 0.73675513, 0.7957817 , ..., 0.88410026, 0.85864794,
        0.99999994]], dtype=float32)

In [27]:
similarity_matrix.shape

(43, 43)

### Autoamte the Process Using a Function

Now let's embed all what we've done above in a re-usable function we can call on the original dataframe:

In [28]:
############ MAIN FUNCTION ############
def compute_similarity_matrix(df, text_field, preprocessing=None, indicies=None, plot=False, colorscale='ylorrd'):
    """
    computes pair-wise cosine similarity for the docuemnts in the input df
    
    Parameters
    ----------
    df:                (pandas DF) the original df 
    text_field:        (name of the column in df containing the docs)
    preprocessing:     (string) the cleaning method to be applied to (query)
    indicies:          (iterable) indicies in the df representing document ids to be compared
    plot               (boolean) whether to plot the similarity matrix.
    colorscale:        (str) the colorscale of the plotted heatmap
    
    Returns
    -------
    pair-wise similarity scores of the documents in df (2-d np array)
    """
    
    # if you want to compute similarities for particular documents
    if indicies is not None:
        df = df.loc[indicies]
        
    # extract text vectors
    embedding_matrix = build_embedding_matrix(df, text_field, preprocessing)
    
    # compute similarity 
    similarity_matrix = pd.DataFrame(cosine_similarity(embedding_matrix), 
                                     index=df[text_field].index, 
                                     columns=df[text_field].index).round(3)
    
    # don't plot on the entire data frame
    # use indicies argument to reduce the size and make it plottable
    if plot:
        similarity_matrix.T.iplot(kind='heatmap',  colorscale=colorscale,
                                  layout=dict(title='Similarity of Texts', 
                                              height=600, margin=dict(l=110, b=100)))
        
    
    return similarity_matrix

In [29]:

############ HELPER FUNCTIONS ############
def build_embedding_matrix(df, text_field, preprocessing=None):
    """
    Computes embedding matrix for a set of documents
    
    Parameters
    ----------
    df:                (pandas DF) the original df 
    text_field:        (name of the column in df containing the docs)
    preprocessing:     (string) the cleaning method to be applied to (query)
    
    Returns
    -------
    embedding matrix where each row is a vector representing the embedding of a document (2-d np array)
    """
    
    assert preprocessing in [None, 'stopwords', 'nouns_only'], \
    "preprocessing should be either 'stopwords' or 'nouns_only' or None"
    
    df = df.copy()
    
    # parse the texts using spacy nlp object
    df[text_field] = list(nlp.pipe(df[text_field], disable=['parser', 'ner']))
    
    # clean the docs
    if preprocessing:
        df[text_field] = df[text_field].apply(lambda doc: process(doc, preprocessing))
        
    # extract text vectors
    embedding_matrix = np.array([doc.vector for doc in df[text_field]])
    
    return embedding_matrix


def process(document, preprocessing=None):
    """
    Parameters
    ----------
    document: spacy Doc
        The document we want to process
        
    preprocessing: str
        Preprocessing steps to be applied to document. e.g. removing stopwords.
        
    Returns
    ----------
    spacy Doc cleaned according to (preprocessing)
    """
    #create spacy object
        
    # remove stopwords and punct (cleaning option 1)
    if preprocessing == 'stopwords':
        clean_doc = nlp(" ".join([str(t) for t in document if not t.is_stop and not t.is_punct]), disable=['parser', 'ner'])
    
    # remove all but nouns (cleaning option 2)
    elif preprocessing == 'nouns_only':
        clean_doc = nlp(" ".join([str(t) for t in document if t.pos_ in ['NOUN', 'PROPN']]), disable=['parser', 'ner'])
        
    else:
        clean_doc = document
    
    return clean_doc

Let's try our function. We'll first try producing the same similarity matrix as before:

In [30]:
sim_matrix_nouns = compute_similarity_matrix(df, 'text', preprocessing='nouns_only')

In [31]:
sim_matrix_nouns.head()

Unnamed: 0,5,6,7,8,9,10,11,12,13,14,...,59,60,61,62,63,64,65,84,85,86
5,1.0,0.986,0.976,0.967,0.967,0.946,0.95,0.946,0.946,0.932,...,0.814,0.827,0.794,0.785,0.788,0.76,0.827,0.829,0.794,0.75
6,0.986,1.0,0.964,0.95,0.95,0.93,0.932,0.925,0.925,0.907,...,0.788,0.805,0.768,0.757,0.758,0.73,0.805,0.806,0.773,0.737
7,0.976,0.964,1.0,0.993,0.993,0.977,0.979,0.974,0.974,0.964,...,0.847,0.858,0.831,0.821,0.824,0.799,0.853,0.857,0.823,0.796
8,0.967,0.95,0.993,1.0,1.0,0.987,0.988,0.985,0.985,0.98,...,0.869,0.88,0.853,0.845,0.847,0.824,0.872,0.874,0.834,0.816
9,0.967,0.95,0.993,1.0,1.0,0.987,0.988,0.985,0.985,0.979,...,0.871,0.882,0.855,0.848,0.849,0.827,0.874,0.877,0.837,0.818


In [32]:
# check if it's the same we computed before
sim_matrix_nouns.values == similarity_matrix.round(3)

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

#### Compute the Similarity for Particular Indicies or Partial Data Frame

In [34]:
# particular indicies
compute_similarity_matrix(df, text_field='text', indicies=['2', '5', '7', '9', ], 
                          preprocessing='nouns_only')

KeyError: "None of [Index(['2', '5', '7', '9'], dtype='object')] are in the [index]"

In [35]:
# range of indicies
compute_similarity_matrix(df, text_field='text', indicies=df.index[1:10], preprocessing='nouns_only')

Unnamed: 0,6,7,8,9,10,11,12,13,14
6,1.0,0.964,0.95,0.95,0.93,0.932,0.925,0.925,0.907
7,0.964,1.0,0.993,0.993,0.977,0.979,0.974,0.974,0.964
8,0.95,0.993,1.0,1.0,0.987,0.988,0.985,0.985,0.98
9,0.95,0.993,1.0,1.0,0.987,0.988,0.985,0.985,0.979
10,0.93,0.977,0.987,0.987,1.0,0.999,0.994,0.994,0.99
11,0.932,0.979,0.988,0.988,0.999,1.0,0.996,0.996,0.991
12,0.925,0.974,0.985,0.985,0.994,0.996,1.0,1.0,0.995
13,0.925,0.974,0.985,0.985,0.994,0.996,1.0,1.0,0.995
14,0.907,0.964,0.98,0.979,0.99,0.991,0.995,0.995,1.0


#### Plotting the Similarity Matrix

In [36]:
sim_matrix_range = compute_similarity_matrix(df, text_field='text', indicies=df.index[:20],
                                             preprocessing='nouns_only', plot=True)

Let's explore some results from this plot:

#### Example of a Low Similarity (0.444)

In [37]:
print(df['text'].loc['VCADAR045'])
print()
print(df['text'].loc['VCAMAE041'])

KeyError: 'VCADAR045'

#### Example of a High Similarity (0.888)

In [36]:
print(df['text'].loc['VCADRR046'])
print()
print(df['text'].loc['VCADAR046'])

Analyse a range of drama from contemporary and past times, including the drama of Aboriginal and Torres Strait Islander peoples to explore differing viewpoints and develop understanding of drama practice across local, national and international contexts

Analyse a range of dance from contemporary and past times, including dance of Aboriginal and Torres Strait Islander peoples, to explore differing viewpoints and develop understanding of dance practice across local, national and international contexts


<strong>Sounds Very Good!</strong>

## Document Retrieval

One of the important applications of wordvectors is document retrieval. Since we saw how easily we can compute pair-wise similarities between a set of documents, it's tempting to take it a step further and see how can we retrieve the closest docuemnt in the corpus to a query document. 

Let's see how we can do this. First, we'll compute the embedding matrix as before, but one for each cleaning method:

In [38]:
embeddings_nouns = build_embedding_matrix(df, 'text', 'nouns_only')
embeddings_nostop = build_embedding_matrix(df, 'text', 'stopwords')

Now we'll build our function to retrieve the closest document. The idea is as follows:

1. Convert the query document to a vector using the same approach we used to convert the docs in the corpus to vectors

2. Now the query document and the documents in the corpus are all represented using vectors. 

3. Compute similarity between the query document vector and all document vectors in the corpus

4. Extract the document in the corpus which vector has the highest similarity score with the query vector, along with this score

Let's do this:

In [39]:
######## MAIN FUNCTION ##########
def get_most_similar_text(query, df, text_field, embedding_matrix, preprocessing=None):
    """
    Transforms query into vector, and computes cosine similarity 
    of query vector against training documents.
    
    Parameters
    ----------
    query:                (string) document to compare
    df:                   (the original df) 
    embedding_matrix:     (2-d np array) the matrix in which each doc represented as vector
    preprocessing:        (string) the cleaning method to be applied to (query)                   
    
    Returns
    -------
    most similar document (string)
    """
    
    query_vector = encode_query(query, preprocessing)
    
    #compute similarities
    similarities = compute_similarities(query_vector, embedding_matrix)
    
    #grab most similar document
    closest_idx, closest_score = get_most_similar(similarities)
    
    return (closest_score, df[text_field].values[closest_idx])

In [40]:
######## HELPER FUNCTIONS ##########
def encode_query(query, preprocessing=None):
    
    spacy_query = nlp(query)
    
    processed_query = process(spacy_query, preprocessing)
    
    return [processed_query.vector]


def compute_similarities(query_vector, embedding_matrix):
    
    similarities = cosine_similarity(embedding_matrix, query_vector)
    return similarities

def get_most_similar(similarities):
    
    # grab most similar document
    closest_idx = np.argmax(similarities)
    closest_score = similarities[closest_idx]
    
    return (closest_idx, closest_score)

Now let's apply our algoritm on a set of arbitray queries. Note that these are queries I wrote myself .. they're not in the corpus

In [41]:
queries = ['this is the best dance ever!! what a terrific dancer!',
           'The Political Landscape is shifting drammatically',
           'Judicial systems should not discriminate against minorities',
           'The movie industry is lacking imagination',
           'International relations have worsened since Trump has become the President of the United States',
           "Darwin's book, The Origin of Species.",
           "Political Parties are batteling over various proposed bills",
           "Physics is the science that enables us to understand laws of nature"]

#### Results for Nouns-only Cleaning Method:

In [42]:
for i, query in enumerate(queries):
    print('{count}: query sentnece:\n{query}.\nClosest sentence:'.format(count=i, query=query))
    closest_score, closest_sentence = get_most_similar_text(query, df, 'text', embeddings_nouns, 'nouns_only')
    print(closest_score, closest_sentence, '\n')

0: query sentnece:
this is the best dance ever!! what a terrific dancer!.
Closest sentence:
[0.4014761] In Year 1, students communicate with peers, teachers, known adults and students from other classes.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret spoken, written and multimodal texts designed to entertain and inform. These encompass traditional oral texts including Aboriginal stories, picture books, various types of stories, rhyming verse, poetry, non-fiction, film, dramatic performances and texts used by students as models for constructing their own texts. Students create a variety of imaginative, informative and persuasive texts including recounts, procedures, performances, literary retellings and poetry. 

1: query sentnece:
The Political Landscape is shifting drammatically.
Closest sentence:
[0.6229957]   The science inquiry skills and science as a human endeavour strands are described across a two-year band. In their planning, sch

#### Results for Removing Stopwords (and Punct.) Cleaning Method:

In [43]:
for i, query in enumerate(queries):
    print('{count}: query sentnece:\n{query}.\nClosest sentence:'.format(count=i, query=query))
    closest_score, closest_sentence = get_most_similar_text(query, df, 'text', embeddings_nostop, 'stopwords')
    print(closest_score, closest_sentence, '\n')

0: query sentnece:
this is the best dance ever!! what a terrific dancer!.
Closest sentence:
[0.56845224] In Year 1, students communicate with peers, teachers, known adults and students from other classes.Students engage with a variety of texts for enjoyment. They listen to, read, view and interpret spoken, written and multimodal texts designed to entertain and inform. These encompass traditional oral texts including Aboriginal stories, picture books, various types of stories, rhyming verse, poetry, non-fiction, film, dramatic performances and texts used by students as models for constructing their own texts. Students create a variety of imaginative, informative and persuasive texts including recounts, procedures, performances, literary retellings and poetry. 

1: query sentnece:
The Political Landscape is shifting drammatically.
Closest sentence:
[0.6773884] In Years 9 and 10, students interact with peers, teachers, individuals, groups and community members in a range of face-to-face a

### Notes on the Results:

It's noticed that both results are really good, while clearly the second cleaning method (removing stopwords and punctuations only) is producing better results in general, particulary example number (5) of the theory of evolution. 


So I'm changing my mind now and will settle on the second cleaning method. 

Now re-compute the pair-wise similarity matrix and save it to desk:

In [44]:
sim_matrix_best = compute_similarity_matrix(df, 'text', preprocessing='stopwords')

In [45]:
sim_matrix_best.to_csv('data/sim_matrix_best.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'data/sim_matrix_best.csv'

### Suggested Further Analysis

Because our corpus is very small, the best course of action was what we just did: use pre-trained word vectors on large corpus and apply them directly on the small corpus. 

But once our corpus is large enough, it's better to train our own word vectors on our own corpus. This usually enables the model to come up with vectors better representing the words in the corpus because it's able to capture the unique relationships between words in this specific corpus, instead of being trained on a general corpus that may not have the same particular characteristics. 

To do this, we'll use gensim library to train the wordvectors and then save the model and import it to spacy to use it as we did before. But instead of haveing spaCy's pre-trained vectors, we'd now have our own trained vectors