## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Rishabh Sharma
    - Email: rs3738@drexel.edu
- Group member 2
    - Name: Shai Wudkwych
    - Email: sw3468@drexel.edu

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment group 1: Textual feature extraction and numerical comparison

## Module B _(35 points)_ Key word in context

Key word in context (KWiC) is a common format for concordance lines, i.e., contextualized instances of principal words used in a book. More generally, KWiC is essentially the concept behind the utility of 'find in page' on document viewers and web browsers. This module builds up a KWiC utility for finding key word-containing sentences, and 'most relevant' paragraphs, quickly.

__B1.__ _(3 points)_ Start by writing a function called `load_book` that reads in a book based on a provided `book_id` string and returns a list of `paragraphs` from the book. When book data is loaded, you should remove the space characters at the beginning and end of the text (e.g., using `strip()`). Then, to split books into paragraphs, use the `re.split()` method to split the input in cases where there are two or more new lines. Note, that books are in the provided `data/books` directory.

Note: this module is not focused on text pre-processing beyond a split into paragraphs; you do _not_ need to remove markup or non-substantive content.

In [None]:
/Users/rishabhsharma/Documents/GitHub/winter-23/DSCI-521/assignment1/module-B/data/books/84.txt

In [2]:
# B1:Function(3/3)

import re

def load_book(book_id):
    #---Your code start here---
    book = './data/books/'+book_id+'.txt'
    with open(book, 'r') as f:
        text = f.read()
    paragraphs = re.split(r'\n\s*\n', text)
    #---Your code ends here---
    
    return paragraphs

To test your function, lets apply it to look at a few paragraphs from book 84.

In [3]:
# B1:SanityCheck
paragraphs = load_book('84')
print(len(paragraphs))
print(paragraphs[10])

725
I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight.  Do you understand this
feeling?  This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icy climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid.  I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight.  There, Margaret, the sun is forever
visible, its broad disk just skirting the horizon and diffusing a
perpetual splendour.  There--for with your leave, my sister, I will put
some trust in preceding navigators--there snow and frost are banished;
and, sailing over a calm sea, we may be wafted to a land surpassing in
wonders and in beauty every region hitherto discovered on the habitable
globe.  Its productions and fe

__B2.__ _(10 points)_ Next, write a function called `kwic(paragraphs, search_terms)` that accepts a list of string `paragraphs` and a set of `search_term` strings. The function should:

1. initialize `data` as a `defaultdict` of lists
2. loop over the `paragraphs` and apply `spacy`'s processing to produce a `doc` for each;
3. loop over the `doc.sents` resulting from each `paragraph`;
4. loop over the words in each `sentence`;
5. check: `if` a `word` is `in` the `search_terms` set;
6. `if` (5), then `.append()` the reference under `data[word]` as a list: `[[i, j, k], sentence]`, where `i`, `j`, and `k` refer to the paragraph-in-book, sentence-in-paragraph, and word-in-sentence indices, respectively.

Your output, `data`, should then be a default dictionary of lists of the format:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

Note, we have imported spacy and set it up to use the `"en"` model. This will require you to install spacy by running `pip install spacy` and downloading the `"en"` model by running the command `python -m spacy download en`.

In [4]:
# B2:Function(10/10)

from collections import defaultdict
import spacy
nlp = spacy.load("en_core_web_sm")

def kwic(paragraphs, search_terms = {}):
    #---Your code starts here---
    data =defaultdict(list)
    for i, paragraph in enumerate(paragraphs):
        doc = nlp(paragraph)
        for j, sentence in enumerate(doc.sents):
            for k, word in enumerate(sentence):
                if word.text in search_terms:
                    data[word.text].append([[i, j, k], [token.text for token in sentence]])
    #---Your code ends here---
    
    return(data)

Now, let's test your function using the paragraphs from your `load_book` function.

In [5]:
# B2:SanityCheck
kwic(paragraphs, {'Ocean', 'seas'})

defaultdict(list,
            {'Ocean': [[[11, 2, 26],
               ['I',
                '\n',
                'have',
                'read',
                'with',
                'ardour',
                'the',
                'accounts',
                'of',
                'the',
                'various',
                'voyages',
                'which',
                'have',
                '\n',
                'been',
                'made',
                'in',
                'the',
                'prospect',
                'of',
                'arriving',
                'at',
                'the',
                'North',
                'Pacific',
                'Ocean',
                '\n',
                'through',
                'the',
                'seas',
                'which',
                'surround',
                'the',
                'pole',
                '.',
                ' ']]],
             'seas': [[[11, 2, 30],
             

__B3.__ _(2 points)_ Let's test your `kwic` search function's utility using the pre-processed `paragraphs` from book `84` for the key words `Frankenstein` and `monster` in context. Answer the inline questions about these tests.

In [6]:
# B3:SanityCheck
results = kwic(paragraphs, {"Frankenstein", "monster"})

print("# of sentences 'Frankenstein' appears in: {}".format(len(results['Frankenstein'])))
print("# of sentences 'monster' appears in: {}".format(len(results['monster'])))
print()

print(" ".join(results['Frankenstein'][0][1]))
print()
print(" ".join(results['monster'][0][1]))

# of sentences 'Frankenstein' appears in: 27
# of sentences 'monster' appears in: 31

Frankenstein ,

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


In [7]:
# B3:Inline(1/2)

# Is the kwic function fast or slow? Print "Fast" or "Slow"
print("Slow")

Slow


In [8]:
# B3:Inline(1/2)

# How many sentences does the work Frankenstein appear in? Print the integer (0 is just a placeholder).
print(27)

27


__B4.__ _(10 pts)_ The cost of _indexing_ a given book turns out to be the limiting factor here for kwic. Presently, we have our pre-processing `load_book` function just splitting a document into paragraphs. Rewrite the `load_book` function to do some additional preprocessing. Specifically, this function should be modified to:

1. split a `book` into paragraphs and loop over them, but
2. process each paragraph with `spacy`;
3. store the `document` as a triple-nested list, so that each word _string_ is reachable via three indices: `word = document[i][j][k]`;
4. record an `index = defaultdict(list)` containing a list of `[i,j,k]` lists for each word; and
5. `return document, index`

Pre-computing the `index` will allow us to efficiently look up the locations of each word's instance in `document`, and the triple-list format of our document will allow us fast access to extract the sentence for KWiC. 

In [12]:
# B4:Function(10/10)

def load_book(book_id):
    #---Your code starts here---
    
    book = './data/books/'+book_id+'.txt'
    with open(book, 'r') as f:
        text = f.read()
    ## splitting books into paragraphs
    paragraphs = re.split(r'\n\s*\n', text)
    ## processing each paragraph with spacy
    document = []
    index = defaultdict(list)
    for i, paragraph in enumerate(paragraphs):
        doc = nlp(paragraph)
        document.append([])
        for j, sentence in enumerate(doc.sents):
            document[i].append([])
            for k, word in enumerate(sentence):
                document[i][j].append(word.text)
                index[word.text].append([i,j,k])

    #---Your code ends here---
                    
    return(document, index)

Now, let's test your new function on `book_id` = `'84'`. We'll use the returned document to access a particular sentence and print out the `[i,j,k]` locations of the word `'monster'` from `index`.

In [13]:
# B4:SanityCheck

# load the book
document, index = load_book("84")

In [14]:
# B4:SanityCheck

# Output the indices for monster
index['monster']

[[125, 9, 57],
 [137, 3, 6],
 [140, 3, 4],
 [143, 1, 4],
 [244, 3, 29],
 [262, 3, 18],
 [281, 0, 2],
 [322, 1, 35],
 [346, 8, 6],
 [381, 12, 5],
 [398, 1, 46],
 [438, 1, 10],
 [440, 0, 3],
 [478, 6, 8],
 [479, 6, 6],
 [511, 1, 8],
 [528, 0, 1],
 [539, 18, 22],
 [561, 3, 31],
 [586, 3, 43],
 [588, 10, 72],
 [607, 3, 2],
 [616, 2, 11],
 [634, 8, 9],
 [640, 1, 21],
 [645, 4, 17],
 [654, 6, 5],
 [664, 4, 2],
 [674, 0, 39],
 [674, 1, 2],
 [710, 12, 1]]

__B5.__ _(5 pts)_ Finally, make a new function called `fast_kwic` that takes a `document` and `index` from our new `load_book` function as well as a provided list of `search_terms` (just like our original kwic function). The function should loops through all specified `search_terms` to identify indices from `index[word]` for the key word-containing sentences and use them to extract these sentences from `document` into the same data structure as output by __B2__:
```
data['word'] = [[[i, j, k], ["These", "are", "sentences", "containing", "the", "word", "'word'", "."]],
                ...,]
```

In [15]:
# B5:Function(5/5)

def fast_kwic(document, index, search_terms = {}):
    
    #---Your code starts here---
    data = defaultdict(list)
    for word in search_terms:
        for i, j, k in index[word]:
            data[word].append([[i, j, k], document[i][j]])
    #---Your code ends here---
    
    return(data)

To test our new function, lets test it on the same keywords as before: `Frankenstein` and `monster`. Note that the output from this sanity check should be the same as the one from **B3**. 

In [16]:
# B5:SanityCheck

fast_results = fast_kwic(document, index, {'Frankenstein', 'monster'})

print("# of sentences 'Frankenstein' appears in: {}".format(len(fast_results['Frankenstein'])))
print("# of sentences 'monster' appears in: {}".format(len(fast_results['monster'])))
print()

print(" ".join(fast_results['Frankenstein'][7][1]))
print()
print(" ".join(fast_results['monster'][0][1]))

# of sentences 'Frankenstein' appears in: 27
# of sentences 'monster' appears in: 31

She nursed Madame 
 Frankenstein , my aunt , in her last illness , with the greatest affection 
 and care and afterwards attended her own mother during a tedious 
 illness , in a manner that excited the admiration of all who knew her , 
 after which she again lived in my uncle 's house , where she was beloved 
 by all the family .  

I started from my sleep with horror ; a cold dew covered my forehead , my 
 teeth chattered , and every limb became convulsed ; when , by the dim and 
 yellow light of the moon , as it forced its way through the window 
 shutters , I beheld the wretch -- the miserable monster whom I had 
 created .  


__B6.__ _(5 pts)_ Your goal here is to modify the pre-processing in `load_book` one more time! Make a small modification to the input: `load_book(book_id, pos = True, lemma = True):`, to accept two boolean arguments, `pos` and `lemma` specifying how to identify each word as a key term. In particular, each word will now be represented in both of the `document` and `index` as a tuple: `heading = (text, tag)`, where `text` contains the `word.text` attribute from `spacy` if `lemma = False`, and `word.lemma_` attribute if `True`. Similarly, `tag` should be left empty as `""` if `pos = False` and otherwise contain `word.pos_`.

Note this functions output should still consist of a `document` and `index` in the same format aside from the replacement of `word` with `heading`, which will allow for the same use of output in `fast_kwic`, although more specified by the textual features.

In [31]:
# B6:Function(5/5)

def load_book(book_id, pos = True, lemma = True):

    #---Your code starts here---
    book = './data/books/'+book_id+'.txt'
    with open(book, 'r') as f:
        text = f.read()
    paragraphs = re.split(r'\n\s*\n', text)
    document = []
    index = defaultdict(list)
    for i, paragraph in enumerate(paragraphs):
        doc = nlp(paragraph)
        document.append([])
        for j, sentence in enumerate(doc.sents):
            document[i].append([])
            for k, word in enumerate(sentence):
                if lemma:
                    if pos:
                        heading = (word.lemma_, word.pos_)
                    else:
                        heading = (word.lemma_, "")
                else:
                    if pos:
                        heading = (word.text, word.pos_)
                    else:
                        heading = (word.text, "")
                document[i][j].append(heading)
                index[heading].append([i,j,k])


    #---Your code ends here---
    
    return document, index

In [32]:
# B6:SanityCheck
document, index = load_book("84", pos = True, lemma = True)

In [37]:
# B6:SanityCheck
document, index = load_book("84", pos = True, lemma = False)

print(fast_kwic(document, index, search_terms = {('cold', 'NOUN')})[('cold', 'NOUN')][0][1])

[('The', 'DET'), ('\n', 'SPACE'), ('cold', 'NOUN'), ('is', 'AUX'), ('not', 'PART'), ('excessive', 'ADJ'), (',', 'PUNCT'), ('if', 'SCONJ'), ('you', 'PRON'), ('are', 'AUX'), ('wrapped', 'VERB'), ('in', 'ADP'), ('furs', 'NOUN'), ('--', 'PUNCT'), ('a', 'DET'), ('dress', 'NOUN'), ('which', 'PRON'), ('I', 'PRON'), ('have', 'AUX'), ('\n', 'SPACE'), ('already', 'ADV'), ('adopted', 'VERB'), (',', 'PUNCT'), ('for', 'SCONJ'), ('there', 'PRON'), ('is', 'VERB'), ('a', 'DET'), ('great', 'ADJ'), ('difference', 'NOUN'), ('between', 'ADP'), ('walking', 'VERB'), ('the', 'DET'), ('\n', 'SPACE'), ('deck', 'NOUN'), ('and', 'CCONJ'), ('remaining', 'VERB'), ('seated', 'VERB'), ('motionless', 'NOUN'), ('for', 'ADP'), ('hours', 'NOUN'), (',', 'PUNCT'), ('when', 'SCONJ'), ('no', 'DET'), ('exercise', 'NOUN'), ('\n', 'SPACE'), ('prevents', 'VERB'), ('the', 'DET'), ('blood', 'NOUN'), ('from', 'ADP'), ('actually', 'ADV'), ('freezing', 'VERB'), ('in', 'ADP'), ('your', 'PRON'), ('veins', 'NOUN'), ('.', 'PUNCT'), (' '

In [38]:
# B6:SanityCheck
print("Sentence with ('cold', 'ADJ'):")

print(fast_kwic(document, index, search_terms = {('cold', 'ADJ')})[('cold', 'ADJ')][0][1])

Sentence with ('cold', 'ADJ'):
[('I', 'PRON'), ('am', 'AUX'), ('already', 'ADV'), ('far', 'ADV'), ('north', 'ADV'), ('of', 'ADP'), ('London', 'PROPN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('as', 'SCONJ'), ('I', 'PRON'), ('walk', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('streets', 'NOUN'), ('of', 'ADP'), ('\n', 'SPACE'), ('Petersburgh', 'PROPN'), (',', 'PUNCT'), ('I', 'PRON'), ('feel', 'VERB'), ('a', 'DET'), ('cold', 'ADJ'), ('northern', 'ADJ'), ('breeze', 'NOUN'), ('play', 'VERB'), ('upon', 'SCONJ'), ('my', 'PRON'), ('cheeks', 'NOUN'), (',', 'PUNCT'), ('which', 'PRON'), ('\n', 'SPACE'), ('braces', 'VERB'), ('my', 'PRON'), ('nerves', 'NOUN'), ('and', 'CCONJ'), ('fills', 'VERB'), ('me', 'PRON'), ('with', 'ADP'), ('delight', 'NOUN'), ('.', 'PUNCT'), (' ', 'SPACE')]
