# Setup

First, we have to set up by importing the libraries and modules necessary to run the code in this notebook.

In [None]:
# Import necessary dependencies
from nltk import Text, word_tokenize
import spacy
nlp = spacy.load('en_core_web_sm')
from random import randrange
import pandas as pd
import os

In the sections below, there are a few different methods to process and create concordances for documents using the NLTK and spaCy Python libraries. Using the spaCy methods below, you can save your concordances to CSV files for later analysis.

<div class='alert alert-warning' role='alert'>
<strong>Working with document files</strong>
   <ul>
       <li>By default, this notebook uses an example corpus comprising the five volumes of The Works of Edgar Allan Poe collection, downloaded from <a href='https://gutenberg.org/ebooks/search/?query=%22the+works+of+edgar+allan+poe%22&submit_search=Go%21' target='blank'>Project Gutenberg</a>. You can use your own corpus files by adding them to a new folder/directory wherever this notebook is saved, like the "example-corpus" directory. In that case, you should make sure to update the <code>file</code> parameter of the <a href='https://docs.python.org/3/library/functions.html#open' target='_blank'><code>open() function</code></a> to reflect the directory and files you've added. In the <a href='#Keyword-in-Context-(KWIC)-Concordance-with-NLTK'>Keyword-in-Context-(KWIC)-Concordance-with-NLTK</a> and <a href='#With-a-single-document'>Sentences Concordance with spaCy: With a single document</a> sections, you will update this in the first code cell. In the <a href='#With-an-entire-corpus'>Sentences Concordance with spaCy: With an entire corpus</a> section, you will need to update the value for the <code>corpus_dir</code> variable.
       </li>
       <li>Make sure that the document you're using is a plain text file (*.txt) that is encoded in UTF-8 (an encoding system for Unicode). You can easily check the encoding of plain text files by opening them in the Notepad text editor program and looking in the bottom right corner of the window. You should see metadata for the file, including the encoding system, e.g., "UTF-8 with BOM".
       </li>
       <li>If your file is not encoded in UTF-8, you can update the <code>encoding</code> parameter of the <code>open()</code> function in the cells with code for opening documents (the first code cell of the first two sections) by replacing the text inside the quotation marks, as in <code>encoding='utf8'</code>. If you don't know what should replace <code>utf8</code>, check Python's <a href='https://docs.python.org/3/library/codecs.html#standard-encodings' target='_blank'>list of standard encodings</a>, find the correct encoding under the "Aliases" column, and use the corresponding "Codec".  
       </li>
    </ul>
</div>

# Keyword in Context (KWIC) Concordance with NLTK

This section uses a Python library called NLTK, which will enable us to create an interactive concordance to search for occurences of keywords in context (KWIC). 

**Resources**
- [NLTK :: Sample usage for concordance](https://www.nltk.org/howto/concordance.html)
- [NLTK :: nltk.text module :: concordance method](https://www.nltk.org/api/nltk.text.html#nltk.text.Text.concordance)
- [Python concordance command in NLTK](https://newbedev.com/python-concordance-command-in-nltk)

In [None]:
# Open the document
with open(r'example-corpus\The Works of Edgar Allan Poe — Volume 1 by Edgar Allan Poe.txt', encoding='utf8') as f:
    document = f.read()

In [None]:
# Create an NLTK Text object
nltk_text = Text(word_tokenize(document))

In [None]:
# Set the keyword to be searched
keyword = ["old", "lady"]

# Generate the concordancer
nltk_text.concordance(keyword, width=79, lines=25)

# Keyword in Sentences Concordance with spaCy

This section uses a Python library called spaCy, which will allow us to iterate over the sentences in documents and create a list of them.

**Resources**
- [Doc · spaCy API Documentation · Doc.sents](https://spacy.io/api/doc#sents)
- [spaCy - Doc.sents Property](https://www.tutorialspoint.com/spacy/spacy_doc_sents.htm)

## With a single document

<div class="alert alert-block alert-info">You can skip the step for opening the document in the cell below if you already did it in the NLTK section above.</div>

In [None]:
# Open the document
with open(r"example-corpus\The Works of Edgar Allan Poe — Volume 1 by Edgar Allan Poe.txt", encoding='utf8') as file:
    document = file.read()

In [None]:
# Create a spaCy Doc object
doc = nlp(document)

In [None]:
# Create a list of parsed sentences in the document
sentences = list(doc.sents)

In [None]:
# Check a random sentence
sentences[randrange(len(sentences) - 1)]

In [None]:
# Check the number of sentences
len(sentences)

In [None]:
# Set the keyword(s) to be searched
keywords = ["beautiful", "lady", "death"]

# Create a list to store keywords and the sentences in which they appear
keyword_in_sentences = []

for sentence in sentences:
    for keyword in keywords:
        if keyword in str(sentence):
            keyword_in_sentences.append({'Keyword': keyword, 'Sentence': str(sentence).replace('\n', ' ')})

In [None]:
# Check the number of parsed sentences
len(keyword_in_sentences)

In [None]:
# Convert the list of keywords and sentences into a DataFrame
headers = ['Keyword', 'Sentence']

keyword_in_sentences_df = pd.DataFrame(keyword_in_sentences)[headers]

In [None]:
# Check the first 10 rows of the DataFrame
keyword_in_sentences_df[:10]

In [None]:
# Write the DataFrame to a CSV file
document_name = file.name.split('\\')[1].split('.')[0]

keyword_in_sentences_df_csv = open('keyword_in_sentences_%s.csv' % (document_name), 'w', encoding='utf-8', newline='')
keyword_in_sentences_df_csv.write(keyword_in_sentences_df.to_csv(index=False))
keyword_in_sentences_df_csv.close()

## With an entire corpus

In [None]:
# Set the directory containing corpus files
corpus_dir = 'example-corpus'

# Set the keyword(s) to be searched
keywords = ["beautiful", "lady", "death"]

# Create a list to store keywords and the sentences in which they appear
keyword_in_sentences = []

# Iterate through all files in the corpus directory
for filename in os.listdir(corpus_dir):
    
    # Open the document
    if filename.endswith('.txt'):
        document_name = os.path.splitext(filename)[0]
        with open(os.path.join(corpus_dir, filename), 'r', encoding='utf-8') as file:
            document = file.read()

    # Create a processed Doc object
    doc = nlp(document)

    # Create a list of parsed sentences in the document
    sentences = list(doc.sents)

    # Add keywords and the sentences in which they appear to the list
    for sentence in sentences:
        for keyword in keywords:
            if keyword in str(sentence):
                keyword_in_sentences.append({'Document': document_name, 'Keyword': keyword, 'Sentence': str(sentence).replace('\n', ' ')})

In [None]:
# Convert the list of keywords and sentences into a DataFrame
headers = ['Document', 'Keyword', 'Sentence']
            
keyword_in_sentences_df = pd.DataFrame(keyword_in_sentences)[headers]

In [None]:
# Check the first 10 rows of the DataFrame
keyword_in_sentences_df[:10]

In [None]:
# Write the DataFrame to a CSV file
keyword_in_sentences_df_csv = open('keyword_in_sentences_example-corpus.csv', 'w', encoding='utf-8', newline='')
keyword_in_sentences_df_csv.write(keyword_in_sentences_df.to_csv(index=False))
keyword_in_sentences_df_csv.close()