# Modeling Text Reuse



## Review Part 1

For Part one of this practicum, you used OpenRefine to two collections of data: a collection of Shakespeare's Sonnets from Project Gutenberg and a small selection of texts from Chronicling America.

![image](../_images/OpenRefine.png)


### Quesions?

## Part 2: Modeling Text Reuse: Anthologizing & reprinting Shakespeare's Sonnets



What if we wanted to know what parts of a work (like parts of Shakespeare's sonnets) have been reprinted in a widely popular nineteenth-century anthology of verse?

To explore how we might answer this question, we're going to compare the text of Shakespeare's Sonnets (from [Project Gutenberg](https://www.gutenberg.org/cache/epub/1105/pg1105.html)) and cleaned, and the text of Francis Palgrave's *The Golden Treasury: Of the best Songs and Lyrical Pieces
In the English Language* (1861), (from [Project Gutenberg](https://gutenberg.org/ebooks/19221) 

The text files are available: 

> ``` "../_datasets/texts/literature/shakespeare-sonnets.txt"  ```  
>```"../_datasets/texts/literature/palgrave-the-golden-treasury.txt"```


We're going to be using a technique called **"text-reuse detection."** 

It's a method that has an *interesting* history (it's often used within industry as a plagiarism detector.

The particular text matching algorithm that we're going to use is from the [Middlematch Critical Histories](https://github.com/lit-mod-viz/middlemarch-critical-histories) project's ["text-matcher"](https://github.com/JonathanReeve/text-matcher/) package. This is a Python package, designed by Jonathan Reeve, that allows you to compare quotations from one text in another text or directory of text files.

### Install `text-matcher`

In [32]:
!pip3 install --user text-matcher

You should consider upgrading via the '/Users/sceckert/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [33]:
from text_matcher import matcher

### Import the Natural Language Toolkit (`nltk`) & stopwords list
We're going to be using a library called the Natural Language Toolkikt (`nltk`) which contains a handy list of pre-curated stowpwards

In [34]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sceckert/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Let's take a look at the stopwords:

In [50]:
from nltk.corpus import stopwords

stops = set(stopwords.words('english'))
print(stops)

{"hadn't", 'about', 'in', 'mustn', 'yourself', "it's", 'its', 'o', 'itself', 'me', 'won', 'not', 'can', 'her', 'them', 'once', 've', 'she', 'but', 'haven', 'until', 'through', 'herself', 'doing', 'the', 'whom', 'all', 'nor', "isn't", 'themselves', "shan't", 'shouldn', 'no', 'himself', "mightn't", 'from', 'hadn', 'when', "you'll", 'into', 'what', 'm', 'such', 'those', 'any', 'few', 'than', 'aren', 'because', 'doesn', 'theirs', 'my', 'their', 're', 'i', "mustn't", 'again', "wasn't", "didn't", 'between', 'an', "shouldn't", 'by', 'under', 'needn', 'on', 'be', 'below', 'and', 'at', 'hers', 'most', 'if', 'don', 'other', 'hasn', 'there', 'for', 'both', 'who', 't', "aren't", 'll', 'd', 'same', 'now', "she's", 'we', 'where', 'yours', 'isn', 'or', 'you', 'here', "hasn't", 'being', 'been', 'during', 'didn', 'ma', 'more', "doesn't", 'that', 'only', 'does', 'am', 'these', 'they', "weren't", "you'd", 'after', 'against', 'before', 'above', 'up', "that'll", 'off', 'y', "wouldn't", 'how', 'yourselves',

### Defining filepaths for the text matcher

In the current directory we're working in, I have addd a few files that we can try and run.
We've imported our text matcher, `matcher`, and now we're going to define two text files for the matcher to process by opening and reading text files in our directory.

In [54]:
text_a = matcher.Text(open('../_datasets/texts/literature/shakespeare-sonnets.txt').read(), 'Shakespeare Sonnets')
text_b = matcher.Text(open('../_datasets/texts/literature/palgrave-the-golden-treasury.txt').read(), 'Palgrave Golden Treasury')

In [56]:
matcher.Matcher(text_a, text_b).match()

9 total matches found.
Extending match forwards with words: think think
Extending match forwards with words: brow brow
Extending match backwards with words: thes thos
Extending match forwards with words: loss loss
Extending match forwards with words: mor mor
Extending match forwards with words: il il
Extending match forwards with words: fee fee
Extending match forwards with words: let let
Extending match forwards with words: dying dying
Extending match forwards with words: rar rar
Extending match forwards with words: pin pin
Extending match forwards with words: aggrav aggrav
Extending match forwards with words: thy thy
Extending match forwards with words: stor stor
Extending match forwards with words: buy buy
Extending match forwards with words: term term
Extending match forwards with words: divin divin
Extending match forwards with words: sel sel
Extending match forwards with words: hour hour
Extending match forwards with words: dross dross
Extending match forwards with words: within 

(6,
 [(36794, 37420),
  (38708, 39171),
  (39200, 39353),
  (45792, 46411),
  (94507, 95050),
  (94842, 95072)],
 [(18996, 19657),
  (34611, 35105),
  (35134, 35293),
  (48874, 49529),
  (61184, 61776),
  (61554, 61798)])

In [None]:
How else can we use the text matcher? We could create a directory of text files, rather than a single file

---

## Alternate method for running  `text-matcher`:

We can also run the text matcher as we would on the command line, which will produce the same output and write the index location to a log file called 'log.txt' like so:

In [62]:
# To run the text matcher on the commmand line, run:
!text-matcher ../_datasets/texts/literature/shakespeare-sonnets.txt ../_datasets/texts/literature/palgrave-the-golden-treasury.txt

9 total matches found.
Extending match forwards with words: think think
Extending match forwards with words: brow brow
Extending match backwards with words: thes thos
Extending match forwards with words: loss loss
Extending match forwards with words: mor mor
Extending match forwards with words: il il
Extending match forwards with words: fee fee
Extending match forwards with words: let let
Extending match forwards with words: dying dying
Extending match forwards with words: rar rar
Extending match forwards with words: pin pin
Extending match forwards with words: aggrav aggrav
Extending match forwards with words: thy thy
Extending match forwards with words: stor stor
Extending match forwards with words: buy buy
Extending match forwards with words: term term
Extending match forwards with words: divin divin
Extending match forwards with words: sel sel
Extending match forwards with words: hour hour
Extending match forwards with words: dross dross
Extending match forward

Let's read in the log file that we created:

In [58]:
import pandas as pd

In [63]:
shakespeare_quotations_df = pd.read_csv('log.txt')

In [64]:
shakespeare_quotations_df

Unnamed: 0,Text A,Text B,Threshold,Cutoff,N-Grams,Num Matches,Text A Length,Text B Length,Locations in A,Locations in B
0,../_datasets/texts/literature/shakespeare-sonn...,../_datasets/texts/literature/palgrave-the-gol...,3,5,3,6,118679,485394,"[(36794, 37420), (38708, 39171), (39200, 39353...","[(18996, 19657), (34611, 35105), (35134, 35293..."


The numbers in "Locations in A" and "Location in B" are the index numbers of the characters for the start and end of each match. We can look at one of these location pair.

Let's open up the Shakespeare sonnets text file and look at the first pair of locations, 36794, 37420): 

In [None]:
# Read in Shakespeare's sonnets as "Shakespeare sonnets text"
with open('../_datasets/texts/literature/shakespeare-sonnets.txt') as file_a: 
    shakespeare_sonnets_text = file_a.read()

Let's use the first match to print just th text of that match.

In [47]:
shakespeare_sonnets_text[36794:37420]

'your slave what should I do but tend,\n  Upon the hours, and times of your desire?\n  I have no precious time at all to spend;\n  Nor services to do, till you require.\n  Nor dare I chide the world-without-end hour,\n  Whilst I, my sovereign, watch the clock for you,\n  Nor think the bitterness of absence sour,\n  When you have bid your servant once adieu;\n  Nor dare I question with my jealous thought\n  Where you may be, or your affairs suppose,\n  But, like a sad slave, stay and think of nought\n  Save, where you are, how happy you make those.\n    So true a fool is love, that in your will,\n    Though you do anything, he thinks'

## Runnning the text-matcher to compare one text with a directory of texts`
To compare textA.txt with every text file in sampletextdir/, run `!text-matcher textA.txt sampletextdir/`

### Fine tuning the parameters of the text matcher

Take a look at the ['Usage'](https://github.com/JonathanReeve/text-matcher/tree/c04e54f3a4d36ab79e5f204809b2eb0d687d5b62#usage) section of the text-matcher. Note how we could change the parameter to search for longer and shorter n-grams for finding matches.