# University of Virginia: Intertextuality Detection
This notebook will set all of the parameters you need to run intertextuality detection on a corpus of interest. The code itself, however, is distributed across several different files and you are welcome to take a look at that. Anything that you should change as a matter of course is located in this file.

This suit of code loads in a corpus, identifies all intertextuality according to set parameters and returns several summary visualizations.

### Dependencies
You will need to install the python Levenshtein library for this to function properly. You can do this using
pip install python-Levenshtein

### Caution:
This will run significantly slower on Windows computers because I have disabled multithreading.

## The Program:
First you will need to load in the scripts that will conduct the actual analysis. Just run this cell and do not change anything here. These are required for the analysis to run properly.

In [1]:
import prepare_corpus, index_corpus, detect_intertexuality, compile_and_filter_results, align_quotes, form_quote_system

### Configure the analysis
Here you can specify what constitues a quote. You will can change these values. 

**seedlength** (Integer) specifies how many characters should be in the seed you will use to start the analysis. Ten seems to work well for English, four for Chinese, but your results may vary.

**threshold** (Float between 1 and 0) specifies the precent similarity necessary to consider something a quote. .95 will return very similar matches, .6 will return very noisy ones.

**matchlength** (Integer) sets the minimum length a string must be to be considered a match. 40 or longer works well for English, otherwise you will catch a lot of meaningless sentence fragments.

**max_comp** (Integer) sets the length of the sliding window of characters to run the algorithm on. The lower the number, the faster the algorithm will run, but please do not set this below 100 (there is no upper limit). Lower numbers will result in more fragmentary results, but the ends will have less noise.

In [2]:
# Set seedlength
seedlength = 10

# Matches must be above this percent similar. Use floats between 0 and 1
# .8 works well for prose Chinese documents. .9 works well for prose
# English
threshold = .95

# Set the minimum length of an acceptable match. The shorter the length
# the more noisy the results are.
matchlength = 20

# Set this to limit the similarity comparison to last n characters
# Set to None for no limit. Setting a limit significantly
# speeds the calculations up.
max_comp = 100

### Corpus Preperation
How do you want to handle the corpus? Do you want to remove words or delete whitespace? You can also create an index to speed up the analysis with large corpora at the expense of disk space.

**toremove** (String, filename) is a file with words to be removed. Put one word per line in this file.

**deletewhitespace** (Boolean) lets you delete whitespace from the analysis

**create_index** (Boolean) specifies if you want to create an Index.

In [3]:
# File that contains words or characters be removed from the analysis
# One item per line in the file. This is probably not necessary for most English
# analyses.
toremove = "remove.txt"

# Remove whitespace? If set to False, this will replace one or more white spaces with
# a single space. Otherwise, all whitespace gets deleted.
deletewhitespace = False

####################
# INDEXING OPTIONS #
####################
# Index the corpus? True to do so, False to skip
# Indexing the corpus significantly speeds the analysis up when working with
# large corpora, but will use significant system resources.
create_index = False

### Input Information
Here you can specify where your corpus is and how you want to handle it. 

**corpusfolder** (String, folder name) takes the name of the folder containing your corpus.

**textstoanalyze** (String, filename) This is optional. If you only want to study certain files, place one file name per line in this text file. Otherwise, leave as None

**corpuscomposition** (String, filename) is like the above file, but it limits the documents against which you compare the files in textstoanalyze. Set to None if you don't want to limit the analysis



In [4]:
corpusfolder = "corpus"


# By default, the script will compare every document in the corpus
# against every other document.
# Optionally, you can provide a file with a list of titles to analyze
# Set to None if you do not wish to use this. This will also default
# to None if the listed file does not exist.
# This file should just contain one filename per line seperated with a
# carraige return.
textstoanalyze = None

# You can also limit the part of the corpus you want to compare against
# By default the provided texts to analyze will be compared against all docs
corpuscomposition = None

# Align quotes occuring between the following documents. Provide at
# least two. If None, all quotes will be aligned. If your corpus contains
# signficant reuse, this may be slow.
alignment_docs = None



### Output Information
Configure how the information is output. 

**result_directory** (String, foldername) will contain the files generated by the intertextuality algorithm. This will be amalgamated into

**filteredresultfile** (String, filename) is the name of the file with the compiled results.

**alignmentoutput** (String, filename) is the name of the file for the aligned results.

**DEBUG** (Boolean) If you set this to true, the model will start over every time you run it. Otherwise, it will track which files have already been studied. This is best to set to False if you are working with a large corpus that takes a long time.

**edgefile** (String, filename) is the name of a file that contains network data for the intertextuality (which can be loaded directly into Gephi)

**scorelimit** (Integer) is the minimum number of characters that must be shared between two documents to appear in the network stored in edgefile

In [5]:
##################
# GENERAL OUTPUT #
##################
# IMPORTANT!!!!! If DEBUG is set to True, this folder will be deleted if it exists!!!!!
result_directory = 'results'

# Intertextuality Output
filteredresultfile = "corpus_results.txt"

# Alignment Output
alignmentoutput = "corpus_alignment.txt"


# Debug:
DEBUG = True


#*******************#
# OUTPUT FILE EDGES #
#*******************#
# Output
edgefile = 'edgetable.csv'

# Set a minimum threshold for similarity for recording an edge.
# 100 means one 100-character quote
# or alternatively ten 10-character quotes (or something like that)
scorelimit = 100


# contains the lengths of all the texts in the corpus. Used for viz.
corpus_text_lengths = "corpus_text_lengths.txt"


### Filtering Results
Once you have calculated the quotes, you have the option of filtering quotes that occur very frequently. This may not be necessary for you, but it helps if you are catching a lot of stock phrases in your search.

**filtercommon** (Boolean) Set to true to remove very common quotes

**shortquotelength** (Integer) How long are these quotes?

**repmax** (Integer) How many times is considered frequent?

**filtersimilar** (Boolean) This will filter out phrases that are highly similar to the common phrases that are also being filtered. THis can slow down the operation significantly

**similaritythreshold** (Float between 1 and 0) Percent of similarity to remove similar quotes



In [6]:
#**********************#
# FILTERING PARAMETERS #
#**********************#

# Filter the common, short quotes?
filtercommon = True
# What length constitutes "short"?
shortquotelength = 40
# How many repetitions consitute common?
repmax = 100

# Should similar to the common ones be filtered?
# This will add significant slowdown depending on how many
# quotes are included
filtersimilar = False
# What is the similarity threshold?
similaritythreshold = .8
# Limit check? If this is true, similarity will only be checked
# for quotes that start with the same characters. This speeds the
# code up significantly
limitcheck = True
limextent = 2


### Options that don't need to be changed
The following set of options can be changed but really need not be changed. These are mostly internal files that are intermediate in the analysis or options that can be complicated to deal with.

#### Alignment Parameters
You can set the parameters for the alignment algorithm, which takes the results of the intertextuality algorithm and finds an optimal alignment between matching quotes. You probably don't need to adjust any of these

#### Maximium child tasks
This is related to the multiprocessing module and ensures a balance between memory usage and speed. You can ignore this.

#### Front loading
You can optionally have the longest texts processed first. This will speed up the algorithm a bit but can make early stages seem slow.

In [7]:
# Match, mismatch, and gap scores
matchscore = 1
misalignscore = -1
mismatchscore = -1

# Limit the length of text that will be aligned
# This significantly speeds up the algorithm when
# aligning very long quotes. This divides the quotes
# into blocks of chunklim length. It tries to divide
# the chunks in places where the alignment is exact
# So overlap looks at the 10 character before and after
# the proposed break. When it finds rangematch exact
# characters, it inserts a break in the middle.
chunklim = 200
overlap = 30
rangematch = 12

# This following setting is necessary because of the multiprocessing module
# The higher the maxtasks, the faster the processing is but the more memory
# use fluctuates. If index is around 2.5 GB, use 50 workers, 150 < 1 GB
# Set to None if you don't want to have processes expire, but watch out for
# large memory use spikes. The multiprocessing occurs at the document level,
# so if you have fewer documents, you can also use fewer tasks
maxchildtasks = 150

# You can sort so the longest texts will be processed first. This will speed
# up overall processing time at the cost of RAM usuage.
frontloading = False

### Run Analysis
The section below should not be changed. It sends the necessary information to the various packages, which then run. Run and enjoy!

In [8]:
# set index and pickle file
indexfile = 'index.db'
picklefile = "corpus.pickle"

prepare_corpus.run(picklefile, toremove, corpusfolder, deletewhitespace)

if create_index:
    index_corpus.run(seedlength, picklefile, indexfile)

detect_intertexuality.run(seedlength, threshold, matchlength, max_comp, textstoanalyze, corpuscomposition, picklefile, indexfile, create_index, result_directory, maxchildtasks, frontloading, DEBUG)

compile_and_filter_results.run(filtercommon, shortquotelength, repmax, filtersimilar, similaritythreshold, limitcheck,limextent, result_directory, filteredresultfile)

form_quote_system.run(scorelimit, filteredresultfile, edgefile)

align_quotes.run(alignment_docs,matchscore, misalignscore, mismatchscore, chunklim, overlap,rangematch, filteredresultfile, alignmentoutput)

56 documents of 56 completed
1349894 from 56 documents.
Analyzing text 1 vs 55 (length: 13584)
Operation completed in 1.03 seconds (averaging 1.03, in total 1.03)
Analyzing text 2 vs 54 (length: 14048)
Operation completed in 0.87 seconds (averaging 0.95, in total 1.90)
Analyzing text 3 vs 53 (length: 22974)
Operation completed in 1.01 seconds (averaging 0.97, in total 2.92)
Analyzing text 4 vs 52 (length: 44424)
Operation completed in 1.42 seconds (averaging 1.08, in total 4.34)
Analyzing text 5 vs 51 (length: 18295)
Operation completed in 0.90 seconds (averaging 1.05, in total 5.24)
Analyzing text 6 vs 50 (length: 23331)
Operation completed in 0.95 seconds (averaging 1.03, in total 6.19)
Analyzing text 7 vs 49 (length: 21333)
Operation completed in 0.92 seconds (averaging 1.02, in total 7.11)
Analyzing text 8 vs 48 (length: 2190)
Operation completed in 0.58 seconds (averaging 0.96, in total 7.69)
Analyzing text 9 vs 47 (length: 9675)
Operation completed in 0.72 seconds (averaging 0.93