text-analysis files on ferguson grand jury documents
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R script
images
jury-docs-split-into-1000-line-chunks
network-analysis-viz
originaldocs
output-files-to-be-analyzed
.RData
.Rhistory
README.md

README.md

ferguson

text-analysis files on ferguson grand jury documents

the documents were compiled by @MitchFraas

download the source docs

Mitch also loaded them into Voyant Tools

caveat utilitor

I haven't interpreted the results yet, nor tried to visualize; am just providing my R script and my initial output files for DH / Data journalist types / others to explore for themselves. The R script does have a piece in it for making word clouds for the various topics, where the size of the word corresponds roughly to the importance of that word in the topic. For more on MALLET and visualizing the results, see the Journal of Digital Humanities, the work of Ted Underwood, Matt Jockers, Andrew Goldstone, etc.

As you look at the topic labels file, you'll see 'october' and 'november' and 'september' and similar (eg roman numerals, words that are appearing in the header/footer of every page etc) that should actually be added to the stoplist file. Then the analysis should be re-run. The stoplist being used is the default Mallet stoplist.

Further analysis might wish to see which documents or which topics correlate with one another (easiest way to do this is to run Excel's correlate function). Or one could look at the correlations of words within the documents. Is Brown always described the same way by witnesses? Do certain topics/discourses associate with Brown more than Wilson (and vice versa)? One could also do sentiment analysis, to see how Brown/Wilson are portrayed by the witnesses, the prosecutor, etc.

  • Update 8.30 Nov 26: I compiled all of the text into a single file, then broke it apart into 1000 line chunks. I tried to clean up the text a bit to remove some of the extraneous info (title pages, etc), but it was very rough-and-ready. Some 'goreperry.com' etc will have crept in. At anyrate, the output files have now been updated, and I've also uploaded the source files as well as my trimmed files, so other folks can run their own analyses.