# Digital Humanities Python Exercise
Created July 9, 2018  
In this exercise, you will download tiff images from the [Gulag Support Documents Database](http://lcweb2.loc.gov/frd/gulagquery.html) maintatined by the Library of Congress. You will then OCR (optical character recognition) the files. OCR'ing will turn the images into full text searchable documents. Then, you will do some analysis of the OCR'ed files in python. 

### Working with Jupyter Notebook
If you haven't worked with Jupyter notebook before here are a few tips. 
* Run code by pressing Shift + Enter. 
* You can run the cells out of order. The number beside In indicates the order the cells were run. 
* Cells that are current running will have an * instead of a number. 
* When you are typing a variable you have already created or called a function, you can use tab to autocomplete/show the possibilities. This can be a big timesaver and prevent typos. 

## Downloading the Files

In [70]:
#urllib.request allows you to open and read urls 
import urllib.request

# To download one of the images from the database you can use the code below
urllib.request.urlretrieve("http://lcweb2.loc.gov/frd/tfrussia/gulag000/000000f8.tif", 
                          "images/sample-download.tif")



('images/sample-download.tif', <http.client.HTTPMessage at 0x11bf78828>)

In [71]:
# To see the documentation for urllib.request.urlretrieve run the code below. 
# You can also use google to get more info. 

help(urllib.request.urlretrieve)

Help on function urlretrieve in module urllib.request:

urlretrieve(url, filename=None, reporthook=None, data=None)
    Retrieve a URL into a temporary location on disk.
    
    Requires a URL argument. If a filename is passed, it is used as
    the temporary file location. The reporthook argument should be
    a callable that accepts a block number, a read size, and the
    total file size of the URL target. The data argument should be
    valid URL encoded data.
    
    If a filename is passed and the URL points to a local resource,
    the result is a copy from local file to new file.
    
    Returns a tuple containing the path to the newly created
    data file as well as the resulting HTTPMessage object.



### Practice Problem

In [121]:
#Here is a list of the last three digits of the urls for the images you want to download

images=["0f8","114","12f","106","110","100","112","10b","11e","12e","129","115","0fe","117","0fa",
       "10f", "126", "119","118", "0fd","11a","0ff","123","11d","125","11f","102","10e","12b",
       "10a","104","127","12c","124","12a","11c","103","101","0fc","12d","11b","10d","116",
        "0f9","122","128","10c","120","0fb","121","0f7","105", "107","108","109","111","113"]


In [None]:
# Finish filling in this for loop to download the rest of the images. 

for ??? in ??? : 
    print(???) #Print out something to help you know how far along your loop is
    urllib.request.urlretrieve(??? + item + '.tif', "images/???" + item +".tif")
    

### Solution
Below is my solution to this problem. Try not to scroll down until you have made a diligent effort to finish writing the looping yourself. Note: you answer may be slightly different than mine. There are normally multiple ways to do things in python. 

In [122]:
#Below is my example answer. 
#Yours may be different. There are normally multiple ways to do the same thing in python.

for item in images: 
    print(item)
    urllib.request.urlretrieve("http://lcweb2.loc.gov/frd/tfrussia/gulag000/00000"+ item + '.tif',
                               "images/00000" + item +".tif")

0f8
114
12f
106
110
100
112
10b
11e
12e
129
115
0fe
117
0fa
10f
126
119
118
0fd
11a
0ff
123
11d
125
11f
102
10e
12b
10a
104
127
12c
124
12a
11c
103
101
0fc
12d
11b
10d
116
0f9
122
128
10c
120
0fb
121
0f7
105
107
108
109
111
113


Double check to make sure the files downloaded before proceeding. You should have 50+ tifs in your files folder.   

## OCR'ing the files
Now that you have a folder of tifs, we need to OCR them and create plain text files that we can analysis in python. 
We will start by importing the libraries we need. 

In [123]:
#Importing our libraries

import glob #This lets us you regular expressions
import pytesseract # This is a library for OCR'ing
from PIL import Image  # This is a library for working with images


Run the code below. **Don't panic! You will get a nasty looking error message.**

In [124]:
#OCRing images and creating txt files
for item in glob.glob('images/*'):
    print (item[6:-4])
    with open('plain-text/'+ item[6:-4]+'.txt', 'w') as f:
        f.write((pytesseract.image_to_string(Image.open(item))))
        f.close
        
        

/000000f7


OSError: encoder error -2 when writing image file

### Question: 
Spend a few minutes trying to look up the error message and see what you can find. 

What are your techniques for trying to find a solution? (i.e. what terms are you using to google your problem?)

Did you find anything useful? 



**This is a very tricky problem to solve. Don't get discourage if you can't understand all the suggestions you find in your search.**

### Answer (sort of): 
I would recommend using the error message to search for a solution. So, in this case, I would search for "OSError: encoder error -2 when writing image file." If you followed this approach you should have a found a few stacks overflow posts and a github issue along with a few other sources. In general StacksOverflow and GitHub are good sources for trying to troubleshoot your code.

In this case, you likely learned that the problem seems to be rooted in the PIL library and its configuration. There are few suggestions for how to fix the problem from several years ago. Try to find the most recent solution. Sometimes you will have to do quite a bit of sleuthing. For example, take a look at this github thread from 2013 that seems to relate to the issue we are experiencing: https://github.com/python-pillow/Pillow/issues/396. 

In general, we wouldn't want to look at a thread from 5+ plus years ago, but the last comment in the thread is from June 2018 and links to a much newer issue. Click that link to go to this github issue: https://github.com/madmaze/pytesseract/issues/127. 

Reading through this thread it seems like the issue is an old style of Tiff conversion that the PIL (or Pillow) library uses. One of the reccomendation is to "Add passthrough option in order to skip any conversion - raw image mode." So to get this OCR to work, you could open each image on your computer and export it to a new file format. **Are there ways you could do this in bulk using a program already installed on your computer?** 

For sake of simplicity, I have done this converstion for you using a command-line tool called imagemagick. You could also do this in bulk with a GUI(graphical user interface) like Adobe Bridge. You can find these new files in the folder called images-converted. 


In [125]:
#OCRing images and creating txt files
for item in glob.glob('images-converted/*'):
    print (item[17:-4])
    with open('plain-text/'+ item[17:-4]+'.txt', 'w') as f:
        f.write((pytesseract.image_to_string(Image.open(item))))
        f.close
        
#This code will take some time to run. So look at the questions below while you wait.        

000000f7
000000f8
000000f9
000000fa
000000fb
000000fc
000000fd
000000fe
000000ff
00000100
00000101
00000102
00000103
00000104
00000105
00000106
00000107
00000108
00000109
0000010a
0000010b
0000010c
0000010d
0000010e
0000010f
00000110
00000111
00000112
00000113
00000114
00000115
00000116
00000117
00000118
00000119
0000011a
0000011b
0000011c
0000011d
0000011e
0000011f
00000120
00000121
00000122
00000123
00000124
00000125
00000126
00000127
00000128
00000129
0000012a
0000012b
0000012c
0000012d
0000012e
0000012f


### Question:  
Why did we use the code **item[17:-4]** instead of **item** or **item[6:-4]**? 

### Answer: 
We wanted to reuse the filename for the text files. So in order to do that we needed to elminate the extra folder before the filename. Since we switch folders from images to images-converted we need extra space. The -4 eliminates the last four characters of the string which includes **.tif or .jpg** for all of our files. 

### Open Ended Question(s):
* How could you learn more about pytesseract?
* Use the cell below to find the documentation without leaving the jupyter notebook. 
* What other methods could you use? 

### Practice Problem
Use the documentation you found to explain the code we just ran/are running. Add comments to each line after # in the cell below. Commenting your code will make it easier to go back to and to share with others. 

*Note: You might want to look at some documentation on the [with statement](http://www.pythonforbeginners.com/files/with-statement-in-python) to understand lines 4-6. 

It is ok if you don't understand every step perfectly. Just aim for a basic understanding of what is going on. There is no shame in looking things up now or after you have been coding for years! 


In [None]:
#OCRing images and creating txt files
for item in glob.glob('images-converted/*'): #
    print (item[17:-4]) #
    with open('plain-text/'+ item[17:-4]+'.txt', 'w') as f: #
        f.write((pytesseract.image_to_string(Image.open(item)))) #
        f.close #

## Dealing with Messy Data
In a perfect world, we would be dealing with nicely typed and visible images. Unfortunately, much of the data we are using is far from perfect. Let's start by viewing a few sample files to see well our OCR program worked. 

In the code below, change the file variable to see how well different images were OCR'ed. Remember to use tab completion to save yourself time!

In [194]:
file = "plain-text/000000ff.txt"

text = open(file, "r") 
text = (text.read())

print(text)



. «m . \'

“AUCUAIVEIS

66m COUNTER INTELLIGENCE CORPS GROUP
UNITED STATES ARMY, EUROPE

 

APO :54 us ARMY t
. ’ '~ (:34 E \
D—JOL’N ‘ ‘ ‘ {’3
W: W Win at 118 from]. (c) i
1‘0: hast-m. m .2 Butt, a: R
mm Batu Am, Inn-0pc Cb . 4;
no m3. us my ‘ &
mm; - We ' \‘u

  
   
  
   
   
  
  

:“nl
—LA‘

1. lift-mo Mm, un bulgm, CI“ 22 Aunt 1955.

2, hum-ind Miamunt Bayon- WWI
mom-umupruuuuumah.

human-n- #:1111113 mﬂxmmmmm,
W' um]. bom25 brunt-y mm M may
mum; “Riel-um, m, w. ' '

BY GDR USAINSCOM FOIPO
lath Para 1-603 DOD 5200. 1-3

V 11ml: Lag .
“"m.5m(w)' " Gulch-limo

FOR OFF‘ClAL USE ONLmﬁed 1o pfar 43. SR 380-32010.c:v1:i::nx:: 1'
REGRADING DATA CANNOT BE PREDEIERMINED ”(M o. , c j G , .

51 um V
' ' - ‘1 “r' .5 nm cot- , ,

x/nx scum nu or «on; m (2) ngggQﬁgﬁgﬁﬁmmzmm #.
no at when m ham u an Air Force _'" Laylﬂvm _ ~2., V:
In- ot other In moan. B _p33m¢;:tch‘;c}‘cmy.orl.31A5r>au;lzctzly‘gmu:ho ed 7
disclosur: of such icfmmnion wul be wandered to -

" ' iijmbﬁw .9! AE’S'EiE ‘

v ,


In [192]:


words = text.split(' ')
print (words)

AttributeError: 'list' object has no attribute 'strip'

In [184]:
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped)

str_list = list(filter(None, stripped))

print(str_list)

['', '«m', '', '\n\n“AUCUAIVEIS\n\n66m', 'COUNTER', 'INTELLIGENCE', 'CORPS', 'GROUP\nUNITED', 'STATES', 'ARMY', 'EUROPE\n\n', '\n\nAPO', '54', 'us', 'ARMY', 't\n', '’', '', '34', 'E', '\nD—JOL’N', '‘', '‘', '‘', '’3\nW', 'W', 'Win', 'at', '118', 'from', 'c', 'i\n1‘0', 'hastm', 'm', '2', 'Butt', 'a', 'R\nmm', 'Batu', 'Am', 'Inn0pc', 'Cb', '', '4\nno', 'm3', 'us', 'my', '‘', '\nmm', '', 'We', '', '‘u\n\n', '', '\n', '', '', '\n', '', '\n', '', '', '\n', '', '', '\n', '', '\n', '', '\n\n“nl\n—LA‘\n\n1', 'liftmo', 'Mm', 'un', 'bulgm', 'CI“', '22', 'Aunt', '1955\n\n2', 'humind', 'Miamunt', 'Bayon', 'WWI\nmomumupruuuuumah\n\nhumann', '1111113', 'mﬂxmmmmm\nW', 'um', 'bom25', 'brunty', 'mm', 'M', 'may\nmum', '“Rielum', 'm', 'w', '', '\n\nBY', 'GDR', 'USAINSCOM', 'FOIPO\nlath', 'Para', '1603', 'DOD', '5200', '13\n\nV', '11ml', 'Lag', '\n“m5mw', '', 'Gulchlimo\n\nFOR', 'OFF‘ClAL', 'USE', 'ONLmﬁed', '1o', 'pfar', '43', 'SR', '38032010cv1inx', '1\nREGRADING', 'DATA', 'CANNOT', 'BE', 'PREDEIERMINED