# Lecture Notes + Exercises: Information Retrieval

In this unit, we'll use python to turn a bunch of loose text documents into a real-life database. (Note: This database was created for a project by R. Terman and E. Voeten, and was processed using much the same process as you'll be learning here.)

The lecture and problem set will leverage your new python skills, especially working with text, lists, and dictionaries; writing for-loops, conditional statements, and functions; and "thinking" like a programmer.

**About the Data**

We'll be creating a database from [Universal Period Review outcome reports](http://www.ohchr.org/EN/HRBodies/UPR/Pages/BasicFacts.aspx).

The Universal Periodic Review (UPR) is a process run by the United Nations Human Rights Council, which involves a periodic review of the human rights records of all 193 UN Member States.

Reviews take place through an interactive discussion between the State under review and other UN Member States. During this discussion any UN Member State can pose questions, comments and/or make recommendations to the States under review. States under review can then respond, stating which recommendations they reject, accept, will consider, etc. Reports are then drawn up detailing this discussion.

We will be analyzing outcome reports from the 2014 Universal Period Reviews of 42 countries, which we retrieved [here](http://www.ohchr.org/EN/HRBodies/UPR/Pages/Documentation.aspx) and formatted as text documents.

The goal is to convert these semi-structured texts to a tabular dataset of **recommendations** with the following variables:

1. Text of recommendation (*text*)
2. Country to which the recommendation is directed (*to*)
3. Country that is making the recommendation (*from*)
4. The year when the review took place (*year*)
5. The response to the recommendation, i.e. whether the reviewed country rejects, accepts, etc (*decision*)

In other words, we want to turn this:

<img src="img/text.png" width="600">

into this:

<img src="img/tabular.png" width="400">



___

# Part 1: Reading Data

During lecture, we'll be completing the first part of our project: reading and structuring data from our text files. Specifically, we need to know how to:

1. Read filenames from a directory
2. Read data from a single file
3. Clean the text to select only the information we care about and format it in a way that's easy to work with
4. Repeat steps 2-3 above by looping through all files in our directory, reading, cleaning, and structuring as we go

First let's start by importing some helpful modules

In [1]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

## 1. Read filenames from directory

### 1.1 

Print every filename in the directory 'data/txts'

In [2]:
dir = 'data/txts'
for file_name in os.listdir(dir):
    print file_name

afghanistan2014.txt
albania2014.txt
bangladesh2013.txt
belize2013.txt
bolivia2014.txt
botswana2013.txt
cotedivoire2014.txt
djibouti2013.txt
elsalvador2014.txt
fiji2014.txt
jordan2013.txt
kazakhstan2014.txt
monaco2013.txt
montenegro2013.txt
sanmarino2014.txt
serbia2013.txt
turkmenistan2013.txt
tuvalu2013.txt


### 1.2 

Print only the filenames that endswith ".txt" (hint: used the `endswith` method)

In [3]:
dir = 'data/txts'
for file_name in os.listdir(dir):
    if file_name.endswith(".txt"):
        print file_name

afghanistan2014.txt
albania2014.txt
bangladesh2013.txt
belize2013.txt
bolivia2014.txt
botswana2013.txt
cotedivoire2014.txt
djibouti2013.txt
elsalvador2014.txt
fiji2014.txt
jordan2013.txt
kazakhstan2014.txt
monaco2013.txt
montenegro2013.txt
sanmarino2014.txt
serbia2013.txt
turkmenistan2013.txt
tuvalu2013.txt


### 1.3 

Print the full (relative) directory path of each filename ending with ".txt." For example, the first one should be "raw-data/txts/afghanistan2014.txt"

In [4]:
dir = 'data/txts'
for file_name in os.listdir(dir):
    if file_name.endswith(".txt"):
        print dir + '/' + file_name

data/txts/afghanistan2014.txt
data/txts/albania2014.txt
data/txts/bangladesh2013.txt
data/txts/belize2013.txt
data/txts/bolivia2014.txt
data/txts/botswana2013.txt
data/txts/cotedivoire2014.txt
data/txts/djibouti2013.txt
data/txts/elsalvador2014.txt
data/txts/fiji2014.txt
data/txts/jordan2013.txt
data/txts/kazakhstan2014.txt
data/txts/monaco2013.txt
data/txts/montenegro2013.txt
data/txts/sanmarino2014.txt
data/txts/serbia2013.txt
data/txts/turkmenistan2013.txt
data/txts/tuvalu2013.txt


### 1.4 

Using the filename, print the country each document is about (hint: use slicing)

In [5]:
dir = 'data/txts'
for file_name in os.listdir(dir):
    if file_name.endswith(".txt"):
        print file_name[:-8]

afghanistan
albania
bangladesh
belize
bolivia
botswana
cotedivoire
djibouti
elsalvador
fiji
jordan
kazakhstan
monaco
montenegro
sanmarino
serbia
turkmenistan
tuvalu


___

## 2. Read data from a single file

Let's work with just one text, "cotedivoire2014.txt".

In [6]:
file_name = "cotedivoire2014.txt"

### 2.1

Create a dictionary called `upr` that stores the `'country'` and `'year'` keys from the `file_name`.

In [7]:
upr = {}
upr['country'] = file_name[:-8]
upr['year'] = file_name[-8:-4]
upr

{'country': 'cotedivoire', 'year': '2014'}

### 2.2 

Read the file into an object called `text`, then make sure to close the file. Be sure to read in the file as universal line mode, or `'rU'`. Later, we'll put this object in the dictionary we just made above.

In [8]:
file_name = "cotedivoire2014.txt"
with open(dir+'/'+ file_name,'rU') as f:
    text = f.read()

In [9]:
# take a look at the first 1000 characters
text[:1000]

'\nDistr.: General 7 July 2014 English Original: English/French \nGeneral Assembly \nHuman Rights Council Twenty-seventh session \nAgenda item 6 \nUniversal Periodic Review \nReport of the Working Group on the Universal Periodic Review* \nC\x99te d\xd5Ivoire \n* The annex to the present report is circulated as received. \n\nGE.14-07583  (E)    280714 300714 \n*1407583* \nContents \n\nParagraphs Page \nIntroduction............................................................................................................. 1\xd04 3 \n\nI. Summary of the proceedings of the review process................................................ 5\xd0126 3 \n\nA. Presentation by the State under review........................................................... 5\xd021 3 \n\nB. Interactive dialogue and responses by the State under review........................ 22\xd0126 5 \n\nII. Conclusions and/or recommendations .................................................................... 127\xd0130 14  Ann

___

## 3. Clean the text

### 3.1 

Split `text` into a list of lines.

In [10]:
text = text.split("\n")
len(text)

715

In [11]:
text[3:15]

['Human Rights Council Twenty-seventh session ',
 'Agenda item 6 ',
 'Universal Periodic Review ',
 'Report of the Working Group on the Universal Periodic Review* ',
 'C\x99te d\xd5Ivoire ',
 '* The annex to the present report is circulated as received. ',
 '',
 'GE.14-07583  (E)    280714 300714 ',
 '*1407583* ',
 'Contents ',
 '',
 'Paragraphs Page ']

### 3.2 

Delete all empty lines from x

In [12]:
text = filter(None,text)
len(text)

525

### 3.3 

We want to slice the text so that we only keep the part in the "Conclusions and/or Recommendations" section. Find all lines that mention the phrase "conclusions and/or recommendations" in it, case insensitive. Put all of thse lines, into a list called `conclusionsList`

In [13]:
conclusionsList = [line for line in text if "conclusions and/or recommendations" in line.lower()]

### 3.4 

Notice that the section starts with the second mention of "conclusions and/or recommendations" (the first is the table of contents) and ends with the third mention. 

Make two objects called ConclusionsStart and ConclusionsEnd, containing the index of the second and third mention "conclusions and/or recommendations", respectively). Hint: Use the `index()` method

In [14]:
ConclusionsStart = text.index(conclusionsList[1])
ConclusionsEnd = text.index(conclusionsList[2])

In [15]:
## Note, we can also get the same results by using a list comprehension:
## Uncomment to try it out
## ConclusionsList = [i for i, j in enumerate(text) if "conclusions and/or recommendations" in j.lower()]
## ConclusionsStart = ConclusionsList[1]
## ConclusionsEnd = ConclusionsEnd[2]

### 3.5 

Using these indeces, slice the `text` list so that it only contains the lines in the section we want (including both the start and end paragraphs you identified above)

In [16]:
text = text[ConclusionsStart+1:ConclusionsEnd+1] 

In [17]:
text[:15]

['127. The recommendations listed below enjoy the support of C\x99te d\xd5Ivoire: ',
 '127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); ',
 '127.2 Make efforts towards the ratification of the OP-CAT (Chile); ',
 '127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); ',
 '127.4 Accede to the OP-CAT as soon as possible (Uruguay); ',
 '127.5 Consider ratifying OP-CAT (Burkina Faso); ',
 '127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); ',
 '127.7 Consider acceding to the ICRMW (Chad); ',
 '127.8 Make efforts towards the ratification of ICCPR-OP 2 (Chile); ',
 '** The conclusions and recommendations have not been edited. ',
 '127.9 Ratify ICCPR-OP 2 (Rwanda) to abol

### 3.6 

We'd like to remove some paragraphs in this list that don't mean anything to us.

In [18]:
# get rid of the weird lines
text = [line for line in text if '**' not in line]
text = [line for line in text if 'recommendations have not been edited.' not in line]
text = [line for line in text if 'recommendations will not be edited.' not in line]

In [19]:
text[-5:]

['The following recommendations did not enjoy the support of C\x99te d\xd5Ivoire and would thus be noted: ',
 '129.1 Take further steps to prevent discrimination on the grounds of gender identity and sexual orientation and to raise awareness on its consequences (Netherlands); ',
 '129.2 Conduct such specific awareness-raising campaigns which can help to sensitize the general Ivorian public regarding the rights of LGBTI persons (Slovenia). ',
 '130. ',
 'All conclusions and/or recommendations contained in the present report reflect the position of the submitting State(s) and/or the State under review. They should not be construed as endorsed by the Working Group as a whole. ']

### 3.7 

Notice how the last few lines are split, with the paragraph number and text on different lines -- deviating from the pattern of the rest of the text. Let's fix that using a `while` loop.

In [20]:
# merge lines so that each number starts with a number
def mergeLines(l):
    '''
    This function takes in a list of lines `l` and merge broken paragraph lines 
    (merge all lines if they don't start with a number)
    '''
    i = 0
    while i < len(l):
        if not l[i][0].isdigit():
            l[i-1:i+1] = [' '.join(l[i-1:i+1])]
        else:
            i = i+1
    return(l)

In [21]:
text = mergeLines(text)

In [22]:
text[-5:]

['128.6 Better protect LGBTI persons and persons with AIDS against any act of discrimination and violence and review its legislation in this context (Switzerland). ',
 '129.  The following recommendations did not enjoy the support of C\x99te d\xd5Ivoire and would thus be noted: ',
 '129.1 Take further steps to prevent discrimination on the grounds of gender identity and sexual orientation and to raise awareness on its consequences (Netherlands); ',
 '129.2 Conduct such specific awareness-raising campaigns which can help to sensitize the general Ivorian public regarding the rights of LGBTI persons (Slovenia). ',
 '130.  All conclusions and/or recommendations contained in the present report reflect the position of the submitting State(s) and/or the State under review. They should not be construed as endorsed by the Working Group as a whole. ']

### 3.8 

We probably don't need that last line -- on how the conclusions don't reflect the position of the working group

In [23]:
# get rid of that disclaimer paragraph
text = [line for line in text if 'endorsed by the working group' not in line.lower()]

In [24]:
text[:20]

['127. The recommendations listed below enjoy the support of C\x99te d\xd5Ivoire: ',
 '127.1 Consider the accession to core human rights instruments (Lesotho); and to other main international human rights treaties that it is not yet a party to (Philippines); ',
 '127.2 Make efforts towards the ratification of the OP-CAT (Chile); ',
 '127.3 Ratify the OP-CAT (Ghana, Tunisia), as recommended previously in 2009 (Czech Republic) and take policy measures to prevent torture and ill-treatment (Estonia); ',
 '127.4 Accede to the OP-CAT as soon as possible (Uruguay); ',
 '127.5 Consider ratifying OP-CAT (Burkina Faso); ',
 '127.6 Ratify the International Convention on the Protection of the Rights of All Migrant Workers and Members of Their Families (ICRMW) (Ghana); ',
 '127.7 Consider acceding to the ICRMW (Chad); ',
 '127.8 Make efforts towards the ratification of ICCPR-OP 2 (Chile); ',
 '127.9 Ratify ICCPR-OP 2 (Rwanda) to abolish death penalty (France, Montenegro); ',
 '127.10 Accede to the 

3.8 Now that we have cleaned text, let's put that list in our dictionary

In [25]:
upr['text'] = text
upr.keys()

['country', 'text', 'year']

___

## 4. Loop through Files

4.1 Using the code you wrote above, loop through all the .txt files in the directory, structuring and cleaning the data as you go. Each txt file will be stored as a dictionary. Store each dictionary in a list.

In [26]:
l=[]
dir = 'data/txts'
for file_name in os.listdir(dir):
    broken = []
    if file_name.endswith(".txt"):
        print 'processing ' + file_name + '...'
        try:
            upr = {}
            upr['country'] = file_name[:-8]
            upr['year'] = file_name[-8:-4]
            f = open(dir + '/' + file_name,'rU')
            text = f.read() # read in text
            f.close
            text = text.split('\n') # make a list
            text = filter(None, text) # get rid of empty string items       
             
            # take only the conclusions and/or recommendations section
            ConclusionsStart = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][1]) #startin from bottom
            ConclusionsEnd = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][2]) # the last one is the disclaimer
            text = text[ConclusionsStart+1:ConclusionsEnd+1] 
            
            # get rid of the weird lines
            text = [line for line in text if '**' not in line]
            text = [line for line in text if 'recommendations have not been edited.' not in line]
            text = [line for line in text if 'recommendations will not be edited.' not in line]
            text = [line.replace('\xd2','') for line in text]
            text = [line.replace('\t','') for line in text]
            text = [line.lstrip(" ") for line in text]
            
            # merge lines so that each line is its own paragraph, starting with a paragraph number
            text = mergeLines(text)
            
            # get rid of that disclaimer paragraph
            text = [line for line in text if 'endorsed by the working group' not in line.lower()]
            
            upr['text'] = text 
            
            # append to list
            l.append(upr)
              
        except Exception,e:
            broken.append(file_name +str(e)) 

processing afghanistan2014.txt...
processing albania2014.txt...
processing bangladesh2013.txt...
processing belize2013.txt...
processing bolivia2014.txt...
processing botswana2013.txt...
processing cotedivoire2014.txt...
processing djibouti2013.txt...
processing elsalvador2014.txt...
processing fiji2014.txt...
processing jordan2013.txt...
processing kazakhstan2014.txt...
processing monaco2013.txt...
processing montenegro2013.txt...
processing sanmarino2014.txt...
processing serbia2013.txt...
processing turkmenistan2013.txt...
processing tuvalu2013.txt...


In [27]:
first_upr = l[0]

print "Country:",first_upr['country']
print "Year:",first_upr['year']
print "Text:",first_upr['text'][:10], "..."

Country: afghanistan
Year: 2014
Text: ['136. The recommendations formulated during the interactive dialogue and listed below have been examined by Afghanistan and enjoy its support: ', '136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); ', '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ', '136.3 Make further efforts to ensure the implementation of the legal framework which guarantees human rights, including the Constitution (Japan); ', '136.4 Further fulfil the internationally taken human rights obligations as well as integrate them into the national legislation (Kazakhstan); ', '136.5 Further strengthen its efforts to review its legislative framework and make necessary adjustments to it in order to ensure that it is in conformity with Afghanistan\xd5s international human rights obligations (Norway); ', '13

## What Next?

What patterns do you see in the text that you think we can harness to make our tabular dataset?