# Lecture Notes + Exercises: Information Retrieval

In this unit, we'll use python to turn a bunch of loose text documents into a real-life database. (Note: This database was created for a project by R. Terman and E. Voeten, and was processed using much the same process as you'll be learning here.)

The lecture and problem set will leverage your new python skills, especially working with text, lists, and dictionaries; writing for-loops, conditional statements, and functions; and "thinking" like a programmer.

**About the Data**

We'll be creating a database from [Universal Period Review outcome reports](http://www.ohchr.org/EN/HRBodies/UPR/Pages/BasicFacts.aspx).

The Universal Periodic Review (UPR) is a process run by the United Nations Human Rights Council, which involves a periodic review of the human rights records of all 193 UN Member States.

Reviews take place through an interactive discussion between the State under review and other UN Member States. During this discussion any UN Member State can pose questions, comments and/or make recommendations to the States under review. States under review can then respond, stating which recommendations they reject, accept, will consider, etc. Reports are then drawn up detailing this discussion.

We will be analyzing outcome reports from the 2014 Universal Period Reviews of 42 countries, which we retrieved [here](http://www.ohchr.org/EN/HRBodies/UPR/Pages/Documentation.aspx) and formatted as text documents.

The goal is to convert these semi-structured texts to a tabular dataset of **recommendations** with the following variables:

1. Text of recommendation (*text*)
2. Country to which the recommendation is directed (*to*)
3. Country that is making the recommendation (*from*)
4. The year when the review took place (*year*)
5. The response to the recommendation, i.e. whether the reviewed country rejects, accepts, etc (*decision*)

In other words, we want to turn this:

<img src="img/text.png" width="600">

into this:

<img src="img/tabular.png" width="400">



___

# Part 1: Reading Data

During lecture, we'll be completing the first part of our project: reading and structuring data from our text files. Specifically, we need to know how to:

1. Read filenames from a directory
2. Read data from a single file
3. Clean the text to select only the information we care about and format it in a way that's easy to work with
4. Repeat steps 2-3 above by looping through all files in our directory, reading, cleaning, and structuring as we go

First let's start by importing some helpful modules

In [1]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

## 1. Read filenames from directory

First let's assign a variable `dir` containing the relative path to our text files.

In [28]:
dir = 'data/txts'

### 1.1 

Print every filename in the directory 'data/txts'

In [None]:
# YOUR CODE HERE

### 1.2 

Print only the filenames that endswith ".txt" (hint: used the `endswith` method)

In [None]:
# YOUR CODE HERE

### 1.3 

Print the full (relative) directory path of each filename ending with ".txt." For example, the first one should be "raw-data/txts/afghanistan2014.txt"

In [None]:
# YOUR CODE HERE

### 1.4 

Using the filename, print the country each document is about (hint: use slicing)

In [None]:
# YOUR CODE HERE

___

## 2. Read data from a single file

Let's work with just one text, "cotedivoire2014.txt".

In [6]:
file_name = "cotedivoire2014.txt"

### 2.1

Create a dictionary called `upr` that stores the `'country'` and `'year'` keys from the `file_name`.

In [36]:
upr = {}
upr['country'] = file_name[:-8]
upr['year'] = file_name[-8:-4]
upr

{'country': 'tuvalu', 'year': '2013'}

### 2.2 

Read the file into an object called `text`, then make sure to close the file. Be sure to read in the file as universal line mode, or `'rU'`. Later, we'll put this object in the dictionary we just made above.

In [8]:
# YOUR CODE HERE

In [9]:
# take a look at the first 1000 characters
text[:1000]

'\nDistr.: General 7 July 2014 English Original: English/French \nGeneral Assembly \nHuman Rights Council Twenty-seventh session \nAgenda item 6 \nUniversal Periodic Review \nReport of the Working Group on the Universal Periodic Review* \nC\x99te d\xd5Ivoire \n* The annex to the present report is circulated as received. \n\nGE.14-07583  (E)    280714 300714 \n*1407583* \nContents \n\nParagraphs Page \nIntroduction............................................................................................................. 1\xd04 3 \n\nI. Summary of the proceedings of the review process................................................ 5\xd0126 3 \n\nA. Presentation by the State under review........................................................... 5\xd021 3 \n\nB. Interactive dialogue and responses by the State under review........................ 22\xd0126 5 \n\nII. Conclusions and/or recommendations .................................................................... 127\xd0130 14  Ann

___

## 3. Clean the text

### 3.1 

Split `text` into a list of lines.

In [None]:
# YOUR CODE HERE

In [30]:
# test
print len(text)
text[3:15]

100


['82.3. Continue its efforts to accede to the remaining core international human rights treaties, which will strengthen the domestic legislation with regard to the promotion and protection of human rights, including freedom of religion or belief (Turkey); ',
 '82.4. Work closely with the OHCHR and the Council for considering eventual participation to the core international instruments on human rights (Viet Nam); ',
 '82.5. Further continue internal consultations and request the technical assistance of relevant UN institutions with regards to the accession to the core international human rights treaties (Azerbaijan); ',
 '82.6. Increase efforts to swiftly ratify fundamental treaties on human rights, such as ICCPR and ICESCR, also by taking advantage of the available international technical assistance to address possible shortcomings in fulfilling the requirements of the international treaties (Italy); ',
 '82.7. Put in place, with the technical cooperation of OHCHR and the financial sup

### 3.2 

Delete all empty lines from x

In [None]:
text = filter(None,text)

### 3.3 

We want to slice the text so that we only keep the part in the "Conclusions and/or Recommendations" section. 

Find all lines that mention the phrase "conclusions and/or recommendations" in it, case insensitive. Put all of thse lines, into a list called `conclusionsList`

In [None]:
# YOUR CODE HERE

### 3.4 

Notice that the section starts with the second mention of "conclusions and/or recommendations" (the first is the table of contents) and ends with the third mention. 

Make two objects called ConclusionsStart and ConclusionsEnd, containing the index of the second and third mention "conclusions and/or recommendations", respectively). Hint: Use the `index()` method

In [None]:
# YOUR CODE HERE

### 3.5 

Using these indeces, slice the `text` list so that it only contains the lines in the section we want (including both the start and end paragraphs you identified above)

In [None]:
# YOUR CODE HERE

In [31]:
# test
text[:15]

['82. The recommendations formulated during the interactive dialogue listed below enjoy the support of Tuvalu: ',
 '82.1. Continue the efforts to achieve accession to the main human rights international instruments and their consistent incorporation into domestic legislation (Costa Rica); ',
 '82.2. Consider ratifying new international human rights instruments which would assist in strengthening its legal and institutional framework for the promotion and protection of human rights (Nicaragua); ',
 '82.3. Continue its efforts to accede to the remaining core international human rights treaties, which will strengthen the domestic legislation with regard to the promotion and protection of human rights, including freedom of religion or belief (Turkey); ',
 '82.4. Work closely with the OHCHR and the Council for considering eventual participation to the core international instruments on human rights (Viet Nam); ',
 '82.5. Further continue internal consultations and request the technical assis

### 3.6 

We'd like to remove some paragraphs in this list that don't mean anything to us.

In [None]:
text = [line for line in text if '**' not in line]
text = [line for line in text if 'recommendations have not been edited.' not in line]
text = [line for line in text if 'recommendations will not be edited.' not in line]

In [37]:
text[-5:]

['84.23. Adopt, as a matter of priority, all legal and administrative measures to prohibit and punish corporal punishment of children in all settings, including at home (Uruguay); ',
 '84.24. Adopt legal and administrative measures to eliminate all forms of corporal punishment of children (Chile); ',
 '84.25. Adopt necessary legislative and administrative measures to guarantee freedom of religion (Mexico); ',
 '84.26. Make changes to the Constitution Amendment Act of 2010 to fully guarantee freedom of religion or belief (Canada); ',
 '84.27. Amend or repeal the Religious Organisations Act so as to establish a legal framework ensuring that everyone is free to practice his or her own religious faith without penalty (Ireland). ']

### 3.7 

Notice how the last few lines are split, with the paragraph number and text on different lines -- deviating from the pattern of the rest of the text. Let's fix that using a `while` loop.

Check it out and try to comprehend what it's doing

In [20]:
# merge lines so that each number starts with a number
def mergeLines(l):
    '''
    This function takes in a list of lines `l` and merge broken paragraph lines 
    (merge all lines if they don't start with a number)
    '''
    i = 0
    while i < len(l):
        if not l[i][0].isdigit():
            l[i-1:i+1] = [' '.join(l[i-1:i+1])]
        else:
            i = i+1
    return(l)

Now let's apply that function to our `text`

In [21]:
text = mergeLines(text)

In [32]:
# test
text[-5:]

['84.23. Adopt, as a matter of priority, all legal and administrative measures to prohibit and punish corporal punishment of children in all settings, including at home (Uruguay); ',
 '84.24. Adopt legal and administrative measures to eliminate all forms of corporal punishment of children (Chile); ',
 '84.25. Adopt necessary legislative and administrative measures to guarantee freedom of religion (Mexico); ',
 '84.26. Make changes to the Constitution Amendment Act of 2010 to fully guarantee freedom of religion or belief (Canada); ',
 '84.27. Amend or repeal the Religious Organisations Act so as to establish a legal framework ensuring that everyone is free to practice his or her own religious faith without penalty (Ireland). ']

### 3.8 

We probably don't need that last line -- on how the conclusions don't reflect the position of the working group

In [33]:
#YOUR CODE HERE

In [34]:
text[:20]

['82. The recommendations formulated during the interactive dialogue listed below enjoy the support of Tuvalu: ',
 '82.1. Continue the efforts to achieve accession to the main human rights international instruments and their consistent incorporation into domestic legislation (Costa Rica); ',
 '82.2. Consider ratifying new international human rights instruments which would assist in strengthening its legal and institutional framework for the promotion and protection of human rights (Nicaragua); ',
 '82.3. Continue its efforts to accede to the remaining core international human rights treaties, which will strengthen the domestic legislation with regard to the promotion and protection of human rights, including freedom of religion or belief (Turkey); ',
 '82.4. Work closely with the OHCHR and the Council for considering eventual participation to the core international instruments on human rights (Viet Nam); ',
 '82.5. Further continue internal consultations and request the technical assis

3.8 Now that we have cleaned text, let's put that list in our dictionary

In [25]:
upr['text'] = text
upr.keys()

['country', 'text', 'year']

___

## 4. Loop through Files

4.1 Using the code you wrote above, loop through all the .txt files in the directory, structuring and cleaning the data as you go. Each txt file will be stored as a dictionary. Store each dictionary in a list.

In [None]:
#YOUR CODE HERE

In [35]:
# test
first_upr = l[0]

print "Country:", first_upr['country']
print "Year:", first_upr['year']
print "Text:", first_upr['text'][:10], "..."

Country: afghanistan
Year: 2014
Text: ['136. The recommendations formulated during the interactive dialogue and listed below have been examined by Afghanistan and enjoy its support: ', '136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); ', '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ', '136.3 Make further efforts to ensure the implementation of the legal framework which guarantees human rights, including the Constitution (Japan); ', '136.4 Further fulfil the internationally taken human rights obligations as well as integrate them into the national legislation (Kazakhstan); ', '136.5 Further strengthen its efforts to review its legislative framework and make necessary adjustments to it in order to ensure that it is in conformity with Afghanistan\xd5s international human rights obligations (Norway); ', '13

## What Next?

What patterns do you see in the text that you think we can harness to make our tabular dataset?