# Problem Set 5: Information Retrieval

## Part 2: Structuring + Manipulating Data

In this problem set, you'll be continuing to work with the data you imported during lecture. You'll be structuring and manipulating that data to turn it into a tabular dataset.

Specifically you'll be:

1. Splitting the text into "chunks" that correspond to major sections
2. Repeat steps in part 1 by creating a function and looping through documents
3. Get information about specific recommendations
4. Write a csv file from our data
5. Writing an algorithm that does steps 1-4 will less structured data

To get started, import the following modules and run the code from lecture

In [146]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

___

## Preface: Run Code from Lecture

Run the code below to get started

In [147]:
# merge lines so that each number starts with a number
def mergeLines(l):
    '''
    This function takes in a list of lines `l` and merge broken paragraph lines 
    (merge all lines if they don't start with a number)
    '''
    i = 0
    while i < len(l):
        if not l[i][0].isdigit():
            l[i-1:i+1] = [' '.join(l[i-1:i+1])]
        else:
            i = i+1
    return(l)

In [148]:
l=[]
dir = 'data/txts'
for file_name in os.listdir(dir):
    broken = []
    if file_name.endswith(".txt"):
        print 'processing ' + file_name + '...'
        try:
            dic = {}
            dic['country'] = file_name[:-8]
            dic['year'] = file_name[-8:-4]
            f = open(dir + '/' + file_name,'rU')
            text = f.read() # read in text
            f.close
            text = text.split('\n') # make a list
            text = filter(None, text) # get rid of empty string items       
             
            # take only the conclusions and/or recommendations section
            ConclusionsStart = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][1]) #startin from bottom
            ConclusionsEnd = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][2]) # the last one is the disclaimer
            text = text[ConclusionsStart+1:ConclusionsEnd+1] 
            
            # get rid of the weird lines
            text = [line for line in text if '**' not in line]
            text = [line for line in text if 'recommendations have not been edited.' not in line]
            text = [line for line in text if 'recommendations will not be edited.' not in line]
            text = [line.replace('\xd2','') for line in text]
            text = [line.replace('\t','') for line in text]
            text = [line.lstrip(" ") for line in text]
            
            # merge lines so that each line is its own paragraph, starting with a paragraph number
            text = mergeLines(text)
            
            # get rid of that disclaimer paragraph
            text = [line for line in text if 'endorsed by the working group' not in line.lower()]
            
            dic['text'] = text 
            
            # append to list
            l.append(dic)
              
        except Exception,e:
            broken.append(file_name +str(e)) 

processing afghanistan2014.txt...
processing albania2014.txt...
processing bangladesh2013.txt...
processing belize2013.txt...
processing bolivia2014.txt...
processing botswana2013.txt...
processing cotedivoire2014.txt...
processing djibouti2013.txt...
processing elsalvador2014.txt...
processing fiji2014.txt...
processing jordan2013.txt...
processing kazakhstan2014.txt...
processing monaco2013.txt...
processing montenegro2013.txt...
processing sanmarino2014.txt...
processing serbia2013.txt...
processing turkmenistan2013.txt...
processing tuvalu2013.txt...


## 1. Chunk into sections

### 1.1

We'll first be working with a single UPR report. Make an object called `upr` that contains the fifth item in the list `l.`

In [149]:
# YOUR CODE HERE

### 1.2

These texts have 3 sections each. The first section contains those recommendations the country supports. The second section contains recs the country will consider. The third contains recommendations the country explicitely rejects. 

Each section starts with a main paragraph number (e.g. **123**. The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.

The problem is, we don't know what these paragarph numbers are *a priori*. Luckily, Rochelle wrote you a function that passes a document and retuns the main paragraph numbers in that document's "Conclusions and Recommendations" section.

In [150]:
# function to find main paragraphs numbers in each upr
def mainParagraphs(upr):
    '''
    This function takes in a upr and returns the main paragraph numbers in the 'recommendations' section.
    '''
    firstParagraph = upr['text'][0].partition(" ")[0]
    if '.' in firstParagraph:
            firstParagraph = firstParagraph.replace(".","")
    
    # the first section containts the supported recs
    firstParagraph = int(firstParagraph)
    
    # the second section contains the considered recs
    secondParagraph = firstParagraph + 1
    
    # the third section contains the rejected recs
    thirdParagraph = secondParagraph + 1
    
    return(firstParagraph, secondParagraph, thirdParagraph)

# Uncomment to test
mainParagraphs(upr)

(113, 114, 115)

In your own words, explain how this function works. Feel free to make new cells and explore each line of the function, and/or litter the function with print statements.

***YOUR MARKDOWN ANSWER HERE***

### 1.3

Using this function, create three new variables, `support_P`, `consider_P`, and `reject_P`. Each variable should point to the paragraph number that it associated with that section. In this case, `support_P` should point to `113`. 

**Important: Don't just assign the numbers. Remember, we're going to generalize this code for all the uprs, and the individual paragraph numbers will change. So use the results from the function above.**

In [164]:
# YOUR CODE HERE

In [165]:
# test your code
print support_P
print consider_P
print reject_P

113
114
115


### 1.4 

Create a new dictionary called `sections`. The dictionary should have three keys: `support`, `consider`, and `implemented`. Each key should point to a list of lines in that section. Your final dictionary should look something like this:

```
{ 'support': [113. Some text, 113.1 An supported rec, 101.2 Another rec, 113.3 Even more...],
    
  'consider': [114. A new section, 114.1 A considered rec, 114.2 Even more recs, 114.3 You get the picture...] ,
  
  'reject': [115. The last section, 115.1 A rejected rec, 115.2 Even more recs, 115.3 Etc ...]
}
```

*hint*: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method.

In [None]:
# YOUR CODE HERE

In [152]:
# test your code
sections['support'][:5]

['113.  The recommendations formulated during the interactive dialogue/listed below have been examined by the Plurinational State of Bolivia and enjoy the support of the State party: ',
 '113.1 Incorporate the Rome Statute into national law (Mexico); ',
 '113.2 Consider ratifying the United Nations Educational, Scientific and Cultural Organization (UNESCO) Convention against Discrimination in Education (Ghana); ',
 '113.3 Ratify the UNESCO Convention against Discrimination in Education and ensure that primary education is free and compulsory for all (Portugal); ',
 '113.4 Ratify the Protocol to the American Convention for Human Rights (Norway); ']

## 1.5 Just keep recommendations

Take a look at one of the sections in the `sections` dictionary. For each section, the first line is the header paragraph, explaining how these recommendations are linked (e.g. they've all been supported or rejected). Below that are the items belonging to that section, containing specific recommendations. The latter is what we really care about. So we can delete the first item of each list contained in the `sections` dictionary. 

Go ahead and delete the first line of the list values in `sections`.

**Note: There are multiple ways to do this.**

In [None]:
# YOUR CODE HERE

In [153]:
# test your code
sections['support'][:5]

['113.1 Incorporate the Rome Statute into national law (Mexico); ',
 '113.2 Consider ratifying the United Nations Educational, Scientific and Cultural Organization (UNESCO) Convention against Discrimination in Education (Ghana); ',
 '113.3 Ratify the UNESCO Convention against Discrimination in Education and ensure that primary education is free and compulsory for all (Portugal); ',
 '113.4 Ratify the Protocol to the American Convention for Human Rights (Norway); ',
 '113.5 Further strengthening, as to its funding and independence, of the national preventive mechanism under the Optional Protocol to the Convention against Torture (OP-CAT) so that it can function effectively and impartially (Czech Republic); ']

___

## 2.  Make a function

### 2.1 

We want to do the above parsing for each document in the list. Using the code you wrote in sections 1, write a function called `parse` that passes a document (from the original upr list), and returns an dictionary object `sections` for that document.

In [179]:
# YOUR (FUNCTION) CODE HERE

In [171]:
# test your code
parse(upr)['support'][:5]

['82.1. Continue the efforts to achieve accession to the main human rights international instruments and their consistent incorporation into domestic legislation (Costa Rica); ',
 '82.2. Consider ratifying new international human rights instruments which would assist in strengthening its legal and institutional framework for the promotion and protection of human rights (Nicaragua); ',
 '82.3. Continue its efforts to accede to the remaining core international human rights treaties, which will strengthen the domestic legislation with regard to the promotion and protection of human rights, including freedom of religion or belief (Turkey); ',
 '82.4. Work closely with the OHCHR and the Council for considering eventual participation to the core international instruments on human rights (Viet Nam); ',
 '82.5. Further continue internal consultations and request the technical assistance of relevant UN institutions with regards to the accession to the core international human rights treaties (A

### 2.2 

Loop through all the dictionaries in the list `l` and assign a new `sections` key to the output from the above function. When you're done, your list `l` should contain items that look like this (remember: dictionaries are unordered):

```
[ 
    {'country': 'afghanistan',
     'year': '2014'
     'sections': {'support': <some list>,
                  'consider': <some list>,
                  'reject': <some list>,
                } 
      }
                  
     
     {'country': 'albania',
      'year': '2014'
      'sections': {'support': <some list>,
                  'consider': <some list>,
                  'reject': <some list>,
                }
      }                 
                  
```               

In [172]:
# YOUR CODE HERE

In [155]:
#uncomment to test
l[0]['sections']['support'][:5]

['136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); ',
 '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ',
 '136.3 Make further efforts to ensure the implementation of the legal framework which guarantees human rights, including the Constitution (Japan); ',
 '136.4 Further fulfil the internationally taken human rights obligations as well as integrate them into the national legislation (Kazakhstan); ',
 '136.5 Further strengthen its efforts to review its legislative framework and make necessary adjustments to it in order to ensure that it is in conformity with Afghanistan\xd5s international human rights obligations (Norway); ']

___

## 3. Get information about specific recommendations

### 3.1 

Take a look at a recommendation. I've given you a sample one below.

In [156]:
# get the first line, from the first section, of the first upr in `l`
rec = l[0]['sections']['support'][0]
print rec

136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); 


Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

In the cell below, find a way to pull out the recommending country. (Hint: Use the `split` method.)

In [173]:
# YOUR CODE HERE

### 3.2 

Create a function called `get_country` that passes an individual recommendation and returns the recommending country

In [174]:
# YOUR CODE HERE

In [175]:
# test your code
get_country(l[0]['sections']['support'][0])

'Ethiopia'

### 3.3

We now want to create a new list called `reclist` containing just individual recommendations. Each recommendation should be a dictionary with the following keys: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
4. `year`: the year of the review (all 2014 here)
5. `decision`: whether the recommendation was supported, rejected, etc.
6. `text`: the text of the recommendation

Create your `reclist` by looping through your list `l`. (Hint: You'll need to use loops within loops.)

In [None]:
# YOUR CODE HERE

In [180]:
# test your code
reclist[1]

{'decision': 'support',
 'from': 'Indonesia',
 'text': '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ',
 'to': 'afghanistan',
 'year': '2014'}

## 4. Write a CSV

We know want to write a csv file containing the information from `reclist`

### 4.1
To do this, first create a variable called `keys` containing the keys from the first dictionary in `reclist`

In [None]:
# YOUR CODE HERE

In [177]:
#test your code
keys

['text', 'to', 'decision', 'from', 'year']

### 4.2 

Using the `DictWriter` function from the `csv` module, write a csv list of `reclist` into a file called `upr-recs.csv`

In [178]:
# YOUR CODE HERE

## 5. Gotcha!

I lied. Not all the upr documents have just 3 paragraphs following the same structure. Some of them have 2. Some of them have 4. And while most have `supported` and `considered` as sections, others have recommendations they refer to as `already implemented`. 

To sum up, there are 4 kinds of sections contained in these documents:

1. Supported recommendations
2. Already implemented recommendations
3. Considered recommendations
4. Rejected recommendations

But any combination of these four can be present in any given document. There's no way to tell, a priori, how many sections the "Conclusions and Recommendations" portion will contain.

(The full text collection is in the directory `data/txts-extra`)

Write me an algorithm -- or a list of step-by-step instructions -- that walks me through how you would solve for this. That is, what extra steps would you have to include from the code above to get it to work? Remember, the final product has to be a CSV with each row being a recommendation, and a column for `year`, `to`, `from`, `text`, and `decision`.

***YOUR ANSWER HERE***

*****

# Optional Challenge

Code it!

****
# How to submit homework

Once your finished with your homework and want to submit it, follow these instructions:
    
1. Go to file --> download as PDF
2. Submit that PDF on bCourses!