# Problem Set: Information Retrieval

## Part 2: Structuring + Manipulating Data

In this problem set, you'll be continuing to work with the data you imported during lecture. You'll be structuring and manipulating that data to turn it into a tabular dataset.

Specifically you'll be:

1. Splitting the text into "chunks" that correspond to major sections
2. Parse those chunks into main text and subitems
3. Repeat steps 1-2 by creating a function and looping through documents
4. Get information about specific recommendations
5. Write a csv file from our data

To get started, import the following modules and run the code from lecture

In [3]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

___

## Preface: Run Code from Lecture

Run the code below to get started

In [4]:
# merge lines so that each number starts with a number
def mergeLines(l):
    '''
    This function takes in a list of lines `l` and merge broken paragraph lines 
    (merge all lines if they don't start with a number)
    '''
    i = 0
    while i < len(l):
        if not l[i][0].isdigit():
            l[i-1:i+1] = [' '.join(l[i-1:i+1])]
        else:
            i = i+1
    return(l)

In [5]:
l=[]
dir = 'data/txts'
for file_name in os.listdir(dir):
    broken = []
    if file_name.endswith(".txt"):
        print 'processing ' + file_name + '...'
        try:
            dic = {}
            dic['country'] = file_name[:-8]
            dic['year'] = file_name[-8:-4]
            f = open(dir + '/' + file_name,'rU')
            text = f.read() # read in text
            f.close
            text = text.split('\n') # make a list
            text = filter(None, text) # get rid of empty string items       
             
            # take only the conclusions and/or recommendations section
            ConclusionsStart = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][1]) #startin from bottom
            ConclusionsEnd = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][2]) # the last one is the disclaimer
            text = text[ConclusionsStart+1:ConclusionsEnd+1] 
            
            # get rid of the weird lines
            text = [line for line in text if '**' not in line]
            text = [line for line in text if 'recommendations have not been edited.' not in line]
            text = [line for line in text if 'recommendations will not be edited.' not in line]
            text = [line.replace('\xd2','') for line in text]
            text = [line.replace('\t','') for line in text]
            text = [line.lstrip(" ") for line in text]
            
            # merge lines so that each line is its own paragraph, starting with a paragraph number
            text = mergeLines(text)
            
            # get rid of that disclaimer paragraph
            text = [line for line in text if 'endorsed by the working group' not in line.lower()]
            
            dic['text'] = text 
            
            # append to list
            l.append(dic)
              
        except Exception,e:
            broken.append(file_name +str(e)) 

processing afghanistan2014.txt...
processing albania2014.txt...
processing bangladesh2013.txt...
processing belize2013.txt...
processing bolivia2014.txt...
processing botswana2013.txt...
processing cotedivoire2014.txt...
processing djibouti2013.txt...
processing elsalvador2014.txt...
processing fiji2014.txt...
processing jordan2013.txt...
processing kazakhstan2014.txt...
processing monaco2013.txt...
processing montenegro2013.txt...
processing sanmarino2014.txt...
processing serbia2013.txt...
processing turkmenistan2013.txt...
processing tuvalu2013.txt...


## 1. Chunk into sections

### 1.1

We'll first be working with a single UPR report. Make an object called `upr` that contains the fifth item in the list `l.`

In [6]:
# assign `upr` as the fifth item in the list
upr = l[4]

### 1.2

These texts have 3 sections each. The first section contains those recommendations the country supports. The second section contains recs the country will consider. The third contains recommendations the country explicitely rejects. 

Each section starts with a main paragraph number (e.g. **123**. The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.

The problem is, we don't know what these paragarph numbers are *a priori*. Luckily, Rochelle wrote you a function that passes a document and retuns the main paragraph numbers in that document's "Conclusions and Recommendations" section.

In [7]:
# function to find main paragraphs numbers in each upr
def mainParagraphs(upr):
    '''
    This function takes in a upr and returns the main paragraph numbers in the 'recommendations' section.
    '''
    firstParagraph = upr['text'][0].partition(" ")[0]
    if '.' in firstParagraph:
            firstParagraph = firstParagraph.replace(".","")
    
    # the first section containts the supported recs
    firstParagraph = int(firstParagraph)
    
    # the second section contains the considered recs
    secondParagraph = firstParagraph + 1
    
    # the third section contains the rejected recs
    thirdParagraph = secondParagraph + 1
    
    return(firstParagraph, secondParagraph, thirdParagraph)

# Uncomment to test
mainParagraphs(upr)

(113, 114, 115)

In your own words, explain how this function works. Feel free to make new cells and explore each line of the function, and/or litter the function with print statements.

***YOUR ANSWER HERE***

### 1.3

Using this function, create three new variables, `support_P`, `consider_P`, and `reject_P`. Each variable should point to the paragraph number that it associated with that section. In this case, `support_P` should point to `113`. 

**Important: Don't just assign the numbers. Remember, we're going to generalize this code for all the uprs, and the individual paragraph numbers will change. So use the results from the function above.**

In [8]:
support_P = mainParagraphs(upr)[0]
consider_P = mainParagraphs(upr)[1]
reject_P = mainParagraphs(upr)[2]

# test your code
print support_P
print consider_P
print reject_P

113
114
115


### 1.4 

Create a new dictionary called `sections`. The dictionary should have three keys: `support`, `consider`, and `implemented`. Each key should point to a list of lines in that section. Your final dictionary should look something like this:

```
{ 'support': [113. Some text, 113.1 An supported rec, 101.2 Another rec, 113.3 Even more...],
    
  'consider': [114. A new section, 114.1 A considered rec, 114.2 Even more recs, 114.3 You get the picture...] ,
  
  'reject': [115. The last section, 115.1 A rejected rec, 115.2 Even more recs, 115.3 Etc ...]
}
```

*hint*: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method.

In [9]:
sections = {}
sections['support'] = [s for s in upr['text'] if s.startswith(str(support_P))]
sections['consider'] = [s for s in upr['text'] if s.startswith(str(consider_P))]
sections['reject'] = [s for s in upr['text'] if s.startswith(str(reject_P))]

# test your code
sections['support'][:5]

['113.  The recommendations formulated during the interactive dialogue/listed below have been examined by the Plurinational State of Bolivia and enjoy the support of the State party: ',
 '113.1 Incorporate the Rome Statute into national law (Mexico); ',
 '113.2 Consider ratifying the United Nations Educational, Scientific and Cultural Organization (UNESCO) Convention against Discrimination in Education (Ghana); ',
 '113.3 Ratify the UNESCO Convention against Discrimination in Education and ensure that primary education is free and compulsory for all (Portugal); ',
 '113.4 Ratify the Protocol to the American Convention for Human Rights (Norway); ']

## 1.5 Just keep recommendations

Take a look at one of the sections in the `sections` dictionary. For each section, the first line is the header paragraph, explaining how these recommendations are linked (e.g. they've all been supported or rejected). Below that are the items belonging to that section, containing specific recommendations. The latter is what we really care about. So we can delete the first item of each list contained in the `sections` dictionary. 

Go ahead and delete the first line of the list values in `sections`.

**Note: There are multiple ways to do this.**

In [10]:
for key in sections.keys():
    sections[key] = sections[key][1:]
    
# test your code
sections['support'][:5]

['113.1 Incorporate the Rome Statute into national law (Mexico); ',
 '113.2 Consider ratifying the United Nations Educational, Scientific and Cultural Organization (UNESCO) Convention against Discrimination in Education (Ghana); ',
 '113.3 Ratify the UNESCO Convention against Discrimination in Education and ensure that primary education is free and compulsory for all (Portugal); ',
 '113.4 Ratify the Protocol to the American Convention for Human Rights (Norway); ',
 '113.5 Further strengthening, as to its funding and independence, of the national preventive mechanism under the Optional Protocol to the Convention against Torture (OP-CAT) so that it can function effectively and impartially (Czech Republic); ']

___

## 2.  Make a function

### 2.1 

We want to do the above parsing for each document in the list. Using the code you wrote in sections 1, write a function that passes a document (from the original `upr` list), and returns an dictionary object `sections`.

In [11]:
# turn it into a function
def parse(document):
    
    upr = document
    
    # find the paragraph numbers associated with each section
    support_P = mainParagraphs(upr)[0]
    consider_P = mainParagraphs(upr)[1]
    reject_P = mainParagraphs(upr)[2]
    
    # separate upr out into chunks by mainP, i.e. the support, reject, consider chunks 
    # by identifying subparagraphs that start with the main paragraph numbers.
    sections = {}
    sections['support'] = [s for s in upr['text'] if s.startswith(str(support_P))]
    sections['consider'] = [s for s in upr['text'] if s.startswith(str(consider_P))]
    sections['reject'] = [s for s in upr['text'] if s.startswith(str(reject_P))]
    
    for key in sections.keys():
        sections[key] = sections[key][1:]
    
    return(sections)

### 2.2 

Loop through all the dictionaries in the list `l` and assign a new `sections` key to the output from the above function. When you're done, your list `l` should contain items that look like this (remember: dictionaries are unordered):

```
[ 
    {'country': 'afghanistan',
     'year': '2014'
     'sections': {'support': <some list>,
                  'consider': <some list>,
                  'reject': <some list>,
                } 
      }
                  
     
     {'country': 'albania',
      'year': '2014'
      'sections': {'support': <some list>,
                  'consider': <some list>,
                  'reject': <some list>,
                }
      }                 
                  
```               

In [12]:
# apply to all docs
for i in l:
    i['sections'] = parse(i)

#uncomment to test
l[0]['sections']['support'][:5]

['136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); ',
 '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ',
 '136.3 Make further efforts to ensure the implementation of the legal framework which guarantees human rights, including the Constitution (Japan); ',
 '136.4 Further fulfil the internationally taken human rights obligations as well as integrate them into the national legislation (Kazakhstan); ',
 '136.5 Further strengthen its efforts to review its legislative framework and make necessary adjustments to it in order to ensure that it is in conformity with Afghanistan\xd5s international human rights obligations (Norway); ']

___

## 3. Get information about specific recommendations

### 3.1 

Take a look at a recommendation. I've given you a sample one below.

In [13]:
# get the first line, from the first section, of the first upr in `l`
rec = l[0]['sections']['support'][0]
print rec

136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); 


Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

In the cell below, find a way to pull out the recommending country. (Hint: Use the `split` method.)

In [14]:
# your code here
rec.split('(')[-1].split(')')[0]

'Ethiopia'

### 3.2 

Create a function called `get_country` that passes an individual recommendation and returns the recommending country

In [15]:
def get_country(rec):
    country = rec.split('(')[-1].split(')')[0]
    return(country)

# test youc ode
get_country(l[0]['sections']['support'][0])

'Ethiopia'

### 3.3

We now want to create a new list called `reclist` containing just individual recommendations. Each recommendation should be a dictionary with the following keys: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
4. `year`: the year of the review (all 2014 here)
5. `decision`: whether the recommendation was supported, rejected, etc.
6. `text`: the text of the recommendation

Create your `reclist` by looping through your list `l`. (Hint: You'll need to use loops within loops.)

In [16]:
# make dictionaries for each individual recommendation item
reclist = []
for upr in l:
    for decision in upr['sections'].keys():
        items = upr['sections'][decision]
        for rec in items:
            dic = {}
            dic['to'] = upr['country']
            dic['year'] = upr['year']
            dic['decision'] = decision
            dic['from'] = get_country(rec)
            dic['text'] = rec
            reclist.append(dic)  

# uncomment to test
print len(reclist)
reclist[1]

2772


{'decision': 'support',
 'from': 'Indonesia',
 'text': '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ',
 'to': 'afghanistan',
 'year': '2014'}

## 4. Write a CSV

We know want to write a csv file containing the information from `reclist`

To do this, first create a variable called `keys` containing the keys from the first dictionary in `reclist`

In [17]:
#writing column headings
keys = reclist[0].keys()
keys

['text', 'to', 'decision', 'from', 'year']

Using the `DictWriter` function from the `csv` module, write a csv list of `reclist` into a file called `upr-recs.csv`

In [18]:
#writing the rest
with open('upr-recs.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(reclist)

## 5. Gotcha!

I lied. Not all the upr documents have just 3 paragraphs following the same structure. Some of them have 2. Some of them have 4. And while most have `supported` and `considered` as sections, others have recommendations they refer to as `already implemented`. 

To sum up, there are 4 kinds of sections contained in these documents:

1. Supported recommendations
2. Already implemented recommendations
3. Considered recommendations
4. Rejected recommendations

But any combination of these four can be present in any given document. There's no way to tell, a priori, how many sections the "Conclusions and Recommendations" portion will contain.

Write me an algorithm -- or a list of step-by-step instructions -- that walks me through how you would solve for this. That is, what extra steps would you have to include from the code above to get it to work? Remember, the final product has to be a CSV with each row being a recommendation, and a column for `year`, `to`, `from`, `text`, and `decision`.

***YOUR ANSWER HERE***

*****

# Optional Challenge

Code it!

First we have to re-load the data from the directory `txts-extra`

In [19]:
l=[]
dir = 'data/txts-extra'
for file_name in os.listdir(dir):
    broken = []
    if file_name.endswith(".txt"):
        print 'processing ' + file_name + '...'
        try:
            dic = {}
            dic['country'] = file_name[:-8]
            dic['year'] = file_name[-8:-4]
            f = open(dir + '/' + file_name,'rU')
            text = f.read() # read in text
            f.close
            text = text.split('\n') # make a list
            text = filter(None, text) # get rid of empty string items       
             
            # take only the conclusions and/or recommendations section
            ConclusionsStart = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][1]) #startin from bottom
            ConclusionsEnd = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][2]) # the last one is the disclaimer
            text = text[ConclusionsStart+1:ConclusionsEnd+1] 
            
            # get rid of the weird lines
            text = [line for line in text if '**' not in line]
            text = [line for line in text if 'recommendations have not been edited.' not in line]
            text = [line for line in text if 'recommendations will not be edited.' not in line]
            text = [line.replace('\xd2','') for line in text]
            text = [line.replace('\t','') for line in text]
            text = [line.lstrip(" ") for line in text]
            
            # merge lines so that each line is its own paragraph, starting with a paragraph number
            text = mergeLines(text)
            
            # get rid of that disclaimer paragraph
            text = [line for line in text if 'endorsed by the working group' not in line.lower()]
            
            dic['text'] = text 
            
            # append to list
            l.append(dic)
              
        except Exception,e:
            broken.append(file_name +str(e)) 

processing afghanistan2014.txt...
processing albania2014.txt...
processing angola2014.txt...
processing azerbaijan2013.txt...
processing bahamas2013.txt...
processing bangladesh2013.txt...
processing barbados2013.txt...
processing belize2013.txt...
processing bhutan2014.txt...
processing bolivia2014.txt...
processing bosniaandherzegovina2014.txt...
processing botswana2013.txt...
processing bruneidarussalam2014.txt...
processing burkinafaso2013.txt...
processing burundi2013.txt...
processing cambodia2014.txt...
processing cameroon2013.txt...
processing canada2013.txt...
processing capeverde2013.txt...
processing centralafricanrepublic2013.txt...
processing chad2013.txt...
processing chile2014.txt...
processing china2013.txt...
processing colombia2013.txt...
processing comoros2014.txt...
processing congo2013.txt...
processing costarica2014.txt...
processing cotedivoire2014.txt...
processing cuba2013.txt...
processing cyprus2014.txt...
processing democraticpeoplesrepublicofkorea2014.txt..

Next we change the `mainParagraphs` function so that it gives us all the main paragraph numbers in the document. Note that this function does not assume 3 functions. If the document has 2 main sections, it'll give us 2 numbers. If it has 4 main sections, it'll give us 4 numbers, etc.

In [26]:
# function to find main paragraphs numbers in each upr
def mainParagraphs(upr):
    '''
    This function takes in a upr and returns the main paragraph numbers in the 'recommendations' section.
    There are usually 2-4 main paragraphs. Sometimes I refer to these main paragraph sections as "chunks".
    '''    
    mainParagraphs = []
    for line in upr['text']:
        paragraph = line.partition(" ")[0]
        if paragraph[-1] == '.':
            paragraph = paragraph[:-1]
                
        mainParagraphs.append(float(paragraph))
        
    mainParagraphs = set([int(n) for n in mainParagraphs])
    return mainParagraphs 

# Uncomment to test
mainParagraphs(l[3])

{109, 110}

Now we create a function (similar to `parse` in 2.1) that does the following:
1. inputs a dictionary from `l` that represents a document
2. outputs a list, `sections`, that holds 2-4 dictionaries, each dictionary representing a section in the document. So instead of `sections` being a dictionary (as above) it is now a list of dictionaries, with each dictionary having two keys: 'decision' (support, reject, etc) and 'items' (list of recs)

In [27]:
# turn it into a function
def parse(upr):
    
    # separate upr out into chunks by mainP, i.e. the support, reject, consider chunks 
    # by identifying subparagraphs that start with the main paragraph numbers.
    sections = []
    for n in mainParagraphs(upr):
        dic = {}
        dic['paragraph'] = n
        dic['text'] = [s for s in upr['text'] if s.startswith(str(n))]
        sections.append(dic)
    
    # delete all the chunks with only 1 paragraph -- these don't actually contain any recs.
    for dic in sections:
        if len(dic['text']) == 1:
            sections.remove(dic)
    
    # parse into header (first line) and items (list of recommendations)
    for dic in sections:
        dic['header'] = dic['text'][0]
        dic['items'] = dic['text'][1:]
    
    # assign a decision
    for dic in sections: 
        text = dic['header'] 
        decision = ''
        if 'implemented' in text or 'process of implementation' in text:
            decision = 'implemented'
        elif 'will be examined' in text or 'will examine' in text or "further examined" in text or \
        "Responses to the following recommendations will be provided" in text or \
        "will be included in the outcome report" in text or "will be provided in due course" in text or \
        "course of the discussion" in text:
            decision = 'consider'
        elif 'not enjoy the support' in text or 'reject' in text or 'cannot be accepted' in text:
            decision = 'reject'
        elif 'support' in text and 'did not enjoy the support' not in text:
            decision = 'support'
        elif 'have been noted by' in text or 'were noted by' in text:
            decision = 'noted'
        elif 'do not reflect the current situation' in text:
            decision = 'reject'
        else:
            decision = 'unknown'
        dic['decision'] = decision
        
    # delete unnecessary keys
    for dic in sections:
        del dic['text']
        del dic['paragraph']
        del dic['header']
    
    return(sections)

Now we loop through all the dictionaries in the list `l` and assign a new `sections` key to the output from the above function. When we're done, our list `l` should contain items that look like this (remember: dictionaries are unordered):
```
[ 
    {'country': 'afghanistan',
     'year': '2014'
     'sections' [
                 {'decision': <some string>
                  'items': <some list> },
                 
                 {'decision': <some string>
                  'items': <some list> }
                ] 
      }
                  
     
     {'country': 'albania',
      'year': '2014'
      'sections' [
                 {'decision': <some string>
                  'items': <some list> },
                 
                 {'decision': <some string>
                  'items': <some list> }
                ] 
      }                 
                  
```   

In [28]:
# apply to all docs
for i in l:
    i['sections'] = parse(i)

#uncomment to test
l[0]['sections'][1]['decision']

'consider'

We now want to create a new list called `reclist` containing just individual recommendations. Each recommendation should be a dictionary with the following keys: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
4. `year`: the year of the review (all 2014 here)
5. `decision`: whether the recommendation was accepted, rejected, etc.
6. `text`: the text of the recommendation

In [29]:
# make dictionaries for each individual recommendation item
reclist = []
for upr in l:
    for section in upr['sections']:
        for item in section['items']:
            dic = {}
            dic['to'] = upr['country']
            dic['year'] = upr['year']
            dic['decision'] = section['decision']
            dic['from'] = item.split('(')[-1].split(')')[0]
            dic['text'] = item
            reclist.append(dic)  

# uncomment to test
print len(reclist)
reclist[200]

14736


{'decision': 'consider',
 'from': 'Estonia',
 'text': '137.24 Ratify the Kampala Amendments to the Rome Statute and the Agreement on the Privileges and Immunities of the International Criminal Court (Estonia); ',
 'to': 'afghanistan',
 'year': '2014'}

Writing the csv is equivalent to above.