# Problem Set: Information Retrieval

## Part 2: Structuring + Manipulating Data

In this problem set, you'll be continuing to work with the data you imported during lecture. You'll be structuring and manipulating that data to turn it into a tabular dataset.

In [32]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

___

## Preface: Run Code from Lecture

Run the code below to get started

In [33]:
# merge lines so that each number starts with a number
def mergeLines(l):
    '''
    This function takes in a list of lines `l` and merge broken paragraph lines 
    (merge all lines if they don't start with a number)
    '''
    i = 0
    while i < len(l):
        if not l[i][0].isdigit():
            l[i-1:i+1] = [' '.join(l[i-1:i+1])]
        else:
            i = i+1
    return(l)

In [34]:
l=[]
dir = 'data/txts'
for file_name in os.listdir(dir):
    broken = []
    if file_name.endswith(".txt"):
        print 'processing ' + file_name + '...'
        try:
            dic = {}
            dic['country'] = file_name[:-8]
            dic['year'] = file_name[-8:-4]
            f = open(dir + '/' + file_name,'rU')
            text = f.read() # read in text
            f.close
            text = text.split('\n') # make a list
            text = filter(None, text) # get rid of empty string items       
             
            # take only the conclusions and/or recommendations section
            ConclusionsStart = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][1]) #startin from bottom
            ConclusionsEnd = text.index([line for line in text if "conclusions and/or recommendations" in line.lower()][2]) # the last one is the disclaimer
            text = text[ConclusionsStart+1:ConclusionsEnd+1] 
            
            # get rid of the weird lines
            text = [line for line in text if '**' not in line]
            text = [line for line in text if 'recommendations have not been edited.' not in line]
            text = [line for line in text if 'recommendations will not be edited.' not in line]
            text = [line.replace('\xd2','') for line in text]
            text = [line.replace('\t','') for line in text]
            text = [line.lstrip(" ") for line in text]
            
            # merge lines so that each line is its own paragraph, starting with a paragraph number
            text = mergeLines(text)
            
            # get rid of that disclaimer paragraph
            text = [line for line in text if 'endorsed by the working group' not in line.lower()]
            
            dic['text'] = text 
            
            # append to list
            l.append(dic)
              
        except Exception,e:
            broken.append(file_name +str(e)) 

processing afghanistan2014.txt...
processing albania2014.txt...
processing angola2014.txt...
processing bhutan2014.txt...
processing bolivia2014.txt...
processing bosniaandherzegovina2014.txt...
processing bruneidarussalam2014.txt...
processing cambodia2014.txt...
processing chile2014.txt...
processing comoros2014.txt...
processing costarica2014.txt...
processing cotedivoire2014.txt...
processing cyprus2014.txt...
processing democraticpeoplesrepublicofkorea2014.txt...
processing democraticrepublicofthecongo2014.txt...
processing dominica2014.txt...
processing dominicanrepublic2014.txt...
processing egypt2014.txt...
processing elsalvador2014.txt...
processing eritrea2014.txt...
processing ethiopia2014.txt...
processing fiji2014.txt...
processing gambia2014.txt...
processing guinea2014.txt...
processing iran2014.txt...
processing iraq2014.txt...
processing italy2014.txt...
processing kazakhstan2014.txt...
processing Macedonia2014.txt...
processing madagascar2014.txt...
processing newzeal

# 1. Chunking

1.0 We'll first be working with a single UPR report. Make an object `upr` that contains the fifth item in the list `l.`

In [35]:
# assign `upr` as the fourth item in the list
upr = l[4]

These texts have sections for conclusions/recommendations the country either accepts, rejects, considers, etc. So the first task is to split the document into sections pertaining to those decisions.

The problem is, we don't know how many sections a document has a priori. Luckily, Rochelle wrote you a function that will tell how you many sections there are in a document. 

1.1 Using this function, how many sections are there in this `upr`?

In [36]:
# function to find main paragraphs numbers in each upr
def mainParagraphs(upr):
    '''
    This function takes in a upr and returns the main paragraph numbers in the 'recommendations' section.
    There are usually 2-4 main paragraphs. Sometimes I refer to these main paragraph sections as "chunks".
    '''
    firstParagraph = upr['text'][0].partition(" ")[0]
    if '.' in firstParagraph:
            firstParagraph = firstParagraph.replace(".","")
    
    mainParagraphs = []
    for line in upr['text']:
        paragraph = line.partition(" ")[0]
        if paragraph[-1] == '.':
            paragraph = paragraph[:-1]
                
        mainParagraphs.append(float(paragraph))
        
    # make a list of the main paragraph numbers
    mainParagraphs = set([int(n) for n in mainParagraphs if int(n)>= int(firstParagraph)])
    return mainParagraphs 

# Uncomment to test
# mainParagraphs(upr)

1.2 In your own words, explain how this function works. Feel free to make new cells and explore each line of the function, and/or litter the function with print statements.

 ...

1.3 Now let's split up the text into sections. Create a new empty list called `sections`. For each section in the `upr`, create a dictionary containing keys for `paragraph-number`, containing the main paragraph number; and `full-text`, containing the all the sub-paragraphs in that section. Your final list should look something like this:

```
[
    { "paragraph-number": 101,
      "full-text": [ 101. Some text, 101.1 More text, 101.2 Another line, 101.3 Even more] } ,
    
    { "paragraph-number": 102,
      "full-text": [ 102. A new line, 102.1 Some text, 102.2 Even more text, 102.3 You get the picture ] }
]
```

In [37]:
sections = []
for n in mainParagraphs(upr):
    dic = {}
    dic['paragraph'] = n
    dic['text'] = [s for s in upr['text'] if s.startswith(str(n))]
    sections.append(dic)

1.4 Some sections aren't really sections -- they're just single paragraphs with no subitems. We should delete those, because they don't contain any actual recommendations. Loop through the sections, and if a section contains only one line (meaning it has no subitems), delete that section. How many sections are we left with?

In [38]:
# delete all the chunks with only 1 paragraph
for dic in sections:
    if len(dic['text']) == 1:
        sections.remove(dic)
len(sections)

3

## 2. Parsing into main text and subitems

2.1 Take a look at one of the sections in the `sections` list. For each section, the first paragraph is the header paragraph, explaining how these recommendations are linked (e.g. they've all been accepted or rejected). Below that are the items belonging to that section, containing specific recommendations. We want to parse the sections to capture this structure, dividing the lines into 'header' and 'items'.

Looping through each dictionary in `sections` list, add the keys `header` and `items` to the dictionary, with their respective lines. The `header` key should contain only one line, while the `items` key should contain a list of lines. In your final result, the `sections` list should look like this:


```
[
    { "paragraph-number": 101,
      "full-text": [ 101. Some text, 101.1 More text, 101.2 Another line, 101.3 Even more]
      "header": 101. Some text
      "items": [101.1 More text, 101.2 Another line, 101.3 Even more] } ,
     
    { "paragraph-number": 102,
      "full-text": [ 102. A new section, 102.1 Some text, 102.2 Even more text, 102.3 You get the picture ] 
      "header": 102. A new section
      "items": [102.1 Some text, 102.2 Even more text, 102.3 You get the picture] }
]
```

In [56]:
for dic in sections:
    dic['header'] = dic['text'][0]
    dic['items'] = dic['text'][1:]

# uncomment to test
sections[0]['header']

'113.  The recommendations formulated during the interactive dialogue/listed below have been examined by the Plurinational State of Bolivia and enjoy the support of the State party: '

2.2 Now that we have the header paragraph isolated, we can use that to derive the 'decision' of the section -- e.g. accept, reject, etc. 

For each dictinary in the `sections` list, add a called `decision` key with the decision of that section. Your results should look something like this (ignore the content details, and remember: dictionaries are not ordered):

```
```
[
    { "paragraph-number": 101,
      "full-text": [ 101. Some text, 101.1 More text, 101.2 Another line, 101.3 Even more]
      "header": 101. Some text
      "items": [101.1 More text, 101.2 Another line, 101.3 Even more] 
      "decision": "accept" } ,
     
    { "paragraph-number": 102,
      "full-text": [ 102. A new section, 102.1 Some text, 102.2 Even more text, 102.3 You get the picture ] 
      "header": 102. A new section
      "items": [102.1 Some text, 102.2 Even more text, 102.3 You get the picture] 
      "decision: "reject" }
]
```



In [58]:
# assign a decision
for dic in sections: 
    text = dic['header'] 
    decision = ''
    if 'implemented' in text or 'process of implementation' in text:
        decision = 'implemented'
    elif 'will be examined' in text or 'will examine' in text or "further examined" in text or "Responses to the following recommendations will be provided" in text or "will be included in the outcome report" in text or "will be provided in due course" in text or "course of the discussion" in text:
        decision = 'consider'
    elif 'not enjoy the support' in text or 'reject' in text or 'cannot be accepted' in text:
        decision = 'reject'
    elif 'support' in text and 'did not enjoy the support' not in text:
        decision = 'support'
    elif 'have been noted by' in text or 'were noted by' in text:
        decision = 'noted'
    elif 'do not reflect the current situation' in text:
        decision = 'reject'
    else:
        decision = 'unknown'
    dic['decision'] = decision

# uncomment to test
sections[0]['decision']

'support'

___

## 3.  Making a Function

3.1 We want to do the above parsing for each document in the list. Using the code you wrote in sections 1 and 2, write a function that passes a document, and returns an object `parsed_sections`.

In [41]:
# turn it into a function
def parse(document):
    
    upr = document
    
    # separate upr out into chunks by mainP, i.e. the support, reject, consider chunks 
    # by identifying subparagraphs that start with the main paragraph numbers.
    sections = []
    for n in mainParagraphs(upr):
        dic = {}
        dic['paragraph'] = n
        dic['text'] = [s for s in upr['text'] if s.startswith(str(n))]
        sections.append(dic)
    
    # delete all the chunks with only 1 paragraph
    for dic in sections:
        if len(dic['text']) == 1:
            sections.remove(dic)
    
    # parse into main-text and items
    for dic in sections:
        dic['header'] = dic['text'][0]
        dic['items'] = dic['text'][1:]
    
    # assign a decision
    for dic in sections: 
        text = dic['header'] 
        decision = ''
        if 'implemented' in text or 'process of implementation' in text:
            decision = 'implemented'
        elif 'will be examined' in text or 'will examine' in text or "further examined" in text or "Responses to the following recommendations will be provided" in text or "will be included in the outcome report" in text or "will be provided in due course" in text or "course of the discussion" in text:
            decision = 'consider'
        elif 'not enjoy the support' in text or 'reject' in text or 'cannot be accepted' in text:
            decision = 'reject'
        elif 'support' in text and 'did not enjoy the support' not in text:
            decision = 'support'
        elif 'have been noted by' in text or 'were noted by' in text:
            decision = 'noted'
        elif 'do not reflect the current situation' in text:
            decision = 'reject'
        else:
            decision = 'unknown'
        dic['decision'] = decision

    return(sections)

3.2 Loop through all the documents in the list `l` and assign a new `sections` key to the output from the above function. 

In [59]:
# apply to all docs
for i in l:
    i['sections'] = parse(i)

#uncomment to test
l[0]['sections'][0]['header']

'136. The recommendations formulated during the interactive dialogue and listed below have been examined by Afghanistan and enjoy its support: '

___

## 4. Getting the final list of recommendations

4.1 Take a look at a recommendation. Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Below, I've given you a sample recommendation. Find a way to pull out the recommending country. (Hint: Use the `split` method.)

In [48]:
rec = l[0]['sections'][0]['items'][0] # the first line, from the first section, of the first upr in `l`
print rec

# your code here
rec.split('(')[-1].split(')')[0]

136.1 To further build up on its effort to fully protect human rights in the country (Ethiopia); 


'Ethiopia'

4.2 We now want to create a new list `reclist` containing just the recommendations. Each recommendation should be a dictionary with the following keys: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
4. `year`: the year of the review (all 2014 here)
5. `decision`: whether the recommendation was accepted, rejected, etc.
6. `text`: the text of the recommendation

In [54]:
# make dictionaries for each individual recommendation item
reclist = []
for upr in l:
    for section in upr['sections']:
        for item in section['items']:
            dic = {}
            dic['to'] = upr['country']
            dic['year'] = upr['year']
            dic['decision'] = section['decision']
            dic['from'] = item.split('(')[-1].split(')')[0]
            dic['text'] = item
            reclist.append(dic)  

# uncomment to test
print len(reclist)
reclist[1]

7382


{'decision': 'support',
 'from': 'Indonesia',
 'text': '136.2 Continue and deepen efforts to firmly root human rights values and principles in the Government system, including through human rights training to state officials (Indonesia); ',
 'to': 'afghanistan',
 'year': '2014'}

## 5. Write a CSV

In [None]:
#writing column headings
keys = l[0].keys()
keys

In [None]:
#writing the rest
with open('2013data.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(recslist)