## Parsing Amnesty International Annual Reports
This data came with information on all the countries in a single .pdf file. I translated them to .txt files. My aim in this notebook is to parse information on each individual country into a row.

In [1]:
## Importing needed libraries
from functools import reduce
import pandas as pd
import re
import os

In [2]:
## Loading data from each year
with open('../data/Amnesty_reports/Amnesty_2009.txt') as file: Amnesty2009 = file.read()
with open('../data/Amnesty_reports/Amnesty_2010.txt') as file: Amnesty2010 = file.read()
with open('../data/Amnesty_reports/Amnesty_2011.txt') as file: Amnesty2011 = file.read()
with open('../data/Amnesty_reports/Amnesty_2012.txt') as file: Amnesty2012 = file.read()
with open('../data/Amnesty_reports/Amnesty_2013.txt') as file: Amnesty2013 = file.read()
with open('../data/Amnesty_reports/Amnesty_2014.txt') as file: Amnesty2014 = file.read()
with open('../data/Amnesty_reports/Amnesty_2015.txt') as file: Amnesty2015 = file.read()
with open('../data/Amnesty_reports/Amnesty_2016.txt') as file: Amnesty2016 = file.read()
with open('../data/Amnesty_reports/Amnesty_2017.txt') as file: Amnesty2017 = file.read()

In [4]:
# Deleting troublesome regexes and other ancillary phrases/words
clean_2010 = Amnesty2010.replace('\n',' ')      # deleting all instances of "\n"
clean_2010 = clean_2010.replace('Amnesty International Report 2010',' ')   # delete all instances of this phrase 
                                                                           # that appeared on each page of the report

clean_2010 = re.sub("\d+", "", clean_2010)    # deleting all numbers
clean_2010 = re.sub(",", "", clean_2010)    # deleting all commas
print(clean_2010 [:1000])                  # snapshot of cleaned data


# 2011
clean_2011 = Amnesty2011.replace('\n',' ')
clean_2011 = clean_2011.replace('Amnesty International Report 2011',' ')
clean_2011 = re.sub("\d+", "", clean_2011)
clean_2011 = re.sub(",", "", clean_2011)
print('\n', clean_2011 [:1000])


# 2012
clean_2012 = Amnesty2012.replace('\n',' ')
clean_2012 = clean_2012.replace('Amnesty International Report 2012',' ')
clean_2012 = re.sub("\d+", "", clean_2012)
clean_2012 = re.sub(",", "", clean_2012)
print('\n', clean_2012 [:1000])


# 2013
clean_2013 = Amnesty2013.replace('\n',' ')
clean_2013 = clean_2013.replace('Amnesty International Report 2013',' ')
clean_2013 = re.sub("\d+", "", clean_2013)
clean_2013 = re.sub(",", "", clean_2013)
print('\n', clean_2013 [:1000])


# 2014
clean_2014 = Amnesty2014.replace('\n',' ')
clean_2014 = clean_2014.replace('Amnesty International Report 2014/15',' ')
clean_2014 = re.sub("\d+", "", clean_2014)
clean_2014 = re.sub(",", "", clean_2014)
print(clean_2014 [:1000])


# 2015
clean_2015 = Amnesty2015.replace('\n',' ')
clean_2015 = clean_2015.replace('Amnesty International Report 2015/16',' ')
clean_2015 = re.sub("\d+", "", clean_2015)
clean_2015 = re.sub(",", "", clean_2015)
print('\n', clean_2015 [:1000])


# 2016
clean_2016 = Amnesty2016.replace('\n',' ')
clean_2016 = clean_2016.replace('Amnesty International Report 2016/17',' ')
clean_2016 = re.sub("\d+", "", clean_2016)
clean_2016 = re.sub(",", "", clean_2016)
print('\n', clean_2016 [:1000])

# 2017
clean_2017 = Amnesty2017.replace('\n',' ')
clean_2017 = clean_2017.replace('Amnesty International Report 2017/18',' ')
clean_2017 = re.sub("\d+", "", clean_2017)
clean_2017 = re.sub(",", "", clean_2017)
print('\n', clean_2017 [:1000])

AFGHANISTAN ISLAMIC REPUBLIC OF AFGHANISTAN Head of state and government: Death penalty: Population: Life expectancy: Under- mortality (m/f): Adult literacy: Hamid Karzai retentionist . million . years / per   per cent Afghan people continued to suffer widespread human rights violations and violations of international humanitarian law more than seven years after the USA and its allies ousted the Taleban. Access to health care education and humanitarian aid deteriorated particularly in the south and south-east of the country due to escalating armed conflict between Afghan and international forces and the Taleban and other armed groups. Conflict-related violations increased in northern and western Afghanistan areas previously considered relatively safe. Background The Taleban and other anti-government groups stepped up attacks against civilians including attacks on schools and health clinics across the country. Allegations of electoral fraud during the  presidential elections reflected w

Now comes the actual chore of manually coding in the country names the data. The script had to be case specific, that is why there is a mixture of upper and lower cases for different countries.  Plus, there were inconsistencies between reports on how the names were written up.

In [5]:
# Coding in country names
countries = ' (?=AFGHANISTAN|ALBANIA|ALGERIA|ANGOLA|ARGENTINA|ARMENIA|AUSTRALIA|AUSTRIA|AZERBAIJAN|\
BAHAMAS|BAHRAIN|BANGLADESH|BELARUS|BELGIUM|BENIN|BOLIVIA|BOSNIA|BOTSWANA|BRAZIL|BRUNEI|BULGARIA|BURKINA|\
CAMBODIA|CAMEROON|CANADA|CENTRAL|CHAD|CHILE|CHINA|COLOMBIA|CÔTE|CROATIA|CUBA|CYPRUS|CZECH|CONGO|DENMARK|\
DOMINICAN|ECUADOR|EGYPT|EL SALVADOR|EQUATORIAL|ERITREA|ESTONIA|ETHIOPIA|FIJI|FINLAND|FRANCE|GABON|GAMBIA|\
GEORGIA|GERMANY|GHANA|GREECE|GUATEMALA|Republic of Guinea|HAITI|HONDURAS|HUNGARY|INDIA|INDONESIA|\
Islamic Republic of Iran Head|Republic of Iraq|REPUBLIC OF IRELAND|State of Israel|ITALY|JAMAICA|JAPAN|\
JORDAN|KAZAKHSTAN|KENYA|KOREA|KUWAIT|KYRGYZSTAN|LAOS|LATVIA|LEBANON|LESOTHO|LIBERIA|State of Libya|\
BURUNDI|LITHUANIA|MACEDONIA|MADAGASCAR|MALAWI|MALAYSIA|MALDIVES|MALI|MALTA|MAURITANIA|MEXICO|MOLDOVA|\
MONGOLIA|MONTENEGRO|Kingdom of Morocco|MOZAMBIQUE|MYANMAR|NAMIBIA|NAURU|NEPAL|NETHERLANDS|NEW ZEALAND|\
NICARAGUA|NIGER|NIGERIA|NORWAY|OMAN|PAKISTAN|State of Palestine Head|Independent State of Papua New Guinea|\
PARAGUAY|PERU|PHILIPPINES|POLAND|PORTUGAL|PUERTO|QATAR|ROMANIA|RUSSIAN|RWANDA|Kingdom of Saudi Arabia|\
SENEGAL|SERBIA|SIERRA|SINGAPORE|SLOVAKIA|SLOVENIA|SOMALIA|SOUTH AFRICA|SOUTH SUDAN|SPAIN|SRI LANKA|\
Republic of the Sudan|SWAZILAND|SWEDEN|SWITZERLAND|Syrian Arab Republic|TAIWAN|TAJIKISTAN|TANZANIA|THAILAND|\
TIMOR-LESTE|TOGO|TUNISIA|TURKEY|TURKMENISTAN|UGANDA|UKRAINE|UNITED ARAB EMIRATES|UNITED KINGDOM|\
UNITED STATES OF AMERICA|URUGUAY|UZBEKISTAN|VENEZUELA|VIET|YEMEN|ZAMBIA|ZIMBABWE|SAINT KITTS AND NEVIS|\
SOLOMON ISLANDS|SURINAME|TONGA|TRINIDAD AND TOBAGO|GUINEA-BISSAU|GUYANA|HOLY SEE|DJIBOUTI|VANUATU)'

#### Parsing the data for each report and year

In [6]:
#2010
Amn_2010 = re.split(countries, clean_2010)
Amn_2010 = pd.DataFrame(Amn_2010)
Amn_2010.columns= ['Report_2010']
print('Countries reported in 2010: ', Amn_2010.shape)


#2011
Amn_2011 = re.split(countries, clean_2011)
Amn_2011 = pd.DataFrame(Amn_2011)
Amn_2011.columns= ['Report_2011']
print('Countries reported in 2011: ', Amn_2011.shape)


#2012
Amn_2012 = re.split(countries, clean_2012)
Amn_2012 = pd.DataFrame(Amn_2012)
Amn_2012.columns= ['Report_2012']
print('Countries reported in 2012: ', Amn_2012.shape)


#2013
Amn_2013 = re.split(countries, clean_2013)
Amn_2013 = pd.DataFrame(Amn_2013)
Amn_2013.columns= ['Report_2013']
print('Countries reported in 2013: ', Amn_2013.shape)


#2014
Amn_2014 = re.split(countries, clean_2014)
Amn_2014 = pd.DataFrame(Amn_2014)
Amn_2014.columns= ['Report_2014']
print('Countries reported in 2014: ', Amn_2014.shape)


#2015
Amn_2015 = re.split(countries, clean_2015)
Amn_2015 = pd.DataFrame(Amn_2015)
Amn_2015.columns= ['Report_2015']
print('Countries reported in 2015: ', Amn_2015.shape)


#2016
Amn_2016 = re.split(countries, clean_2016)
Amn_2016 = pd.DataFrame(Amn_2016)
Amn_2016.columns= ['Report_2016']
print('Countries reported in 2017: ', Amn_2016.shape)

#2017
Amn_2017 = re.split(countries, clean_2017)
Amn_2017 = pd.DataFrame(Amn_2017)
Amn_2017.columns= ['Report_2017']
print('Countries reported in 2017: ', Amn_2017.shape)

Countries reported in 2010:  (270, 1)
Countries reported in 2011:  (268, 1)
Countries reported in 2012:  (265, 1)
Countries reported in 2013:  (277, 1)
Countries reported in 2014:  (162, 1)
Countries reported in 2015:  (162, 1)
Countries reported in 2017:  (161, 1)
Countries reported in 2017:  (159, 1)


In [7]:
# Making an empty column in each dataframe for country names
Amn_2010['Country'] = ''
Amn_2011['Country'] = ''
Amn_2012['Country'] = ''
Amn_2013['Country'] = ''
Amn_2014['Country'] = ''
Amn_2015['Country'] = ''
Amn_2016['Country'] = ''
Amn_2017['Country'] = ''

# Assigning names of respective countries 
Amn_2010.loc[Amn_2010.Country == '', 'Country'] = Amn_2010.Report_2010.str.split().str.get(0)
Amn_2011.loc[Amn_2011.Country == '', 'Country'] = Amn_2011.Report_2011.str.split().str.get(0)
Amn_2012.loc[Amn_2012.Country == '', 'Country'] = Amn_2012.Report_2012.str.split().str.get(0)
Amn_2013.loc[Amn_2013.Country == '', 'Country'] = Amn_2013.Report_2013.str.split().str.get(0)
Amn_2014.loc[Amn_2014.Country == '', 'Country'] = Amn_2014.Report_2014.str.split().str.get(0)
Amn_2015.loc[Amn_2015.Country == '', 'Country'] = Amn_2015.Report_2015.str.split().str.get(0)
Amn_2016.loc[Amn_2016.Country == '', 'Country'] = Amn_2016.Report_2016.str.split().str.get(0)
Amn_2017.loc[Amn_2017.Country == '', 'Country'] = Amn_2017.Report_2017.str.split().str.get(0)

There are some duplicated rows in this data so I am going to delete them now

In [8]:
## Deleting duplicates
Amn_2010_c = Amn_2010.drop_duplicates(subset=['Country'], keep='last')
Amn_2011_c = Amn_2011.drop_duplicates(subset=['Country'], keep='last')
Amn_2012_c = Amn_2012.drop_duplicates(subset=['Country'], keep='last')
Amn_2013_c = Amn_2013.drop_duplicates(subset=['Country'], keep='last')
Amn_2014_c = Amn_2014.drop_duplicates(subset=['Country'], keep='last')
Amn_2015_c = Amn_2015.drop_duplicates(subset=['Country'], keep='last')
Amn_2016_c = Amn_2016.drop_duplicates(subset=['Country'], keep='last')
Amn_2017_c = Amn_2017.drop_duplicates(subset=['Country'], keep='last')

# CHecking to see to see the change in dataframe
print('2010 data: deleting duplicates resulted in a drop from', len(Amn_2010), 'to', len(Amn_2010_c))
print('2011 data: deleting duplicates resulted in a drop from', len(Amn_2011), 'to', len(Amn_2011_c))
print('2012 data: deleting duplicates resulted in a drop from', len(Amn_2012), 'to', len(Amn_2012_c))
print('2013 data: deleting duplicates resulted in a drop from', len(Amn_2013), 'to', len(Amn_2013_c))
print('2014 data: deleting duplicates resulted in a drop from', len(Amn_2014), 'to', len(Amn_2014_c))
print('2015 data: deleting duplicates resulted in a drop from', len(Amn_2015), 'to', len(Amn_2015_c))
print('2016 data: deleting duplicates resulted in a drop from', len(Amn_2016), 'to', len(Amn_2016_c))
print('2017 data: deleting duplicates resulted in a drop from', len(Amn_2017), 'to', len(Amn_2017_c))

2010 data: deleting duplicates resulted in a drop from 270 to 144
2011 data: deleting duplicates resulted in a drop from 268 to 142
2012 data: deleting duplicates resulted in a drop from 265 to 139
2013 data: deleting duplicates resulted in a drop from 277 to 145
2014 data: deleting duplicates resulted in a drop from 162 to 148
2015 data: deleting duplicates resulted in a drop from 162 to 148
2016 data: deleting duplicates resulted in a drop from 161 to 148
2017 data: deleting duplicates resulted in a drop from 159 to 148


In [9]:
# Putting it all together
Amnesty_final = reduce(lambda x,y: pd.merge(x,y, on='Country', how='outer'), [Amn_2010_c,Amn_2011_c,Amn_2012_c,Amn_2013_c,Amn_2014_c,Amn_2015_c,Amn_2016_c,Amn_2017_c])
Amnesty_final = Amnesty_final.set_index('Country').reset_index()
Amnesty_final.head()

Unnamed: 0,Country,Report_2010,Report_2011,Report_2012,Report_2013,Report_2014,Report_2015,Report_2016,Report_2017
0,AFGHANISTAN,AFGHANISTAN Head of state and government: Deat...,AFGHANISTAN Head of state and government: Deat...,AFGHANISTAN Head of state and government: Deat...,AFGHANISTAN Head of state and government: Hami...,AFGHANISTAN Islamic Republic of Afghanistan He...,AFGHANISTAN Islamic Republic of Afghanistan He...,AFGHANISTAN Islamic Republic of Afghanistan He...,AFGHANISTAN Islamic Republic of Afghanistan He...
1,ALBANIA,ALBANIA Head of state: Head of government: Dea...,ALBANIA Head of state: Head of government: Dea...,ALBANIA Head of state: Head of government: Dea...,ALBANIA Head of state: Bujar Nishani (replaced...,ALBANIA Republic of Albania Head of state: Buj...,ALBANIA Republic of Albania Head of state: Buj...,ALBANIA Republic of Albania Head of state: Buj...,ALBANIA Republic of Albania Head of state: Ili...
2,ALGERIA,ALGERIA Head of state: Head of government: Dea...,ALGERIA A Head of state: Head of government: D...,ALGERIA Head of state: Head of government: Dea...,ALGERIA Head of state: Head of government: Abd...,ALGERIA People’s Democratic Republic of Algeri...,ALGERIA People’s Democratic Republic of Algeri...,ALGERIA other ill-treatment counter-terrorism ...,ALGERIA People’s Democratic Republic of Algeri...
3,ANGOLA,ANGOLA Head of state: Head of government: Deat...,ANGOLA Head of state: Head of government: Deat...,ANGOLA Head of state and government: Death pen...,ANGOLA Head of state and government: José Edua...,ANGOLA related to freedom of expression associ...,ANGOLA Republic of Angola Head of state and go...,ANGOLA Republic of Angola Head of state and go...,ANGOLA Republic of Angola Head of state and go...
4,ARGENTINA,ARGENTINA ARGENTINE REPUBLIC Head of state and...,ARGENTINA ARGENTINE REPUBLIC Head of state and...,ARGENTINA ARGENTINE REPUBLIC Head of state and...,ARGENTINA ARGENTINE REPUBLIC Head of state and...,ARGENTINA Argentine Republic Head of state and...,ARGENTINA Argentine Republic Head of state and...,ARGENTINA Argentine Republic Head of state and...,ARGENTINA Argentine Republic Head of state and...


In [10]:
Amnesty_final.to_csv('../data/Amnesty2010_2017.csv')