## Parsing Amnesty International Annual Reports
This data came with information on 159 countries in a single .pdf file. I translated them to .txt files. My aim in this notebook is to parse information on each individual country into a row.

In [1]:
## Importing needed libraries
import pandas as pd
import re

In [2]:
## Loading data
with open('../data/Amnesty_reports/Amnesty_2018.txt') as file:
    Amnesty2018 = file.read()

In [3]:
# Deleting troublesome regexes and other ancillary phrases/words
data_clean = Amnesty2018.replace('\n',' ')      # deleting all instances of "\n"
data_clean = data_clean.replace('Amnesty International Report 2017/18',' ')   # delete all instances of this phrase 
                                                                              # that appeared on each page of the report

data_clean = re.sub("\d+", "", data_clean)    # deleting all numbers
data_clean [:1000]                  # snapshot of cleaned data

'AFGHANISTAN Islamic Republic of Afghanistan Head of state and government: Muhammad Ashraf Ghani The civilian population suffered widespread human rights abuses as a result of the continuing conflict. Conflict-related violence led to deaths, injuries and displacement. Civilian casualties continued to be high; the majority were killed or injured by armed insurgent groups, but a significant minority by pro-government forces. The number of people internally displaced by conflict rose to more than  million; about . million Afghan refugees lived outside the country. Gender-based violence against women and girls persisted by state and non-state actors. An increase in public punishments of women by armed groups applying Shari’a law was reported. Human rights defenders received threats from both state and non-state actors; journalists faced violence and censorship. Death sentences continued to be imposed; five people were executed in November. Members of the Hazara minority group and Shi’a con

Now comes the actual chore of parsing the data. I had to manualy code in the countries. The code if case specific, that is why there is a mixture of upper and lower cases for different countries.

In [4]:
countries = re.split(' (?=AFGHANISTAN|ALBANIA|ALGERIA|ANGOLA|ARGENTINA|ARMENIA|AUSTRALIA|AUSTRIA|AZERBAIJAN|\
BAHRAIN|BANGLADESH|BELARUS|BELGIUM|BENIN|BOLIVIA|BOSNIA|BOTSWANA|BRAZIL|BRUNEI|BULGARIA|BURKINA|BURUNDI|\
CAMBODIA|CAMEROON|CANADA|CENTRAL|CHAD|CHILE|CHINA|COLOMBIA|CÔTE|CROATIA|CUBA|CYPRUS|CZECH|CONGO|DENMARK|\
DOMINICAN|ECUADOR|EGYPT|EL SALVADOR|EQUATORIAL|ERITREA|ESTONIA|ETHIOPIA|FIJI|FINLAND|FRANCE|GABON|GAMBIA|\
GEORGIA|GERMANY|GHANA|GREECE|GUATEMALA|Republic of Guinea|HAITI|HONDURAS|HUNGARY|INDIA|INDONESIA|\
Islamic Republic of Iran Head|Republic of Iraq|REPUBLIC OF IRELAND|State of Israel|ITALY|JAMAICA|JAPAN|\
JORDAN|KAZAKHSTAN|KENYA|KOREA|KUWAIT|KYRGYZSTAN|LAOS|LATVIA|LEBANON|LESOTHO|LIBERIA|State of Libya|\
LITHUANIA|MACEDONIA|MADAGASCAR|MALAWI|MALAYSIA|MALDIVES|MALI|MALTA|MAURITANIA|MEXICO|MOLDOVA|\
MONGOLIA|MONTENEGRO|Kingdom of Morocco|MOZAMBIQUE|MYANMAR|NAMIBIA|NAURU|NEPAL|NETHERLANDS|NEW ZEALAND|\
NICARAGUA|NIGER|NIGERIA|NORWAY|OMAN|PAKISTAN|State of Palestine Head|Independent State of Papua New Guinea|\
PARAGUAY|PERU|PHILIPPINES|POLAND|PORTUGAL|PUERTO|QATAR|ROMANIA|RUSSIAN|RWANDA|Kingdom of Saudi Arabia|\
SENEGAL|SERBIA|SIERRA|SINGAPORE|SLOVAKIA|SLOVENIA|SOMALIA|SOUTH AFRICA|SOUTH SUDAN|SPAIN|SRI LANKA|\
Republic of the Sudan|SWAZILAND|SWEDEN|SWITZERLAND|Syrian Arab Republic|TAIWAN|TAJIKISTAN|TANZANIA|THAILAND|\
TIMOR-LESTE|TOGO|TUNISIA|TURKEY|TURKMENISTAN|UGANDA|UKRAINE|UNITED ARAB EMIRATES|UNITED KINGDOM|\
UNITED STATES OF AMERICA|URUGUAY|UZBEKISTAN|VENEZUELA|VIET|YEMEN|ZAMBIA|ZIMBABWE)', data_clean)

countries = pd.DataFrame(countries)
countries.columns= ['Report_2018']
print(countries.shape)
countries.head()

(159, 1)


Unnamed: 0,Report_2018
0,AFGHANISTAN Islamic Republic of Afghanistan He...
1,ALBANIA Republic of Albania Head of state: Ili...
2,ALGERIA People’s Democratic Republic of Algeri...
3,ANGOLA Republic of Angola Head of state and go...
4,ARGENTINA Argentine Republic Head of state and...


Now, I wanted to somehow merge this data with CIRI's code for the countries, as well as the division into regions and sub-regions. That is what I have done below:

In [5]:
cntry = pd.read_csv('../data/Amnesty_reports/CountryCodes.csv')
cntry = cntry.drop(cntry.columns[0], axis=1)
cntry.head()

Unnamed: 0,COUNTRY,CODE
0,Afghanistan,101
1,Albania,104
2,Algeria,107
3,Angola,113
4,Argentina,119


In [6]:
if len(cntry) == len(countries):
    print ('Equal length dataframes, looks good to merge')
    
else:
        print('MISMATCHED LENGTHS! Check dataframes again')


Equal length dataframes, looks good to merge


In [9]:
Countries = pd.concat([cntry,countries], axis=1)
Countries.head()

Unnamed: 0,COUNTRY,CODE,Report_2018
0,Afghanistan,101,AFGHANISTAN Islamic Republic of Afghanistan He...
1,Albania,104,ALBANIA Republic of Albania Head of state: Ili...
2,Algeria,107,ALGERIA People’s Democratic Republic of Algeri...
3,Angola,113,ANGOLA Republic of Angola Head of state and go...
4,Argentina,119,ARGENTINA Argentine Republic Head of state and...


In [10]:
CIRI = pd.read_csv('../data/Amnesty_reports/CIRI_codes.csv')
CIRI.head()

Unnamed: 0,CTRY,YEARS,CIRI,REGION_code,REGION,SUBREGION_code,SUBREGION
0,Afghanistan,1981-2011,101,142,Asia,62,South-central Asia
1,Albania,1981-2011,104,150,Europe,39,Southern Europe
2,Algeria,1981-2011,107,2,Africa,15,Northern Africa
3,Andorra,1993-2011,110,150,Europe,39,Southern Europe
4,Angola,1981-2011,113,2,Africa,17,Middle Africa


In [11]:
Countries_CIRI = pd.merge(Countries, CIRI, how='left', left_on='CODE', right_on='CIRI')
Countries_CIRI.head()

Unnamed: 0,COUNTRY,CODE,Report_2018,CTRY,YEARS,CIRI,REGION_code,REGION,SUBREGION_code,SUBREGION
0,Afghanistan,101,AFGHANISTAN Islamic Republic of Afghanistan He...,Afghanistan,1981-2011,101,142,Asia,62,South-central Asia
1,Albania,104,ALBANIA Republic of Albania Head of state: Ili...,Albania,1981-2011,104,150,Europe,39,Southern Europe
2,Algeria,107,ALGERIA People’s Democratic Republic of Algeri...,Algeria,1981-2011,107,2,Africa,15,Northern Africa
3,Angola,113,ANGOLA Republic of Angola Head of state and go...,Angola,1981-2011,113,2,Africa,17,Middle Africa
4,Argentina,119,ARGENTINA Argentine Republic Head of state and...,Argentina,1981-2011,119,419,Latin America and the Caribbean,5,South America


In [12]:
Countries_CIRI.to_csv('../data/parsed_Amnesty.csv')