# Project Data Sample Review 

In [1]:
import pandas as pd
import numpy as np
import os
from collections import Counter
%matplotlib inline
pd.set_option('max_rows',300)

In [2]:
projects=pd.read_csv('../Data/EWS_Published Project_Listing_11JUN18.csv')
projects = projects[projects['EWS ID'].notnull()]

In [3]:
projects.shape

(6832, 62)

** Example Row **

In [4]:
projects.iloc[0]

EWS ID                                                               29164
ProjectNumber                                            AFDB-P-TN-BB0-007
Published                                                        Published
Bank Risk Rating                                                         U
Project Status                                                    Proposed
EWS URL                  https://ews.rightsindevelopment.org/projects/p...
Detailed Analysis URL                                                  NaN
Project Name                                    TUNISIA FERTILIZER PROJECT
City                                                                   NaN
Country Count                                                            1
Country 1                                                          Tunisia
Country 2                                                              NaN
Country 3                                                              NaN
Country 4                

** Null Check **

In [5]:
projects.count()/len(projects.index)

EWS ID                   1.000000
ProjectNumber            1.000000
Published                1.000000
Bank Risk Rating         1.000000
Project Status           0.937500
EWS URL                  1.000000
Detailed Analysis URL    0.000000
Project Name             1.000000
City                     0.169936
Country Count            1.000000
Country 1                0.909690
Country 2                0.035714
Country 3                0.019760
Country 4                0.011856
Country 5                0.007758
Country 6                0.005123
Country 7                0.003367
Country 8                0.001903
Country 9                0.001610
Country 10               0.001464
Country 11               0.001025
Country 12               0.000585
Borrower or Client       0.762002
Private Actor Count      1.000000
Private Actor 1          0.077283
Private Actor 2          0.018004
Private Actor 3          0.006587
Private Actor 4          0.003074
Private Actor 5          0.002196
Private Actor 

## Project Description Column will Likely Be most Useful for Matching 

**Notes**
    * Some descriptions are pretty short - not sure how easy it will be to match to those
    * Some of the other fields will likely be useful (Country, Borrower or Client, etc.)

In [6]:
for i in projects.sample(15)['Project Description']:
    print(i)
    print('*****\n')

The proposed project involves IFC investing an INR equivalent of approximately US$50 million in a combination of instruments in one or more Special Purpose Vehicles ("SPV") promoted by Mahindra Lifespace Developers Limited ("MLDL" or the "Company" or the "Sponsor") set up for the development of three industrial clusters (ICs) around established industrial areas in Rajasthan, Gujarat, and Maharashtra (the "Project").
*****

The objective of the project is to adapt and scale a loan product to finance water and sanitation solutions to poor and vulnerable population in Brazil.
*****

The proposed financing consists of two separate long-term loans to be granted directly to Durlicouros Industria e Comercio de Couros, Exportacao e Importacao Ltda. in Brazil and to Durli Leather S.A. and Veneza Inversiones S.A. in Paraguay. The Borrowers will use the proceeds of the loans to support the construction, operation, and maintenance of two (2) new leather manufacturing plants (one (1) in Brazil and 

## Looking at the Sector Data 

This could be a dataset that could help in the tagging of the news articles with (Sector Infortmation)

In [7]:
def get_category_cols(category, additional_removes=None):
    cols = [i for i in projects.columns if category in i]
    cols.remove(category + ' Count')
    if additional_removes:
        [cols.remove(i) for i in additional_removes]
    return cols

sector_cols = get_category_cols('Sector')   
all_sectors = projects[sector_cols].as_matrix().flatten()


In [8]:
for i in Counter(all_sectors): print(i)

nan
Transport
Hydropower
Infrastructure
Climate and Environment
Finance
Industry and Trade
Humanitarian Response
Construction
Education and Health
Communications
Mining
Law and Government
Energy
Technical Cooperation
Agriculture and Forestry
Water and Sanitation


# NOTE 

Might be able to use this to also classify the Bank and Country 

**Countries**

In [9]:
country_cols = get_category_cols('Country')
all_countries = projects[country_cols].as_matrix().flatten()
country_counter = Counter(all_countries)
print(len(country_counter))
country_counter

170


Counter({'Afghanistan': 25,
         'Albania': 10,
         'Angola': 6,
         'Argentina': 57,
         'Armenia': 26,
         'Austria': 11,
         'Azerbaijan': 15,
         'Bahamas': 8,
         'Bangladesh': 64,
         'Barbados': 3,
         'Belarus': 7,
         'Belgium': 10,
         'Belize': 9,
         'Benin': 6,
         'Bhutan': 17,
         'Bolivia': 30,
         'Bosnia and Herzegovina': 15,
         'Brazil': 79,
         'Bulgaria': 9,
         'Burkina Faso': 7,
         'Burundi': 1,
         'Cambodia': 42,
         'Cameroon': 11,
         'Cape Verde': 3,
         'Central African Republic': 7,
         'Chad': 10,
         'Chile': 22,
         'China': 128,
         'Colombia': 53,
         'Congo, Democratic Republic of': 7,
         'Congo, Republic of': 3,
         'Cook Islands': 1,
         'Costa Rica': 23,
         'Croatia': 13,
         'Cyprus': 1,
         'Czech Republic': 11,
         'Denmark': 9,
         'Djibouti': 3,
         'Do

**Banks**

In [10]:
banks_cols = get_category_cols('Bank', ['Bank Risk Rating'])
all_banks = projects[banks_cols].as_matrix().flatten()
all_banks = [i for i  in all_banks if  pd.notnull(i)] ## Something weird with the nulls in this one. 
banks_counter = Counter(all_banks)
print(len(banks_counter))
banks_counter

13


Counter({'African Development Bank (AFDB)': 45,
         'Asian Development Bank (ADB)': 644,
         'Asian Infrastructure Investment Bank (AIIB)': 51,
         'European Bank for Reconstruction and Development (EBRD)': 179,
         'European Investment Bank (EIB)': 462,
         'Green Climate Fund (GCF)': 60,
         'Inter-American Development Bank (IADB)': 574,
         'Inter-American Investment Corporation (IIC)': 67,
         'International Finance Corporation (IFC)': 332,
         'Multilateral Investment Guarantee Agency (MIGA)': 39,
         'Netherlands Development Finance Company (FMO)': 206,
         'New Development Bank (NDB)': 12,
         'World Bank (WB)': 337})

In [19]:
for b in banks_counter: print(b)

Netherlands Development Finance Company (FMO)
Asian Development Bank (ADB)
Inter-American Investment Corporation (IIC)
New Development Bank (NDB)
Asian Infrastructure Investment Bank (AIIB)
World Bank (WB)
Inter-American Development Bank (IADB)
African Development Bank (AFDB)
European Bank for Reconstruction and Development (EBRD)
European Investment Bank (EIB)
International Finance Corporation (IFC)
Multilateral Investment Guarantee Agency (MIGA)
Green Climate Fund (GCF)


------------

# Compare to the Labeled Data 

In [8]:
labels = pd.read_csv('../Temp_Output/june14_temp_labeled.csv',encoding = "ISO-8859-1")

In [10]:
labels.head()

Unnamed: 0,article_id,published,title,HyperLink,url,feed_label,Sectors,Bank1,Bank2,Country1,Country2,Projects(EWSProjectID),EWS Project Name,EWS hyperlink
0,10f9ed2,1/11/18,ADB Provides Support for Three Infrastructure ...,http://moderndiplomacy.eu/2018/01/11/adb-provi...,http://moderndiplomacy.eu/2018/01/11/adb-provi...,NEWS ADB - All Streams,Infrastructure,ADB,,Cambodia,,"ADB-41123-015, ADB-48158-001, ADB-41435-053",Road Network Improvement Project (formerly Sec...,https://ewsdata.rightsindevelopment.org/projec...
1,ac62f9df,1/30/18,ADB commits $250 loan for all-weather roads in...,http://www.ddinews.gov.in/business/adb-commits...,http://www.ddinews.gov.in/business/adb-commits...,NEWS ADB - All Streams,Transport,ADB,,India,,ADB-48226-002,Second Rural Connectivity Investment Program (,https://ewsdata.rightsindevelopment.org/projec...
2,d1d79dd8,2/20/18,ADB Provides $360 Million for Rolling Stock to...,http://feedproxy.google.com/~r/adb_news/~3/v9s...,http://feedproxy.google.com/~r/adb_news/~3/v9s...,NEWS ADB - All Streams,Transport,ADB,,Bangladesh,,ADB-50312-003,Railway Rolling Stock Operations Improvement P...,https://ewsdata.rightsindevelopment.org/projec...
3,f0d65e5,2/25/18,ADB provides financing to Thailand's B.Grimm P...,https://www.dealstreetasia.com/stories/adb-b-g...,https://www.dealstreetasia.com/stories/adb-b-g...,NEWS ADB - All Streams,"Construction, Finance",ADB,,Thailand,,ADB-50410-001,ASEAN Distributed Power Project: Initial Pover...,https://ewsdata.rightsindevelopment.org/projec...
4,4a557358,2/26/18,ADB's $235m loan to support B.Grimm Power expa...,https://www.power-technology.com/news/adbs-235...,https://www.power-technology.com/news/adbs-235...,NEWS ADB - All Streams,"Construction, Finance",ADB,,Thailand,,ADB-50410-001,ASEAN Distributed Power Project: Initial Pover...,https://ewsdata.rightsindevelopment.org/projec...


-----------

## Projects 

**Valid EWS HyperLink**

We have some repeat projects 

In [22]:
labels[(labels['EWS hyperlink'].notnull()) & (labels['EWS hyperlink'].str.contains('//ews')) ]['EWS hyperlink'].value_counts().head()

https://ewsdata.rightsindevelopment.org/projects/50410-001-asean-distributed-power-project-initial-poverty-a/    8
https://ewsdata.rightsindevelopment.org/projects/20150676-tanap-trans-anatolian-natural-gas-pipeline/            6
https://ewsdata.rightsindevelopment.org/projects/p146330-ng-electricity-transmission-project/                    3
https://ewsdata.rightsindevelopment.org/projects/49188-tiryaki-agro-trading/                                     2
https://ewsdata.rightsindevelopment.org/projects/ec-l1111-quito-metropolitan-urban-transport-system/             2
Name: EWS hyperlink, dtype: int64

We have this Many Unique Projects that were matched to Articles 

In [23]:
labels[(labels['EWS hyperlink'].notnull()) & (labels['EWS hyperlink'].str.contains('//ews')) ]['EWS hyperlink'].nunique()

39

**Process to Check How Many We can Match ** 

In [24]:
project_labels = labels[ labels['EWS hyperlink'].notnull()]
project_labels = project_labels[project_labels['EWS hyperlink'].str.contains('https://ews')]

In [25]:
project_ids = []
for i in project_labels['Projects(EWSProjectID)']:
    project_ids += [j.strip() for j in  i.split(',')]

In [26]:
project_ids = list(set(project_ids))
print(len(project_ids))

47


In [14]:
set(project_ids) - set(projects.ProjectNumber.unique())

{'ADB-48226-002',
 'EIB-20140596',
 'EIB-20150676',
 'Figeac Aero Regional',
 'Tranche 2 in EWS',
 'WB-P146330',
 'WB-P148775',
 'WB-P160383',
 'WB-P160408',
 'missing Tranche 3?: https://www.adb.org/projects/36330-043/main#project-overview'}

# Sector 

In [55]:
def clean(x):
    try:
        return x.lower().strip()
    except:
        return 'ERROR'

In [56]:
sector_data = labels[labels.Sectors.notnull()]
sector_data.Sectors = sector_data.Sectors.apply(clean)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [57]:
sector_data.Sectors.value_counts()

energy                                                68
infrastructure                                        42
finance                                               35
water and sanitation                                  25
transportation                                        14
transport                                             13
health                                                12
communications                                         9
construction, finance                                  8
0                                                      6
trade                                                  6
hydropower                                             6
trade and industry                                     4
agriculture and forestry, industry and trade           3
water and sanitation, climate                          3
finance, energy                                        3
infrastructure, transportation                         3
education                      

## Banks 

In [58]:
banks_data = labels[labels.Bank1.notnull()]
banks_data.Bank1 = banks_data.Bank1.apply(clean)
banks_data.Bank2 = banks_data.Bank2.apply(clean)

In [59]:
banks_data.Bank1.value_counts()

abd                 188
afbd                 68
ebrd                 50
aiib                 43
adb                  25
eib                  16
wb                    8
idb                   4
idb / idb invest      1
Name: Bank1, dtype: int64

In [60]:
banks_data.Bank2.value_counts()

ERROR         397
jica            2
eib             1
gcf             1
ifc             1
world bank      1
Name: Bank2, dtype: int64

## Country 

In [61]:
country_data = labels[labels.Country1.notnull()]
country_data.Country1 = country_data.Country1.apply(clean)
country_data.Country2 = country_data.Country2.apply(clean)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [63]:
country_data.Country1.value_counts()

india                                                                                                       79
philippines                                                                                                 34
pakistan                                                                                                    25
nigeria                                                                                                     22
bangladesh                                                                                                  19
sri lanka                                                                                                   11
mongolia                                                                                                    11
turkey                                                                                                      10
thailand                                                                                                    10
e

In [62]:
country_data.Country2.value_counts()

ERROR    399
nauru      1
Name: Country2, dtype: int64