# Question 1

Problem objective: This problem measures the candidate's competency in automation, understanding and accessing APIs, unstructured data processing, and visualization.

Problem description:

-	Using the World Bank Documents & Reports API, extract metadata related to the Economic and Sector work conducted by the Bank from January 1st, 2010 to April 1st, 2024.
-	Use documents written in English.
-	Then, extract country names from the document title (not from any other metadata component).
-	Produce an animated chart showing the evolution of the number and the percentage of documents by country and year.
 


## Import Libraries

In [179]:
import requests
import json
import re
import pandas as pd
import plotly.express as px
# !pip install spacy
import spacy
!python -m spacy download en_core_web_lg

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     ---------------------------------------- 0.1/587.7 MB 3.6 MB/s eta 0:02:42
     --------------------------------------- 0.8/587.7 MB 13.0 MB/s eta 0:00:46
     --------------------------------------- 2.5/587.7 MB 23.1 MB/s eta 0:00:26
     --------------------------------------- 5.4/587.7 MB 34.7 MB/s eta 0:00:17
      -------------------------------------- 7.7/587.7 MB 40.7 MB/s eta 0:00:15
      ------------------------------------- 10.5/587.7 MB 54.4 MB/s eta 0:00:11
      ------------------------------------- 13.3/587.7 MB 65.6 MB/s eta 0:00:09
     - --------------------------------

DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


## Fetch Documents

In [180]:
def fetch_documents(start_date, end_date, language='English'):
    """
    Fetches all documents related to 'Economic and Sector Work' from the World Bank Documents & Reports API 
    within the specified date range and language.
    
    Args:
        start_date (str): The start date in the format 'YYYY-MM-DD'.
        end_date (str): The end date in the format 'YYYY-MM-DD'.
        language (str): The language of documents to fetch. Defaults to 'English'.
    
    Returns:
        list: A list of dictionaries, each representing a document's metadata.
    """
    base_url = "https://search.worldbank.org/api/v2/wds"
    documents = {}
    
    params = {
        'format': 'json',
        'lang_exact': language,
        'strdate': start_date,
        'enddate': end_date,
        'qterm': 'Economic and Sector Work',
        'fl': 'docdt,docty,count',
        'rows': 500,
        'os': 0
    }
    
    try:
        while True:
            response = requests.get(base_url, params=params)
            response.raise_for_status()  # Raises HTTPError for bad requests (4XX or 5XX)
            
            data = response.json()
            if 'documents' in data:
                documents.update(data['documents'])
                print(f"Fetched {len(data['documents'])} documents. Total fetched: {len(documents)}")
                if params['os'] + len(data['documents']) < data['total']:
                    params['os'] += len(data['documents'])
                else:
                    break  # All documents fetched
            else:
                print("No documents found or error in response format.")
                break
            
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    
    return documents


In [181]:
documents = fetch_documents("2010-01-01", "2024-04-01")
documents_df = pd.DataFrame(documents).T.reset_index(names=['document_id'])
documents_df.head()

Fetched 501 documents. Total fetched: 501
Fetched 501 documents. Total fetched: 1001
Fetched 501 documents. Total fetched: 1501
Fetched 501 documents. Total fetched: 2001
Fetched 501 documents. Total fetched: 2501
Fetched 501 documents. Total fetched: 3001
Fetched 501 documents. Total fetched: 3501
Fetched 501 documents. Total fetched: 4001
Fetched 501 documents. Total fetched: 4501
Fetched 501 documents. Total fetched: 5001
Fetched 501 documents. Total fetched: 5501
Fetched 501 documents. Total fetched: 6001
Fetched 501 documents. Total fetched: 6501
Request failed: 503 Server Error: Service Unavailable for url: https://search.worldbank.org/api/v2/wds?format=json&lang_exact=English&strdate=2010-01-01&enddate=2024-04-01&qterm=Economic+and+Sector+Work&fl=docdt%2Cdocty%2Ccount&rows=500&os=6513


Unnamed: 0,document_id,id,count,docty,entityids,docdt,abstracts,display_title,pdfurl,listing_relative_url,url_friendly_title,new_url,guid,url
0,D23811621,23811621,Philippines,Economic Updates and Modeling,{'entityid': '000477144_20150116101016'},2015-01-16T00:00:00Z,"{'cdata!': 'In the third quarter of 2014 (Q3),...",Philippine economic update :\n maki...,http://documents.worldbank.org/curated/en/4714...,/research/2015/01/23811621/philippine-economic...,http://documents.worldbank.org/curated/en/4714...,2015/01/23811621/philippine-economic-update-ma...,471411468057360432,http://documents.worldbank.org/curated/en/4714...
1,D26408130,26408130,Myanmar,Economic Updates and Modeling,{'entityid': '090224b0849162da_3_0'},2016-05-01T00:00:00Z,{'cdata!': 'After two years of strong economic...,Myanmar economic monitor :\n growin...,http://documents.worldbank.org/curated/en/2320...,/research/2016/05/26408130/myanmar-economic-mo...,http://documents.worldbank.org/curated/en/2320...,2017/06/26408130/myanmar-economic-monitor-grow...,232051468186846783,http://documents.worldbank.org/curated/en/2320...
2,D31202125,31202125,Sao Tome and Principe,Country Economic Memorandum,{'entityid': '090224b086e2abe6_1_0'},2019-06-26T00:00:00Z,{'cdata!': 'The purpose of this background not...,Country Economic Memorandum :\n Bac...,http://documents.worldbank.org/curated/en/1420...,/research/2019/06/31202125/country-economic-me...,http://documents.worldbank.org/curated/en/1420...,2019/06/31202125/Country-Economic-Memorandum-B...,142041562906624878,http://documents.worldbank.org/curated/en/1420...
3,D25169507,25169507,Malawi,Economic Updates and Modeling,{'entityid': '090224b083156ac6_1_0'},2015-10-01T00:00:00Z,{'cdata!': 'The Malawi economic monitor (MEM) ...,Malawi economic monitor :\n adjusti...,http://documents.worldbank.org/curated/en/4277...,/research/2015/10/25169507/malawi-economic-mon...,http://documents.worldbank.org/curated/en/4277...,2015/10/25169507/malawi-economic-monitor-adjus...,427721468190759173,http://documents.worldbank.org/curated/en/4277...
4,D31016637,31016637,Türkiye,Country Economic Memorandum,{'entityid': '090224b086de8d53_2_0'},2019-04-29T00:00:00Z,{'cdata!': 'Turkey’s pace of income convergenc...,Turkey Country Economic\n Memorandu...,http://documents.worldbank.org/curated/en/3056...,/research/2019/04/31016637/turkey-country-econ...,http://documents.worldbank.org/curated/en/3056...,2019/04/31016637/Turkey-Country-Economic-Memor...,305601561046781065,http://documents.worldbank.org/curated/en/3056...


In [182]:
# documents_df.to_csv('world_bank_documents.csv', index=False)
documents_df = pd.read_csv('world_bank_documents.csv')


Columns (14,15,16) have mixed types. Specify dtype option on import or set low_memory=False.



In [183]:
documents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47711 entries, 0 to 47710
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   document_id           47711 non-null  object 
 1   id                    47710 non-null  float64
 2   count                 47709 non-null  object 
 3   docty                 46985 non-null  object 
 4   entityids             47710 non-null  object 
 5   docdt                 47710 non-null  object 
 6   abstracts             25320 non-null  object 
 7   display_title         47710 non-null  object 
 8   pdfurl                45309 non-null  object 
 9   listing_relative_url  47710 non-null  object 
 10  url_friendly_title    47710 non-null  object 
 11  new_url               41810 non-null  object 
 12  guid                  47710 non-null  float64
 13  url                   47710 non-null  object 
 14  seccl                 1000 non-null   object 
 15  disclstat          

## Extract Country Names

In [184]:
documents_df["display_title"] = documents_df["display_title"].apply(lambda x: re.sub(r'[\s\n-]+', ' ', str(x)).strip())


In [185]:
# Extract a list of countries and a mapping from country demonyms to common names
def fetch_country_data():
    url = "https://restcountries.com/v3.1/all"
    response = requests.get(url)
    data = response.json()

    # Create a mapping of demonyms to country names
    demonym_to_country = {}
    country_ls = []
    for country in data:
        name = country['name']['common']
        country_ls.append(name)
        demonym = country.get('demonyms', {}).get('eng', {}).get('f')
        if demonym:
            demonym_to_country[demonym.lower()] = name
        demonym = country.get('demonyms', {}).get('eng', {}).get('m')
        if demonym:
            demonym_to_country[demonym.lower()] = name
    return demonym_to_country, country_ls

# Fetch the data and print it
mapping,country_ls = fetch_country_data()
print(mapping)

{'moldovan': 'Moldova', 'american': 'Northern Mariana Islands', 'mahoran': 'Mayotte', 'nauruan': 'Nauru', 'mozambican': 'Mozambique', 'brazilian': 'Brazil', 'cape verdian': 'Cape Verde', 'equatorial guinean': 'Equatorial Guinea', 'albanian': 'Albania', 'virgin islander': 'British Virgin Islands', 'niuean': 'Niue', 'palauan': 'Palau', 'nigerian': 'Nigeria', 'gambian': 'Gambia', 'somali': 'Somalia', 'yemeni': 'Yemen', 'malaysian': 'Malaysia', 'dominican': 'Dominican Republic', 'british': 'United Kingdom', 'malagasy': 'Madagascar', 'sahrawi': 'Western Sahara', 'cypriot': 'Cyprus', 'antiguan, barbudan': 'Antigua and Barbuda', 'irish': 'Ireland', 'paraguayan': 'Paraguay', 'sri lankan': 'Sri Lanka', 'south african': 'South Africa', 'kuwaiti': 'Kuwait', 'algerian': 'Algeria', 'croatian': 'Croatia', 'martinican': 'Martinique', 'sierra leonean': 'Sierra Leone', 'rwandan': 'Rwanda', 'syrian': 'Syria', 'saint vincentian': 'Saint Vincent and the Grenadines', 'kosovar': 'Kosovo', 'saint lucian': 'S

In [186]:
# Mannually add some mappings from our dataset
mapping["lao pdr"] = "Laos"
mapping["lao"] = "Laos"
mapping["philippine"] = "Philippines"
mapping["sao tome and príncipe"] = "São Tomé and Príncipe"
mapping["sao tome and principe"] = "São Tomé and Príncipe"
mapping["republic of congo"] = "Republic of the Congo"
mapping["congo"] = "Republic of the Congo"
mapping["kyrgyz republic"] = "Kyrgyzstan"

In [187]:
# Sort the demonyms
sorted_demonyms = sorted(mapping.keys(), key=len, reverse=True)

In [188]:
def replace_demonyms(text):
    # Sort demonyms by length in descending order to handle longer names first
    # This prevents partial replacements of shorter names that are substrings of longer names
    pattern = r'\b(' + '|'.join(map(re.escape, sorted_demonyms)) + r')\b'

    # Function to replace each match
    def replace_func(match):
        return mapping[match.group(0).lower()]

    # Replace all occurrences found in the text
    return re.sub(pattern, replace_func, text, flags=re.IGNORECASE)

In [189]:
# Replace demonyms in titles with common country names
# E.g. Chinese -> China, Cuban -> Cuba
documents_df["display_title"] = documents_df["display_title"].apply(replace_demonyms)

In [190]:
# Find all countries in the titles from country list
def find_all_occurrences(text, word_list):
    pattern = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, word_list)), re.IGNORECASE)
    return pattern.findall(text)

# Example usage
text = 'São Tomé and Príncipe - Country Economic Memorandum : Background Note 15 - Blue Economy and Environmental Resiliency'

occurrences = find_all_occurrences(text, country_ls)
print(occurrences)

['São Tomé and Príncipe']


In [191]:
# Load the English NLP model
nlp_lg = spacy.load('en_core_web_lg')

In [192]:
from tqdm import tqdm
country_col = []
# Process the text
for idx, title in enumerate(tqdm(documents_df["display_title"])):
    countries = find_all_occurrences(title, country_ls)
    if len(countries) == 0:
        doc = nlp_lg(str(title))
        # Extract countries (entities labeled as GPE - Geo-political entity)
        countries = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
    countries = list(set(countries))
    country_col.append(countries)
    # print("Countries found:", countries)

# Add a new column to store the list of countries for each title
documents_df["countries"] = country_col

  0%|          | 0/47711 [00:00<?, ?it/s]

100%|██████████| 47711/47711 [03:32<00:00, 224.02it/s]


In [193]:
# Explode to one row per country for each country list
documents_df_count = documents_df.explode("countries")

In [194]:
# Count the number of occurences of each country each year
# Keep countries with count > 5 and country name length < 25 to exclude incorrect country names during extraction
country_counts_per_year = documents_df_count.groupby(['year', 'countries']).size().reset_index(name='count')
country_counts_per_year.rename(columns={"countries": "country"}, inplace=True)

filtered_countries = country_counts_per_year[(country_counts_per_year['count'] > 5) & (country_counts_per_year['country'].str.len() < 25)]

filtered_countries

Unnamed: 0,year,country,count
0,2010.0,Afghanistan,41
1,2010.0,Albania,24
3,2010.0,Angola,8
5,2010.0,Argentina,33
6,2010.0,Armenia,37
...,...,...,...
3154,2023.0,Uganda,7
3159,2023.0,Vietnam,6
3165,2024.0,Bangladesh,7
3175,2024.0,Ethiopia,7


In [195]:
filtered_countries['year'] = filtered_countries['year'].astype(int)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Visualization

In [196]:
# Calculate the total number of documents per year to find the percentage later.
total_docs_by_year = filtered_countries.groupby('year')['count'].sum().reset_index(name='total_docs')

# Merge this total back into your original DataFrame to calculate percentages.
filtered_countries = pd.merge(filtered_countries, total_docs_by_year, on='year')

# Calculate the percentage of documents for each country each year.
filtered_countries['percentage'] = (filtered_countries['count'] / filtered_countries['total_docs']) * 100


In [197]:
filtered_countries

Unnamed: 0,year,country,count,total_docs,percentage
0,2010,Afghanistan,41,2497,1.641970
1,2010,Albania,24,2497,0.961153
2,2010,Angola,8,2497,0.320384
3,2010,Argentina,33,2497,1.321586
4,2010,Armenia,37,2497,1.481778
...,...,...,...,...,...
1351,2023,Uganda,7,187,3.743316
1352,2023,Vietnam,6,187,3.208556
1353,2024,Bangladesh,7,20,35.000000
1354,2024,Ethiopia,7,20,35.000000


In [198]:
filtered_countries_sorted = filtered_countries.sort_values(by=['year','count'], ascending=[True,True])

# # Create a unique color map for countries.
# unique_countries = filtered_countries['country'].unique()
# color_palette = px.colors.qualitative.Plotly 

# colors_needed = len(unique_countries)
# full_color_palette = color_palette * (colors_needed // len(color_palette) + 1)

# # Create a dictionary mapping each country to a color.
# color_map = dict(zip(unique_countries, full_color_palette))

fig = px.bar(filtered_countries_sorted,
             x="country", 
             y=["count",'percentage'], 
            #  color="country",
             animation_frame="year",
             title="Number and Percentage of Documents by Country and Year",
             labels={"count": "Number of Documents"},
            #  color_discrete_map=color_map,
            range_y=[0,max(filtered_countries['count'])+10],
             height=1600,
             width=2000)

# Update layout for better readability and alignment.
fig.update_layout(
    # xaxis_tickangle=-45,  # Rotate the tick labels for better visibility.
    xaxis={'categoryorder':'total descending'},
    xaxis_title="Country",
    yaxis_title="Number of Documents"
)

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 2000

fig.show()
