Skip to content

xushuo0/Qarik-Project

Repository files navigation

Qarik-Project (slides)

Authors: Xiaoyan Ding, Jakob Hansen, Shasha Liao, Shuo Xu

Actionable Insights

See our executive summary for a brief overview of what we've done and why you should care.

Background

Large swaths of the world's information are hidden away in unstructured and poorly-accessible formats. For instance, it is estimated that there are trillions of PDF files in existence, with tens of millions more posted to the internet each day. These formats pose a challenge for data extraction and analysis. PDFs are a double challenge: much of the information they contain is encoded in natural language, and further, many such files are generated by scanning paper documents, leaving them without computer-readable text. Reliably obtaining machine-usable information from PDF files is therefore a valuable capability.

The World Bank is an international development organization headquartered in Washington, D.C. that provides loans and aid for projects in developing countries. The Bank has existed for decades, and while a large amount of its data is readily available via a standardized API, much of it is still buried in PDF files. For example, the Bank's loan agreements are only available as unstructured PDF files.

Objective

Our goal is to demonstrate the ability to extract useful information and insights from unstructured PDF files. We do this by analyzing a collection of World Bank loan agreements. These are formulaic but inconsistently structured.

Approach

We received a collection of 3,205 PDF files scraped from the World Bank's database of loan agreements. Some of these PDFs contained machine-readable text, while others were simple scans. We first built a pipeline to extract and clean the text from these files. We used the open-source package OCRmyPDF to perform OCR on the scanned files, and Tika to programmatically extract the text from the files. We then discarded extremely short files and files with a high proportion of symbols likely to be OCR artifacts.

With cleaned full-text extracted, we set about finding features that could be consistently extracted from the agreements. These included:

  • Segmentation into sections
  • Date of the agreement
  • Name of country making the agreement
  • Name of the project
  • Amount of the loan
  • Ending date of the loan agreement

The World Bank has a database of information about projects which is accessible via an API. However, the loan agreements are not linked to the entries in this database. We were able to match many of the agreements to projects with high confidence by using fuzzy string matching on the project names, and from the project database find the economic sector corresponding to these agreements. However, many more documents remained unmatched. This provided an opportunity to build a model that uses information from the agreements themselves to predict the project sector.

Code and notebooks used in feature extraction and data cleaning are in the folder extraction_pipeline.

Results

Sector Classification

We built models to classify agreements by their primary sector, as categorized by the World Bank. Our initial models were unsupervised, and predicted the sector based on computed similarities between embeddings of the project name and description, and embeddings of the sector and subsector names. The best of these models used pretrained Word2Vec embeddings of the relevant words in the documents, and computed the average similarity between words in the project description and name and words in the sector name. It attained an accuracy, based on the matched data from the World Bank database, of 47% over 11 classes, significantly better than chance, but still wrong more often than not.

We also used the subset of projects we were able to match with the World Bank database to train a supervised model. These models used features produced by simple document embeddings of the project names and descriptions. We constructed tf-idf vectors for each document, and built a latent semantic indexing (LSI) model to reduce the vectors to a topic space that identifies some correlations between words. We then used these embeddings as inputs to a number of simple classifier models. An ensemble of three classifiers (a random forest, a logistic regression, and a kernel support vector machine) was able to achieve 76% accuracy over 10 classes on a held-out test set. (One class only had 6 labeled examples, so we dropped it from the model.)

Confusion matrix for the ensemble model

Code and notebooks used to build the classifiers are in the folder classification.

Data Analysis

One major signal we found in the data was a marked increase in the total amount lent by the Bank in the period after the 2008 financial crisis. This increase is likely related to additional financial pressures faced by governments during the time period. The financial sector and public administration, two sectors very strongly affected by the crisis, saw particularly large increases in lending. Total loan amounts by sector from 1999-2019

We applied Named Entity Analysis tools from Google Cloud Platform and extracted the total amount of the loans with the corresponding currency and then converted all the loan amounts into the current US dollars. This allowed us to compare the total amount of loans of different borrower countries. Loan Amount Map

It is also interesting to see if these loans really helped the developing countries increase their economy. We downloaded the time series GDP data from the World Bank Data and compared it with the total loan amounts for the top 10 borrower countries. The plot showes a positive correlation between the loan amounts and total GDPs.

Sublime's custom image

Code to generate plots can be find here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •