# Sector prediction from loan documents

Here we build an unsupervised model to predict the primary sector of a project from the project description and project title given in the loan document.

Some caveats: 
- Not all loan documents have a project description or a project title, and the model will probably not perform well on these.
- We are able to match data from the World Bank database with some of the loan documents, but the majority are unmatched.

The main idea is to embed the project description and the sector names as vectors in a document space, and match each project description with the nearest sector name. To give additional semantic information to the sector names, we append all the subsector names given by the World Bank.

Producing good documents for embedding will probably require cleaning. 

Does doc2vec work well with stopwords removed? Maybe a baseline tf-idf or LSI model would be best to start with.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
import gensim

## Load and collate all the data

It begins in a few different tables, and we need to join them together to get all the relevant features and labels.

In [2]:
descriptions_table = pd.read_csv("word_bank_project_description.csv")

In [3]:
descriptions_table.head()

Unnamed: 0.1,Unnamed: 0,year,month,day,id,name,project_desc,top_common_40
0,0,1991,september,12,558671468103155868,conformed-copy--l3361--kolubara-b-thermal-powe...,SCHEDULE 2\n\n Description ...,serbia rehabilitation thermal monitoring progr...
1,1,2001,may,31,473681468094784501,conformed-copy--l7054--railway-restructuring-p...,SCHEDULE 2\n\nDescription of the Project\n\nTh...,borrower staff program objectives transport se...
2,2,1991,november,19,892131468028134392,china--beijing-environmental-project-:-loan-34...,,
3,3,2013,may,20,945021468224685176,"official-documents-loan-agreement,-l8236-cn-cl...",SCHEDULE1\n\nProject Description\n\nThe objec...,urban rehabilitation water construction city s...
4,4,1991,may,7,107521468303622178,conformed-copy--l3230--third-highway-sector-pr...,SCHEDULE 2\n\nDescription of the Project\n\nTh...,maintenance improve highway planning including...


In [4]:
features_with_project_id = pd.read_pickle("extracted_features_with_project_id.pkl")

In [5]:
world_bank_projects = pd.read_pickle("world_bank_projects/world_bank_projects_json.pkl")

In [6]:
descriptions_table.project_desc.count()

2906

In [7]:
features_with_project_id['id'] = features_with_project_id.id.apply(int)

In [8]:
features_with_project_id.head()

Unnamed: 0,year,month,day,id,name,countries,Total Amount,Project Name,wb_project_id
0,1991,september,12,558671468103155868,conformed-copy--l3361--kolubara-b-thermal-powe...,"[Serbia, United States]",300000000,Kolubara B Thermal Power and Lignite Mine Project,
1,2001,may,31,473681468094784501,conformed-copy--l7054--railway-restructuring-p...,"[Poland, United States]",110000000,Railway Restructuring Project,
2,1991,november,19,892131468028134392,china--beijing-environmental-project-:-loan-34...,"[China, United States]",60100000,Beijing Environmental Project,
3,2013,may,20,945021468224685176,"official-documents-loan-agreement,-l8236-cn-cl...","[China, United States]",150000000,Liaoning Coastal Economic Zone Urban Infrastru...,
4,1991,may,7,107521468303622178,conformed-copy--l3230--third-highway-sector-pr...,"[Bosnia and Herzegovina, United States]",55000000,Third Highway Sector Project,


In [9]:
agreements_features = descriptions_table.merge(features_with_project_id[['id','Project Name', 'wb_project_id']], left_on='id', right_on='id', how='inner')

In [10]:
agreements_features = agreements_features.merge(world_bank_projects[['id','major_sector_name']], left_on='wb_project_id',right_on='id',how='left')

In [11]:
len(agreements_features)

3195

In [12]:
agreements_features.sample(10)

Unnamed: 0.1,Unnamed: 0,year,month,day,id_x,name,project_desc,top_common_40,Project Name,wb_project_id,id_y,major_sector_name
1547,1548,1994,november,15,638351468096860331,pakistan--financial-sector-deepening-and-inter...,SCHEDULE 2 Description of the Project The obje...,development insurance strengthening financial ...,Financial Sector Deepening and Intermediation ...,,,
844,845,2000,may,19,750091468282904743,conformed-copy--l4056--uttar-pradesh-rural-wat...,SCHEDULE 2\n\n Description...,water supply sanitation environmental construc...,Uttar Pradesh Rural Water Supply and\n ...,,,
2100,2104,1993,september,15,178261468292835233,conformed-copy--c2542--public-sector-adjustmen...,SCHEDULE 1\n\n Withdrawal of the ...,expenditures bank shall borrower loan goods pr...,Public Sector Adjustment Loan,,,
261,261,1992,january,27,208111468233360870,conformed-copy--c2313--national-drainage-proje...,,,National Drainage Project,,,
2156,2160,2003,october,17,834001468350174170,conformed-copy--l4709--rural-land-titling-and-...,SCHEDULE 2 \n \nDescription of the Project \n\...,land development farm including cadastre train...,Rural Land Titling and Cadastre Development Pr...,,,
3194,3198,2014,february,18,277931468279946325,"official-documents-loan-agreement,-l8314-mx-cl...",SCHEDULE 2\n\nThe Borrower shall repay th...,borrower means 's education bank loan section ...,Third Upper Secondary Education\nDevelopment P...,,,
920,921,1996,september,13,762561468020643363,conformed-copy--l4027--second-henan-provincial...,SCHEDULE 2\n\nDescription of the Project\n\nTh...,henan road construction improvement safety pro...,Second Henan Provincial Highway Project,,,
2170,2174,1996,august,12,150211468267591336,conformed-copy--l4030--social-sectors-strategy...,SCHEDULE 2\n\n Descripti...,sector development capacity strategies includi...,Social Sectors Strategy and Capacity Building ...,,,
476,476,2013,july,29,414031468228875616,official-documents-amendment-to-the-loan-agree...,,,202,,,
1038,1039,1991,june,28,662691468116369316,conformed-copy--l3324--state-and-provincial-ro...,SCHEDULE 2\n\n Description...,development programs local training including ...,Employment Promotion and Services Project,,,


In [13]:
# remove all agreements where we have no data whatsoever. if there's a description or a name, we keep it, though.
agreements_features.dropna(how='all',inplace=True,subset=['project_desc','Project Name'])

In [14]:
len(agreements_features)
#we don't lose very many

3189

In [15]:
with open("world_bank_sectors.txt") as f:
  sectors = f.read()

sectors_list = sectors.split("\n")

sector_dict = dict()
current_major_sector = ''
for sector in sectors_list:
    if sector == '':
        current_major_sector = ''
        continue
    if current_major_sector == '':
        current_major_sector = sector
        sector_dict[current_major_sector] = []
        continue
    if current_major_sector in sector_dict.keys():
        sector_dict[current_major_sector].append(sector)

reverse_sector_dict = dict()
for major_sector in sector_dict:
    for sector in sector_dict[major_sector]:
        reverse_sector_dict[sector] = major_sector

In [16]:
sector_numbers = dict()
i = 0
for sector in sector_dict:
    sector_numbers[sector] = i
    i += 1

In [17]:
agreements_features['sector_id'] = [sector_numbers[sector] if type(sector) == str else None for sector in agreements_features.major_sector_name.values]

In [18]:
full_sector_description = [' '.join([x] + sector_dict[x]) for x in sector_dict]

In [19]:
full_sector_description

['Agriculture Agricultural Extension, Research, and Other Support Activities Crops Fisheries Forestry Irrigation and Drainage Livestock Other Agriculture, Fishing and Forestry Public Administration - Agriculture, Fishing and Forestry',
 'Education Adult, Basic and Continuing Education Early Childhood Education Other Education Primary Education Public Administration - Education Secondary Education Tertiary Education Workforce Development/Skills',
 'Energy and Extractives Energy Transmission and Distribution Mining Non-Renewable Energy Generation Oil and Gas Other Energy and Extractives Public Administration - Energy and Extractives Renewable Energy Biomass Renewable Energy Geothermal Renewable Energy Hydro Renewable Energy Solar Renewable Energy Wind',
 'Financial Sector Banking Institutions Capital Markets Insurance and Pension Other Non-bank Financial Institutions Public Administration - Financial Sector',
 'Health Health Health Facilities and Construction Public Administration - Heal

## Clean the data

We're going to start with a simple bag-of-words based model, so we'll need to remove uninformative words. These come in two types:
- common words: those that are too common across the dataset to convey much information
- rare words: those that only occur a couple of times, so that there's no sense in using them for comparison. these also include ocr typos.

We will remove these from the dictionary so they don't get taken into account in the comparison.

First, tokenize all the texts to split them into words.

In [20]:
descriptions_tokens = [nltk.word_tokenize(description.lower()) if type(description) == str else [] for description in agreements_features.project_desc.values]
project_names_tokens = [nltk.word_tokenize(name.lower()) if type(name) == str else [] for name in agreements_features["Project Name"].values]
sector_descriptions_tokens = [nltk.word_tokenize(desc.lower()) for desc in full_sector_description]

Construct a gensim dictionary, which counts all the words and their occurrences.

In [21]:
dictionary = gensim.corpora.Dictionary([x for x in descriptions_tokens if x is not None])

In [22]:
len(dictionary)

30892

We now filter out the words that appear in at most 5 documents and the words that are in more than 50% of the documents, as well as the overall 50 most frequent words.

In [23]:
dictionary.filter_extremes(no_below=5,no_above=.5)

In [24]:
len(dictionary)

6623

In [25]:
dictionary.filter_n_most_frequent(50)

In [26]:
len(dictionary)

6573

Add in the project names and filter again.

In [27]:
dictionary.add_documents([x for x in project_names_tokens if x is not None])

In [28]:
len(dictionary)

7598

In [29]:
dictionary.filter_extremes(no_below=2,no_above=.9)

In [30]:
len(dictionary)

6877

Finally, add in the sector descriptions.

In [31]:
dictionary.add_documents(sector_descriptions_tokens)

We won't remove any of the words from these; we need all we can get.

In [32]:
len(dictionary)

6891

Now we use this dictionary to construct a sequence of vector embeddings. First simple bag of words:

In [33]:
description_bow = [dictionary.doc2bow(description) if description is not None else None for description in descriptions_tokens]
project_name_bow = [dictionary.doc2bow(name) if name is not None else None for name in project_names_tokens]
sector_descriptions_bow = [dictionary.doc2bow(name) if name is not None else None for name in sector_descriptions_tokens]

In [34]:
corpus = list(filter(lambda x: x is not None, description_bow + project_name_bow + sector_descriptions_bow))

Now we get tf-idf vectors for these documents.

In [35]:
tfidf = gensim.models.TfidfModel(corpus)

In [36]:
description_tfidf = [tfidf[desc] if desc is not None else None for desc in description_bow]
project_name_tfidf = [tfidf[name] if name is not None else None for name in project_name_bow]
sector_descriptions_tfidf = [tfidf[name] if name is not None else None for name in sector_descriptions_bow]

In [37]:
agreements_features['project_description_tfidf'] = description_tfidf
agreements_features['project_name_tfidf'] = project_name_tfidf
#agreements_features['sector_name_tfidf'] = sector_name_tfidf

Finally, we use latent semantic indexing to reduce the dimensionality, removing noise. This is just an SVD on the tf-idf matrix.

In [38]:
lsi = gensim.models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=200)

In [39]:
description_lsi = [lsi[desc] if desc is not None else None for desc in description_tfidf]
project_name_lsi = [tfidf[name] if name is not None else None for name in project_name_tfidf]
sector_desc_lsi = [tfidf[name] if name is not None else None for name in sector_descriptions_tfidf]

## Making predictions

We'll now use similarities in the document space to predict sectors. 

In [40]:
similarities = gensim.similarities.MatrixSimilarity(sector_desc_lsi)
description_similarities = similarities.get_similarities(description_lsi)
project_name_similarities = similarities.get_similarities(project_name_lsi)

In [41]:
description_based_topic_predictions = description_similarities.argmax(axis=1)
project_name_topic_predictions = project_name_similarities.argmax(axis=1)

In [42]:
agreements_features['description_topic_predictions'] = description_based_topic_predictions
agreements_features['project_name_topic_predictions'] = project_name_topic_predictions

In [43]:
description_test_set = agreements_features.dropna(how='any',inplace=False,subset=['project_desc','sector_id'])
correct_desc_pred = description_test_set.description_topic_predictions == description_test_set.sector_id
description_accuracy = sum(correct_desc_pred) / len(correct_desc_pred)
print("Accuracy:", description_accuracy, "on", len(correct_desc_pred), "predictions")

Accuracy: 0.15249266862170088 on 682 predictions


In [44]:
name_test_set = agreements_features.dropna(how='any',inplace=False,subset=['Project Name','sector_id'])
correct_name_pred = name_test_set.project_name_topic_predictions == name_test_set.sector_id
name_accuracy = sum(correct_name_pred) / len(correct_name_pred)
print("Accuracy:", name_accuracy, "on", len(correct_name_pred), "predictions")

Accuracy: 0.3897364771151179 on 721 predictions


Interestingly, the project names do better on a larger test set than the full project descriptions, despite having far less information. Maybe we should train the model differently.

Some things to try:
- Use only the words that appear in the sector descriptions. (This makes sense, and probably should have been the case from the beginning.)
- Different thresholds for filtering words out.
- More sophisticated models like word2vec or doc2vec that include relationships between words. We probably want some pretrained model that knows that certain words are related to terms in the sector descriptions, even if they don't appear there.