# Sector prediction from loan documents

Here we build an unsupervised model to predict the primary sector of a project from the project description and project title given in the loan document.

Some caveats: 
- Not all loan documents have a project description or a project title, and the model will probably not perform well on these.
- We are able to match data from the World Bank database with some of the loan documents, but the majority are unmatched.

The main idea is to embed the project description and the sector names as vectors in a document space, and match each project description with the nearest sector name. To give additional semantic information to the sector names, we append all the subsector names given by the World Bank.

Producing good documents for embedding will probably require cleaning. 

Does doc2vec work well with stopwords removed? Maybe a baseline tf-idf or LSI model would be best to start with.

In [53]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
import gensim
import enum

## Load and collate all the data

It begins in a few different tables, and we need to join them together to get all the relevant features and labels.

In [3]:
descriptions_table = pd.read_csv("word_bank_project_description.csv")

In [4]:
descriptions_table.head()

Unnamed: 0.1,Unnamed: 0,year,month,day,id,name,project_desc,top_common_40
0,0,1991,september,12,558671468103155868,conformed-copy--l3361--kolubara-b-thermal-powe...,SCHEDULE 2\n\n Description ...,serbia rehabilitation thermal monitoring progr...
1,1,2001,may,31,473681468094784501,conformed-copy--l7054--railway-restructuring-p...,SCHEDULE 2\n\nDescription of the Project\n\nTh...,borrower staff program objectives transport se...
2,2,1991,november,19,892131468028134392,china--beijing-environmental-project-:-loan-34...,,
3,3,2013,may,20,945021468224685176,"official-documents-loan-agreement,-l8236-cn-cl...",SCHEDULE1\n\nProject Description\n\nThe objec...,urban rehabilitation water construction city s...
4,4,1991,may,7,107521468303622178,conformed-copy--l3230--third-highway-sector-pr...,SCHEDULE 2\n\nDescription of the Project\n\nTh...,maintenance improve highway planning including...


In [5]:
features_with_project_id = pd.read_pickle("extracted_features_with_project_id.pkl")

In [6]:
world_bank_projects = pd.read_pickle("world_bank_projects/world_bank_projects_json.pkl")

In [7]:
descriptions_table.project_desc.count()

2906

In [8]:
features_with_project_id['id'] = features_with_project_id.id.apply(int)

In [9]:
features_with_project_id.head()

Unnamed: 0,year,month,day,id,name,countries,Total Amount,Project Name,wb_project_id
0,1991,september,12,558671468103155868,conformed-copy--l3361--kolubara-b-thermal-powe...,"[Serbia, United States]",300000000,Kolubara B Thermal Power and Lignite Mine Project,
1,2001,may,31,473681468094784501,conformed-copy--l7054--railway-restructuring-p...,"[Poland, United States]",110000000,Railway Restructuring Project,
2,1991,november,19,892131468028134392,china--beijing-environmental-project-:-loan-34...,"[China, United States]",60100000,Beijing Environmental Project,
3,2013,may,20,945021468224685176,"official-documents-loan-agreement,-l8236-cn-cl...","[China, United States]",150000000,Liaoning Coastal Economic Zone Urban Infrastru...,
4,1991,may,7,107521468303622178,conformed-copy--l3230--third-highway-sector-pr...,"[Bosnia and Herzegovina, United States]",55000000,Third Highway Sector Project,


In [10]:
agreements_features = descriptions_table.merge(features_with_project_id[['id','Project Name', 'wb_project_id']], left_on='id', right_on='id', how='inner')

In [11]:
agreements_features = agreements_features.merge(world_bank_projects[['id','major_sector_name']], left_on='wb_project_id',right_on='id',how='left')

In [12]:
len(agreements_features)

3195

In [13]:
agreements_features.sample(10)

Unnamed: 0.1,Unnamed: 0,year,month,day,id_x,name,project_desc,top_common_40,Project Name,wb_project_id,id_y,major_sector_name
1898,1900,1992,december,16,760141468045251729,conformed-copy--l3536--third-primary-and-secon...,SCHEDULE 2\n\n Description ...,provision educational primary secondary traini...,Third Primary and Secondary Education Sector P...,,,
1167,1168,1991,november,26,916971468272053819,egypt--national-drainage-project-:-loan-3417--...,,,National Drainage Project,,,
1717,1718,2011,may,10,727071468251100676,loan-agreement-for-loan-8029-in-conformed.txt,SCHEDULE 2 \n \nAmortization Schedule \n\nThe ...,date principal loan payment bank paragraph bor...,e-Delivery of Public Services Development Poli...,,,
1776,1778,1992,january,15,441261468203647983,conformed-copy--l3422--primary-education-devel...,SCHEDULE 2\n\nDescription of the Project\n\nTh...,training education teachers primary management...,Primary Education Development Project,,,
98,98,2011,september,22,655611468005729826,loan-agreement-for-loan-8024-az-conformed.txt,SCHEDULE 1 \n \nProject Description \n\n \n\n ...,provision consultants services market capital ...,Capital Markets Modernization Project,P120321,P120321,Financial Sector
1889,1891,2017,june,28,374591499706222690,official-documents-loan-agreement-for-loan-875...,SCHEDULE 2\n\nAmortization Schedule\n\nThe f...,loan date principal payment paragraph bank sec...,Second Competitiveness and Fiscal Management P...,,,
1173,1174,2012,february,15,533511468218386526,official-documents-loan-agreement-for-loan-812...,SCHEDULE 2 Amortization Schedule The following...,date principal payment loan bank borrower para...,Third Development Policy Operation,,,
1508,1509,2004,october,18,907781468222587588,loan-agreement-l4751-cn-conformed.txt,SCHEDULE 2 \n\nDescription of the Project \n\n...,flood control management coal capacity scenic ...,Hunan Urban Development Project,P075730,P075730,Water/Sanitation/Waste
2846,2850,2013,december,16,246261468040135600,official-documents-loan-agreement-for-loan-831...,SCHEDULE 2\n\nAmortization Schedule\n\nThe f...,date principal loan payment bank paragraph bor...,Economic Stabilization and Foundations for Gro...,,,
626,626,2006,september,12,345501468312857965,"loan-agreement,-l4824-tr-conformed.txt",SCHEDULE 1 \n \nProject Description \n \n\nThe...,generation maintenance business power plant sy...,Electricity Generation Rehabilitation and \nR...,P085561,P085561,Energy and Extractives


In [14]:
# remove all agreements where we have no data whatsoever. if there's a description or a name, we keep it, though.
agreements_features.dropna(how='all',inplace=True,subset=['project_desc','Project Name'])

In [15]:
len(agreements_features)
#we don't lose very many

3189

In [16]:
with open("world_bank_sectors.txt") as f:
  sectors = f.read()

sectors_list = sectors.split("\n")

sector_dict = dict()
current_major_sector = ''
for sector in sectors_list:
    if sector == '':
        current_major_sector = ''
        continue
    if current_major_sector == '':
        current_major_sector = sector
        sector_dict[current_major_sector] = []
        continue
    if current_major_sector in sector_dict.keys():
        sector_dict[current_major_sector].append(sector)

reverse_sector_dict = dict()
for major_sector in sector_dict:
    for sector in sector_dict[major_sector]:
        reverse_sector_dict[sector] = major_sector

In [17]:
sector_numbers = dict()
i = 0
for sector in sector_dict:
    sector_numbers[sector] = i
    i += 1

In [18]:
agreements_features['sector_id'] = [sector_numbers[sector] if type(sector) == str else None for sector in agreements_features.major_sector_name.values]

In [19]:
full_sector_description = [' '.join([x] + sector_dict[x]) for x in sector_dict]

In [20]:
full_sector_description

['Agriculture Agricultural Extension, Research, and Other Support Activities Crops Fisheries Forestry Irrigation and Drainage Livestock Other Agriculture, Fishing and Forestry Public Administration - Agriculture, Fishing and Forestry',
 'Education Adult, Basic and Continuing Education Early Childhood Education Other Education Primary Education Public Administration - Education Secondary Education Tertiary Education Workforce Development/Skills',
 'Energy and Extractives Energy Transmission and Distribution Mining Non-Renewable Energy Generation Oil and Gas Other Energy and Extractives Public Administration - Energy and Extractives Renewable Energy Biomass Renewable Energy Geothermal Renewable Energy Hydro Renewable Energy Solar Renewable Energy Wind',
 'Financial Sector Banking Institutions Capital Markets Insurance and Pension Other Non-bank Financial Institutions Public Administration - Financial Sector',
 'Health Health Health Facilities and Construction Public Administration - Heal

## Clean the data

We're going to start with a simple bag-of-words based model, so we'll need to remove uninformative words. These come in two types:
- common words: those that are too common across the dataset to convey much information
- rare words: those that only occur a couple of times, so that there's no sense in using them for comparison. these also include ocr typos.

We will remove these from the dictionary so they don't get taken into account in the comparison.

First, tokenize all the texts to split them into words.

In [21]:
descriptions_tokens = [nltk.word_tokenize(description.lower()) if type(description) == str else [] for description in agreements_features.project_desc.values]
project_names_tokens = [nltk.word_tokenize(name.lower()) if type(name) == str else [] for name in agreements_features["Project Name"].values]
sector_descriptions_tokens = [nltk.word_tokenize(desc.lower()) for desc in full_sector_description]

Construct a gensim dictionary, which counts all the words and their occurrences.

In [22]:
dictionary = gensim.corpora.Dictionary([x for x in descriptions_tokens if x is not None])

In [23]:
len(dictionary)

30892

We now filter out the words that appear in at most 5 documents and the words that are in more than 50% of the documents, as well as the overall 50 most frequent words.

In [24]:
dictionary.filter_extremes(no_below=5,no_above=.7)

In [25]:
len(dictionary)

6662

In [26]:
dictionary.filter_n_most_frequent(30)

In [27]:
len(dictionary)

6632

Add in the project names and filter again.

In [28]:
dictionary.add_documents([x for x in project_names_tokens if x is not None])

In [29]:
len(dictionary)

7623

In [30]:
dictionary.filter_extremes(no_below=2,no_above=.9)

In [31]:
len(dictionary)

6908

Finally, add in the sector descriptions.

In [32]:
dictionary.add_documents(sector_descriptions_tokens)

We won't remove any of the words from these; we need all we can get.

In [33]:
len(dictionary)

6918

Actually, never mind. Let's just use the sector descriptions, since similarities to sector descriptions is all we care about.

In [34]:
#dictionary = gensim.corpora.Dictionary(sector_descriptions_tokens)

In [35]:
#len(dictionary)

Now we use this dictionary to construct a sequence of vector embeddings. First simple bag of words:

In [36]:
description_bow = [dictionary.doc2bow(description) if description is not None else None for description in descriptions_tokens]
project_name_bow = [dictionary.doc2bow(name) if name is not None else None for name in project_names_tokens]
sector_descriptions_bow = [dictionary.doc2bow(name) if name is not None else None for name in sector_descriptions_tokens]

In [37]:
corpus = list(filter(lambda x: x is not None, description_bow + project_name_bow + sector_descriptions_bow))

Now we get tf-idf vectors for these documents.

In [38]:
tfidf = gensim.models.TfidfModel(corpus)

In [39]:
description_tfidf = [tfidf[desc] if desc is not None else None for desc in description_bow]
project_name_tfidf = [tfidf[name] if name is not None else None for name in project_name_bow]
sector_descriptions_tfidf = [tfidf[name] if name is not None else None for name in sector_descriptions_bow]

In [40]:
agreements_features['project_description_tfidf'] = description_tfidf
agreements_features['project_name_tfidf'] = project_name_tfidf
#agreements_features['sector_name_tfidf'] = sector_name_tfidf

Finally, we use latent semantic indexing to reduce the dimensionality, removing noise. This is just an SVD on the tf-idf matrix.

In [122]:
lsi = gensim.models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=300)

In [123]:
description_lsi = [lsi[desc] if desc is not None else None for desc in description_tfidf]
project_name_lsi = [lsi[name] if name is not None else None for name in project_name_tfidf]
sector_desc_lsi = [lsi[name] if name is not None else None for name in sector_descriptions_tfidf]

In [124]:
agreements_features['project_description_lsi'] = description_lsi
agreements_features['project_name_lsi'] = project_name_lsi

## Making predictions

We'll now use similarities in the document space to predict sectors. 

In [125]:
similarities = gensim.similarities.MatrixSimilarity(sector_desc_lsi)
description_similarities = similarities[description_lsi]
project_name_similarities = similarities[project_name_lsi]

In [126]:
description_based_topic_predictions = description_similarities.argmax(axis=1)
project_name_topic_predictions = project_name_similarities.argmax(axis=1)

In [127]:
agreements_features['description_topic_predictions'] = description_based_topic_predictions
agreements_features['project_name_topic_predictions'] = project_name_topic_predictions

In [128]:
description_test_set = agreements_features.dropna(how='any',inplace=False,subset=['project_desc','sector_id'])
correct_desc_pred = description_test_set.description_topic_predictions == description_test_set.sector_id
description_accuracy = sum(correct_desc_pred) / len(correct_desc_pred)
print("Accuracy:", description_accuracy, "on", len(correct_desc_pred), "predictions")

Accuracy: 0.4574780058651026 on 682 predictions


In [129]:
name_test_set = agreements_features.dropna(how='any',inplace=False,subset=['Project Name','sector_id'])
correct_name_pred = name_test_set.project_name_topic_predictions == name_test_set.sector_id
name_accuracy = sum(correct_name_pred) / len(correct_name_pred)
print("Accuracy:", name_accuracy, "on", len(correct_name_pred), "predictions")

Accuracy: 0.406380027739251 on 721 predictions


Interestingly, the project names do significantly better on a larger test set than the full project descriptions, despite having far less information. Maybe we should train the model differently.

Some things to try:
- Use only the words that appear in the sector descriptions. OK, tried it: it doesn't change the performance of the project name prediction, and makes the description set worse somehow. Possibly because the LSI was able to find connections between topics and words using the project descriptions... that's probably why you would want to use all the words. So that leaves the next steps:
- Different thresholds for filtering words out. The project name prediction accuracy is pretty robust to this, which makes sense. 
- Figure out how to weight words based on the length of the sector description they belong to. Class 7 is the least frequently predicted, even though it's the most common, probably because its sector description is the shortest.
- More sophisticated models like word2vec or doc2vec that include relationships between words. We probably want some pretrained model that knows that certain words are related to terms in the sector descriptions, even if they don't appear there.

In [120]:
description_test_set.project_name_topic_predictions.value_counts()

9     102
5      88
3      79
6      72
7      65
10     64
0      51
8      48
4      41
1      39
2      33
Name: project_name_topic_predictions, dtype: int64

In [121]:
description_test_set.sector_id.value_counts()

7.0     122
9.0     100
10.0     99
0.0      78
2.0      62
4.0      52
5.0      49
1.0      43
8.0      38
3.0      33
6.0       6
Name: sector_id, dtype: int64

Interesting: although the classes are unbalanced, this is not the reason the project name-based classification does better than chance.

In [50]:
sector_descriptions_tokens[7]

['public',
 'admin',
 'central',
 'government',
 '(',
 'central',
 'agencies',
 ')',
 'law',
 'and',
 'justice',
 'other',
 'public',
 'administration',
 'sub-national',
 'government']

## A Doc2Vec model

Let's try using a more sophisticated model: Doc2Vec. This is technically an neural network, although a very shallow one, which is trained to predict words from local context (like word2vec) combined with a document context vector.

We need to get our corpus in a slightly different format to use with gensim's doc2vec.

In [61]:
doc2vec_corpus = [gensim.models.doc2vec.TaggedDocument(tokens, [i]) for i, tokens in enumerate(filter(lambda x: len(x) > 0, descriptions_tokens + project_names_tokens + sector_descriptions_tokens))]

In [62]:
doc2vec_corpus[:2]

[TaggedDocument(words=['schedule', '2', 'description', 'of', 'the', 'project', 'the', 'objectives', 'of', 'the', 'project', 'are', 'to', ':', '(', 'i', ')', 'meet', 'the', 'growth', 'in', 'demand', 'for', 'electricity', 'in', 'serbia', ',', 'including', 'its', 'supplies', 'to', 'other', 'republics', 'of', 'the', 'guarantor', ';', '(', 'ii', ')', 'facilitate', 'rehabilitation', 'of', 'aging', 'thermal', 'generating', 'units', 'in', 'serbia', ';', '(', 'iii', ')', 'introduce', 'a', 'systematic', 'air-quality', 'monitoring', 'and', 'study', 'program', 'in', 'the', 'kolubara', 'area', ';', 'and', '(', 'iv', ')', 'improve', 'the', 'financial', 'performance', 'of', 'the', 'borrower', 'and', 'ep', '(', 's', ')', 'and', 'strengthen', 'their', 'capacities', 'in', 'financial', 'management', 'and', 'accounting', '.', 'the', 'project', 'consists', 'of', 'the', 'following', 'parts', ',', 'subject', 'to', 'such', 'modifications', 'thereof', 'as', 'the', 'borrower', 'and', 'the', 'bank', 'may', 'agre

In [63]:
doc2vec = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=3, epochs=40)

In [64]:
doc2vec.build_vocab(doc2vec_corpus)

In [65]:
doc2vec.train(doc2vec_corpus, total_examples=doc2vec.corpus_count, epochs=doc2vec.epochs)

In [70]:
description_vectors = [doc2vec.infer_vector(desc) for desc in descriptions_tokens]
project_name_vectors = [doc2vec.infer_vector(name) for name in project_names_tokens]
sector_description_vectors = [doc2vec.infer_vector(desc) for desc in sector_descriptions_tokens]

In [77]:
description_distances_doc2vec = np.zeros((len(sector_description_vectors),len(description_vectors)))
for i in range(len(description_vectors)):
    for j in range(len(sector_description_vectors)):
        description_distances_doc2vec[j,i] = np.linalg.norm(description_vectors[i]-sector_description_vectors[j])
project_name_distances_doc2vec = np.zeros((len(sector_description_vectors),len(project_name_vectors)))
for i in range(len(project_name_vectors)):
    for j in range(len(sector_description_vectors)):
        project_name_distances_doc2vec[j,i] = np.linalg.norm(project_name_vectors[i]-sector_description_vectors[j])

In [81]:
doc2vec_description_predictions = np.argmin(description_distances_doc2vec,axis=0)
doc2vec_project_name_predictions = np.argmin(project_name_distances_doc2vec,axis=0)

In [82]:
doc2vec_description_predictions

array([9, 1, 4, ..., 0, 4, 7])

In [83]:
agreements_features['description_topic_predictions_doc2vec'] = doc2vec_description_predictions
agreements_features['project_name_topic_predictions_doc2vec'] = doc2vec_project_name_predictions

In [84]:
correct_desc_pred = description_test_set.description_topic_predictions_doc2vec == description_test_set.sector_id
description_accuracy = sum(correct_desc_pred) / len(correct_desc_pred)
print("Description-based Accuracy:", description_accuracy, "on", len(correct_desc_pred), "predictions")

Description-based Accuracy: 0.3592375366568915 on 682 predictions


In [85]:
correct_name_pred = name_test_set.project_name_topic_predictions_doc2vec == name_test_set.sector_id
name_accuracy = sum(correct_name_pred) / len(correct_name_pred)
print("Project Name-based Accuracy:", name_accuracy, "on", len(correct_name_pred), "predictions")

Project Name-based Accuracy: 0.18585298196948682 on 721 predictions


In [98]:
w = 0.95 #weighting factor for joint prediction
combined_distances = w * description_distances_doc2vec + (1-w) * project_name_distances_doc2vec
combined_predictions = np.argmin(combined_distances,axis=0)

In [99]:
agreements_features['predictions_doc2vec'] = combined_predictions

In [100]:
combined_test_set = agreements_features.dropna(how='any',inplace=False,subset=['project_desc','Project Name','sector_id'])
correct_joint_pred = combined_test_set.predictions_doc2vec == combined_test_set.sector_id
joint_accuracy = sum(correct_joint_pred) / len(correct_desc_pred)
print("Combined Accuracy:", joint_accuracy, "on", len(correct_joint_pred), "predictions")


Combined Accuracy: 0.35043988269794724 on 682 predictions


Well, the descriptions perform better with the doc2vec model. Still not very well, though. Actually, after fixing some bugs, it's worse than the LSI model.

In [111]:
agreements_features.to_pickle("classification_features.pkl")