# Sector prediction from loan documents

Here we build an unsupervised model to predict the primary sector of a project from the project description and project title given in the loan document.

Some caveats: 
- Not all loan documents have a project description or a project title, and the model will probably not perform well on these.
- We are able to match data from the World Bank database with some of the loan documents, but the majority are unmatched.

The main idea is to embed the project description and the sector names as vectors in a document space, and match each project description with the nearest sector name. To give additional semantic information to the sector names, we append all the subsector names given by the World Bank.

Producing good documents for embedding will probably require cleaning. 

Does doc2vec work well with stopwords removed? Maybe a baseline tf-idf or LSI model would be best to start with.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
import gensim
import enum

## Load the data



In [6]:
agreements_features = pd.read_pickle("../extraction_pipeline/agreement_features_join_projects.pkl")

In [7]:
agreements_features.head()

Unnamed: 0,id,date,country_code,country_name,named_countries,Project Name,project_desc,amount_standard,currency_standard,closing_month,closing_day,closing_year,loan_length in year,wb_project_id,major_sector_name,sector1_name,sector2_name,sector3_name
0,658601468273303198,1990-01-05,jo,Jordan,"[Jordan, United States]",Industry and Trade Policy Adjustment Loan,SCHEDULE 1\n\n Withdrawal of the Procee...,150000000.0,us dollar,December,31,1992,2.0,,,,,
1,784351468234894668,1990-01-10,br,Brazil,"[Brazil, United States]",Third Agricultural Research Project,SCHEDULE 2\n\nDescription of the Project\n\nTh...,47000000.0,us dollar,June,30,1995,5.0,,,,,
2,891941468049466312,1990-01-11,ma,Malaysia,"[Malaysia, United States]",Rubber Industry Smallholders\n Deve...,SCHEDULE 2\n\n Description of the...,71000000.0,us dollar,December,31,1993,3.0,,,,,
3,125511468051878371,1990-01-11,ma,Malaysia,"[Malaysia, United States]",Highway Rehabilitation and Improvement Project,SCHEDULE 2\n\n Description of the...,83200000.0,us dollar,June,30,1995,5.0,P099112,Transportation,Rural and Inter-Urban Roads,Sub-National Government,
4,733341468033245855,1990-01-16,ivc,Cote d'Ivoire,[United States],Abidjan Environmental Protection Project,SCHEDULE 2\n\nDescription of the Project\n\nTh...,21900000.0,us dollar,December,31,1993,3.0,,,,,


In [8]:
# remove all agreements where we have no data whatsoever. if there's a description or a name, we keep it, though.
agreements_features.dropna(how='all',inplace=True,subset=['project_desc','Project Name'])

In [9]:
len(agreements_features)
#we don't lose very many

3189

Get the sector names and subsector names.

In [11]:
with open("../world_bank_sectors.txt") as f:
  sectors = f.read()

sectors_list = sectors.split("\n")

sector_dict = dict()
current_major_sector = ''
for sector in sectors_list:
    if sector == '':
        current_major_sector = ''
        continue
    if current_major_sector == '':
        current_major_sector = sector
        sector_dict[current_major_sector] = []
        continue
    if current_major_sector in sector_dict.keys():
        sector_dict[current_major_sector].append(sector)

reverse_sector_dict = dict()
for major_sector in sector_dict:
    for sector in sector_dict[major_sector]:
        reverse_sector_dict[sector] = major_sector

In [12]:
sector_numbers = dict()
i = 0
for sector in sector_dict:
    sector_numbers[sector] = i
    i += 1

Assign each sector an id.

In [13]:
agreements_features['sector_id'] = [sector_numbers[sector] if type(sector) == str else None for sector in agreements_features.major_sector_name.values]

In [14]:
full_sector_description = [' '.join([x] + sector_dict[x]) for x in sector_dict]

In [15]:
full_sector_description

['Agriculture Agricultural Extension, Research, and Other Support Activities Crops Fisheries Forestry Irrigation and Drainage Livestock Other Agriculture, Fishing and Forestry Public Administration - Agriculture, Fishing and Forestry',
 'Education Adult, Basic and Continuing Education Early Childhood Education Other Education Primary Education Public Administration - Education Secondary Education Tertiary Education Workforce Development/Skills',
 'Energy and Extractives Energy Transmission and Distribution Mining Non-Renewable Energy Generation Oil and Gas Other Energy and Extractives Public Administration - Energy and Extractives Renewable Energy Biomass Renewable Energy Geothermal Renewable Energy Hydro Renewable Energy Solar Renewable Energy Wind',
 'Financial Sector Banking Institutions Capital Markets Insurance and Pension Other Non-bank Financial Institutions Public Administration - Financial Sector',
 'Health Health Health Facilities and Construction Public Administration - Heal

## Prepare the data

We're going to start with a simple bag-of-words based model, so we'll need to remove uninformative words. These come in two types:
- common words: those that are too common across the dataset to convey much information
- rare words: those that only occur a couple of times, so that there's no sense in using them for comparison. these also include ocr typos.

We will remove these from the dictionary so they don't get taken into account in the comparison.

First, tokenize all the texts to split them into words.

In [16]:
descriptions_tokens = [nltk.word_tokenize(description.lower()) if type(description) == str else [] for description in agreements_features.project_desc.values]
project_names_tokens = [nltk.word_tokenize(name.lower()) if type(name) == str else [] for name in agreements_features["Project Name"].values]
sector_descriptions_tokens = [nltk.word_tokenize(desc.lower()) for desc in full_sector_description]

Construct a gensim dictionary, which counts all the words and their occurrences.

In [17]:
dictionary = gensim.corpora.Dictionary([x for x in descriptions_tokens if x is not None])

In [18]:
len(dictionary)

30615

We now filter out the words that appear in at most 5 documents and the words that are in more than 50% of the documents, as well as the overall 50 most frequent words.

In [19]:
dictionary.filter_extremes(no_below=5,no_above=.8)

In [20]:
len(dictionary)

6617

In [21]:
dictionary.filter_n_most_frequent(30)

In [22]:
len(dictionary)

6587

Add in the project names and filter again.

In [23]:
dictionary.add_documents([x for x in project_names_tokens if x is not None])

In [24]:
len(dictionary)

7626

In [25]:
dictionary.filter_extremes(no_below=2,no_above=.5)

In [26]:
len(dictionary)

6859

Finally, add in the sector descriptions.

In [27]:
dictionary.add_documents(sector_descriptions_tokens)

We won't remove any of the words from these; we need all we can get.

In [28]:
len(dictionary)

6869

Actually, never mind. Let's just use the sector descriptions, since similarities to sector descriptions is all we care about.

In [29]:
#dictionary = gensim.corpora.Dictionary(sector_descriptions_tokens)

In [30]:
#len(dictionary)

Now we use this dictionary to construct a sequence of vector embeddings. First simple bag of words:

In [31]:
description_bow = [dictionary.doc2bow(description) if description is not None else None for description in descriptions_tokens]
project_name_bow = [dictionary.doc2bow(name) if name is not None else None for name in project_names_tokens]
sector_descriptions_bow = [dictionary.doc2bow(name) if name is not None else None for name in sector_descriptions_tokens]

In [32]:
corpus = list(filter(lambda x: x is not None, description_bow + project_name_bow + sector_descriptions_bow))

Now we get tf-idf vectors for these documents.

In [33]:
tfidf = gensim.models.TfidfModel(corpus)

In [34]:
description_tfidf = [tfidf[desc] if desc is not None else None for desc in description_bow]
project_name_tfidf = [tfidf[name] if name is not None else None for name in project_name_bow]
sector_descriptions_tfidf = [tfidf[name] if name is not None else None for name in sector_descriptions_bow]

In [35]:
agreements_features['project_description_tfidf'] = description_tfidf
agreements_features['project_name_tfidf'] = project_name_tfidf
#agreements_features['sector_name_tfidf'] = sector_name_tfidf

Finally, we use latent semantic indexing to reduce the dimensionality, removing noise. This is just an SVD on the tf-idf term-document matrix.

In [36]:
lsi = gensim.models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=100)

In [37]:
description_lsi = [lsi[desc] if desc is not None else None for desc in description_tfidf]
project_name_lsi = [lsi[name] if name is not None else None for name in project_name_tfidf]
sector_desc_lsi = [lsi[name] if name is not None else None for name in sector_descriptions_tfidf]

In [38]:
agreements_features['project_description_lsi'] = description_lsi
agreements_features['project_name_lsi'] = project_name_lsi

## Making predictions

We'll now use similarities in the document space to predict sectors. 

In [39]:
similarities = gensim.similarities.MatrixSimilarity(sector_desc_lsi)
description_similarities = similarities[description_lsi]
project_name_similarities = similarities[project_name_lsi]

In [40]:
description_similarities.shape

(3189, 11)

In [41]:
agreements_features['description_similarities_lsi'] = np.split(description_similarities,description_similarities.shape[0],axis=0)
agreements_features['project_name_similarities_lsi'] = np.split(project_name_similarities,project_name_similarities.shape[0],axis=0)

In [42]:
description_based_topic_predictions = description_similarities.argmax(axis=1)
project_name_topic_predictions = project_name_similarities.argmax(axis=1)

In [43]:
agreements_features['description_topic_predictions'] = description_based_topic_predictions
agreements_features['project_name_topic_predictions'] = project_name_topic_predictions

In [44]:
description_test_set = agreements_features.dropna(how='any',inplace=False,subset=['project_desc','sector_id'])
correct_desc_pred = description_test_set.description_topic_predictions == description_test_set.sector_id
description_accuracy = sum(correct_desc_pred) / len(correct_desc_pred)
print("Accuracy:", description_accuracy, "on", len(correct_desc_pred), "predictions")

Accuracy: 0.38037486218302097 on 907 predictions


In [45]:
name_test_set = agreements_features.dropna(how='any',inplace=False,subset=['Project Name','sector_id'])
correct_name_pred = name_test_set.project_name_topic_predictions == name_test_set.sector_id
name_accuracy = sum(correct_name_pred) / len(correct_name_pred)
print("Accuracy:", name_accuracy, "on", len(correct_name_pred), "predictions")

Accuracy: 0.42408376963350786 on 955 predictions


Interestingly, the project names do better on a larger test set than the full project descriptions, despite having far less information. Maybe we should train the model differently.

Some things to try:
- Use only the words that appear in the sector descriptions. OK, tried it: it doesn't change the performance of the project name prediction, and makes the description set worse somehow. Possibly because the LSI was able to find connections between topics and words using the project descriptions... that's probably why you would want to use all the words. So that leaves the next steps:
- Different thresholds for filtering words out. The project name prediction accuracy is pretty robust to this, which makes sense. 
- Figure out how to weight words based on the length of the sector description they belong to. Class 7 is the least frequently predicted, even though it's the most common, probably because its sector description is the shortest.
- More sophisticated models like word2vec or doc2vec that include relationships between words. We probably want some pretrained model that knows that certain words are related to terms in the sector descriptions, even if they don't appear there.

In [46]:
description_test_set.project_name_topic_predictions.value_counts()

5     137
9     115
3     102
10     89
6      88
7      75
1      70
0      61
8      59
2      56
4      55
Name: project_name_topic_predictions, dtype: int64

In [47]:
description_test_set.sector_id.value_counts()

7.0     178
9.0     118
10.0    116
0.0     104
2.0      79
5.0      70
1.0      69
4.0      63
8.0      56
3.0      47
6.0       7
Name: sector_id, dtype: int64

Interesting: although the classes are unbalanced, this is not the reason the project name-based classification does better than chance.

In [294]:
sector_descriptions_tokens[7]

['public',
 'admin',
 'central',
 'government',
 '(',
 'central',
 'agencies',
 ')',
 'law',
 'and',
 'justice',
 'other',
 'public',
 'administration',
 'sub-national',
 'government']

## A Doc2Vec model

Let's try using a more sophisticated model: Doc2Vec. This is technically an neural network, although a very shallow one, which is trained to predict words from local context (like word2vec) combined with a document context vector.

We need to get our corpus in a slightly different format to use with gensim's doc2vec.

In [48]:
doc2vec_corpus = [gensim.models.doc2vec.TaggedDocument(tokens, [i]) for i, tokens in enumerate(filter(lambda x: len(x) > 0, descriptions_tokens + project_names_tokens + sector_descriptions_tokens))]

In [49]:
doc2vec_corpus[:2]

[TaggedDocument(words=['schedule', '1', 'withdrawal', 'of', 'the', 'proceeds', 'of', 'the', 'loan', '1.', 'subject', 'to', 'the', 'provisions', 'set', 'forth', 'or', 'referred', 'to', 'in', 'this', 'schedule', ',', 'the', 'proceeds', 'of', 'the', 'loan', 'may', 'be', 'withdrawn', 'from', 'the', 'loan', 'account', 'for', 'expenditures', 'made', '(', 'or', ',', 'if', 'the', 'bank', 'shall', 'so', 'agree', ',', 'to', 'be', 'made', ')', 'in', 'respect', 'of', 'the', 'reasonable', 'cost', 'of', ':', '(', 'i', ')', 'goods', 'required', 'during', 'the', 'execution', 'of', 'the', 'program', 'and', 'to', 'be', 'financed', 'out', 'of', 'such', 'proceeds', ';', 'and', '(', 'ii', ')', 'other', 'items', 'of', 'expenditures', 'relating', 'to', 'technical', 'assistance', ',', 'all', 'as', 'listed', 'in', 'the', 'table', 'set', 'forth', 'below', ':', 'amount', 'of', 'the', 'loan', 'allocated', '%', 'of', '(', 'expressed', 'in', 'expenditures', 'category', 'dollar', 'equivalent', ')', 'to', 'be', 'fina

In [50]:
doc2vec = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=5, epochs=50)

In [51]:
doc2vec.build_vocab(doc2vec_corpus)

In [52]:
doc2vec.train(doc2vec_corpus, total_examples=doc2vec.corpus_count, epochs=doc2vec.epochs)

In [53]:
description_vectors = [doc2vec.infer_vector(desc) for desc in descriptions_tokens]
project_name_vectors = [doc2vec.infer_vector(name) for name in project_names_tokens]
sector_description_vectors = [doc2vec.infer_vector(desc) for desc in sector_descriptions_tokens]

In [54]:
description_distances_doc2vec = np.zeros((len(sector_description_vectors),len(description_vectors)))
for i in range(len(description_vectors)):
    for j in range(len(sector_description_vectors)):
        description_distances_doc2vec[j,i] = np.linalg.norm(description_vectors[i]-sector_description_vectors[j])
project_name_distances_doc2vec = np.zeros((len(sector_description_vectors),len(project_name_vectors)))
for i in range(len(project_name_vectors)):
    for j in range(len(sector_description_vectors)):
        project_name_distances_doc2vec[j,i] = np.linalg.norm(project_name_vectors[i]-sector_description_vectors[j])

In [55]:
doc2vec_description_predictions = np.argmin(description_distances_doc2vec,axis=0)
doc2vec_project_name_predictions = np.argmin(project_name_distances_doc2vec,axis=0)

In [56]:
doc2vec_description_predictions

array([3, 0, 3, ..., 8, 7, 7])

In [57]:
agreements_features['description_topic_predictions_doc2vec'] = doc2vec_description_predictions
agreements_features['project_name_topic_predictions_doc2vec'] = doc2vec_project_name_predictions

In [58]:
correct_desc_pred = agreements_features.description_topic_predictions_doc2vec == agreements_features.sector_id
description_accuracy = sum(correct_desc_pred) / len(description_test_set)
print("Description-based Accuracy:", description_accuracy, "on", len(description_test_set), "predictions")

Description-based Accuracy: 0.19845644983461963 on 907 predictions


In [59]:
correct_name_pred = agreements_features.project_name_topic_predictions_doc2vec == agreements_features.sector_id
name_accuracy = sum(correct_name_pred) / len(name_test_set)
print("Project Name-based Accuracy:", name_accuracy, "on", len(name_test_set), "predictions")

Project Name-based Accuracy: 0.06806282722513089 on 955 predictions


In [60]:
w = 0.95 #weighting factor for joint prediction
combined_distances = w * description_distances_doc2vec + (1-w) * project_name_distances_doc2vec
combined_predictions = np.argmin(combined_distances,axis=0)

In [61]:
agreements_features['predictions_doc2vec'] = combined_predictions

In [62]:
combined_test_set = agreements_features.dropna(how='any',inplace=False,subset=['project_desc','Project Name','sector_id'])
correct_joint_pred = combined_test_set.predictions_doc2vec == combined_test_set.sector_id
joint_accuracy = sum(correct_joint_pred) / len(correct_joint_pred)
print("Combined Accuracy:", joint_accuracy, "on", len(correct_joint_pred), "predictions")


Combined Accuracy: 0.1819184123484013 on 907 predictions


Well, the descriptions perform better with the doc2vec model. Still not very well, though. Actually, after fixing some bugs, it's worse than the LSI model.

Now save the features for use in supervised classification.

In [63]:
def make_np_vector(x):
    out = np.zeros(len(x))
    for i in range(len(x)):
        out[i] = x[i][1]
    return out

In [64]:
agreements_features.project_description_lsi = agreements_features.project_description_lsi.apply(make_np_vector)
agreements_features.project_name_lsi = agreements_features.project_name_lsi.apply(make_np_vector)

In [65]:
agreements_features.drop(["project_description_tfidf","project_name_tfidf"],axis=1,inplace=True)

In [66]:
agreements_features.columns

Index(['id', 'date', 'country_code', 'country_name', 'named_countries',
       'Project Name', 'project_desc', 'amount_standard', 'currency_standard',
       'closing_month', 'closing_day', 'closing_year', 'loan_length in year',
       'wb_project_id', 'major_sector_name', 'sector1_name', 'sector2_name',
       'sector3_name', 'sector_id', 'project_description_lsi',
       'project_name_lsi', 'description_similarities_lsi',
       'project_name_similarities_lsi', 'description_topic_predictions',
       'project_name_topic_predictions',
       'description_topic_predictions_doc2vec',
       'project_name_topic_predictions_doc2vec', 'predictions_doc2vec'],
      dtype='object')

In [68]:
agreements_features[['id','project_name_lsi','project_description_lsi','project_name_similarities_lsi','description_similarities_lsi','wb_project_id','major_sector_name']].to_pickle("classification_features.pkl")