## Identification of topics in World Bank articles
This model was originally constructed as a capstone project for the Microsoft Professional Certificate in Artificial Intelligence on EdX.

The task was to build a machine learning model that would learn to identify topics from a large selection of documents published by the World Bank. A list of 29 topics was provided, and documents could have any number of these topics attributed to them. A labelled dataset was provided for model construction and testing, and a separate unlabelled dataset was used for scoring the assignment.

The model I've constructed uses ScikitLearn's term frequency-inverse document frequency (TF-IDF) vectorizer. This tool calculates the ratio of the frequency of each word in a document to the proportion of documents in the entire set in which the word appears. Hence, it will tend to give a high score to words that appear frequently and uniquely in a given document, so presumably to words of high relevance to the document's topic.

First we import the libraries and training dataset to be used.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

In [2]:
import keras
from keras.models import Sequential
from keras.layers import Dense

In [3]:
df = pd.read_csv("train_values.csv")
df.head()

Unnamed: 0,row_id,doc_text
0,0,"For more information, visit http://www.wor..."
1,1,...
2,2,...
3,3,71399\n\nProcur...
4,4,90189\n\n\n\n\nExecu...


**This quick look** at the training set using df.head() is not very informative, so let's show the first few lines of one of the documents to get a feel for what we're looking at. The full document is 19782 characters long, so this is less than a tenth of it.

In [4]:
df.doc_text[0][:1200] 

'    For more information, visit http://www.worldbank.org/prospects                                                                  98944\n\n    Taking Stock\n   ECB unveiled 1.1 trillion asset-purchase program. The European Central Bank announced it will inject about 1.1\n    trillion ($1.3 trillion) into financial markets through an asset-purchase program in a bid to counter weaker than expected\n    inflation dynamics. Beginning March this year, the ECB will buy a total of 60 billion ($69 billion) a month of public and\n    private sector securities until end-September 2016. These purchases may continue beyond September 2016 unless there\n    are clear signs of a sustained adjustment in the path of infl ation towards ECBs target of close to 2 percent. Following\n    the announcement, European stocks rallied, while the euro weakened to an 11-year low against the dollar.\n\n   World Bank released its January 2015 Commodity Markets Outlook. Prices of most commodities are expected to\n

**The code below** converts the training set into a matrix in which each row represents a document and each column represents a specific word or term, containing the TF-IDF score for each term in each document. The settings used will exclude words that appear in more than half the documents (max_df=0.5), so particularly including *stop words* such as "the", "and", "not", etc, as well as those appearing in less than 10 documents (min_df=10), which might include personal names or very obscure terms. The resulting matrix covers 18830 documents and a vocabulary of 37560 terms.

In [5]:
vectorizer = TfidfVectorizer(min_df=10, max_df=0.5)
vectorizer.fit(df["doc_text"])
len(vectorizer.vocabulary_)

37560

In [6]:
train_vector = vectorizer.transform(df["doc_text"])
train_vector

<18830x37560 sparse matrix of type '<class 'numpy.float64'>'
	with 10245656 stored elements in Compressed Sparse Row format>

In [7]:
input_len = train_vector.shape[1]
input_len

37560

In [8]:
X_train = train_vector.toarray()

In [9]:
df = pd.read_csv("train_labels.csv")
df.head()

Unnamed: 0,row_id,information_and_communication_technologies,governance,urban_development,law_and_development,public_sector_development,agriculture,communities_and_human_settlements,health_and_nutrition_and_population,culture_and_development,...,private_sector_development,informatics,energy,social_development,water_resources,education,transport,water_supply_and_sanitation,gender,infrastructure_economics_and_finance
0,0,0,1,0,1,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,1,0,1,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,4,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**In the resulting dataframe,** we see that some documents do indeed cover multiple topics. For this reason, *binary cross-entropy* is used as the loss function in the neural network below, instead of categorical cross-entropy. Rather than a multi-class classification, what we really have is a binary classification task in which we find a "yes" or "no" answer for the presence of each of the 29 topics in each of the documents. 

Several different neural network shapes were tested, and a single hidden layer of 900 nodes was found to give the best results. Keras conveniently gives accuracy and loss scores for each epoch of the training, so it is easy to run the model, check the scores and then run it again with the optimal number of epochs. The preferred metric for this is *val_acc*, which is the accuracy calculated on the unseen test set.

In [10]:
y_train = np.array(df.iloc[:,1:])

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_1, X_2, y_1, y_2 = train_test_split(X_train, y_train, test_size=0.05)

In [13]:
model = Sequential()
model.add(Dense(900, activation="relu", input_shape=(input_len,)))
model.add(Dense(29, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, validation_data=(X_2, y_2), epochs=9, batch_size=500)

Train on 18830 samples, validate on 942 samples
Epoch 1/9
Epoch 2/9
Epoch 3/9
Epoch 4/9
Epoch 5/9
Epoch 6/9
Epoch 7/9
Epoch 8/9
Epoch 9/9


<keras.callbacks.History at 0x1248c9fa0>

**The output so far** is a series of probabilities, which we need to convert to ones and zeroes. The scoring metric used for assessing the model overall is the F1 score, which in the simplest case equals *2 x ((precision x recall)/(precision + recall))*. In this case we use a variation called the "micro" F1 score, which combines accuracy measurements across multiple classes to calculate a single score. Below we test the model by using a wide range of cut-off values for converting predicted probabilities into one or zero, and find which cut-off value gives the highest F1 score.

In [14]:
df = pd.read_csv("test_values.csv")
test_vector = vectorizer.transform(df["doc_text"])
X_submit = test_vector.toarray()

In [15]:
from sklearn.metrics import f1_score
y_pred = model.predict(X_2)
for i in range(1, 65):
    y_pred_bin = np.where(y_pred > (i / 100), 1, 0)
    print(i/100, f1_score(y_2, y_pred_bin, average="micro"))

0.01 0.30200156985871274
0.02 0.3698654924905
0.03 0.4281812451852371
0.04 0.4772940808017539
0.05 0.5188420692017814
0.06 0.5565410199556542
0.07 0.5879352623835212
0.08 0.6152733285109021
0.09 0.6385685027487334
0.1 0.6569065713008493
0.11 0.6760009257116408
0.12 0.6938239159001314
0.13 0.7077378018728437
0.14 0.7184074587375584
0.15 0.729708837928369
0.16 0.7381794368041913
0.17 0.7486331510868116
0.18 0.7574028796522684
0.19 0.7637719177136546
0.2 0.7694682194471727
0.21 0.7758128921848259
0.22 0.7820049182699262
0.23 0.7879143443825168
0.24 0.7928061831153389
0.25 0.7967455175531113
0.26 0.79945180447693
0.27 0.8022215365627893
0.28 0.8053042121684867
0.29 0.8058527375707991
0.3 0.8047036389639282
0.31 0.8060995184590689
0.32 0.8063420158550396
0.33 0.8067199478062306
0.34 0.8070406316828426
0.35 0.8079734219269102
0.36 0.808695652173913
0.37 0.8071537033912602
0.38 0.8061883713022782
0.39 0.8023355658595227
0.4 0.8031822898650984
0.41 0.7995817357964448
0.42 0.7958538299367534
0.

**The model is then fitted upon the submission dataset,** for which the labels are not made available to the course participants. The output is converted to ones and zeroes using the optimal cut-off value found above, and then submitted as a CSV file.

In [16]:
y_submit = model.predict(X_submit)

In [17]:
## check cut-off value below
y_submit_bin = np.where(y_submit > 0.36, 1, 0)

In [18]:
df2 = pd.read_csv("submission_format.csv")
df2.head()

Unnamed: 0,row_id,information_and_communication_technologies,governance,urban_development,law_and_development,public_sector_development,agriculture,communities_and_human_settlements,health_and_nutrition_and_population,culture_and_development,...,private_sector_development,informatics,energy,social_development,water_resources,education,transport,water_supply_and_sanitation,gender,infrastructure_economics_and_finance
0,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,2,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,3,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,4,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [19]:
df2.shape

(18831, 30)

In [20]:
y_submit_bin2 = np.column_stack((np.array(df2.row_id).reshape(18831,1), y_submit_bin))

In [21]:
subdf = pd.DataFrame(data=y_submit_bin2, columns=df2.columns, index=None)

In [22]:
subdf.to_csv("submission15.csv", index=False)

**The submitted CSV file** is scored automatically against the unseen labels, and multiple attempts were allowed. My best submission obtained an F1 score of 0.6611, which was in the top 6% of course participants.