# Predicting Categories from Named Entities in NOS Articles

Final Assignment for Fudamentals of Machine Learning 
by Sterre van Geest

January 2021


## Introduction
-----

I am using the [Dutch News Articles](https://www.kaggle.com/maxscheijen/dutch-news-articles) dataset I found on Kaggle.com. The dataset contains all the articles published by the NOS as of the 1st of January 2010 till the 1st of January 2021. The data is obtained by scraping the NOS website. The NOS is one of the biggest (online) news organizations in the Netherlands. I first found it hard to give direction to the research. But I decided I want to research if I'm able to predict what category an article belongs to by analyzing named entities inside the articles. This may not have the most practical relevance but it was an interesting way for me to get started with Natural Language Processing in combination with named entities, which is something I am interested in.  


## The dataset
---


### Getting the data

In [3]:
import pandas as pd
import csv

**The dataset includes the following columns** (also shown below):

- **datetime**: date and time of publication of the article.
- **title**: the title of the news article.
- **content**: the content of the news article.
- **category**: the category under which the NOS filed the article.
- **url**: link to the original article.

In [9]:
df = pd.read_csv('dutch-news-articles.csv')
df.head()

Unnamed: 0,datetime,title,content,category,url
0,2010-01-01 00:49:00,Enige Litouwse kerncentrale dicht,De enige kerncentrale van Litouwen is oudjaars...,Buitenland,https://nos.nl/artikel/126231-enige-litouwse-k...
1,2010-01-01 02:08:00,Spanje eerste EU-voorzitter onder nieuw verdrag,Spanje is met ingang van vandaag voorzitter va...,Buitenland,https://nos.nl/artikel/126230-spanje-eerste-eu...
2,2010-01-01 02:09:00,Fout justitie in Blackwater-zaak,Vijf werknemers van het omstreden Amerikaanse ...,Buitenland,https://nos.nl/artikel/126233-fout-justitie-in...
3,2010-01-01 05:14:00,"Museumplein vol, minder druk in Rotterdam",Het Oud en Nieuwfeest op het Museumplein in Am...,Binnenland,https://nos.nl/artikel/126232-museumplein-vol-...
4,2010-01-01 05:30:00,Obama krijgt rapporten over aanslag,President Obama heeft de eerste rapporten gekr...,Buitenland,https://nos.nl/artikel/126236-obama-krijgt-rap...


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220132 entries, 0 to 220131
Data columns (total 5 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   datetime  220132 non-null  object
 1   title     220132 non-null  object
 2   content   220132 non-null  object
 3   category  220132 non-null  object
 4   url       220132 non-null  object
dtypes: object(5)
memory usage: 8.4+ MB


The unit of observation are articles. The dataset consists of **220132** unique articles/rows.
### Cleaning the dataset
Before I'm able to analyse the news articles, the raw data needs to be cleaned. 
The texts contain a lot of white spaces and some weird characters.

In [None]:
# this code part of this Notebook: https://www.kaggle.com/maxscheijen/text-mining-dutch-news-articles 
import unicodedata
import re

def clean_string(x):
    x = re.sub(" +", " ", x)
    x = x.replace("\n", " ")
    x = unicodedata.normalize("NFKD", x)
    return x

df['content'] = df['content'].progress_apply(clean_string)

## Feature engineering
----

The dataset consist of the following categories:

In [14]:
df['category'].value_counts()

Buitenland          85582
Binnenland          74433
Politiek            20145
Economie            17965
Regionaal nieuws    13279
Koningshuis          3012
Opmerkelijk          2716
Cultuur & Media      2046
Tech                  954
Name: category, dtype: int64

I want to use named entities to predict what category an article belongs to. A **named entity** is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. [spaCy](https://spacy.io/) is an open-source software library helping with identifying named entities. There are several other models that can also extract named entities from a text, for example [NLTK](https://www.nltk.org/book/ch07.html). I’m using spaCy for this project because it has a Dutch model.

### Step 1 - Getting Named Entities with Spacy
First I'm importing and loading spacy:

In [15]:
import spacy
from spacy import displacy
from collections import Counter
import nl_core_news_sm
nlp = nl_core_news_sm.load()

Below I explain with one article how spaCy works. Later I apply spaCy to every row of the dataset. 

In [185]:
text = df.content.iloc[9]
doc = nlp(text)

SpaCy returns different types of named entities:

In [186]:
displacy.render(nlp(str(doc)), jupyter=True, style='ent')

**While applying spaCy to the articles, I noticed the following:**

1. when applying spaCy on all articles, it will generate many different and unique named entities.

Therefore I decided to only use the 110 most common entities per category in a sparse matrix. This results in 9 * 110 = 990 named entities. 

2. Some named entities can mean something different in another contexts and some are therefore not usable. For example as seen in the example above: the words 'eerste' (means: first) and 'half jaar' (means: half a year).

Therefore I decided to not use the following named entities types: `CARDINAL`, `ORDINAL`, `MONEY`, `DATE`, `QUANTITY`, `TIME`, `PERCENT`. 



**I wrote the following code to get the most common 110 named entities**

In [None]:
categories = df['category'].unique()

def export_named_entities(categories):
    for category in categories:
        df_category = df.loc[df['category'] == category]
        # 1. get all named entities per category for every article
        all_named_entities = df_category['content'].apply(
            get_all_named_entities)
       
        # 2. calculate most common named entities for every category
        sum_named_entities = count_named_entities(all_named_entities)
        
        # 3. export most common named entities per category to a .json file
        create_json(category, sum_named_entities)

# 1. get all named entities per category for every article
def get_all_named_entities(row):
    doc = nlp(row)
    items = []
    for entity in doc.ents:
        if entity.label_ != 'CARDINAL' and entity.label_ != 'DATE' and entity.label_ != 'QUANTITY' and entity.label_ != 'TIME' and entity.label_ != 'ORDINAL' and entity.label_ != 'PERCENT' and entity.label_ != 'MONEY':
            items.append(entity.text)
    return items

# 2. calculate most common named entities for every category
def count_named_entities(all_named_entities):
    sum_entities = []
    for items in all_named_entities:
        for item in items:
            sum_entities.append(item)
    sum_entities = dict(Counter(sum_entities).most_common(110)).keys() #return keys of 110 most common NE.
    return sum_entities

# 3. export most common named entities per category to a .json file
def create_json(category, sum_named_entities):
    print(category, sum_named_entities)
    result = dict([(item, idx) for idx, item in enumerate(sum_named_entities)])
    with open("./named-entities/" + category + ".json", "w") as out_file:
        json.dump(result, out_file, ensure_ascii=False)

        
# I comment this out, because this takes a 
# long time and the results are stored in 
# json files.

# export_named_entities(categories)

Results can be viewed in: https://github.com/sterrevangeest/dutch-news-articles/tree/master/named-entities

### Step 2 - Creating a Sparse Matrix

After getting the named entities, I needed to get the text into a form that a machine learning model and Python can understand and use to comprehend and train a model. The procedure which is used to convert the text into a form that Python and machine learning models can understand is called vectorising (Drikvandi & Lawal, 2020). To do this, I'm creating a sparse matrix.

**I wrote the following code to create a sparse matrix for every named entity :**

In [486]:
from sklearn.feature_extraction.text import CountVectorizer
import json #because the named entities are saved in json files

In [None]:
# this code part of this article: 
# https://towardsdatascience.com/working-with-sparse-data-sets-in-pandas-and-sklearn-d26c1cfbe067
def mytokenizer(text):
    return text.split()

def create_sparse_matrix(df, vocab):
    vectorizer = CountVectorizer(
        vocabulary=vocab, tokenizer=mytokenizer, lowercase=False)

    X = vectorizer.fit_transform(df['content'])

    count_vect_df = pd.DataFrame(
        X.todense(), columns=vectorizer.get_feature_names())

    def convert_to_sparse_pandas(df, exclude_columns=[]):
        
        """
        Converts columns of a data frame into SparseArrays and returns the data frame with transformed columns.
        Use exclude_columns to specify columns to be excluded from transformation.
        :param df: pandas data frame
        :param exclude_columns: list
            Columns not be converted to sparse
        :return: pandas data frame
        """
        
        df = df.copy()
        exclude_columns = set(exclude_columns)

        for (columnName, columnData) in df.iteritems():
            if columnName in exclude_columns:
                continue
            df[columnName] = pd.arrays.SparseArray(
                columnData.values, dtype='uint8')

        return df

    sparse_matrix_df = convert_to_sparse_pandas(count_vect_df)
    return sparse_matrix_df


# I comment this out, because this takes a 
# long time and the results are stored in 
# a dataframe (see below).

# create a new column for every named entity in the json files

#for category in categories:
#    with open('./named-entities/' + category + ".json") as json_file:
#        vocab = json.load(json_file)
        
#        sparse_matrix = create_sparse_matrix(df, vocab)
#        df = pd.concat([df, sparse_matrix], axis=1)

In [None]:
# to save memory I remove the cells 
# I don't need for the predicting. 

del df['content']
del df['datetime']
del df['title']


# and save the "new dataset" to a new .csv file:

# I comment this out, because this takes a 
# long time and the results are stored in
# the dataframe.

# df.to_csv('nos-sparse-matrix.csv')

The sparse matrix is a matrix that consists of mostly zero's. For every time the content of the article contains the named entity, it adds a 1 to that named entity. This is what the data frame looks like:

In [7]:
df_sparse = pd.read_csv('nos-sparse-matrix.csv')

del df_sparse['Unnamed: 0'] # deleting this extra index column, because I don't need this

df_sparse.head()

Unnamed: 0,category,url,Amerikaanse,Trump,VS,Britse,Rusland,Nederland,Russische,Franse,...,Gelderse,PVV.2,Zuid-Holland,Urk,Alkmaar,Roosendaal,Weert,Britse.7,Hengelo,A12
0,Buitenland,https://nos.nl/artikel/126231-enige-litouwse-k...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Buitenland,https://nos.nl/artikel/126230-spanje-eerste-eu...,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,Buitenland,https://nos.nl/artikel/126233-fout-justitie-in...,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Binnenland,https://nos.nl/artikel/126232-museumplein-vol-...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Buitenland,https://nos.nl/artikel/126236-obama-krijgt-rap...,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**I removed the categories**: "culture & media", "remarkable" and "tech" from the training and test data because they resulted in very bad predictions.

I think this is because the named entities in these articles are all so unique that the named entities are not well represented enough in the sparse matrix. Because the articles in these categories probably also contain general entities: such as 'Europa' or 'Nederland', they are more likely to be estimated in other categories.

In [8]:
df_sparse = df_sparse.loc[df_sparse['category'] != 'Tech']
df_sparse = df_sparse.loc[df_sparse['category'] != 'Opmerkelijk']
df_sparse = df_sparse.loc[df_sparse['category'] != 'Cultuur & Media']

### Step 3 - Creating a sparse matrix with Scipy

In [355]:
BYTES_TO_MB_DIV = 0.000001
def print_memory_usage_of_data_frame(df):
    mem = round(df.memory_usage().sum() * BYTES_TO_MB_DIV, 3) 
    print("Memory usage is " + str(mem) + " MB")
    
print_memory_usage_of_data_frame(df_sparse)

Memory usage is 1705.036 MB


The sparse data is a very big file. This is because sklearn does not handle sparsa dataframes as sparse data. Instead, sparse columns are converted to dense before being processed, causing the data frame size to explode (Velidou, 2019). 

**To be able to work with the sparse matrix I convertet the pandas data frame into a [scipy](https://www.scipy.org/) sparse matrix. I wrote the following code for that:**

In [9]:
from sklearn.model_selection import train_test_split
import numpy as np

from scipy.sparse import lil_matrix
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

# this code part of this article: 
# https://towardsdatascience.com/working-with-sparse-data-sets-in-pandas-and-sklearn-d26c1cfbe067
def data_frame_to_scipy_sparse_matrix(df):
    
    """
    Converts a sparse pandas data frame to sparse scipy csr_matrix.
    :param df: pandas data frame
    :return: csr_matrix
    """
    
    arr = lil_matrix(df.shape, dtype=np.float32)
    for i, col in enumerate(df.columns):
        ix = df[col] != 0
        arr[np.where(ix), i] = 1

    return arr.tocsr()


y_sparse = df_sparse['category']
X = df_sparse[df_sparse.columns.difference(['category', 'url'])]
X_sparse = data_frame_to_scipy_sparse_matrix(X)

The new sparse matrix is significantly smaller than the original sparse dataframe:

In [368]:
def get_csr_memory_usage(matrix):
    mem = (X_sparse.data.nbytes + X_sparse.indptr.nbytes + X_sparse.indices.nbytes) * BYTES_TO_MB_DIV
    print("Memory usage is " + str(mem) + " MB")

get_csr_memory_usage(X_sparse)

Memory usage is 26.25266 MB


## Descriptive Analysis
---

In [96]:
sum_columns = []
for (columnName, columnData) in X.iteritems():
    sum_column = X[columnName].sum()
    sum_columns.append([sum_column, columnName])

max_sub = max(sum_columns, key=lambda x: x[0])
max_sub

print('The most found named entity is:', max_sub[1], 'and was found', max_sub[0], 'times in all the articles' )

The most found named entity is: Nederland and was found 56426 times in all the articles


NOS also talk a lot about itself :)

In [88]:
NOS = list(df_sparse['NOS'].value_counts())
NOS.pop(0) #because the first elements counts al the 0 times NOS was found
sum(NOS)
print('NOS was found', sum(NOS), 'times in all the articles' )

NOS was found 10956 times in all the articles


In [89]:
TRUMP = list(df_sparse['Trump'].value_counts())
TRUMP.pop(0) #because the first elements counts al the 0 times TRUMP was found
print('Trump was found', sum(TRUMP), 'times in all the articles' )

Trump was found 5184 times in all the articles



## Training the model
---

I tried to fit different models on the data. First I split the data into training and test data.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_sparse, y_sparse, test_size=0.3, random_state=42)

### Logistic Regression

In [497]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(random_state=0, multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("The accuracy for Logistic Regression:", acc)

The accuracy for Logistic Regression: 0.7351263116984065


### SGDClassifier

In [498]:
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("The accuracy for SGDClassifier:", acc)

The accuracy for SGDClassifier: 0.7321103769918383


### KNeighborsClassifier

In [499]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("The accuracy for KNeighborsClassifier:", acc)

The accuracy for KNeighborsClassifier: 0.7321103769918383


### RandomForestRegressor

This one took VERY long, so I don't have an result for this algorithm.

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

le = preprocessing.LabelEncoder()
categories = y_train.unique()
le.fit(categories)
list(le.classes_)
y_train = le.transform(y_train)

rf = RandomForestRegressor(n_estimators = 10, random_state = 42)
rf.fit(X_train, y_train);

The accuracy of the different algorithms are very close to each other. The Logistic Regression algorithm has the highest accuracy for now. I will continue with this algorithm because this algorithm is quite fast in comparison to other algorithms.

## Evaluation
----

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(random_state=0, multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("The accuracy for Logistic Regression:", acc)

The accuracy for Logistic Regression: 0.7351263116984065



This means the model is **73,5%** accurate. I think this is not very bad. But it still means that 30% of the articles are not predicted correctly. 

While applying the named entities, I encountered some duplicates in the named entities. By removing these errors and increasing the number of named entities, I think the model can perform better. Also I think the model can be improved by looking at named entities that are correlated with eachother. For example: 'Nederland' and 'Nederlandse'.  

To furter evaluate the model I created a classification report and a confusion matrix.

**Clasification report**

In [12]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

                  precision    recall  f1-score   support

      Binnenland       0.71      0.70      0.71     22467
      Buitenland       0.75      0.92      0.83     25486
        Economie       0.68      0.39      0.49      5353
     Koningshuis       0.65      0.40      0.50       865
        Politiek       0.75      0.60      0.67      6091
Regionaal nieuws       0.82      0.47      0.60      4063

        accuracy                           0.74     64325
       macro avg       0.73      0.58      0.63     64325
    weighted avg       0.73      0.74      0.72     64325



**Precision** is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answers is: *Of all categories that are labeled as X, how many actually are X?* A high precision relates to a low false positive rate. The model gives 0.73 precision. In this context the precision is not bad. But for actually implementing a model like this the presicion should be higher. 

**Recall** is the ratio of correctly predicted positive observations to the all observations in actual class. The question recall answers is: *Of all the categories that truly are X, how many did we label?* The model got a quite high recall for the category 'Buitenland', also 'Binnenland' is not doing too bad. But the model is actually quite bad for prediciting the 'Economie' and 'Koningshuis' categories. 

**Confusion matrix**

In [13]:
from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

conf_matrix = pd.DataFrame(cnf_matrix, index=["Domestic", "Foreign", "Economy", "Royal Family", "Politics", "Regional News" ], 
                                   columns = ["PRED: Domestic", "PRED: Foreign", "PRED: Economy", "PRED: Royal Family", "PRED: Politics", "PRED: Regional News" ]) 
conf_matrix

Unnamed: 0,PRED: Domestic,PRED: Foreign,PRED: Economy,PRED: Royal Family,PRED: Politics,PRED: Regional News
Domestic,15771,4793,509,109,898,387
Foreign,1507,23510,265,46,150,8
Economy,1690,1440,2075,5,134,9
Royal Family,236,253,7,346,20,3
Politics,1586,651,159,23,3663,9
Regional News,1482,612,24,1,22,1922


The confusion matrix shows the summary of the prediction results. 

This matrix, and the other results, show that it is quite difficult to predict a category based on named entities. I personally think this is because some named entities are very general and therefore appear in different categories. 

## References
____

- Drikvandi, R., & Lawal, O. (2020). Sparse Principal Component Analysis for Natural Language Processing. Annals of Data Science, 0. https://doi.org/10.1007/s40745-020-00277-x
- Velidou, D. S. (2019, November 5). Working with sparse data sets in pandas and sklearn. Medium. https://towardsdatascience.com/working-with-sparse-data-sets-in-pandas-and-sklearn-d26c1cfbe067
