As mentioned in the assignment the problem is:

> Often the urgency indication turns out to be wrong: patients that are classified as urgent, are in reality not so urgent when they see the GP. Also, the opposite happens. Can we predict better how urgent a patient will be?

> We then want to use the data from the triage to train a machine learning model to predict the label.

To predict the label (urgent or non-urgent), we can have three approaches:

1. supervised classification
2. un-supervised classification
3. semi-supervised classification

when training data are labelled and the training set is big enough, **supervised classification methods** can be applied and the result of classification will be highly accurate. At the first glance to dataset we figure out that the problem can be solved with applying supervised methods because data are labelled in the dataset (urgent vs non-urgent).

However, from information in another part of assignment, we know that:

> Important to know is that this labelling is imperfect.

Based on this information, using the labels may lead classifier to produce wrong predictions. We can ignore the labels completely and solve the problem using **unsupervised classification methods**, or we can use part of labels which are valid and remove the rest and solve it as **semi-supervised classification method**.

I break this assignment to three approaches: supervised, unsupervised and semi-supervised. 




First of all, I import all required libraries and I read input data from the file:

The main required libraries are:

* matplotlib and seaborn for visualization
* pandas and numpy for data structure
* nltk for natural language processing

**Note:** Based on my research, we can use *Frog* which is an integration of memory-based natural language processing (NLP) modules developed for Dutch, instead of NLTK. 

* sklearn for machine-learning
* keras for deep learning


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from collections import defaultdict
from collections import  Counter
plt.style.use('ggplot')
import re
from nltk.tokenize import word_tokenize
import gensim
import string
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
from keras.models import Sequential,Model
from keras import optimizers
from keras.layers import Embedding,LSTM,Dense,SpatialDropout1D,Input,Concatenate
from keras.initializers import Constant
from sklearn.model_selection import train_test_split
from sklearn.cluster import AgglomerativeClustering

stop=set(stopwords.words('dutch'))


In [None]:
df=pd.read_excel("/kaggle/input/pacmed/triage_example.xlsx")

**EDA on Dataset:**

The first step is Exploratory Data Analysis (EDA). We need to know more about our data before going to next steps.

Dataset Columns are:

In [None]:
df.columns

checking the first rows of dataset to get more insight about the data:

In [None]:
df.head(5)

**checking class distribution:**

It is impoertant to know how classes are distributed. sometimes the class distribution is unbalanced which needs furthur steps to handle that.

In [None]:
x=df['Urgency indication'].value_counts()
sns.barplot(x.index,x)
plt.gca().set_ylabel('samples')

Note: because classification is not possible with only one class, I added an augmented row to dataset (so we can run the whole code and make prediction)

In [None]:
augmented_row={"Ingangsklacht":["griep"] ,"Triage: H":["Wat te doen?"] ,"Triage: B":["plotselinge koorts, pijntjes, zwakte of verlies van eetlust. In het bijzonder samen hoesten en koorts hebben"] , "Triage: M":["Geen"],"Triage: V": ["Geen"],"Triage: Checklist":["Kortademig: Nee; Blaarvormig: Nee; Zieke indruk: Nee; Ontsteking: Nee"],"Age": [23],"Gender": ["Female"], "Urgency indication":["Non-urgent"]}

In [None]:
df_augmented = pd.DataFrame(augmented_row, columns = list(df.columns))

In [None]:
df=pd.concat([df,df_augmented])

In [None]:
df

plotting the class distribution again :

In [None]:
x=df['Urgency indication'].value_counts()
sns.barplot(x.index,x)
plt.gca().set_ylabel('samples')

**Spliting features from target variable:**

In this assignment we consider that all columns are important and should be considered as features ( predictors ).However, we may drop some unuseful ones ( such as Triage: H ).
We record features as X_train and the target as y_train. 

In [None]:
X_train=df.iloc[:, :-1]
y_train=df.iloc[:,-1]

Since the target is categorical,first we need to convert it to numerical:

In [None]:
y_train= pd.Categorical(y_train).codes

In [None]:
y_train

In [None]:
X_train

In this assignment, there are two types of features: **structured categorical** and **free unstructured text**. So, we need diffrent feature engineering techiniques for each one

In [None]:
categorical_features=X_train.drop(['Triage: B'],axis=1)
text_feature=X_train['Triage: B']

**feature engineering for categorical features**

In [None]:
categorical_features

In [None]:
len(categorical_features)

the column "Triage: Checklist" is a list of four other features. We should break this column to 4 different column:

In [None]:
Kortademig=[]
Blaarvormig=[]
indruk=[]
Ontsteking=[]
for i in range(len(categorical_features)):
    checklist_text=categorical_features.iloc[i]['Triage: Checklist']
    items=checklist_text.split(';')
    Kortademig.append(items[0].split(": ",1)[1])
    Blaarvormig.append(items[1].split(": ",1)[1])
    indruk.append(items[2].split(": ",1)[1])
    Ontsteking.append(items[3].split(": ",1)[1])
        

In [None]:
categorical_features['Kortademig']=Kortademig
categorical_features['Blaarvormig']=Blaarvormig
categorical_features['indruk']=indruk
categorical_features['Ontsteking']=Ontsteking
categorical_features=categorical_features.drop(['Triage: Checklist'],axis=1)
categorical_features

Now, we convert all categorical values to numerics (except for Age). if we had more data, we could do EDA on these features as well. but in this assignment we stop at this step. 

In [None]:
categorical_features['Ingangsklacht'] = pd.Categorical(categorical_features['Ingangsklacht']).codes
categorical_features['Triage: H'] = pd.Categorical(categorical_features['Triage: H']).codes
categorical_features['Triage: M'] = pd.Categorical(categorical_features['Triage: M']).codes
categorical_features['Triage: V'] = pd.Categorical(categorical_features['Triage: V']).codes
categorical_features['Gender'] = pd.Categorical(categorical_features['Gender']).codes
categorical_features['Kortademig'] = pd.Categorical(categorical_features['Kortademig']).codes
categorical_features['Blaarvormig'] = pd.Categorical(categorical_features['Blaarvormig']).codes
categorical_features['indruk'] = pd.Categorical(categorical_features['indruk']).codes
categorical_features['Ontsteking'] = pd.Categorical(categorical_features['Ontsteking']).codes
categorical_features

**Analysis of free text:** (NLP)

First,we will do very basic analysis to get more insight for the data; to do so, we need to create a list of all words and analyze these wors. 

In [None]:
#put all documents together
corpus=text_feature.str.cat(sep=' ,')

In [None]:
corpus

In [None]:
#get the words
words=corpus.split()

In [None]:
words

Checking Stop-words:

In [None]:
dic=defaultdict(int)
for word in words:
    if word in stop:
        dic[word]+=1
        
top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10] 

No, we want to know which stop words are repeated more frequently in the corpus. Most of stop words should be removed from corpus, however, some of them may be useful ans should be kept in text. 

In [None]:
x,y=zip(*top)
plt.bar(x,y)

We need to check if there is any punctutations or special chractor in the corpus and what are these punctutations. the reason is that, most of panctuation marks are useless and should be removed from text but some of them are important and should be kept. Thus, by looking at this plot we get more information about pancuatations. 

In [None]:

dic=defaultdict(int)
import string
special = string.punctuation
for word in words:
    for ch in word:
        if ch in special:
            dic[ch]+=1
        

x,y=zip(*dic.items())
plt.bar(x,y)

In addition, we need to know which words are more more repeated in the text. These words may have some very useful information. 

In [None]:
counter=Counter(words)
most=counter.most_common()
x=[]
y=[]
for word,count in most[:40]:
    if (word not in stop) :
        x.append(word)
        y.append(count)

In [None]:
sns.barplot(x=y,y=x)


**Data cleaning:**

1. making all words lower case ( should be skipped or patially done for word-embedding and BERT solutions )
2. cleaning all panctuation marks 
3. cleaning special chractors
4. cleaning all stop words
5. correcting words spellings
6. stemming and lemmatization

In this assignment, I followed steps 1 to 4 to clean data. Step 5 and 6 are very related to language, and because of limitations of libraries for Dutch, I skipped these two steps. 



In [None]:
text_feature=text_feature.str.lower()
text_feature

Removing punctuations:

In [None]:
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

In [None]:
text_feature=text_feature.apply(lambda x : remove_punct(x))
text_feature

Remove stop-words:

In [None]:
def clean_stopwords(text):
    pattern = re.compile(r'\b(' + r'|'.join(stop) + r')\b\s*')
    text = pattern.sub(' ', text)
    return text

In [None]:
text_feature=text_feature.apply(lambda x : clean_stopwords(x))
text_feature

**Ngram analysis**

words can be more meaningful when the comes together. for example "fish" has less information than "aquatics fish" and "food fish". So, we need to extract phrases with length more than 1 for analysis. 
In this assignment, I extracted bi-grams (phrases with length 2) and tri-grams (phrases with length 3):


**Bi-gram**

In [None]:
def get_top_text_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
plt.figure(figsize=(10,5))
top_text_bigrams=get_top_text_bigrams(text_feature)[:10]
x,y=map(list,zip(*top_text_bigrams))
sns.barplot(x=y,y=x)

Tri-gram

In [None]:
def get_top_text_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
plt.figure(figsize=(10,5))
top_text_trigrams=get_top_text_trigram(text_feature)[:10]
x,y=map(list,zip(*top_text_trigrams))
sns.barplot(x=y,y=x)

**Supervised Classification:**

First solution is simply vectorize text into a **bag of words** representation (word or n-gram frequencies) 

In [None]:
vectorizer = CountVectorizer(ngram_range=(1,4))
text_feature_vector= vectorizer.fit_transform(text_feature)
print(vectorizer.get_feature_names())

In [None]:
text_feature_vector

In [None]:
text_feature_vector=text_feature_vector.todense()

In [None]:
#converting matrix to numpy array
text_feature_vector=np.array(text_feature_vector)

In [None]:
#converting dataframe to numpy array
categorical_features=categorical_features.values

Now, we can combine these vectors by numerical vectors, and create a new Dataframe:

In [None]:
final_features_vectors=[]
for i in range(len(categorical_features)):
    vector=np.concatenate((categorical_features, text_feature_vector), axis=None)
    final_features_vectors.append(vector)

In [None]:
final_features_vectors

The feature vector is ready now to be fed into a classifier. In this assignemt, I choosed SVM classifier from sklearn.

In [None]:
from sklearn.svm import SVC
clf = SVC(gamma='auto')
clf.fit(final_features_vectors, y_train)

The model is trained and ready for making prediction. In a real application, we have to split test and train and do the prediction on test data. But in this assignement we have to use the same training data for testing:

In [None]:
predictions=clf.predict(final_features_vectors)
predictions

To evaluate the model, I calculte the accuracy:

In [None]:
def get_accuracy(predictions,realValues):
    correct=0
    incorrect=0
    for i in range(len(predictions)):
        if predictions[i]==realValues[i]:
            correct=correct+1
        else:
            incorrect=incorrect+1
    return correct/(correct+incorrect)

        

In [None]:
get_accuracy(predictions,y_train)

As results show, the accuracy is 50% which is low. We have many options to improve the accuracy:

1. changing feature selection method
2. changing different parameteres that we used during impelementation
3. changing classification method
4. a mixture of all above items


**Word Embedding & A simple LSTM Deep Neural Network**

In the first classification method, I used Bag of Words as feature selection method. However, Bag of word is only big for very small datasets. 

One state-of-the-art method to handle this problem is using word-embedding method. Three popular word-embedding methods, Word2vec, Glove and FastText have been commonly used. We can train or use a pre-trained model. Each of these pre-trained models have a vocabulary in addition to a vector for each word and thus for each language a different pre-trained model is required. 

Because I only had two observations, I could not train a model, so I tried to find a pre-trained model for Dutch langauge. I found that Fasttext has one pre-trained model for Dutch. 




In [None]:
# please un-comment these lines to download fasttext model:
# import urllib.request
# urllib.request.urlretrieve("https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.vec.gz", "cc.nl.300.vec")

In [None]:
embedding_dict={}
with open('/kaggle/input/fasttext-dutch/cc.nl.300.vec','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

In [None]:
MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(text_feature)
sequences=tokenizer_obj.texts_to_sequences(text_feature)

text_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

In [None]:
word_index=tokenizer_obj.word_index
print('Number of unique words:',len(word_index))

In [None]:
found=0
not_found=0
num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,300))

for word,i in tqdm(word_index.items()):
    if i > num_words:
        continue
    
    emb_vec=embedding_dict.get(word)
    if emb_vec is not None:
        embedding_matrix[i]=emb_vec
        found=found+1
    else:
        not_found=not_found+1

In [None]:
print("number of words which found a vector from vocabulary is: ",found)
print("number of words which haven't found a vector from vocabulary is: ",not_found)

Now, we can use the embedding matrix as the weights to convert input text to vectors while feeding it into a LSTM deep neural network. 

In [None]:
model=Sequential()
embedding=Embedding(num_words, 300,
          weights=[embedding_matrix], input_length=MAX_LEN, trainable=False)

model.add(embedding)
model.add(SpatialDropout1D(0.2))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))


optimzer=optimizers.Adam(learning_rate=1e-5)

model.compile(loss='binary_crossentropy',optimizer=optimzer,metrics=['accuracy'])

model.summary()

In [None]:
model.fit(text_pad,y_train)

The model is trained so we can use it for prediction :

In [None]:
predictions=model.predict(text_pad)

In [None]:
def get_binary_output(val):
    if val>=0.5:
        return 1
    else:
        return 0

In [None]:
predictions=[get_binary_output(val) for val in predictions]

In [None]:
predictions

In [None]:
get_accuracy(predictions, y_train)

Since we have a structured feature list as well, I added it as an input to a middle layer of LSTM:
<img src="http://digital-thinking.de/wp-content/uploads/2018/07/combine.png" width="20%">



In [None]:
nlp_input=Input(shape=(MAX_LEN,),name='nlp_input')
categorical_input = Input((10,))


embedding=Embedding(input_dim=num_words,output_dim=300,weights=[embedding_matrix], input_length=MAX_LEN, trainable=False)(nlp_input)
drop_layer=SpatialDropout1D(0.2)(embedding)
Lstm_layer = LSTM(64, dropout=0.2, recurrent_dropout=0.2)(drop_layer)

# Concatenate the convolutional features and the vector input
concat_layer= Concatenate()([categorical_input, Lstm_layer])
output = Dense(1, activation='sigmoid')(concat_layer)

# define a model with a list of two inputs
model = Model(inputs=[nlp_input, categorical_input], outputs=output)
model.compile(loss='binary_crossentropy',optimizer=optimzer,metrics=['accuracy'])
model.summary()

In [None]:
model.fit([text_pad,categorical_features],y_train)

In [None]:
predictions=model.predict([text_pad,categorical_features])

In [None]:
predictions

In [None]:
predictions=[get_binary_output(val) for val in predictions]

In [None]:
get_accuracy(predictions, y_train)

**Unsupervised classification **(Categorization)

In this group of methods, we don't considers the labels of documnets for training a classifier. 
There are many methods for clustering, in this assignment, I impelemented two of them:

- Agglomerative Clustering
- Topic Modeling



**Agglomerative Clustering**

In [None]:
def get_Hierarchical_Clusters(words_vectors,NUMBER_OF_CLUSTERS):

    cluster = AgglomerativeClustering(n_clusters=NUMBER_OF_CLUSTERS, affinity='euclidean', linkage='ward')
#     fit_predict fits the hierarchical clustering from features, and return cluster labels.
    cluster_labels= cluster.fit_predict(words_vectors)
    return cluster_labels

In [None]:
clusters = get_Hierarchical_Clusters(final_features_vectors, 2)
clusters

Topic modeling is a popular and easy-to-impelement unsupervised methods. 

In [None]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer

number_words=5

def lda(data, NUMBER_OF_CLUSTERS):
    count_vectorizer = CountVectorizer(stop)
    count_data = count_vectorizer.fit_transform(data)

    lda = LDA(n_components=NUMBER_OF_CLUSTERS)
    lda.fit(count_data)
    
    all_topics_words=[]
    words=count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(lda.components_):
        each_topic_words=([words[i] for i in topic.argsort()[:-number_words - 1:-1]])
        # each_topic_words=([words[i] for i  in topic.argsort()[::-1]])
        all_topics_words.append(each_topic_words)

    return lda, all_topics_words

In [None]:
NUMBER_OF_CLUSTERS=2
lda, topic_words=lda(text_feature,NUMBER_OF_CLUSTERS)
#printing topic words:
for i in range(NUMBER_OF_CLUSTERS):
    print("topic "+str(i)+": "+'[%s]' % ', '.join(map(str, topic_words[i])))

In [None]:
#clustering based on the lda model:
count_vectorizer = CountVectorizer(stop)
count_data = count_vectorizer.fit_transform(text_feature)
lda.transform(count_data)

As resluts shows, the first document is more likely to belong to topic 1; the second topic is more likely to belong to topic 0;

**Semi-supervised classification **

Since the labels ("Urgent indication") is not perfect, we cannot use them for classification. However, there is another dataset based on the medication / referral as done by the doctor. The lables in cloumn "Urgent in hindsight" seem to be correct. The problem is that, we only have these labels if, prior to visiting doctor, "Urgent indication" is urgent. So, we have the correct lables for patients who are reffered to doctor, but we don't have correct label for the other group. 

<img src="http://parvaneh.me/wp-content/uploads/2020/02/screenshot_20200212_170221.png">

In semi-supervised classification, the labels of unlabled data are found based on the labled ones ( for example by co-training mothod ) and then the whole dataset can be used to train a supervised classifier. 

Due to the time constraint for this assignment, I have left this last part for future endeavours.