<div>
    <h1 align="center">Excercise 03 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

Before you start with this exercise:

In the last weeks exercise, we went through the NLP-steps data-exploration, data cleaning and tokenization. In the end of this exercise we want to test a simple machine learning model, which takes your preprocessed texts as input. Therefore, we suggest that, before you start, you copy your code from data cleaning inside this notebook and continue with the following tasks. **A tokenizer is not needed in this exercise**.

In [1]:
import pandas as pd
import re
import nltk

df = pd.read_csv("../DATA/mtsamples_clean.csv")

df['cleaned'] = df['transcription'].apply(lambda row: re.sub(r"[\W\d\s]", ' ', str(row)).lower())
df['tokenized'] = df['cleaned'].apply(lambda x: nltk.word_tokenize(str(x)))
df.dtypes

medical_specialty    object
transcription        object
cleaned              object
tokenized            object
dtype: object

## NLP Pipeline - Part 2 <a class="anchor" id="first"></a>

This week, we further focus on two common preprocessing steps in the NLP pipeline: stemming/lemmatization and stop word analysis.

### Stemming/Lemmatization
Stemming and lemmatization are techniques used to reduce inflectional and derivational forms of words to their base or root form, in order to simplify the analysis of text data. In this task, we will use the nltk library to perform stemming and lemmatization on the preprocessed text data.

Please perform the following steps:

* Choose a stemmer from the nltk library and apply it to the tokenized text in the dataset.
* Choose a lemmatizer from the nltk library and apply it to the tokenized text in the dataset.
* Compare the results of the stemming and lemmatization techniques on the dataset.

In [5]:
### your code ###
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')
df['stemmed'] = df['cleaned'].apply(lambda x: [stemmer.stem(i) for i in list(x)])


In [17]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['cleaned'].apply(lambda x: [lemmatizer.lemmatize(i) for i in list(x)])

In [4]:
df.head()

Unnamed: 0,medical_specialty,transcription,cleaned,tokenized,stemmed,lemmatized
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr...",subjective this year old white female pr...,"[subjective, this, year, old, white, female, p...","[subject, this, year, old, white, femal, prese...","[subjective, this, year, old, white, female, p..."
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb...",past medical history he has difficulty climb...,"[past, medical, history, he, has, difficulty, ...","[past, medic, histori, he, has, difficulti, cl...","[past, medical, history, he, ha, difficulty, c..."
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",history of present illness i have seen abc ...,"[history, of, present, illness, i, have, seen,...","[histori, of, present, ill, i, have, seen, abc...","[history, of, present, illness, i, have, seen,..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit...",d m mode left atrial enlargement wit...,"[d, m, mode, left, atrial, enlargement, with, ...","[d, m, mode, left, atrial, enlarg, with, left,...","[d, m, mode, left, atrial, enlargement, with, ..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...,the left ventricular cavity size and wall ...,"[the, left, ventricular, cavity, size, and, wa...","[the, left, ventricular, caviti, size, and, wa...","[the, left, ventricular, cavity, size, and, wa..."


### Stop Word Analysis

Stop words are words that are commonly used in a language, but do not add any value to the meaning of a sentence. Examples of stop words in English include "the", "and", "a", "an", "in", etc. In this task, we will remove stop words from the preprocessed text data in order to improve the quality of our classification results.

Please perform the following steps:

* Use the nltk library to obtain a list of stop words in English.
* Remove the stop words from the preprocessed text data in the dataset.
* Compare the results of the classification with and without stop word removal.

In [18]:
### your code ###
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')
eng_stopwords

df['stopwords_removed'] = df['lemmatized'].apply(lambda x: [word for word in x if word not in eng_stopwords])


### Test your preprocessing steps!

It is crucial to test the effectiveness of our preprocessing techniques on the accuracy of a machine learning classifier, as the quality of the data fed into the model can have a significant impact on the model's performance. If the input data is noisy, contains irrelevant information, or is not properly transformed, the accuracy of the resulting model will suffer. Therefore, it is important to **evaluate the impact of our preprocessing techniques on the classification task.**

In this exercise, we provide you with a simple baseline model for text classification using scikit-learn. Now, it's time to test the effectiveness of your techniques you learned so-far on the classification accuracy.

To do this, we recommend that you first select a subset of the dataset and preprocess it using the techniques learned in the previous sections. Then, train a machine learning classifier on the preprocessed data using the given pipeline. Finally, evaluate the accuracy of the classifier on the preprocessed data and compare it with the accuracy of the same classifier trained on the raw data.

The following train-function takes the pandas dataframe as input and trains a naive-bayes classifier on it.

Note, that the following train-function takes a pandas dataframe as input. **You should keep the column-names for notes ("transcription") and classes ("medical_specialty").**

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score


### do not change! ###
def train(df):
    X_train, X_test, y_train, y_test = train_test_split(df['stopwords_removed'], df['medical_specialty'], test_size=0.2, random_state=42)

    # Define the text classification pipeline
    text_clf = Pipeline([
        ('vect', CountVectorizer(lowercase=False)),
        ('clf', MultinomialNB())
    ])

    # Train the pipeline on the preprocessed data
    text_clf.fit(X_train, y_train)

    predicted = text_clf.predict(X_test)

    # Evaluate the accuracy of the model
    accuracy = (predicted == y_test).mean()
    f1 = f1_score(y_test, predicted, average='micro')
    print(f"Accuracy: {accuracy:.2f}")

##############################


In [20]:
# df['stopwords_removed'] = df['stopwords_removed'].apply(lambda x: ' '.join(list(x)))
# df['medical_specialty'] = df['medical_specialty'].astype(str)

train(df)

TypeError: expected string or bytes-like object

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df['stopwords_removed'], df['medical_specialty'], test_size=0.2, random_state=42)
X_train_org, X_test_org, y_train_org, y_test_org = train_test_split(df['transcription'], df['medical_specialty'], test_size=0.2, random_state=42)


In [16]:
print('stopwords_removed:', X_train[0], '\noriginial:', X_train_org[0])

stopwords_removed: u   b   j   e   c   v   e                   h                   e   r       l       w   h   e       f   e   l   e       p   r   e   e   n       w   h       c   p   l   n       f       l   l   e   r   g   e               h   e       u   e           h   v   e       l   l   e   r   g   e       w   h   e   n       h   e       l   v   e       n       e   l   e       b   u       h   e       h   n   k       h   e       r   e       w   r   e       h   e   r   e               n       h   e       p           h   e       h       r   e       c   l   r   n           n       z   r   e   c               b   h       w   r   k   e       f   r       h   r       e       b   u       h   e   n       e   e   e           l   e       e   f   f   e   c   v   e   n   e               h   e       h       u   e       l   l   e   g   r       l               h   e       u   e       h       l       u   e   r       n       h   e       b   e   g   n       u   n   g           g   n       w       w   e

In order to train and test the machine learnign model on your preprocessed dataset, the dataframe should have the same column names as in the original dataset.