<div>
    <h1 align="center">Excercise 03 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

Before you start with this exercise:

In the last weeks exercise, we went through the NLP-steps data-exploration, data cleaning and tokenization. In the end of this exercise we want to test a simple machine learning model, which takes your preprocessed texts as input. Therefore, we suggest that, before you start, you copy your code from data cleaning inside this notebook and continue with the following tasks. **A tokenizer is not needed in this exercise**.

In [68]:
import pandas as pd
import re
import nltk

df = pd.read_csv("../DATA/mtsamples_clean.csv")

df['cleaned'] = df['transcription'].apply(lambda row: re.sub(r"[\W\d\s]", ' ', str(row)).lower())
df['tokenized'] = df['cleaned'].apply(lambda x: nltk.word_tokenize(str(x)))
df.dtypes

medical_specialty    object
transcription        object
cleaned              object
tokenized            object
dtype: object

## NLP Pipeline - Part 2 <a class="anchor" id="first"></a>

This week, we further focus on two common preprocessing steps in the NLP pipeline: stemming/lemmatization and stop word analysis.

### Stemming/Lemmatization
Stemming and lemmatization are techniques used to reduce inflectional and derivational forms of words to their base or root form, in order to simplify the analysis of text data. In this task, we will use the nltk library to perform stemming and lemmatization on the preprocessed text data.

Please perform the following steps:

* Choose a stemmer from the nltk library and apply it to the tokenized text in the dataset.
* Choose a lemmatizer from the nltk library and apply it to the tokenized text in the dataset.
* Compare the results of the stemming and lemmatization techniques on the dataset.

In [69]:
### your code ###
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')
df['stemmed'] = df['tokenized'].apply(lambda x: [stemmer.stem(i) for i in list(x)])


In [70]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['tokenized'].apply(lambda x: [lemmatizer.lemmatize(i) for i in list(x)])

### Stop Word Analysis

Stop words are words that are commonly used in a language, but do not add any value to the meaning of a sentence. Examples of stop words in English include "the", "and", "a", "an", "in", etc. In this task, we will remove stop words from the preprocessed text data in order to improve the quality of our classification results.

Please perform the following steps:

* Use the nltk library to obtain a list of stop words in English.
* Remove the stop words from the preprocessed text data in the dataset.
* Compare the results of the classification with and without stop word removal.

In [71]:
### your code ###
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')
eng_stopwords

df['stopwords_removed'] = df['lemmatized'].apply(lambda x: [word for word in x if word not in eng_stopwords])


### Test your preprocessing steps!

It is crucial to test the effectiveness of our preprocessing techniques on the accuracy of a machine learning classifier, as the quality of the data fed into the model can have a significant impact on the model's performance. If the input data is noisy, contains irrelevant information, or is not properly transformed, the accuracy of the resulting model will suffer. Therefore, it is important to **evaluate the impact of our preprocessing techniques on the classification task.**

In this exercise, we provide you with a simple baseline model for text classification using scikit-learn. Now, it's time to test the effectiveness of your techniques you learned so-far on the classification accuracy.

To do this, we recommend that you first select a subset of the dataset and preprocess it using the techniques learned in the previous sections. Then, train a machine learning classifier on the preprocessed data using the given pipeline. Finally, evaluate the accuracy of the classifier on the preprocessed data and compare it with the accuracy of the same classifier trained on the raw data.

The following train-function takes the pandas dataframe as input and trains a naive-bayes classifier on it.

Note, that the following train-function takes a pandas dataframe as input. **You should keep the column-names for notes ("transcription") and classes ("medical_specialty").**

In [82]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score


### do not change! ###
def train(df):
    X_train, X_test, y_train, y_test = train_test_split(df['stopwords_removed'], df['medical_specialty'], test_size=0.2, random_state=42)

    # Define the text classification pipeline
    text_clf = Pipeline([
        ('vect', CountVectorizer(lowercase=False)),
        ('clf', MultinomialNB())
    ])

    # Train the pipeline on the preprocessed data
    text_clf.fit(X_train, y_train)

    predicted = text_clf.predict(X_test)

    # Evaluate the accuracy of the model
    accuracy = (predicted == y_test).mean()
    f1 = f1_score(y_test, predicted, average='micro')
    print(f"Accuracy: {accuracy:.2f}")

##############################


In [84]:
df['stopwords_removed'] = df['stopwords_removed'].apply(lambda x: ' '.join(list(x)))
df['medical_specialty'] = df['medical_specialty'].astype(str)

train(df)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [74]:
X_train, X_test, y_train, y_test = train_test_split(df['stopwords_removed'], df['medical_specialty'], test_size=0.2, random_state=42)
X_train_org, X_test_org, y_train_org, y_test_org = train_test_split(df['transcription'], df['medical_specialty'], test_size=0.2, random_state=42)


In [78]:
print('stopwords_removed:', X_train[0], '\noriginial:', X_train_org[0])

stopwords_removed: subjective year old white female present complaint allergy used allergy lived seattle think worse past ha tried claritin zyrtec worked short time seemed lose effectiveness ha used allegra also used last summer began using two week ago doe appear working well ha used counter spray prescription nasal spray doe asthma doest require daily medication doe think flaring medication medication currently ortho tri cyclen allegra allergy ha known medicine allergy objective vitals weight wa pound blood pressure heent throat wa mildly erythematous without exudate nasal mucosa wa erythematous swollen clear drainage wa seen tm clear neck supple without adenopathy lung clear assessment allergic rhinitis plan try zyrtec instead allegra another option use loratadine doe think ha prescription coverage might cheaper sample nasonex two spray nostril given three week prescription wa written well 
originial: SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies. 

In order to train and test the machine learnign model on your preprocessed dataset, the dataframe should have the same column names as in the original dataset.