# IS597MLC: Model Training Assignment 

### Student Name:   Shrey Shah
### Net ID:  sshah023

# Instructions

* This assignment consists of four exercises. You are required to write your code in a cell with a comment "Insert your code here" included. You may add more cells if needed. 

* Please do not change exercise numbers or instruction comments. Also, do not remove or modify if any cells include image of expected outputs.  

* Please be aware that there is no one absolute solution to answer a question, i.e., tasks can have multiple correct solution methods you can choose from. 

* Once you have completed all exercises, update the file name by adding your surname and given name at the end of file name (e.g., IS597MLC_Model_Training_Assignment_Kim_Jenna.ipynb).  

* Make sure that all the codes in your updated Jupyter Notebook run properly before you submit it. If a grader encounters an error while attempting to run your codes, points will be deducted even if the code looks correct. If you are sure your files are ready to go, include them into a folder with the same naming convention. Zip the folder into one file and upload it to the UIUC Canvas assignment section.     

### Your submitted zipped file should include the following items:  
**- Updated Jupyter Notebook with your codes included**  
**- dataset file provided by the instructor**   
**- output files**   

## Data set 

The goal of this assignment is to build machine learning models to predict whether or not a given article is a randomized controlled trial. Two different machine learning algorithms (Support Vector Machine and Random Forest) will be applied to build prediction models using a data set collected from MEDLINE Corpus. Then, these models will be tested using held-out data and the results will be evaluated.  
  
The dataset was created by querying the MEDLINE database and downloading the publication record in XML format from PubMed. PubMed (https://pubmed.ncbi.nlm.nih.gov/) is a free online database that supports the search and retrieval of biomedical and life science literature. It contains more than 33 million citations and abstracts of biomedical literature. Since its launch in 1996, PubMed has been maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM), which is located at the National Institutes of Health (NIH). MEDLINE is one of the NLM literature archives in which searching is facilitated by PubMed.  


To collect data for this study, Entrez, a PubMed API provided by NCBI, was used to automatically access the database with a code script. It provides 9 E-utilities (https://dataguide.nlm.nih.gov/eutilities/utilities.html#efetch) that can be used for searching a query in the database. The biopython (https://biopython.org/), a python library specially developed for biological computation, was used to communicate with the NCBI Entrez API for retrieving the query results. Queries were restricted to collect the articles written in English and published in the year of 2019. Records were downloaded in XML files, which were then parsed in python to extract the information needed for processing. The final dataset ("pubmed_rct.txt") is a txt file with 50,006 instances and 5 attributes: pmid, rct, year, title, abstract. The column named 'rct' contains the target class which includes either 1 (RCT) or 0 (Non-RCT).

# Exercise 1 (Regular) 

## Ex 1-1. Problem Formulation   

#### Formulate your research or business question(s) you can think of when you plan to use the dataset provided. You can provide 1 or multiple questions depending on your ideas.

#### Insert your answer here:   

1) Which are the top 10 topics of interest in the biomedical and life science literature?
2) What is the % of similiarity between 2 different articles on the same topic?

## Ex 1-2. Create a 'modules.py' file

* This file should include functions to load and pre-process data. You may reuse the functions you created for previous assignments.
* The functions in this file should display proper output to keep track of each step.
* The functions are required to include a comment (using # or """) at the top to briefly describes what your code does.  
* Import modules to check if all the functions work properly.   
* You may call one or more funtions to prove that they run without an error.

In [7]:
from modules import *

In [8]:
import time
from datetime import timedelta
import pandas as pd

In [9]:
# Importing necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [10]:
# Insert your code here

#############  Input file name  #############

input_filename = "pubmed_rct.txt"
pubmed_data = pd.read_csv(input_filename, sep="\t")
    
#############  Which column to choose?  #############

"""
Column options:
title text 
abstract text 
title + abstract text
    
"""

# Using the abstract column for the processing
column_name = "title"


In [5]:
# Insert your code here

#### Create a function named "load_data()" 
#### that includes all the code you write for Exercise 1.

# Insert your code here

def load_data(filename, colname):
    """
    Read in input file and load data

    filename: csv file
    colname: column name for texts
    return: X and y dataframe
    """

    ## 1. Read in data from input file
    df = pd.read_csv(filename, sep="\t", encoding='utf-8')
    
    print("************** Loading Data ************", "\n")

    # Check number of rows and columns
    print("No of Rows: {}".format(df.shape[0]))
    print("No of Columns: {}".format(df.shape[1]))

    ## 2. Select data needed for processing
    print(f"Selecting columns needed for processing: pmid, {column_name}, rct", "\n")
    df = df[['pmid', column_name, 'rct']]
    

    ## 3. Cleaning data
    # Trim unnecessary spaces for strings
    df[colname] = df[colname].apply(lambda x: str(x))

    # 3-1. Remove null values
    df=df.dropna()

    # Check number of rows and columns
    print("No of rows (After dropping null): {}".format(df.shape[0]))
    print("No of columns: {}".format(df.shape[1]))

    # 3-2. Remove duplicates and keep first occurrence
    df.drop_duplicates(subset=['pmid'], keep='first', inplace=True)

    # Check number of rows and columns
    print("No of rows (After removing duplicates): {}".format(df.shape[0]))

    # Check the first few instances
    print("\n<Data View: First Few Instances>\n")
    print(df.head(5))
    
    # 3-3. Check label class
    print('\nClass Counts(label, row): Total')
    print(df["rct"].value_counts())
    

    ## 4. Split into X and y (target)
    X, y = df.iloc[:, :-1], df.iloc[:, -1]

    return X, y

In [6]:
def split_data(X_data, y_data):

    print("\n************** Spliting Data **************\n")

    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42, stratify=y_data)
    X_val, X_test, y_val, y_test = train_test_split(X_test,y_test, test_size=0.5, random_state=42, stratify=y_test)

    ## Check the data view of each data set

    ## Data Shape
    print("Train Data: {}".format(X_train.shape))
    print("Val Data: {}".format(X_val.shape))
    print("Test Data: {}".format(X_test.shape))

    ## Label Distribution
    print('\nClass Counts(label, row): Train')
    print(y_train.value_counts())
    print('\nClass Counts(label, row): Validation')
    print(y_val.value_counts())
    print('\nClass Counts(label, row): Test')
    print(y_test.value_counts())

    ## Display the first 3 instances of X data
    print("\nData View: X Train")
    print(X_train.head(3))
    print("\nData View: X Val")
    print(X_val.head(3))
    print("\nData View: X Test")
    print(X_test.head(3))

    ## Reset index

    print("\n************** Resetting Index **************\n")

    # Train Data
    X_train=X_train.reset_index(drop=True)
    y_train=y_train.reset_index(drop=True)

    # Validation Data
    X_val=X_val.reset_index(drop=True)
    y_val=y_val.reset_index(drop=True)

    # Test Data
    X_test=X_test.reset_index(drop=True)
    y_test=y_test.reset_index(drop=True)

    ## Check data

    ## Data Shape
    print("Train Data: {}".format(X_train.shape))
    print("Validation Data: {}".format(X_val.shape))
    print("Test Data: {}".format(X_test.shape))

    ## Label Distribution
    print('\nClass Counts(label, row): Train\n')
    print(y_train.value_counts())
    print('\nClass Counts(label, row): Val\n')
    print(y_val.value_counts())
    print('\nClass Counts(label, row): Test\n')
    print(y_test.value_counts())

    ## Display the first 3 instances of X data
    print("\nData View: X Train")
    print(X_train.head(3))
    print("\nData View: X Val")
    print(X_val.head(3))
    print("\nData View: X Test")
    print(X_test.head(3))
    
    return (X_train, X_val, X_test, y_train, y_val, y_test)


In [7]:
def preprocess_data(X_data_raw):
    """
       Preprocess data with lowercase conversion, punctuation removal, tokenization, stemming

       X_data_raw: X data in dataframe
       return: transformed dataframe

    """

    print("\n************** Pre-processed Data **************\n")
    
    X_data=X_data_raw.iloc[:, -1].astype(str)
    print(f"\nTrain Data: {X_data.shape}")
    
    ## 1. convert all characters to lowercase
    X_data = X_data.map(lambda x: x.lower())

    ## 2. remove punctuation
    X_data = X_data.str.replace('[^\w\s]', '')

    ## 3. tokenize sentence
    X_data = X_data.apply(nltk.word_tokenize)

    ## 4. remove stopwords
    stopword_list = stopwords.words("english")
    X_data = X_data.apply(lambda x: [word for word in x if word not in stopword_list])

    ## 5. stemming
    stemmer = PorterStemmer()
    X_data = X_data.apply(lambda x: [stemmer.stem(y) for y in x])

    ## 6. removing unnecessary space
    X_data = X_data.apply(lambda x: " ".join(x))

    # Check data view
    print("\nData View: X Train\n")
    print(X_data.head(3))

    return X_data

In [11]:
X_data, y_data = load_data(input_filename, column_name)

************** Loading Data ************ 

No of Rows: 50006
No of Columns: 5
Selecting columns needed for processing: pmid, title, rct 

No of rows (After dropping null): 50006
No of columns: 3
No of rows (After removing duplicates): 50006

<Data View: First Few Instances>

       pmid                                              title  rct
0  24900659  Probing the binding site of abl tyrosine kinas...    0
1  29492752  Variability in Bariatric Surgical Care Among V...    0
2  30574804  Maternal antibiotic prophylaxis affects Bifido...    0
3  29679827  Environmental impact assessment of alfalfa (Me...    0
4  30117518  Sandwiched spherical tin dioxide/graphene with...    0

Class Counts(label, row): Total
rct
0    40007
1     9999
Name: count, dtype: int64


In [12]:
X_train, X_val, X_test, y_train, y_val, y_test = split_data(X_data, y_data)


************** Spliting Data **************

Train Data: (40004, 2)
Val Data: (5001, 2)
Test Data: (5001, 2)

Class Counts(label, row): Train
rct
0    32005
1     7999
Name: count, dtype: int64

Class Counts(label, row): Validation
rct
0    4001
1    1000
Name: count, dtype: int64

Class Counts(label, row): Test
rct
0    4001
1    1000
Name: count, dtype: int64

Data View: X Train
           pmid                                              title
23663  30472200  Prenatal propofol exposure downregulates NMDA ...
40088  27385766  Protein N-terminal acetylation is required for...
44930  24423084  Let's face it: facial emotion processing is im...

Data View: X Val
           pmid                                              title
36841  31503331  Acute hemolytic transfusion reaction associate...
21713  22902894  The role of ovarian hormones in sexual reward ...
6131   29944726  In vitro cytotoxicity of superheated steam hyd...

Data View: X Test
           pmid                           

In [13]:
preprocessed_X_train = preprocess_data(X_train)


************** Pre-processed Data **************


Train Data: (40004,)

Data View: X Train

0    prenat propofol exposur downregul nmda recepto...
1    protein n-termin acetyl requir embryogenesi ar...
2    let 's face : facial emot process impair bipol...
Name: title, dtype: object


# Exercise 2 (Regular)

## Ex 2-1. Model Fitting

* Create a function named "fit_model" that conducts model fitting on input data.  
* This function should include three parameters: X, y, modelname.
* This function should contain the options to choose any of the following ML algorithms:  
  - Decision Tree
  - Logistic Regression  
  - Support Vector Machines  
  - Random Forest    
* Note that the function is required to include a comment at the top to briefly describes what your code does.

In [14]:
# Insert your code here

def fit_model(X, y, modelname):
    """
    Fits a machine learning model to input data.

    Parameters:
    X: Input features.
    y: Target variable.
    modelname: Name of the machine learning algorithm to use.
               Choose from: Decision Tree, Logistic Regression, Support Vector Machines, Random Forest.
    """
    
    # Mapping modelname to corresponding machine learning algorithm
    models = {
        'Decision_tree': DecisionTreeClassifier(),
        'Logistic_regression': LogisticRegression(),
        'Support_vector_machine': SVC(),
        'Random_forest': RandomForestClassifier()
    }
    
    # Checking if the specified modelname is valid
    if modelname not in models:
        raise ValueError("Invalid modelname. Choose from: Decision_tree, Logistic_regression, Support_vector_machine, Random_forest.")
    
    # Fitting the selected model to the data
    
    print(f"\n************** Training Model: {modelname} **************\n")
    
    model = models[modelname]
    model.fit(X, y)
    
    return model

## Ex 2-2. Performance Evaluation

* Create a function named 'evaluate_model' that produces confusion matrix.  
* This function should require two parameters that take predicted labels and actual labels.  
* Note that the function is required to include a comment at the top to briefly describes what your code does.

In [15]:
# Insert your code here

def evaluate_model(y_pred, y_true):
    """
    Computes the confusion matrix for evaluating model performance.

    Parameters:
    y_pred Predicted labels.
    y_true: Actual labels.

    Returns:
    array: Confusion matrix.
    """
    
    print("\n************** Model Evaluation **************\n")
    print("\nConfusion Matrix:\n")
    
    # Computing the confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    return cm

# Exercise 3 (Regular)

## Ex 3-1. Create a Main Function

* Create a function named "main_function" that automatically conducts the entire process from loading and transforming data to model performance evaluation. 
* For data splitting, set the random_state parameter to a specific numeber (e.g., random_state=42) so that the grader can replicate the same data when running your code.  
* Before fitting a ML model, you are required to transform textual data into numerical representations.   
* You may use one of the vectorization schema shown in class, such as TF-IDF, Bag-of-Words, etc. or try out other strategies if you want to.   
* When running this function, output corresponding to each step should be displayed to keep track of the process.  
* You may refer a sample output file ("output_title_LR.txt") provided for Exercise 4.  
* Note that the function is required to include a comment at the top to briefly describes what your code does. 


In [17]:
# Insert your code here

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

def main_function(filename, colname):
    """
    Automates the process from loading and transforming data to model performance evaluation.

    Parameters:
    filename (str): Name of the CSV file containing data.
    colname (str): Column name for textual data.

    Returns:
    None
    """
    # Load data
    X, y = load_data(filename, colname)

    # Split data
    X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y)

    # Preprocess textual data
    X_train_processed = preprocess_data(X_train)
    X_val_processed = preprocess_data(X_val)
    X_test_processed = preprocess_data(X_test)

    # Transform textual data into numerical representations using the TF-IDF vectorizer
    # Fit the vectorizer on the X_train, X_val, and X_test dataset
    vectorizer = TfidfVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train_processed)
    X_val_vectorized = vectorizer.transform(X_val_processed)
    X_test_vectorized = vectorizer.transform(X_test_processed)

    # Fit model (example using Logistic regression)
    model = fit_model(X_train_vectorized, y_train, 'Logistic_regression')
    print("\nModel fitted successfully!\n")

    # Predictions
    print("\n************** Getting predictions **************\n")
    y_pred_test = model.predict(X_test_vectorized)

    # Evaluateing model performance
    print("\n************** Evaluating performance **************\n")
    print(evaluate_model(y_pred_test, y_test))
    
    print("\nClassification Report:\n")
    print(classification_report(y_pred_test, y_test))

## Ex 3-2. Run the Main Function

* Call the main function with the following requirements:  
  - Column name: title  
  - Model type: Logistic Regression  

In [18]:
######################################################
#############  1. Set Parameter Values  ##############
######################################################

    
#############  1-1. Input file name  #############

input_filename = "pubmed_rct.txt"
    
    
#############  1-2. Which column to choose?  #############

"""
Column options:
title text 
abstract text 
title + abstract text   
"""
     
column_name = "title"         


#############  1-3. Which ML model to use?  #############
    
"""
Model options:
    
Decision_tree
Logisitic_regression
Support_vector_machine
Random_forest
"""
    
model_type = "Logisitic_regression"                           

In [19]:
%%time

if __name__== "__main__":
         
    main_function(input_filename, column_name)

************** Loading Data ************ 

No of Rows: 50006
No of Columns: 5
Selecting columns needed for processing: pmid, title, rct 

No of rows (After dropping null): 50006
No of columns: 3
No of rows (After removing duplicates): 50006

<Data View: First Few Instances>

       pmid                                              title  rct
0  24900659  Probing the binding site of abl tyrosine kinas...    0
1  29492752  Variability in Bariatric Surgical Care Among V...    0
2  30574804  Maternal antibiotic prophylaxis affects Bifido...    0
3  29679827  Environmental impact assessment of alfalfa (Me...    0
4  30117518  Sandwiched spherical tin dioxide/graphene with...    0

Class Counts(label, row): Total
rct
0    40007
1     9999
Name: count, dtype: int64

************** Spliting Data **************

Train Data: (40004, 2)
Val Data: (5001, 2)
Test Data: (5001, 2)

Class Counts(label, row): Train
rct
0    32005
1     7999
Name: count, dtype: int64

Class Counts(label, row): Validatio

## Ex 3-3. Performance Evaluation Scores

* Interpret the results from the confusion matrix generated in Ex3-2. Provide the following 4 evaluation scores for our target class (both RCT and Non-RCT):  
  - Accuracy  
  - Precision  
  - Recall  
  - F1  
* Demonstrate how these scores for each class are calculated with mathmatical equation.   
* Use values in the confusion matrix such as true positive to explain how the evaluation scores are created.
* The numbers used in the equation should match the ones in the confusion matrix generated by the main function.

#### Provide evaluation scores for each class.

1. Class 0 (Non-RCT)
- Accuracy = 0.905
- Precision = 0.979
- Recall = 0.909
- F1 = 0.942

2. Class 1 (RCT)
- Accuracy = 0.905
- Precision = 0.607
- Recall = 0.879
- F1 = 0.718

#### Show how each score is calculated with numbers in confusion matrix.

1. Class 0 (Non-RCT)  
- Accuracy = (TP + TN)/(TP + TN + FP + FN) = (3918 + 607)/(3918 + 607 + 83 + 393) = 4311/5001 = 0.905
- Precision = (TP)/(TP + FP) = 3918/(3918 + 83) = 3918/4001 = 0.979
- Recall = (TP)/(TP + FN) = 3918/(3918 + 393) = 3918/4311 = 0.909
- F1 = 2 * Precision * Recall / (Precision + Recall) = 2 * 0.979 * 0.909/(0.979 + 0.909) = 1.779/1.888 = 0.942

2. Class 1 (RCT)
- Accuracy = (TP + TN)/(TP + TN + FP + FN) = (3918 + 607)/(3918 + 607 + 83 + 393) = 4311/5001 = 0.905
- Precision = (TP)/(TP + FP) = 607/(607 + 393) = 607/1000 = 0.607
- Recall = (TP)/(TP + FN) = 607/(607 + 83) = 607/690 = 0.879
- F1 = 2 * Precision * Recall / (Precision + Recall) = 2 * 0.607 * 0.879/(0.607 + 0.879) = 1.067/1.486 = 0.718

# Exercise 4 (Challenge)

* Create a file named "modules_updated.py" that includes all the functions you created for previous exercises.  
* Revise the main_function() so that it also writes the output of each step into a separate file named like output_column_model.txt.  
* The output in the file should look somthing like the one in the sample file provided ("output_title_LR.txt").   
* After running the code, the same output is expected to be shown in both the Jupyter Notebook and an output file.

In [20]:
from modules_updated import *

In [21]:
def main_function(filename, colname):
    """
    Automates the process from loading and transforming data to model performance evaluation.

    Parameters:
    filename (str): Name of the CSV file containing data.
    colname (str): Column name for textual data.

    Returns:
    None
    """
    # Define model name
    model_type = "Logistic_regression"

    # Define log file name
    log_file = f"output_{colname}_{model_type}.txt"

    # Open log file for writing
    with open(log_file, 'w') as f:
        # Load data
        X, y = load_data_new(filename, colname, f)
        print("\n", file=f)

        # Split data
        X_train, X_val, X_test, y_train, y_val, y_test = split_data_new(X, y, f)
        print("\n", file=f)

        # Preprocess textual data
        X_train_processed = preprocess_data_new(X_train, f)
        X_val_processed = preprocess_data_new(X_val, f)
        X_test_processed = preprocess_data_new(X_test, f)

        # Transform textual data into numerical representations using the TF-IDF vectorizer
        # Fit the vectorizer on the X_train, X_val, and X_test dataset
        vectorizer = TfidfVectorizer()
        X_train_vectorized = vectorizer.fit_transform(X_train_processed)
        X_val_vectorized = vectorizer.transform(X_val_processed)
        X_test_vectorized = vectorizer.transform(X_test_processed)

        # Fit model (example using Logistic regression)
        model = fit_model_new(X_train_vectorized, y_train, model_type, f)
        print("\nModel fitted successfully!\n", file=f)
        
        # Predictions
        print("\n************** Getting predictions **************\n", file=f)
        y_pred_test = model.predict(X_test_vectorized)

        # Evaluate model performance
        print("\n************** Evaluating performance **************\n", file=f)
        evaluate_model_new(y_pred_test, y_test, f)
        print("\nClassification Report:\n", file=f)
        print(classification_report(y_pred_test, y_test), file=f)

## Run main function

In [22]:
######################################################
#############  1. Set Parameter Values  ##############
######################################################

    
#############  1-1. Input file name  #############

input_filename = "pubmed_rct.txt"
    
    
#############  1-2. Which column to choose?  #############

"""
Column options:
title text 
abstract text 
title + abstract text   
"""
     
column_name = "title"         


#############  1-3. Which ML model to use?  #############
    
"""
Model options:
    
Decision Tree
Logisitic regression
Support Vector Machines
Random Forest
"""
    
model_type = "Logisitic_regression"                                            
                                                   

In [23]:
%%time

if __name__== "__main__":

            
    main_function(input_filename, column_name)
    
        
    print("\n************** Processing Completed **************\n")
    


************** Processing Completed **************

CPU times: user 19.1 s, sys: 85.6 ms, total: 19.2 s
Wall time: 20.2 s
