# IS597MLC: Data Pre-processing Coding Assignment

### Student Name:   Shrey Shah
### Net ID:  sshah023

# Instruction

#### * This assignment consists of 4 exercises. Each exercise contains several tasks shown in bold. You are required to write your code in a cell with a comment "Insert your code here" included. You may add more cells if needed. The expected output is displayed for each exercise so that you can compare it with the one from your own code.   

#### * Note: Do not change exercise numbers or instruction comments. Also, do not remove or modify the cells that include image of expected outputs. To see all the images embeded in your notebook, you should create a folder named 'Image' in the same directory where your notebook and dataset files are located and upload the files in the Image folder provided by the instructor.  

#### * Remember that most of the tasks for each exercise can be solved in one or a few lines of code. You do not need to write a complex code to get expected results. Also, be aware that there is no one absolute solution to answer a question, i.e., tasks can have multiple correct solution methods you can choose from. You may want to refer to the codes used for class activity demonstration.

#### * Once you have completed all exercises, change the file name by adding your surname and given name at the end of file name (e.g., IS597MLC_Preprocessing_Data_Assignment_Kim_Jenna.ipynb).  

#### * You are required to use a AWS Academy Learner Lab for this assignment. Please see detailed instruction for providng an evidence document at the bottom of this notebook. 

#### * Make absolutely sure that all the codes in the updated Jupyter Notebook run properly before submission. If a grader encounters an error while attempting to run your codes, points will be deducted even if they look correct. If you are sure your files are ready to go, upload a zipped file to UIUC Canvas assignment section. Your submitted zipped file should include the following items:  
- Updated Jupyter Notebook with your codes included 
- dataset  
- Image folder containing screenshots provided by instructors 
- a screenshot of your notebook showing the AWS SageMaker URL at the top

# Set up

### Libraries

In [218]:
#### Check pre-installed Python packages
#!pip list

In [219]:
import time
from datetime import timedelta
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ec2-user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [220]:
#### Check memory usage
#!free -h

#### clear occupied memory
#import gc
#gc.collect()

#### Check memory usage again
#!free -h

### Data set

#### Use the data set provided by the instructor: "pubmed_data.txt"

# Exercise 1 (Regular)

## Ex 1-1. Load data 

In [221]:
#### Load data from dataset

# Insert your code here

in_filename = "pubmed_data.txt"
pubmed_data = pd.read_csv(in_filename, sep="\t")

In [222]:
##### Check the info of the dataframe

# Insert your code here

print("\nSummary of data\n")
pubmed_data.info()


Summary of data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   pmid           25000 non-null  int64  
 1   pubdate        24351 non-null  float64
 2   language       25000 non-null  object 
 3   title          24918 non-null  object 
 4   abstract       22618 non-null  object 
 5   otherabstract  4 non-null      object 
 6   pubtype        25000 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 1.3+ MB


### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex1-1.png" width=400, height=300, alt="Alternative text"/>

In [223]:
##### Check number of rows and columns

# Insert your code here
print("No of rows: ", pubmed_data.shape[0])
print("No of columns: ", pubmed_data.shape[1])

print("\n")
##### Check the last 3 instances (rows)

# Insert your code here
print("Data View: Last 3 Instances")
pubmed_data.tail(3)

No of rows:  25000
No of columns:  7


Data View: Last 3 Instances


Unnamed: 0,pmid,pubdate,language,title,abstract,otherabstract,pubtype
24997,31152989,2019.0,eng,Synthesis and antiproliferative activity of cu...,Culicinin D is a 10 amino acid peptaibol conta...,,Other
24998,31152988,2019.0,eng,Bioaccumulation of polycyclic aromatic hydroca...,Little data are available on polycyclic aromat...,,Other
24999,31152973,2019.0,eng,,A total of 3 men and 7 women (age range 22 to ...,,Other


### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex1-2.png" width=900, height=800, alt="Alternative text"/>

## Ex 1-2. Cleaning data

In [224]:
#### Select data needed for processing
#### Select the following columns: pmid, title, abstract, and pubtype

# Insert your code here
pubmed_data_filtered = pubmed_data[['pmid', 'title', 'abstract', 'pubtype']]
pubmed_data_filtered

Unnamed: 0,pmid,title,abstract,pubtype
0,29650023,Sustained impact of energy-dense TV and online...,Policies restricting children's exposure to un...,RCT
1,29649996,The effects of synbiotic supplementation on ho...,"To our knowledge, no reports are available ind...",RCT
2,29649669,360° virtual reality video for the acquisition...,360° virtual reality (VR) video is an exciting...,RCT
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...,Individuals receiving radiation for head and n...,RCT
4,29648860,Motivation and readiness for tobacco cessation...,Despite considerable health risks due to lower...,RCT
...,...,...,...,...
24995,31153048,Decreasing waiting time for treatment before a...,"In 2015, Norway implemented cancer patient pat...",Other
24996,31153036,Hypomethylating agents in the treatment of acu...,"The hypomethylating agents (HMAs), decitabine ...",Other
24997,31152989,Synthesis and antiproliferative activity of cu...,Culicinin D is a 10 amino acid peptaibol conta...,Other
24998,31152988,Bioaccumulation of polycyclic aromatic hydroca...,Little data are available on polycyclic aromat...,Other


In [225]:
#### Make sure that text data in selected columns are strings
##### Trim unnecessary spaces for strings in 'title' & 'abstract' columns

# Insert your code here
pubmed_data_filtered.loc[:, 'title'] = pubmed_data_filtered.loc[:, 'title'].astype(str).str.strip()
pubmed_data_filtered.loc[:, 'abstract'] = pubmed_data_filtered.loc[:, 'abstract'].astype(str).str.strip()

In [226]:
pubmed_data_filtered.head()

Unnamed: 0,pmid,title,abstract,pubtype
0,29650023,Sustained impact of energy-dense TV and online...,Policies restricting children's exposure to un...,RCT
1,29649996,The effects of synbiotic supplementation on ho...,"To our knowledge, no reports are available ind...",RCT
2,29649669,360° virtual reality video for the acquisition...,360° virtual reality (VR) video is an exciting...,RCT
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...,Individuals receiving radiation for head and n...,RCT
4,29648860,Motivation and readiness for tobacco cessation...,Despite considerable health risks due to lower...,RCT


In [227]:
pubmed_data_filtered['pubtype'].unique()

array(['RCT', 'Other'], dtype=object)

In [228]:
#### Convert label text to numeric values

# Insert your code here
pubmed_data_filtered.loc[:, 'pubtype'] = pubmed_data_filtered.loc[:, 'pubtype'].apply(lambda x: 1 if x == 'RCT' else 0)

pubmed_data_filtered.head()

Unnamed: 0,pmid,title,abstract,pubtype
0,29650023,Sustained impact of energy-dense TV and online...,Policies restricting children's exposure to un...,1
1,29649996,The effects of synbiotic supplementation on ho...,"To our knowledge, no reports are available ind...",1
2,29649669,360° virtual reality video for the acquisition...,360° virtual reality (VR) video is an exciting...,1
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...,Individuals receiving radiation for head and n...,1
4,29648860,Motivation and readiness for tobacco cessation...,Despite considerable health risks due to lower...,1


In [229]:
#### Check label distribution of target class

# Insert your code here
print("Class Counts(label, row):", "\n")
print(pubmed_data_filtered['pubtype'].value_counts(), "\n")

#### Check the first 5 instances

# Insert your code here
print("Data View: First 5 Instances")
pubmed_data_filtered.head(5)

Class Counts(label, row): 

pubtype
0    14388
1    10612
Name: count, dtype: int64 

Data View: First 5 Instances


Unnamed: 0,pmid,title,abstract,pubtype
0,29650023,Sustained impact of energy-dense TV and online...,Policies restricting children's exposure to un...,1
1,29649996,The effects of synbiotic supplementation on ho...,"To our knowledge, no reports are available ind...",1
2,29649669,360° virtual reality video for the acquisition...,360° virtual reality (VR) video is an exciting...,1
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...,Individuals receiving radiation for head and n...,1
4,29648860,Motivation and readiness for tobacco cessation...,Despite considerable health risks due to lower...,1


### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex1-3.png" width=700, height=600, alt="Alternative text"/>

## Ex 1-3. Remove duplicates if any

In [230]:
#### Remove duplicates using 'pmid' column and keep first occurrence

# Insert your code here
pubmed_data_filtered.drop_duplicates(subset=['pmid'], keep='first', inplace=True)
pubmed_data_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pubmed_data_filtered.drop_duplicates(subset=['pmid'], keep='first', inplace=True)


Unnamed: 0,pmid,title,abstract,pubtype
0,29650023,Sustained impact of energy-dense TV and online...,Policies restricting children's exposure to un...,1
1,29649996,The effects of synbiotic supplementation on ho...,"To our knowledge, no reports are available ind...",1
2,29649669,360° virtual reality video for the acquisition...,360° virtual reality (VR) video is an exciting...,1
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...,Individuals receiving radiation for head and n...,1
4,29648860,Motivation and readiness for tobacco cessation...,Despite considerable health risks due to lower...,1


In [231]:
#### Check number of rows and columns

# Insert your code here
print("No of rows (After removing duplicates): {}".format(pubmed_data_filtered.shape[0]))
print("No of columns: {}".format(pubmed_data_filtered.shape[1]))

No of rows (After removing duplicates): 25000
No of columns: 4


### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex1-5.png" width=400, height=300, alt="Alternative text"/>

## Ex 1-4. Select text column to be used for processing

In [232]:
#### Create a function named "select_text"
#### that provides the following 3 options 
#### for selecting text based on column name:
#### 'title', 'abstract', and 'mix' (title + abstract)
#### This function must have two parameters named 'df' & 'colname'

# Insert your code here
import pandas as pd

def select_text(df, colname):
    if colname == 'title':
        return pd.DataFrame({'pmid': df['pmid'], 'title': df['title'], 'pubtype': df['pubtype']})
    elif colname == 'abstract':
        return pd.DataFrame({'pmid': df['pmid'], 'abstract': df['abstract'], 'pubtype': df['pubtype']})
    elif colname == 'mix':
        return pd.DataFrame({'pmid': df['pmid'], 'mix': df['title'] + ' ' + df['abstract'], 'pubtype': df['pubtype']})
    else:
        raise ValueError("Invalid column name. Choose from 'title', 'abstract', or 'mix'.")

In [233]:
#### Call the function to extract text from 'title' column

column_name = "title"
df_selected = select_text(pubmed_data_filtered, colname=column_name)

df_selected

Unnamed: 0,pmid,title,pubtype
0,29650023,Sustained impact of energy-dense TV and online...,1
1,29649996,The effects of synbiotic supplementation on ho...,1
2,29649669,360° virtual reality video for the acquisition...,1
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...,1
4,29648860,Motivation and readiness for tobacco cessation...,1
...,...,...,...
24995,31153048,Decreasing waiting time for treatment before a...,0
24996,31153036,Hypomethylating agents in the treatment of acu...,0
24997,31152989,Synthesis and antiproliferative activity of cu...,0
24998,31152988,Bioaccumulation of polycyclic aromatic hydroca...,0


### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex1-6.png" width=450, height=350, alt="Alternative text"/>

## Ex 1-5. Split data

In [234]:
#### Split data from 'df_selected' into X_data and y_data
#### y_data must contain data from label column only
#### X_data includes data from all the other columns

# Insert your code here
X_data = df_selected.drop(columns='pubtype')
y_data = df_selected['pubtype']

In [235]:
#### Check the first 5 instance of X_data

# Insert your code here
X_data.head(5)

Unnamed: 0,pmid,title
0,29650023,Sustained impact of energy-dense TV and online...
1,29649996,The effects of synbiotic supplementation on ho...
2,29649669,360° virtual reality video for the acquisition...
3,29649082,Transcutaneous Electrical Nerve Stimulation Re...
4,29648860,Motivation and readiness for tobacco cessation...


### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex1-7.png" width=400, height=300, alt="Alternative text"/>

# Exercise 2 (Regular)


### Let's split data into three subsets.

## Ex 2-1. Train and test split

In [236]:
#### Split data into the subsets of train, validation, and test
#### with a Ratio of 8:1:1.
#### Make sure to stratify data in each set using the parameter named 'stratify'
#### which aligns with the distribution of label class.
#### Set the parameter values of random_state to 5 for reproducibility.

# Insert your code here
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=5, stratify=y_data)
X_val, X_test, y_val, y_test = train_test_split(X_test,y_test, test_size=0.5, random_state=5, stratify=y_test)

In [237]:
#### Check the data view of each data set

# Insert your code here


print("\n************** Data After Splitting **************\n")
print("Train Data: {}".format(X_train.shape))
print("Val Data: {}".format(X_val.shape))
print("Test Data: {}".format(X_test.shape))
print("\n")


print("\n\n************** Class Label Distribution **************")
print('\nClass Counts(label, row): Train')
print(y_train.value_counts())
print('\nClass Counts(label, row): Validation')
print(y_val.value_counts())
print('\nClass Counts(label, row): Test')
print(y_test.value_counts())
print("\n")



************** Data After Splitting **************

Train Data: (20000, 2)
Val Data: (2500, 2)
Test Data: (2500, 2)




************** Class Label Distribution **************

Class Counts(label, row): Train
pubtype
0    11510
1     8490
Name: count, dtype: int64

Class Counts(label, row): Validation
pubtype
0    1439
1    1061
Name: count, dtype: int64

Class Counts(label, row): Test
pubtype
0    1439
1    1061
Name: count, dtype: int64




### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex2-1.png" width=400, height=300, alt="Alternative text"/>

In [238]:
## Display the first 5 instances of X data

# Insert your code here

print("\n\n************** First 5 Instances of Data **************")

print("\nFirst 5 Instance: Train\n")
print(X_train.head())
print("\nFirst 5 Instance: Validation\n")
print(X_val.head())
print("\nFirst 5 Instance: Test\n")
print(X_test.head())



************** First 5 Instances of Data **************

First 5 Instance: Train

           pmid                                              title
7627   26566950  A pilot randomized controlled trial of telepho...
8587   26209090  Vitamin D and prostate cancer prognosis: a Men...
10087  25675661  Dissection with harmonic scalpel versus cold i...
23673  31163254    MicroRNA heterogeneity in melanoma progression.
8769   31347292  A randomized phase II trial of nab-paclitaxel ...

First 5 Instance: Validation

           pmid                                              title
17012  31202269  Quercetin: a natural compound for ovarian canc...
24250  31159469  Partial Surface Modification of Low Generation...
18205  31193061  Antitumor effects of flavopiridol, a cyclin-de...
8318   26311526  PATHOS: a phase II/III trial of risk-stratifie...
9220   25986854  Impact of axillary dissection in women with in...

First 5 Instance: Test

           pmid                                         

### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex2-2.png" width=450, height=350, alt="Alternative text"/>

## Ex 2-2. Reset index of instances in each subset

In [239]:
#### Reset index of X, y data in Train, Validation, test sets

# Insert your code here

# Train Data
X_train=X_train.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)


# Validation Data
X_val=X_val.reset_index(drop=True)
y_val=y_val.reset_index(drop=True)

# Test Data
X_test=X_test.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

In [240]:
#### Display the first 5 instances of X data

# Insert your code here

print("\n************** Data After Index Reset **************\n")
print("\n************** First 5 Instances of Data **************")

print("\nFirst 5 Instance: Train")
print(X_train.head())
print("\nFirst 5 Instance: Validation")
print(X_val.head())
print("\nFirst 5 Instance: Test")
print(X_test.head())


************** Data After Index Reset **************


************** First 5 Instances of Data **************

First 5 Instance: Train
       pmid                                              title
0  26566950  A pilot randomized controlled trial of telepho...
1  26209090  Vitamin D and prostate cancer prognosis: a Men...
2  25675661  Dissection with harmonic scalpel versus cold i...
3  31163254    MicroRNA heterogeneity in melanoma progression.
4  31347292  A randomized phase II trial of nab-paclitaxel ...

First 5 Instance: Validation
       pmid                                              title
0  31202269  Quercetin: a natural compound for ovarian canc...
1  31159469  Partial Surface Modification of Low Generation...
2  31193061  Antitumor effects of flavopiridol, a cyclin-de...
3  26311526  PATHOS: a phase II/III trial of risk-stratifie...
4  25986854  Impact of axillary dissection in women with in...

First 5 Instance: Test
       pmid                                          

### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex2-3.png" width=450, height=350, alt="Alternative text"/>

# Exercise 3 (Regular)

### Let's pre-process data using X_train data


In [241]:
#### Make sure that text data in selected column are string
#### Use X_train data for processing

# Insert your code here

X_data=X_train.iloc[:, -1].astype(str)

In [242]:
#### 1. Convert all characters to lowercase

# Insert your code here

X_data = X_data.map(lambda x: x.lower())

In [243]:
#### 2. remove punctuation

# Insert your code here

X_data = X_data.str.replace('[^\w\s]', '', regex=True)

In [244]:
#### 3. tokenize sentence

# Insert your code here

X_data = X_data.apply(nltk.word_tokenize)

In [245]:
#### 4. remove stopwords

# Insert your code here

stopword_list = stopwords.words("english")
X_data = X_data.apply(lambda x: [word for word in x if word not in stopword_list])

In [246]:
#### 5. stemming

# Insert your code here

stemmer = PorterStemmer()
X_data = X_data.apply(lambda x: [stemmer.stem(y) for y in x])

In [247]:
#### 6. removing unnecessary space

# Insert your code here

X_data = X_data.apply(lambda x: " ".join(x))

In [248]:
#### 7. Check data view

# Insert your code here

print("\nFirst 5 Instances After Pre-processing:\n")

X_data.head(5)


First 5 Instances After Pre-processing:



0    pilot random control trial telephon intervent ...
1    vitamin prostat cancer prognosi mendelian rand...
2    dissect harmon scalpel versu cold instrument p...
3                 microrna heterogen melanoma progress
4    random phase ii trial nabpaclitaxel gemcitabin...
Name: title, dtype: object

### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex3-1.png" width=450, height=350, alt="Alternative text"/>

# Exercise 4 (Challenge)

## Ex 4-1. Create funtions

### Create 3 functions that align with each regular exercise with the following names:  
####     - load_data(filename, colname)   
####     - split_data(X_data, y_data)    
####     - preprocess_data(X_data)

In [249]:
#### Create a function named "load_data()" 
#### that includes all the code you write for Exercise 1.

# Insert your code here

def load_data(filename, colname):
    """
    Read in input file and load data

    filename: csv file
    colname: column name for texts
    return: X and y dataframe
    """

    ## 1. Read in data from input file
    df = pd.read_csv(filename, sep="\t", encoding='utf-8')

    # Check number of rows and columns
    print("No of Rows: {}".format(df.shape[0]))
    print("No of Columns: {}".format(df.shape[1]))

    # Check the first few instances
    print("\nData View: First Few Instances\n")
    print(df.head(10))


    ## 2. Select data needed for processing & convert labels
    df = df[['pmid', colname, 'pubtype']]
    df["pubtype"] = df["pubtype"].apply(lambda x: 1 if x == 'RCT' else 0)

    # Check label class
    print('\nClass Counts(label, row): Total')
    print(df["pubtype"].value_counts())
    

    ## 3. Cleaning data
    # Trim unnecessary spaces for strings
    df[colname] = df[colname].apply(lambda x: str(x))

    # Check the first few instances
    print("\nData View: First Few Instances\n")
    print(df.head(10))

    # 3-1. Remove null values
    df=df.dropna()

    # Check number of rows and columns
    print("No of rows (After dropping null values): {}".format(df.shape[0]))
    print("No of columns: {}".format(df.shape[1]))

    # 3-2. Remove duplicates and keep first occurrence
    df.drop_duplicates(subset=['pmid'], keep='first', inplace=True)

    # Check number of rows and columns
    print("No of rows (After removing duplicates): {}".format(df.shape[0]))
    print("No of columns: {}".format(df.shape[1]))

    # Check the first few instances
    print("\nData View: First Few Instances\n")
    print(df.head(10))
    

    ## 4. Split into X and y (target)
    X, y = df.iloc[:, :-1], df.iloc[:, -1]

    return X, y

In [250]:
#### Create a function named "split_data()" 
#### that contains all the code you write for Exercise 2.

# Insert your code here

def split_data(X_data, y_data):

    print("\n************** Spliting Data **************\n")

    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=5, stratify=y)
    X_val, X_test, y_val, y_test = train_test_split(X_test,y_test, test_size=0.5, random_state=5, stratify=y_test)

    ## Check the data view of each data set

    print("\n************** Data After Splitting **************\n")

    ## Data Shape
    print("Train Data: {}".format(X_train.shape))
    print("Val Data: {}".format(X_val.shape))
    print("Test Data: {}".format(X_test.shape))

    ## Label Distribution
    print('\nClass Counts(label, row): Train')
    print(y_train.value_counts())
    print('\nClass Counts(label, row): Validation')
    print(y_val.value_counts())
    print('\nClass Counts(label, row): Test')
    print(y_test.value_counts())

    ## Display the first 5 instances of X data
    print("\nFirst 5 Instance: Train")
    print(X_train.head())
    print("\nFirst 5 Instance: Validation")
    print(X_val.head())
    print("\nFirst 5 Instance: Test")
    print(X_test.head())

    ## Reset index

    print("\n************** Resetting Index **************\n")

    # Train Data
    X_train=X_train.reset_index(drop=True)
    y_train=y_train.reset_index(drop=True)


    # Validation Data
    X_val=X_val.reset_index(drop=True)
    y_val=y_val.reset_index(drop=True)

    # Test Data
    X_test=X_test.reset_index(drop=True)
    y_test=y_test.reset_index(drop=True)

    ## Check data

    print("\n************** Data After Resetting **************\n")

    ## Data Shape
    print("\nTrain Data: {}".format(X_train.shape))
    print("\nValidation Data: {}".format(X_val.shape))
    print("Test Data: {}".format(X_test.shape))

    ## Label Distribution
    print('\nClass Counts(label, row): Train\n')
    print(y_train.value_counts())
    print('\nClass Counts(label, row): Validation\n')
    print(y_val.value_counts())
    print('\nClass Counts(label, row): Test\n')
    print(y_test.value_counts())

    ## Display the first 5 instances of X data
    print("\nFirst 5 Instance: Train\n")
    print(X_train.head())
    print("\nFirst 5 Instance: Validation\n")
    print(X_val.head())
    print("\nFirst 5 Instance: Test\n")
    print(X_test.head())

    return (X_train, X_val, X_test, y_train, y_val, y_test)


In [251]:
#### Create a function named "preprocess_data()" 
#### that contains all the code you write for of Exercise 3.

# Insert your code here

def preprocess_data(X_data_raw):
    """
       Preprocess data with lowercase conversion, punctuation removal, tokenization, stemming

       X_data_raw: X data in dataframe
       return: transformed dataframe

    """

    X_data=X_data_raw.iloc[:, -1].astype(str)

    ## 1. convert all characters to lowercase

    X_data = X_data.map(lambda x: x.lower())


    ## 2. remove punctuation

    X_data = X_data.str.replace('[^\w\s]', '')


    ## 3. tokenize sentence

    X_data = X_data.apply(nltk.word_tokenize)


    ## 4. remove stopwords

    stopword_list = stopwords.words("english")
    X_data = X_data.apply(lambda x: [word for word in x if word not in stopword_list])


    ## 5. stemming
    stemmer = PorterStemmer()
    X_data = X_data.apply(lambda x: [stemmer.stem(y) for y in x])


    ## 6. removing unnecessary space
    X_data = X_data.apply(lambda x: " ".join(x))


    # Check data view
    print("\nData View After Pre-processing:\n")
    print(X_data.head())


    return X_data

## Ex 4-2. Call each funtion in order

In [252]:
#### Call three functions using the following filename and column name

in_filename = "pubmed_data.txt"
colname = "title"

X, y = load_data(in_filename, colname)
X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y)
X_train_processed = preprocess_data(X_train)

## Display the first 5 instances of preprocessed train data
## The output should look as the same as the one for Exercise 3.

# Insert your code here

print("\nFirst 5 Instance After Pre-processing:\n")
X_train_processed.head(5)

No of Rows: 25000
No of Columns: 7

Data View: First Few Instances

       pmid  pubdate language  \
0  29650023   2018.0      eng   
1  29649996   2018.0      eng   
2  29649669   2018.0      eng   
3  29649082      NaN      eng   
4  29648860   2018.0      eng   
5  29648580   2018.0      eng   
6  29645086   2018.0      eng   
7  29644504   2018.0      eng   
8  29644408   2018.0      eng   
9  31669564   2020.0      eng   

                                               title  \
0  Sustained impact of energy-dense TV and online...   
1  The effects of synbiotic supplementation on ho...   
2  360° virtual reality video for the acquisition...   
3  Transcutaneous Electrical Nerve Stimulation Re...   
4  Motivation and readiness for tobacco cessation...   
5  Bevacizumab plus hypofractionated radiotherapy...   
6  Erlotinib plus either pazopanib or placebo in ...   
7  Impact of smoking history on the outcomes of w...   
8  MODUL-a multicenter randomized clinical trial ...   
9  Toxic

0    pilot random control trial telephon intervent ...
1    vitamin prostat cancer prognosi : mendelian ra...
2    dissect harmon scalpel versu cold instrument p...
3               microrna heterogen melanoma progress .
4    random phase ii trial nab-paclitaxel gemcitabi...
Name: title, dtype: object

In [253]:
print("\nFirst 5 Instance After Pre-processing:\n")
X_train_processed.head(5)


First 5 Instance After Pre-processing:



0    pilot random control trial telephon intervent ...
1    vitamin prostat cancer prognosi : mendelian ra...
2    dissect harmon scalpel versu cold instrument p...
3               microrna heterogen melanoma progress .
4    random phase ii trial nab-paclitaxel gemcitabi...
Name: title, dtype: object

### The output should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex4-1.png" width=450, height=350, alt="Alternative text"/>

# Provide an evidence that you have used AWS SageMaker for this assignment

###  Create a SageMaker Notebook Instance named " IS597MLC-SP2024-Data-Preprocessing-Assignment" using 'ml.t3.medium' instance type. Please provide a screen shot (png or jpeg file) of your notebook including AWS SageMaker Notebook instace URL at the top in your image. Include the image named "IS597MLC-YourSurname-YourFirstName" in your submitted zipped file.


### The image should look like this: 

<img src="Image/DataPreprocessing-Assignment-Ex5-1.png" width=500, height=400, alt="Alternative text"/>