## Data Set from Kaggle

Recognized the wealth of crucial information within biomedical texts pivotal for advancements in the medical domain. 

Advocated for the automation of event extraction from biomedical texts to accelerate progress and innovation within the medical field, emphasizing the substantial benefits of this approach.

https://www.kaggle.com/datasets/nishanthsalian/genia-biomedical-event-dataset?select=GE11-LICENSE

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv('train_data.csv')

In [3]:
test = pd.read_csv('test_data.csv')

In [4]:
train.shape

(8666, 4)

In [5]:
test.shape

(3360, 4)

In [6]:
train.head()

Unnamed: 0,Sentence,TriggerWord,TriggerWordLoc,EventType
0,Down-regulation of interferon regulatory fact...,Down-regulation;expression;,1;8;,Negative_regulation;Gene_expression;
1,Although the bcr - abl translocation has been...,deregulation;,30;,Regulation;
2,Promoter methylation of CpG target sites or d...,,,
3,"Therefore , we investigated whether IRF-4 pro...",regulation;expression;,16;19;,Regulation;Gene_expression;
4,Whereas promoter mutations or structural rear...,altered;expression;influence;transcription;,14;16;30;32;,Regulation;Gene_expression;Regulation;Transcri...


In [7]:
test.head()

Unnamed: 0,Sentence,TriggerWord,TriggerWordLoc,EventType
0,Resistance to IL-10 inhibition of interferon ...,,,
1,IL-10 has been shown to block the antigen-spe...,,,
2,We found that peripheral blood CD4 + T cells ...,,,
3,The phosphorylation of signal transducer and ...,,,
4,Sera from RA patients induced signal transduc...,,,


In [8]:
## Concatination of 2 data frames vertically
data = pd.concat([train,test],axis=0)

In [9]:
data.shape

(12026, 4)

In [10]:
data.isnull().mean()

Sentence          0.000000
TriggerWord       0.691585
TriggerWordLoc    0.691585
EventType         0.691585
dtype: float64

In [11]:
data.TriggerWord.unique()

array(['Down-regulation;expression;', 'deregulation;', nan, ...,
       'expressed;cross-linking;augmented;expression;transcription;up-regulated;activated;transcription;cross-linking;',
       'up-regulate;expression;', 'expression;levels;'], dtype=object)

In [12]:
data['TriggerWord'].value_counts().iloc[:50]

expression;                     182
expressed;                       90
binding;                         54
binds;                           36
bind;                            36
induction;                       32
overexpression;                  29
expression;expression;           28
activation;                      28
induced;                         26
express;                         24
activated;                       24
bound;                           23
regulation;                      22
expressing;                      20
cross-linking;                   18
increase;                        18
transcription;                   18
increased;                       17
regulated;                       17
production;                      17
Overexpression;                  15
induced;expression;              15
detected;                        14
Expression;                      14
interact;                        13
produced;                        13
interaction;                

In [13]:
data['EventType'].value_counts().iloc[:50]

Gene_expression;                                                428
Positive_regulation;                                            378
Binding;                                                        310
Positive_regulation;Gene_expression;                            178
Negative_regulation;                                            152
Regulation;                                                     147
Regulation;Gene_expression;                                     103
Positive_regulation;Positive_regulation;                         96
Transcription;                                                   91
Gene_expression;Positive_regulation;                             81
Gene_expression;Gene_expression;                                 64
Negative_regulation;Gene_expression;                             62
Negative_regulation;Positive_regulation;                         52
Localization;                                                    49
Positive_regulation;Transcription;              

In [14]:
data['EventType']=data['EventType'].str.split(';').str[0]

In [15]:
data['EventType'].fillna(data['EventType'].mode()[0],inplace=True)

In [16]:
data['EventType'].value_counts()

Positive_regulation    9488
Gene_expression         785
Negative_regulation     521
Regulation              445
Binding                 439
Transcription           187
Localization             89
Phosphorylation          52
Protein_catabolism       20
Name: EventType, dtype: int64

In [18]:
data.isnull().sum()

Sentence             0
TriggerWord       8317
TriggerWordLoc    8317
EventType            0
dtype: int64

In [19]:
y = data['EventType']

In [20]:
data.reset_index(inplace=True)

In [88]:
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
WordNet = WordNetLemmatizer()
corpus = []
for i in range(0,len(data)):
    review = re.sub('[^a-zA-Z]',' ',data['Sentence'][i])
    review = review.lower()
    review = review.split()
    review = [WordNet.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [89]:
corpus

['regulation interferon regulatory factor gene expression leukemic cell due hypermethylation cpg motif promoter region',
 'although bcr abl translocation shown causative genetic aberration chronic myeloid leukemia cml mounting evidence deregulation gene transcription factor interferon regulatory factor irf also implicated pathogenesis cml',
 'promoter methylation cpg target site direct deletion insertion gene mechanism reversible permanent silencing gene expression respectively',
 'therefore investigated whether irf promoter methylation mutation may involved regulation irf expression leukemia cell',
 'whereas promoter mutation structural rearrangement could excluded cause altered irf expression hematopoietic cell irf promoter methylation status found significantly influence irf transcription',
 'first treatment irf negative lymphoid myeloid monocytic cell line methylation inhibitor aza deoxycytidine resulted time concentration dependent increase irf mrna protein level',
 'second using 

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
Tf = TfidfVectorizer()
X = Tf.fit_transform(corpus).toarray()

In [110]:
## Handling categorical variable using target guided ordinal encoding
y.head()

0    Negative_regulation
1             Regulation
2    Positive_regulation
3             Regulation
4             Regulation
Name: EventType, dtype: object

In [111]:
y.value_counts().sort_values()

Protein_catabolism       20
Phosphorylation          52
Localization             89
Transcription           187
Binding                 439
Regulation              445
Negative_regulation     521
Gene_expression         785
Positive_regulation    9488
Name: EventType, dtype: int64

In [112]:
y.value_counts().sort_values().index

Index(['Protein_catabolism', 'Phosphorylation', 'Localization',
       'Transcription', 'Binding', 'Regulation', 'Negative_regulation',
       'Gene_expression', 'Positive_regulation'],
      dtype='object')

In [113]:
ordinal_labels = y.value_counts().sort_values().index

In [98]:
enumerate(ordinal_labels,1)

<enumerate object at 0x0000028D4F76D800>


In [114]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,1)}
ordinal_labels2

{'Protein_catabolism': 1,
 'Phosphorylation': 2,
 'Localization': 3,
 'Transcription': 4,
 'Binding': 5,
 'Regulation': 6,
 'Negative_regulation': 7,
 'Gene_expression': 8,
 'Positive_regulation': 9}

In [115]:
y = y.map(ordinal_labels2)

In [116]:
X.shape, y.shape

((12026, 8992), (12026,))

In [117]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=42)

In [118]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((8418, 8992), (3608, 8992), (8418,), (3608,))

In [119]:
## Naive Baye's Classifier
from sklearn.naive_bayes import MultinomialNB
naive = MultinomialNB()
naive.fit(X_train,y_train)

MultinomialNB()

In [120]:
from sklearn.metrics import accuracy_score
y_pred = naive.predict(X_test)
acc = accuracy_score(y_pred,y_test)
acc

0.7827050997782705

## Multinomial Classifier with HyperParameter

In [132]:
classifier = MultinomialNB(alpha=0.1)

In [135]:
import numpy as np
previous_score=0
for alpha in np.arange(0,1,0.1):
    sub_classifier=MultinomialNB(alpha=alpha)
    sub_classifier.fit(X_train,y_train)
    y_pred = sub_classifier.predict(X_test)
    score = accuracy_score(y_pred,y_test)
    if score > previous_score:
        classifier = sub_classifier
    print("Alpha: {}, score : {}".format(alpha,score))



Alpha: 0.0, score : 0.7657982261640798
Alpha: 0.1, score : 0.772450110864745
Alpha: 0.2, score : 0.7768847006651884
Alpha: 0.30000000000000004, score : 0.779379157427938
Alpha: 0.4, score : 0.7796563192904656
Alpha: 0.5, score : 0.7810421286031042
Alpha: 0.6000000000000001, score : 0.7818736141906873
Alpha: 0.7000000000000001, score : 0.782150776053215
Alpha: 0.8, score : 0.7824279379157428
Alpha: 0.9, score : 0.7824279379157428


## Passive Aggressive Classifier Algorithm

In [121]:
from sklearn.linear_model import PassiveAggressiveClassifier
linear_clf = PassiveAggressiveClassifier()

In [122]:
linear_clf.fit(X_train,y_train)
y_pred = linear_clf.predict(X_test)
acc = accuracy_score(y_pred,y_test)
acc

0.707039911308204

## Random forest Classifier

In [123]:
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier()
random.fit(X_train,y_train)
y_pred = random.predict(X_test)
acc = accuracy_score(y_pred,y_test)
acc

0.7854767184035477

In [124]:
## Hyperparameter tuning
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

In [130]:
param_grid = {
    'n_estimators':[25,50,100,150],
    'max_depth': [3,6,9],
    'max_leaf_nodes': [3,6,9]
}

In [131]:
grid_search = GridSearchCV(RandomForestClassifier(),param_grid=param_grid)
grid_search.fit(X_train,y_train)
y_pred = grid_search.predict(X_test)
acc = accuracy_score(y_pred,y_test)
acc

0.7832594235033259