# Predict Drug MOAs 

In this notebook, we try to predict a drug's MOA using its per-cell-line PRISM response data.

In [1]:
from util import *

## All MOAs

First, let's try roughly predicting all MOAs using all the data. We remove MOAs that don't have a reasonable amount of data.

In [2]:
df, df_info = get_processed_data(drugs_as_features=False, impute=True)

In [3]:
orig_len = len(df)
df = df.join(df_info[["moa"]], how="inner").dropna(subset=["moa"])
print("[Rows = {}] Removing {} of {} rows without a known MOA in the info data".format(
    len(df), orig_len - len(df), orig_len))

[Rows = 4325] Removing 361 of 4686 rows without a known MOA in the info data


In [4]:
orig_len = len(df)
df = df[~df.moa.apply(lambda s: "," in s)]
print("[Rows = {}] For simplicity, removing {} of {} rows with multiple MOAs".format(
    len(df), orig_len - len(df), orig_len))

[Rows = 3929] For simplicity, removing 396 of 4325 rows with multiple MOAs


What do these MOAs look like?

In [5]:
df.moa.value_counts()

cyclooxygenase inhibitor                        89
adrenergic receptor antagonist                  89
adrenergic receptor agonist                     81
acetylcholine receptor antagonist               78
histamine receptor antagonist                   68
                                                ..
glycosylated protein precursor                   1
excipient                                        1
ecdysone receptor modulator                      1
PKA inhibitor                                    1
pyruvate ferredoxin oxidoreductase inhibitor     1
Name: moa, Length: 781, dtype: int64

In [6]:
orig_len = len(df)
df["moa_count"] = df.groupby("moa")["moa"].transform("count")
df = df[df.moa_count >= 10]
print("[Rows = {}] Removing {} of {} rows without an MOA that is present for at least 10 drugs".format(
    len(df), orig_len - len(df), orig_len))
df = df.drop(columns=["moa_count"])

[Rows = 2311] Removing 1618 of 3929 rows without an MOA that is present for at least 10 drugs


In [7]:
model = logistic_regression_auc(x_values=df.drop(columns=["moa"]), y_values=df.moa)

Fitting model...
Model achieved an accuracy of 0.14, balanced accuracy of 0.13, AUC of 0.67


Interesting. Average AUC (when averaging over MOA A vs. not MOA A, MOA B vs. not MOA B and so on) is reasonable, but 
accuracy when looking at the complete multiclass predictions is very low. Can we do better?

## Non-Cancer-Related MOA

Hypothesis: perhaps a lot of these MOAs, many of them not cancer specific, are fairly irrelevant to this prediction
problem and are muddying our ability to predict cancer-specific MOAs. For example, can we really predict anything about a histamine receptor antagonist with cancer cell line response data?

In [8]:
df["is_histamine_antagonist"] = df.moa == "histamine receptor antagonist"

In [9]:
model = logistic_regression_auc(x_values=df.drop(columns=["moa", "is_histamine_antagonist"]),
                                y_values=df.is_histamine_antagonist)
df = df.drop(columns=["is_histamine_antagonist"])

Fitting model...
Model achieved an accuracy of 0.94, balanced accuracy of 0.49, AUC of 0.55


This poor AUC and poor balanced accuracy seems in line with the above hypothesis.

## Cancer-Related MOAs

How about picking a few MOAs that are clearly cancer related?

In [10]:
orig_len = len(df)
df = df[df.moa.isin({"EGFR inhibitor", "PI3K inhibitor", "MEK inhibitor", "PARP inhibitor", "CDK inhibitor"})]
print("[Rows = {}] Removing {} of {} rows that do not match these specific MOAs".format(
    len(df), orig_len - len(df), orig_len))

[Rows = 130] Removing 2181 of 2311 rows that do not match these specific MOAs


In [11]:
df.moa.value_counts()

EGFR inhibitor    34
PI3K inhibitor    33
MEK inhibitor     22
CDK inhibitor     21
PARP inhibitor    20
Name: moa, dtype: int64

In [12]:
model = logistic_regression_auc(x_values=df.drop(columns=["moa"]),
                                y_values=df.moa)

Fitting model...
Model achieved an accuracy of 0.75, balanced accuracy of 0.75, AUC of 0.90


This also makes sense: trying to predict cancer-related MOAs from PRISM cell line signatures seems reasonable.