# Using Collaborative Filtering To Repurpose Drugs for COVID-19

### Introduction

This notebook is attempting to accelerate this process of drug repurposing for COVID-19 by using a popular and relatively simple technique called Collaborative Filtering to search for potential drug candidates.

### What is Collaborative Filtering and How Can We Use It for Drug Repurposing?

When a website recommends a product to you it can recommend the product to you in one fo two ways:

1. Using a content based approach (you are female, married and aged 30+; this product is bought by married females aged 30+ and therefore can be recommended to you)
2. Using collaborative filtering wherein the latent factors (female, married, aged 30+ etc.) of the product and the user are not used but rather, they are inferred based on past purchases. More info here - https://en.wikipedia.org/wiki/Collaborative_filtering

We're using collaborative filtering for drug repurposing by simply replacing users with drugs and products with diseases. So if we have a large enough database of drugs and disease conditions, we could, in theory predict the efficacy of a drug on a particular disease condition (ex: COVID-19).

Similar approaches (but more complicated by including protein interactions and structural similarities in the calculations) have been tried earlier for other conditions - 
1.  Computational drug repositioning using collaborative filtering via multi-source fusion https://www.sciencedirect.com/science/article/abs/pii/S0957417417303202

2. Realizing drug repositioning by adapting a recommendation system to handle the process:: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2142-1

### Potential Drawbacks of Collaborative Filtering For COVID-19
1. "Cold Start Problem" - since COVID-19 is a new disease and no drug has been conclusively proven to be efficacious against the disease, we simply do not enough data to make a prediction. This situation will rapidly change if some existing drug emerges as having efficacy. For the purpose of modeling here, I've simply taken some drugs that have anecdotally shown some efficacy or are currently approved for use against COVID-19. 
    
2. Popular drugs may get recommended more ... we'll need to watch out for drugs that address many types of cancers or steroids that are used for a variety of conditions.


### Citation for Drug Database Used
I've used the database available on http://apps.chiragjpgroup.org/repoDB/ for this notebook (associated paper: 'Brown AS and Patel CJ. repoDB: A New Standard for Drug Repositioning Validation. Scientific Data. 170029 (2017).'


### Code for Collaborative Filtering

First we make some imports. I'm using a library called fastai which is built on top of pytorch. More info here - https://www.fast.ai/2018/10/02/fastai-ai/

In [4]:
from fastai.collab import *
from fastai.tabular import *

Next we take the full drug database and peek inside it. This is what we'll be using to train our model.

In [5]:
all_drugs = pd.read_csv('./full.csv')

In [6]:
all_drugs.head()

Unnamed: 0,drug_name,drug_id,ind_name,ind_id,NCT,status,phase,DetailedStatus
0,Lepirudin,DB00001,Heparin-induced thrombocytopenia with thrombosis,C0272275,,Approved,,
1,Cetuximab,DB00002,Squamous cell carcinoma of mouth,C0585362,,Approved,,
2,Cetuximab,DB00002,Squamous cell carcinoma of nose,C3163899,,Approved,,
3,Cetuximab,DB00002,Squamous cell carcinoma of pharynx,C1319317,,Approved,,
4,Cetuximab,DB00002,Laryngeal Squamous Cell Carcinoma,C0280324,,Approved,,


In [7]:
all_drugs.tail()

Unnamed: 0,drug_name,drug_id,ind_name,ind_id,NCT,status,phase,DetailedStatus
10557,Temozolomide,DB00853,CNS disorder,C0007682,NCT02661113,Withdrawn,Phase 2,Sponsor withdrew support; Study did not progre...
10558,Dasatinib,DB01254,CNS disorder,C0007682,NCT02661113,Withdrawn,Phase 2,Sponsor withdrew support; Study did not progre...
10559,Dacarbazine,DB00851,Brain Diseases,C0006111,NCT02661113,Withdrawn,Phase 2,Sponsor withdrew support; Study did not progre...
10560,Temozolomide,DB00853,Brain Diseases,C0006111,NCT02661113,Withdrawn,Phase 2,Sponsor withdrew support; Study did not progre...
10561,Dasatinib,DB01254,Brain Diseases,C0006111,NCT02661113,Withdrawn,Phase 2,Sponsor withdrew support; Study did not progre...


In [8]:
all_drugs.shape

(10562, 8)

From the full drug database I've made a simplified version with just three columns. The last column ('rating') provides a score of 1, if a drug is used for a particular condition. 

In [9]:
approved_covid = pd.read_csv('./approved_COVID.csv')

In [10]:
approved_covid.head()

Unnamed: 0,drug_name,ind_name,rating
0,Lepirudin,Heparin-induced thrombocytopenia with thrombosis,1
1,Cetuximab,Squamous cell carcinoma of mouth,1
2,Cetuximab,Squamous cell carcinoma of nose,1
3,Cetuximab,Squamous cell carcinoma of pharynx,1
4,Cetuximab,Laryngeal Squamous Cell Carcinoma,1


In the end, I added 14 drugs that are currently being tried for COVID-19 in the literature as of today (May 6, 2020). Favipiravir was not available in the database that I downloaded. So I added an additional row for it associating it with influenza in addition to associating it with COVID-19

In [11]:
approved_covid.tail(15)

Unnamed: 0,drug_name,ind_name,rating
6677,Remdesivir,COVID-19,1
6678,Hydroxychloroquine,COVID-19,1
6679,Azithromycin,COVID-19,1
6680,Tocilizumab,COVID-19,1
6681,Lopinavir,COVID-19,1
6682,Ritonavir,COVID-19,1
6683,Atovaquone,COVID-19,1
6684,Omeprazole,COVID-20,1
6685,Famotidine,COVID-21,1
6686,Oseltamivir,COVID-22,1


Then we train our model.

In [12]:
data = CollabDataBunch.from_df(approved_covid, seed=42)

In [13]:
y_range = [0,1.01]
learn = collab_learner(data, n_factors=50, y_range=y_range)
learn.fit_one_cycle(6, 5e-3)

epoch,train_loss,valid_loss,time
0,0.238649,0.228914,00:00
1,0.166478,0.107541,00:00
2,0.075966,0.068207,00:01
3,0.032451,0.058261,00:00
4,0.01612,0.055541,00:00
5,0.011155,0.055121,00:00


In [14]:
#sanity check to see if our model is predicting well on the validation data
(drug, disease), efficacy = next(iter(data.valid_dl))
preds = learn.model(drug, disease)
print('Real\tPred\tDifference')
for p in list(zip(efficacy, preds))[:5]:
    print('{}\t{:.1f}\t{:.1f}'.format(p[0],p[1],p[1]-p[0]))

Real	Pred	Difference
1.0	0.8	-0.2
1.0	0.9	-0.1
1.0	0.5	-0.5
1.0	1.0	-0.0
1.0	1.0	0.0


Next I created a list of all the drugs in our database with the disease set as COVID-19 so that we can ask our model to make predictions of efficacy for each of the drugs on COVID-19.

In [15]:
covid_candidate_drugs= pd.read_csv('./covid_candidate_drugs.csv')

In [16]:
covid_candidate_drugs.shape

(1573, 3)

In [17]:
covid_candidate_drugs.tail()

Unnamed: 0,drug_name,ind_name,rating
1568,Acetylcarnitine,COVID-19,
1569,Eslicarbazepine acetate,COVID-19,
1570,Nitrendipine,COVID-19,
1571,Succinylcholine,COVID-19,
1572,Remdesivir,COVID-19,


In [18]:
drugs_covid_predictions = pd.DataFrame(
    [covid_candidate_drugs.loc[i][0], learn.predict(covid_candidate_drugs.loc[i])[0]] for i in range(covid_candidate_drugs.shape[0])
)

In [19]:
#drugs_df.to_csv('drug_names.csv')
drugs_covid_predictions.to_csv('drugs_covid_predictions.csv')

In [20]:
drugs_covid_predictions = pd.read_csv('./drugs_covid_predictions.csv')

In [21]:
del drugs_covid_predictions['Unnamed: 0']

In [22]:
drugs_covid_predictions.rename(columns={'0': 'drug_name', '1': 'predicted_efficacy'}, inplace=True)

In [23]:
drugs_covid_predictions.sort_values('predicted_efficacy').head()

Unnamed: 0,drug_name,predicted_efficacy
793,Benzylpenicillin,0.311477
899,Fluoxymesterone,0.360499
1229,Methyltestosterone,0.381914
1029,Testosterone Propionate,0.384171
445,Testosterone,0.389494


### Listing Our Top 30 Predictions
Finally, we list out our top 30 drugs with highest predicted efficacy. 

In [25]:
drugs_covid_predictions.sort_values('predicted_efficacy').tail(30)

Unnamed: 0,drug_name,predicted_efficacy
512,Amprenavir,0.806214
1395,Elvitegravir,0.807242
1333,Dolutegravir,0.808324
468,Stavudine,0.811085
988,Fosamprenavir,0.811877
64,Enfuvirtide,0.814427
158,Calcium Acetate,0.814611
1289,Rilpivirine,0.815321
673,Didanosine,0.817202
964,Darunavir,0.817858


### Evaluating Results from Our Top Predictions
It's no surprise to see the usual suspects such as ritonavir or tenofovir in our top 30 as these were already supplied by us with a score of 1 when we trained our model. Some drugs predicted by our model and being investigated for COVID-19 are:

1. Nelfinavir - Multidrug treatment with nelfinavir and cepharanthine against COVID-19 https://www.biorxiv.org/content/10.1101/2020.04.14.039925v1

2. Indinavir - Mentioned in a paper by Yu-Chuan Chang etc. titled 'Potential therapeutic agents for COVID-19 based on the analysis of protease and RNA polymerase docking'. In this paper, the following claim is made "Indinavir has the lowest docking score, even outperforming the results of Lopinavir and Ritonavir" - https://www.preprints.org/manuscript/202002.0242/v1/download





### Citing this Notebook
Contents of this notebook are available under a Creative Commons Attribution 4.0 International License - https://creativecommons.org/licenses/by/4.0/

If you've got an idea or suggestion please send me an email at vikramsnarayan@gmail.com