# Intelligent systems for bioinformatics- Group 1

This work is developed in the ambit of curricular unit intelligent systems for bioinformatics of the Bioinformatic Master by:
- [Beatriz Santos](https://github.com/beatrizsantos1607)
- [Duarte Velho](https://github.com/duartebred)
- [Ricardo Oliveira](https://github.com/ricardofoliveira61)
- [Rita Nobrega](https://github.com/ritanobrega00)
- [Rodrigo Esperança](https://github.com/esperancaa)

This work consists in the analysis of a dataset through the utilization of machine learning algorithms, recurring to Python as the programming language.
The entire analysis is present on a Jupyter Notebook, organized in sections (explained later on) containing succinct explanations of the procedures and decisions taken throughout the analysis.

For this work we selected the [GDSC1](https://tdcommons.ai/multi_pred_tasks/drugres) dataset. This dataset contains the wet lab IC50 for 208 drugs in 1000 cancer cells lines and can be used to design models that can predict drug response since the same compound can have differents levels of responses in different patients. With this we aim to design a model that given a pair of drug and cell line genomics profile can predict the drug response and find the best drug to treat certain patient. In this dataset the RMD normalized gene expression was used for cancer lines and the SMILES for drugs. Y is the log normalized IC50.

## Notebook sections
### 1.Preprocessing and data exploration
- Review of all documentation available about the dataset
- Load the dataset and realize a exploratory analysis 
- Prepare the dataset with the generation and selection of features and treatment of the missing values 

This stage corresponds to the 1st section of the Notebook where:
- The dataset must the described according to the documentation
- Summarize the characteristics of the data trought an exploratory analysis
- Description of the preprocessing steps justifying the choices
- Include graphics that represent the main characteristics of the dataset

### 2. Non-supervised learning
- Utilization of the adequate visualization and dimensionality reduction technique
-  Application of clustering methods

This stage corresponds to the section 2 of the Notebook where:
- The results must be analyzed and the procedures explain

### 3. Machine Learning
- Compare the behavior of different models/methods of machine learning through the calculation of the performance metrics
- Present the best model for the dataset

This stage correspond to the section 3 of the notebook and all the results must be reported and analyzed in a critical way

### 4. Deep Leaning
- Utilization of deep learning methods similarly to the stage 3

This stage correspond to the section 4 of the notebook and must report the results and have a critical analysis.

All the packages used during this work are listed on the cell below


In [1]:
from tdc.multi_pred import DrugRes
import pandas as pd
import numpy as np

## 1. Preprocessing and data exploration
 The first stage of this work consists of describing the dataset that is going to be used allong the project. This stage is crucial to understand the data, its structure, its quality, and potencial analysis. Let's start by reviewing all the avaiable documentation.

 ### 1.1 Documentation review and dataset description
 Recent studies have shown that alterations in cancer genomes influence the clinical response to anticancer therapies. Nowadays, genomic changes are used as molecular biomarkers to identify patientes most likely to benefit from a treatment, however many cancer drugs in development or already in use have not been linked to a specific genomic markers that could guide their clinical use to diminuish the time needed to treat a patient. 

 The discovery of cancer genome as a potencial biomarker was only possible due to the advances in the recent years in high-throughput technologies, in particular, DNA sequencing technoliges, that allow the sequencing on a scale that was previously unthinkable. To explore the increased knowledge of cancer genomics, preclinical studies that link the genomic complexity of cancer with functional readouts such as drug sensitivity are required. For this studies, cancer lines derived from many differnt naturally occuring cancer types are essensial to mimic the tissue type and genomic context of the cancer and they also provide a easy system for experimental manipulation for molecular biology and drug discovery. For this reason, several studies have used cancer cell lines to link pharmacological data with genomic information and helped define therapeutic biomarkers as well as to demonstrate that pharmacogenomic profilling in cancer cell lines can be used as a biomarker discovery platform to guide the development of new cancer therapies.

The Genomics of Drug sensitivity in Cancer database, or GDSC for short, was designed to facilitate the study and understanding of the molecular features that influence drug response in cancer cell lines. The database holds datasets of drug sensitivity in cancer cells and links these data to detailed genomic information to facilitate the discovery of molecular biomarkers for drug response. This efforts are expected to, in a near future, provide a complete description of the genomic changes that occur in many cancer types
and profund insights into the origins, evolution and progression of cancer.

In order to download and load the dataset, we will use the `DrugRes` class from the `tdcommons.multi_pred` package. This package will facilitate data access to perform an initial analysis of the dataset, including its structure, format and content. The code presented below was used to download/load the dataset.


In [3]:
# download of the dataset and conversion to pandas.dataframe
data = DrugRes(name = 'GDSC1')
dataframe = data.get_data()
dataframe


Found local copy...
Loading...
Done!


Unnamed: 0,Drug_ID,Drug,Cell Line_ID,Cell Line,Y
0,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,MC-CAR,"[3.23827250519154, 2.98225419469807, 10.235490...",2.395685
1,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,ES3,"[8.690197905033282, 3.0914731119366, 9.9924871...",3.140923
2,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,ES5,"[8.233101127037282, 2.82468731112752, 10.01588...",3.968757
3,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,ES7,"[8.33346622426757, 3.9667571228514302, 9.79399...",2.692768
4,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,EW-11,"[8.39134072442845, 2.9683601858810698, 10.2606...",2.478678
...,...,...,...,...,...
177305,PFI-3,C1[C@@H]2CN([C@H]1CN2C3=CC=CC=N3)/C=C/C(=O)C4=...,SNU-1040,"[8.65368534780164, 2.9238748715081, 10.1278774...",5.353963
177306,PFI-3,C1[C@@H]2CN([C@H]1CN2C3=CC=CC=N3)/C=C/C(=O)C4=...,SNU-407,"[8.57966425274312, 2.77877087774424, 9.7680113...",4.820567
177307,PFI-3,C1[C@@H]2CN([C@H]1CN2C3=CC=CC=N3)/C=C/C(=O)C4=...,SNU-61,"[8.077115751588071, 2.78132536810578, 10.03805...",5.785978
177308,PFI-3,C1[C@@H]2CN([C@H]1CN2C3=CC=CC=N3)/C=C/C(=O)C4=...,SNU-81,"[7.7976988637889, 2.6408995410198797, 9.463400...",5.393454



The dataset donwload contains


DATABASE CONTENT

The GDSC database is based on three types of datasets as described in the following sections.
Cell line drug sensitivity data

Cancer cell line drug sensitivity data are generated from ongoing high-throughput screening performed by the Cancer Genome Project at the Wellcome Trust Sanger Institute (WTSI) and the Center for Molecular Therapeutics at Massachusetts General Hospital using a collection of >1000 cell lines (7). Compounds selected for screening are anticancer therapeutics encompassing both targeted agents and cytotoxic chemotherapeutics. They are comprised of approved drugs used in the clinic, drugs undergoing clinical development and in clinical trials and tool compounds in early phase development. They cover a wide range of targets and processes implicated in cancer biology including receptor tyrosine kinase signalling, cell cycle control, DNA damage response and the cytoskeleton. Compounds are sourced from commercial vendors or provided by collaborators in academia, biotech and the pharmaceutical industry.

Cell line drug sensitivity is measured using fluorescence-based cell viability assays following 72 h of drug treatment. Dose–response curves are fitted to fluorescence signal intensities over nine drug concentrations (2-fold dilution series) to derive a multi-parameter signature of drug response. Values reported on the website include the half maximal inhibitory concentration (IC50), the slope of the dose–response curve and the area under the curve for each experiment.

The current release of GDSC (release 2, July 2012) includes drug sensitivity data for 138 anticancer compounds screened across a range of 329–668 cell lines per drug (mean = 525 cell lines per drug) representing 73 169 cell line–drug interactions. This is the largest public resource available on drug sensitivity in cancer cells. Screening is ongoing and the objective is to screen these compounds, as well as additional compounds in the future, across the entire collection of >1000 cell lines. Data release occurs every 4 months and with each release, these results are updated with new data for existing drugs, as well as data for newly screened drugs.
Genomic datasets for cell lines

The total collection available for screening includes >1000 different cancer cell lines. These have been selected to represent the spectrum of common and rare types of adult and childhood cancers of epithelial, mesenchymal and haematopoietic origin. The cell lines have been extensively genomically characterized as part of the cancer cell line project from the Cancer Genome Project at the WTSI. The genomic datasets currently available for each cell line include information on somatic mutations in 75 cancer genes, genome wide gene copy number for amplification and deletion, targeted screening for seven gene rearrangements, markers of microsatellite instability, tissue type and transcriptional data. Using various statistical approaches as described below, genomic datasets are used together with drug sensitivity data for each cell line to identify genomic biomarkers of drug response. Genomic datasets within GDSC are obtained and updated directly from the Catalogue of Somatic Mutations in Cancer (COSMIC) database, a comprehensive freely available resource for the annotation and presentation of somatic mutations in cancer (8).
Analysis of genomic features of drug sensitivity

An essential component of the GDSC database is the systematic integration of large-scale genomic and drug sensitivity datasets. To identify genomic markers of drug response, we currently use two complementary analytical approaches (7). A multivariate analysis of variance (MANOVA) is used to correlate drug sensitivity (IC50 values and slope of the dose–response curve) with genomic alterations in cancer including point mutations, amplifications and deletions of common cancer genes, cancer gene rearrangements and microsatellite instability. The MANOVA identifies individual genomic features associated with drug sensitivity and for each drug–gene association reports a size effect and statistical significance of the association.

We also apply elastic net regression, a penalized linear modelling technique, to identify multiple interacting genomic features influencing each drug response. Genomic data used in the elastic net analysis include all of those used in the MANOVA and also incorporate genome-wide transcriptional profiles and tissue type. The elastic net selects which of these features are associated with drug response as measured by IC50 values across the cell line panel. For each drug, a feature list is built comprised of mutations, transcripts and tissue with an effect size assigned to each.

A more detailed description of the different statistical analyses performed, as well as guidance on interpreting the results, can be found on the ‘Help & Documentation’ webpages under the ‘statistical analysis’ tab.

In [7]:
gene_symbols = data.get_gene_symbols()
len(gene_symbols)

Found local copy...
Loading...


17737

In [14]:
data.print_stats()

--- Dataset Statistics ---
208 unique drugs.
958 unique cell lines.
177310 drug-cell line pairs.
--------------------------
