# Exploratory Data Analysis (EDA)

## 1. Introduction

This notebook explores the *Genomics of Drug Sensitivity in Cancer (GDSC)* dataset.  
The dataset contains genomic and pharmacological data describing how different cancer cell lines respond to various drug treatments.  

The goal of this analysis is to:
- Understand the dataset’s structure and integrity  
- Assess data completeness and consistency  
- Identify outliers or anomalies  
- Summarize key findings before further modeling or analysis

## 2. Assessment of Dataset Structure

This section evaluates the overall structure of the dataset, including its dimensions, variable types, and completeness. The goal is to understand the foundational layout before proceeding to deeper analysis.

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv("../data/gdsc_clean.csv")

# Dimensions
print("Dataset shape:", df.shape)

# Column names
print("\nColumn names:")
print(df.columns.tolist())

# Data types and non-null counts
print("\nDataset info:")
print(df.info())

# Reapply data type casting
df = df.astype({"COSMIC_ID": "Int64",
    "DRUG_ID": "Int64",
    "LN_IC50": "float64",
    "AUC": "float64",
    "Z_SCORE": "float64",
    "CELL_LINE_NAME": "string",
    "DRUG_NAME": "string",
    "TCGA_DESC": "category",
    "GDSC Tissue descriptor 1": "category",
    "GDSC Tissue descriptor 2": "category",
    "Cancer Type (matching TCGA label)": "category",
    "Microsatellite instability Status (MSI)": "category",
    "Screen Medium": "category",
    "Growth Properties": "category",
    "CNA": "category",
    "Gene Expression": "category",
    "Methylation": "category",
    "TARGET": "string",
    "TARGET_PATHWAY": "category"
})
print("\nDataset info after type casting:")
print(df.info())

# Statistical summary of numerical columns
print("\nSummary statistics:")
display(df.describe())

# Preview first few rows
df.head()

Dataset shape: (2468, 19)

Column names:
['COSMIC_ID', 'CELL_LINE_NAME', 'TCGA_DESC', 'DRUG_ID', 'DRUG_NAME', 'LN_IC50', 'AUC', 'Z_SCORE', 'GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2', 'Cancer Type (matching TCGA label)', 'Microsatellite instability Status (MSI)', 'Screen Medium', 'Growth Properties', 'CNA', 'Gene Expression', 'Methylation', 'TARGET', 'TARGET_PATHWAY']

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2468 entries, 0 to 2467
Data columns (total 19 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   COSMIC_ID                                2468 non-null   int64  
 1   CELL_LINE_NAME                           2468 non-null   object 
 2   TCGA_DESC                                2460 non-null   object 
 3   DRUG_ID                                  2468 non-null   int64  
 4   DRUG_NAME                                2468 non-null   object 
 5   L

Unnamed: 0,COSMIC_ID,DRUG_ID,LN_IC50,AUC,Z_SCORE
count,2468.0,2468.0,2468.0,2468.0,2468.0
mean,986198.944489,1003.916126,-0.940266,0.819678,0.000576
std,216046.344962,0.832468,3.428846,0.149018,0.999388
min,683667.0,1003.0,-7.530958,0.260719,-2.688692
25%,906807.75,1003.0,-3.746083,0.733137,-0.770615
50%,909713.0,1004.0,-1.590407,0.863975,-0.11972
75%,1240132.5,1005.0,1.742901,0.940245,0.666575
max,1789883.0,1005.0,9.229988,0.994588,3.377354


Unnamed: 0,COSMIC_ID,CELL_LINE_NAME,TCGA_DESC,DRUG_ID,DRUG_NAME,LN_IC50,AUC,Z_SCORE,GDSC Tissue descriptor 1,GDSC Tissue descriptor 2,Cancer Type (matching TCGA label),Microsatellite instability Status (MSI),Screen Medium,Growth Properties,CNA,Gene Expression,Methylation,TARGET,TARGET_PATHWAY
0,683667,PFSK-1,MB,1003,Camptothecin,-1.463887,0.93022,0.433123,nervous_system,medulloblastoma,MB,MSS/MSI-L,R,Adherent,Y,Y,Y,TOP1,DNA replication
1,684057,ES5,UNCLASSIFIED,1003,Camptothecin,-3.360586,0.791072,-0.599569,bone,ewings_sarcoma,,MSS/MSI-L,R,Adherent,Y,Y,Y,TOP1,DNA replication
2,684059,ES7,UNCLASSIFIED,1003,Camptothecin,-5.04494,0.59266,-1.516647,bone,ewings_sarcoma,,MSS/MSI-L,R,Adherent,Y,Y,Y,TOP1,DNA replication
3,684062,EW-11,UNCLASSIFIED,1003,Camptothecin,-3.741991,0.734047,-0.807232,bone,ewings_sarcoma,,MSS/MSI-L,R,Adherent,Y,Y,Y,TOP1,DNA replication
4,684072,SK-ES-1,UNCLASSIFIED,1003,Camptothecin,-5.142961,0.582439,-1.570016,bone,ewings_sarcoma,,MSS/MSI-L,R,Semi-Adherent,Y,Y,Y,TOP1,DNA replication


### Observations after dataset structure assessment

- The dataset contains **2468 rows and 19 columns**, providing a substantial amount of data for exploratory analysis.
- **Key identifiers:** `COSMIC_ID` identifies each cell line, `DRUG_ID` identifies each drug. Together, they can uniquely identify each experimental record.  
- Column types after casting are now appropriate: **2 Int64**, **3 float64**, **3 string**, and **11 categorical** columns, which will help with further analysis.  
- Numeric columns (`LN_IC50`, `AUC`, `Z_SCORE`) have reasonable ranges. For example, `LN_IC50` ranges from about -7.53 to 9.23, and `AUC` ranges from 0.26 to 0.99, suggesting no obvious extreme errors.  
- Some columns have missing values: for instance, `Cancer Type (matching TCGA label)` has 1947 non-null entries, indicating ~21% missing. Other categorical columns like `GDSC Tissue descriptor 1` and `2` also have some missing values.  
- Overall, the dataset structure is suitable for EDA, and the next step will be a more detailed evaluation of **data completeness and integrity**, including missing values and duplicates.

## 3. Data Completeness & Integrity Assessment

In this section, we evaluate the completeness and integrity of the dataset.  
We will check for missing values, duplicate rows, and consistency in categorical variables to ensure data quality before further analysis.

In [3]:
# Check missing values per column
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100
missing_df = pd.DataFrame({
    'missing_count': missing_counts,
    'missing_percent': missing_percent
})
display(missing_df.sort_values(by='missing_percent', ascending=False))

# Check for duplicate rows
duplicates_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicates_count)

# Examine categorical columns for value distributions
categorical_cols = df.select_dtypes(include='category').columns.tolist()
for col in categorical_cols:
    print(f"\nColumn: {col}")
    print(df[col].value_counts(dropna=False))

Unnamed: 0,missing_count,missing_percent
Cancer Type (matching TCGA label),521,21.110211
Microsatellite instability Status (MSI),110,4.45705
GDSC Tissue descriptor 2,88,3.56564
Methylation,88,3.56564
Growth Properties,88,3.56564
Screen Medium,88,3.56564
GDSC Tissue descriptor 1,88,3.56564
CNA,88,3.56564
Gene Expression,88,3.56564
TCGA_DESC,8,0.324149


Number of duplicate rows: 0

Column: TCGA_DESC
TCGA_DESC
UNCLASSIFIED    464
LUAD            164
BRCA            144
SCLC            137
COREAD          129
SKCM            116
ESCA             94
HNSC             91
PAAD             86
OV               84
GBM              83
NB               78
ALL              74
DLBC             72
STAD             70
KIRC             64
LAML             61
MESO             53
BLCA             51
MM               47
THCA             44
LIHC             43
LUSC             41
CESC             40
LGG              35
LCML             28
UCEC             27
PRAD             18
MB               12
NaN               8
CLL               6
ACC               3
OTHER             1
Name: count, dtype: int64

Column: GDSC Tissue descriptor 1
GDSC Tissue descriptor 1
lung_NSCLC           272
urogenital_system    271
leukemia             210
aero_dig_tract       187
lymphoma             167
breast               147
lung_SCLC            138
nervous_system       12

### Observations after assessment of dataset Completeness & Integrity

- `Cancer Type (matching TCGA label)` has the highest percentage of missing values (~21%), other columns have smaller gaps.  
- No duplicate rows were found, indicating all entries are unique.
- All categorical columns were inspected using `value_counts()`.
- Categorical columns have reasonable distributions; a few NaNs exist that may need handling in preprocessing, e.g., `Cancer Type (matching TCGA label)` (521 missing), `MSI` (110 missing), etc.
- No obvious typos or inconsistent category names were found.   
- Overall, the categories are consistent and interpretable for further analysis, so the dataset is complete enough for exploratory analysis, with minor missing values to consider in later steps.

## 4. Outlier and Anomaly Detection

In this section, we evaluate numeric features for extreme values or unusual patterns.  
Detecting outliers helps understand the data distribution and identify potential errors or interesting biological signals.

In [6]:
numeric_cols = ['LN_IC50', 'AUC', 'Z_SCORE']

# Summary statistics
display(df[numeric_cols].describe())

# Simple outlier detection using IQR
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"\nColumn: {col}")
    print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")
    print(f"Number of outliers: {len(outliers)}")

Unnamed: 0,LN_IC50,AUC,Z_SCORE
count,2468.0,2468.0,2468.0
mean,-0.940266,0.819678,0.000576
std,3.428846,0.149018,0.999388
min,-7.530958,0.260719,-2.688692
25%,-3.746083,0.733137,-0.770615
50%,-1.590407,0.863975,-0.11972
75%,1.742901,0.940245,0.666575
max,9.229988,0.994588,3.377354



Column: LN_IC50
Lower bound: -11.979558625000001, Upper bound: 9.976376375000001
Number of outliers: 0

Column: AUC
Lower bound: 0.4224742499999999, Upper bound: 1.25090825
Number of outliers: 33

Column: Z_SCORE
Lower bound: -2.92640025, Upper bound: 2.82235975
Number of outliers: 12


### Observations after Outlier and Anomaly Detection

- `LN_IC50` has no outliers according to the IQR method, suggesting the distribution is relatively balanced despite a wide range (-7.53 to 9.23).  
- `AUC` has 33 outliers, mostly on the lower end (below ~0.42), which may indicate extreme drug sensitivity or potential measurement anomalies.  
- `Z_SCORE` has 12 outliers beyond ±2.82, representing extreme deviations from the mean.  
- Overall, the numeric features are generally well-behaved. Outliers are relatively few and likely reflect **biological variability** rather than data errors.

## Summary & Conclusions

- **Dataset Overview:** The dataset contains 2468 records and 19 columns, including numeric, categorical, and string features. Key identifiers are `COSMIC_ID` (cell line) and `DRUG_ID` (drug), which uniquely identify each experimental observation.  

- **Data Completeness & Integrity:**  
  - The dataset is mostly complete with minor missing values.  
  - The column `Cancer Type (matching TCGA label)` has the highest missing fraction (~21%).  
  - No duplicate rows were found, confirming that each record is unique.  

- **Categorical Columns:**  
  - All categorical columns were checked with `value_counts()` and found to be consistent.  
  - Some NaNs are present but manageable in preprocessing.  

- **Numeric Columns & Biological Sense:**  
  - `LN_IC50`, `AUC`, and `Z_SCORE` values are within biologically plausible ranges.  
  - Outlier detection shows a few extreme AUC and Z_SCORE values; LN_IC50 values are within expected log-transformed limits.  

- **Overall Conclusion:**  
  - The dataset is well-structured and clean enough for exploratory analysis.  
  - Minor missing values and a small number of outliers should be addressed in downstream analysis, but the dataset is suitable for studying genomics of drug sensitivity in cancer.  