
# Multiclass Cancer Classification using RNA-seq and Logistic Regression

**Goal:** Predict cancer type from gene expression using multiclass logistic regression.
**Notes:** 
- Uses processed RNA-seq data.
- Prevents data leakage by fitting preprocessing only on training data.
- Evaluates with accuracy and cross-validation-ready structure.


In [None]:

# Import required libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report



## Load data
- `data.csv`: gene expression matrix (rows=samples, columns=genes)
- `labels.csv`: sample metadata with cancer `Class`


In [None]:

# Load expression data and labels (update paths as needed)
data = pd.read_csv('data.csv')
labels = pd.read_csv('labels.csv')



## Merge expression and labels
- Merge on sample identifier
- Ensures correct alignment of samples and labels


In [None]:

# Merge on sample ID
df = data.merge(labels, on='Unnamed: 0')



## Define features (X) and labels (y)
- Drop sample ID and label from features


In [None]:

# Features: all gene columns
X = df.drop(columns=['Unnamed: 0', 'Class'])

# Labels: cancer type
y = df['Class']



## Trainâ€“test split
- Stratified split preserves class proportions


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)



## Feature scaling
- Fit scaler ONLY on training data to avoid data leakage


In [None]:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



## Train multiclass logistic regression
- SAGA solver handles high-dimensional RNA-seq data


In [None]:

md = LogisticRegression(
    max_iter=5000,
    solver='saga'
)

md.fit(X_train_scaled, y_train)



## Prediction and evaluation


In [None]:

y_pred = md.predict(X_test_scaled)
y_prob = md.predict_proba(X_test_scaled)

print("Train accuracy:", accuracy_score(y_train, md.predict(X_train_scaled)))
print("Test accuracy :", accuracy_score(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))



## Interpretation
- High accuracy and consistent CV indicate strong class separability.
- Coefficients can be inspected to identify discriminative genes.
