## *Introduction*

This notebook develops a predictive model for lung cancer risk using a Kaggle dataset with 1000 patient records and 26 features, including demographics, lifestyle, and medical history. The target variable, Level (encoded as 0=Low, 1=Medium, 2=High), indicates cancer risk. We aim to classify risk levels using SVM, GaussianNB, and AdaBoost, with GridSearch for AdaBoost to optimize performance. The dataset is clean (no missing values or duplicates), and we evaluate models using accuracy, precision, recall, F1-score, and confusion matrices, followed by feature importance analysis to identify key predictors.





## *Data Loading and Preprocessing*

In [1]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv('/kaggle/input/cancer-patients-and-air-pollution-a-new-link/cancer patient data sets.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
sns.barplot(y='Level', x ='Smoking', data=df)

In [None]:
# Select numeric columns (int64) before encoding
numeric_cols = df.select_dtypes(include=['int64']).columns
corr_matrix = df[numeric_cols].corr()

# Visualize correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix of Numeric Features')
plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
df['Level'] = encoder.fit_transform(df['Level'])

In [None]:
df['Level'].value_counts()

In [None]:
# df=df.drop('Patient Id',axis=1)
df = df.drop(['Patient Id', 'index'], axis=1)
# df = df.drop( 'index', axis=1)

In [None]:
X=df.drop('Level',axis=1)
y=df['Level']

## *Modeling* 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [None]:
# Initialize models
models = {
    'SVM': SVC(),
    'GaussianNB': GaussianNB(),
    'AdaBoost': AdaBoostClassifier(random_state=42)
}

# Parameter grid for AdaBoost
ada_params = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1, 1.0]
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    print(f"\nTraining {name}...")
    if name == 'AdaBoost':
        # Apply GridSearch for AdaBoost
        grid = GridSearchCV(model, ada_params, cv=5, scoring='accuracy', n_jobs=-1)
        grid.fit(X_train, y_train)
        model = grid.best_estimator_
        print(f"Best AdaBoost Params:", grid.best_params_)
        print(f"Best CV Accuracy:", grid.best_score_)
    else:
        # Train directly for SVM and GaussianNB
        model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)  # Make predictions
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f"Test Accuracy for {name}: {accuracy:.4f}")
    print(f"Classification Report for {name}:\n{classification_report(y_test, y_pred)}")
    print(f"Confusion Matrix for {name}:\n{confusion_matrix(y_test, y_pred)}")

Feature Importance (RandomForest)

In [None]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': rf.feature_importances_})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print("\nRandomForest Feature Importance:")
print(importance_df)
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(10))
plt.title('Top 10 Feature Importance (RandomForest)')
plt.show()

In [None]:
# Plot comparison
plt.figure(figsize=(8, 6))
sns.barplot(x=list(results.keys()), y=list(results.values()))
plt.title('Model Comparison - Test Accuracy')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
plt.show()

## *Conclusion :*

SVM achieved the highest test accuracy (97.33%), excelling in classifying all risk levels, followed by GaussianNB and AdaBoost (both 90.00%). AdaBoost, optimized with GridSearch (learning_rate=0.01, n_estimators=100), struggled with Low-risk recall. RandomForest feature importance identified Coughing of Blood (12.49%), Passive Smoker (10.33%), and Obesity (9.33%) as top predictors, highlighting their strong influence on lung cancer risk. High accuracies suggest a potentially oversimplified dataset, requiring external validation to ensure robustness.