<a href="https://colab.research.google.com/github/Ash100/Trainings/blob/main/Two-Days_International_Workshop_HU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Practical Data Science for Nanomaterial Safety: Analyzing ZnO, CuO, TiO2, Fe2O3, and Al2O3 Toxicity and Predictions**
###Presented by **Dr. Ashfaq Ahmad,** Department of Bioinformatics, Hazara University, Mansehra.

**Date: July 22-23, 2025**

This demonstration will be conducted on July 22, 2025, as part of the<br>
### **_2-Days International Workshop on Nanomaterials: A Journey Through Progress, Emerging Trends, and Future Challenges (NJTP-ETFC25)_**.

## **Introduction**

1. Nanoparticles (NPs) are widely used across various industries, but concerns remain about their potential **toxicity**.<br>
2. This talk **aims** to demonstrate the development of a machine learning model that predicts nanoparticle toxicity based on their physicochemical properties.<br>
3. Also the participant will **observe** the importance of cloud computing during the session.<br>



### **Methodology**<br>

The dataset for this study was sourced from the internet sources **Kaggle,** and the analysis followed these key steps:

**Data Preprocessing:** This involved addressing missing values, encoding categorical variables, and scaling numerical features to prepare the data for modeling.

**Dataset Splitting:** The processed data was divided into training and testing sets to ensure robust model evaluation.

**Model Training:** A Random Forest Classifier was implemented to predict toxicity based on the prepared dataset.

**Performance Evaluation:** The model's effectiveness was assessed by evaluating its accuracy and identifying feature importance.

**Hyperparameter Tuning:** GridSearchCV was utilized to optimize the Random Forest Classifier's hyperparameters, enhancing its predictive performance.

**Model Deployment:** The trained model was saved and loaded to facilitate future toxicity predictions.

In [1]:
#@title Libraries Import
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

In [None]:
#@title Loading the dataset from local computer
import pandas as pd
from google.colab import files

# Upload the file manually
uploaded = files.upload()

# Extract the filename (assumes only one file is uploaded)
file_path = next(iter(uploaded))

# Load the dataset
df = pd.read_csv(file_path)

# Display basic information
print("Dataset Information:")
df.info()

# Show the first few rows
df.head()


### **About the Dataset**
**NPs:** Identifies the specific type of nanoparticle.

**coresize:** Refers to the diameter or size of the solid core of the nanoparticle.

**hydrosize:** Represents the hydrodynamic size of the nanoparticle, including any adsorbed layers in solution.

**surfcharge:** Indicates the electrical charge on the surface of the nanoparticle.

**surfarea:** Denotes the total surface area of the nanoparticle.

**Ec:** Represents the electrokinetic (zeta) potential, indicating colloidal stability.

**Exptime:** Specifies the duration of exposure to the nanoparticles.

**dosage:** Refers to the concentration or amount of nanoparticles administered.

**e:** Denotes the electrophoretic mobility, measuring particle movement in an electric field.

**NOxygen:** Likely represents the number of oxygen atoms associated with the nanoparticle or its environment.

**class:** Categorizes the outcome, likely indicating the toxicity level (e.g., toxic/non-toxic).

In [None]:
#@title Check for missing values
# Check for missing values
print("Missing Values in Each Column:")
print(df.isnull().sum())

# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())

In [None]:
#@title Check for duplicate rows
# Remove duplicate rows
df = df.drop_duplicates()

# Check the new shape of the dataset
print("New dataset shape after removing duplicates:", df.shape)

In [None]:
#@title Let's Explore data patterns/Class Distribution
# Feature Distributions (Histograms for numerical features)
df.hist(figsize=(12, 8), bins=30)
plt.suptitle('Feature Distributions', fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.96]) # Adjust layout to prevent suptitle overlap
plt.savefig('feature_distributions.png', dpi=600) # Save with 600 DPI
plt.show()

# Class Distribution (for classification tasks)
plt.figure(figsize=(6, 4))
sns.countplot(x='class', data=df, palette='viridis')
plt.title('Class Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.savefig('class_distribution.png', dpi=600) # Save with 600 DPI
plt.show()

In [None]:
#@title Individual Variable Comparison - For instance, **coresize Vs Toxicity**
import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set(style='whitegrid')
plt.figure(figsize=(8, 6))

# Just the boxplot — no individual points
sns.boxplot(x='class', y='surfarea', data=df, palette='pastel', showfliers=False) # Changed 'surfarea' to 'coresize' as per the title

plt.title('Relationship Between Core Size and Toxicity', fontsize=14)
plt.xlabel('Toxicity Class')
plt.ylabel('Core Size (nm)')
plt.tight_layout()
plt.savefig('coresize_vs_toxicity.png', dpi=600) # Save with 600 DPI
plt.show()

In [12]:
#@title Let's Calculate some statistics on variable comparision for significance
group_nonToxic = df[df['class'] == 'nonToxic']['coresize']
group_Toxic = df[df['class'] == 'Toxic']['coresize']  # Capital 'T'


In [None]:
#@title Perform mannwhitneyu Test
from scipy.stats import mannwhitneyu

# Mann–Whitney U test
if not group_nonToxic.empty and not group_Toxic.empty:
    stat, p_value = mannwhitneyu(group_nonToxic, group_Toxic, alternative='two-sided')
    print(f"Mann–Whitney U test statistic: {stat:.3f}")
    print(f"P-value: {p_value:.4f}")

    if p_value < 0.05:
        print("🔬 Statistically significant difference in core size between Toxic and nonToxic groups.")
    else:
        print("ℹ️ No statistically significant difference in core size between groups.")
else:
    print("⚠️ One or both groups are empty — check filtering.")


In [None]:
#@title Data processing for Machine Learning Models
import pickle
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Encode the categorical variable 'NPs' and 'class'
label_encoders = {}
for col in ["NPs", "class"]:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])

# Save label encoders for future use
with open("label_encoders.pkl", "wb") as f:
    pickle.dump(label_encoders, f)

# Scale numerical features
scaler = StandardScaler()
numeric_cols = df.columns[df.dtypes != "object"].tolist()  # Get numeric column names
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Show transformed dataset
print("Preprocessing Completed!")
print(df.head())
print(df.head())

In [None]:
#@title Splitting data in Training and Testing set
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = df.drop(columns=["class"])  # Features
y = df["class"]  # Target (toxicity class)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the shape
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")

In [None]:
#@title Label Encoding
from sklearn.preprocessing import LabelEncoder

# Ensure target labels are categorical
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)  #Use transform to keep consistency

# Print unique labels
print("Encoded classes:", label_encoder.classes_)

In [None]:
#@title Initiate Random Forest Model
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)  # This will work correctly

In [None]:
import pandas as pd

# Convert y_train to a Pandas Series
y_train_series = pd.Series(y_train)

# Now you can use .head()
print(y_train_series.head())
print(y_train_series.dtype)

In [None]:
print(y_train[:5])  # This works for NumPy arrays
print(type(y_train))  # Check its type

In [None]:
#@title Now let's Train the Model
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Check training and testing accuracy
train_accuracy = rf_model.score(X_train, y_train)
test_accuracy = rf_model.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")

In [None]:
#@title Hypertunning using Gridsearchcsv
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4]
}

# Initialize Grid Search
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, scoring="accuracy", n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Get best parameters
print("Best Parameters:", grid_search.best_params_)

# Train best model
best_rf = grid_search.best_estimator_
best_rf.fit(X_train, y_train)

# Evaluate again
print("Optimized Model Accuracy:", best_rf.score(X_test, y_test))

In [None]:
#@title Saving the trained model
# Train the best model after hyperparameter tuning
best_rf_model = RandomForestClassifier(**grid_search.best_params_, random_state=42)
best_rf_model.fit(X_train, y_train)

# Calculate training accuracy for the optimized model
train_accuracy_optimized = best_rf_model.score(X_train, y_train)
print(f"Optimized Model Training Accuracy: {train_accuracy_optimized:.4f}")

# Save the optimized model
import pickle
with open("optimized_rf_model.pkl", "wb") as f:
    pickle.dump(best_rf_model, f)

print("Optimized model saved successfully!")

In [None]:
#@title Loading the model
# Load the saved model
with open("optimized_rf_model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

# Test with new data
sample = X_test.iloc[0:1]  # Take one sample from test set
prediction = loaded_model.predict(sample)
print("Predicted Class:", prediction)

In [None]:
#@title Evaluate Accuracy of the model
import pickle

# Load the saved model
with open("optimized_rf_model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

# Predict on the test set
y_pred = loaded_model.predict(X_test)

# Calculate test accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize

# Check if binary classification
if len(loaded_model.classes_) == 2:
    # Get predicted probabilities
    y_proba = loaded_model.predict_proba(X_test)[:, 1]

    # ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba, pos_label=loaded_model.classes_[1])
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve - Test Set')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.tight_layout()
    plt.show()


**High Discriminatory Power:**<br> A 0.98 AUC means your model has a very high ability to distinguish between the **two classes (e.g., "Toxic" vs. "Non-toxic" nanoparticles)**

In [None]:
#@title Feature Importance Analysis
import matplotlib.pyplot as plt
import numpy as np

# Get feature importances
feature_importances = loaded_model.feature_importances_

# Create a bar chart
features = X.columns
indices = np.argsort(feature_importances)[::-1]  # Sort by importance

plt.figure(figsize=(10, 6))
plt.title("Feature Importance in Nanoparticle Toxicity Prediction")
plt.bar(range(X.shape[1]), feature_importances[indices], align="center")
plt.xticks(range(X.shape[1]), features[indices], rotation=45, ha="right")
plt.xlabel("Feature")
plt.ylabel("Importance Score")
plt.show()

## We can also use this trained **model,** and can **predict** the toxicity for unknown data.

# Thank You So Much