<a href="https://colab.research.google.com/github/Ash100/Trainings/blob/main/EDA_and_Machine_Learning_on_Water_quality_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
My name is Dr. Ashfaq Ahmad, and I work in the field of Bioinformatics, Data science and Structural Biology. This training is prepared on a water quality dataset to teach you how well you can play with data.

You are welcome to follow me on youtube [**Bioinformatics Insights**](www.youtube.com/@Bioinformaticsinsights)

# Exploratory Data Analysis (EDA) of Water Quality Dataset
In this section, we will perform Exploratory Data Analysis (EDA) on the water quality dataset using Google Colab. We will use Python libraries such as pandas, matplotlib, and seaborn to visualize and understand the data.

**First, we need to import the necessary libraries for data manipulation and visualization.**

In [1]:
#@title Necessary Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Next, we load the dataset into a pandas DataFrame. For this example, we will create the DataFrame directly from the provided data.**

In [2]:
#@title Load your dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

# Step 2: Upload the CSV file
print("Please upload your CSV file:")
uploaded = files.upload()

Please upload your CSV file:


Saving water_potability.csv to water_potability.csv


In [3]:
#@title Load the data into a pandas DataFrame
file_name = list(uploaded.keys())[0]  # Get the name of the uploaded file
df = pd.read_csv('/content/water_potability.csv')

In [None]:
#@title Display basic information about the dataset
print("Basic Information about the Dataset:")
print(df.info())
print("\nFirst few rows of the dataset:")
print(df.head())

In [6]:
#@title Data Processing - Handle missing values by filling them with the mean of the column
df.fillna(df.mean(), inplace=True)

In [None]:
#@title Save descriptive statistics to a CSV file in /content
descriptive_stats = df.describe()
descriptive_stats.to_csv('/content/descriptive_stats.csv', index=True)

print("Descriptive statistics saved to /content/descriptive_stats.csv")

**We will create various visualizations to understand the distribution and relationships within the data.**

In [None]:
#@title Visualize - Histograms for each feature
df.hist(bins=15, figsize=(15, 10), layout=(4, 3))
plt.suptitle('Histograms of Water Quality Features', fontsize=16)

# Save the figure as PNG with 600 DPI
plt.savefig('/content/water_quality_feature_histograms.png', dpi=600, bbox_inches='tight')

# Show the plots
plt.show()

In [None]:
#@title Visualize - Box Plot for each feature
plt.figure(figsize=(15, 10))
df.boxplot()
plt.title('Boxplots of Water Quality Features')

# Save the figure as PNG with 600 DPI
plt.savefig('/content/boxplots_water_quality.png', dpi=600, bbox_inches='tight')

# Show the plots
plt.show()

In [None]:
#@title Visualize - the correlation matrix
corr_matrix = df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Water Quality Features')

# Save the heatmap as a high-resolution PNG
plt.savefig('/content/correlation_heatmap.png', dpi=600, bbox_inches='tight')

# Show the plot
plt.show()

In [None]:
#@title Visualize - thePairplot with hue based on 'Potability'
pairplot = sns.pairplot(df, hue='Potability')

# Save the pairplot as a high-resolution PNG
plt.savefig('/content/water_quality_pairplot.png', dpi=600, bbox_inches='tight')

# Show the pairplot
plt.show()

In [None]:
#@title Compare two different features - Hardness vs. Conductivity
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Hardness', y='Conductivity', hue='Potability', data=df)
plt.title('Scatter Plot of Hardness vs. Tirbidity')
plt.show()

# Interpretation of the above
**Histograms**

Histograms provide a visual representation of the distribution of each feature. They help identify the range, central tendency, and spread of the data. For example, the histogram of Hardness shows a relatively normal distribution, while Solids has a more skewed distribution.

**Boxplots**

Boxplots are useful for identifying outliers and understanding the spread of the data. They show the median, quartiles, and potential outliers. For instance, the boxplot of Chloramines indicates that there are no significant outliers in this feature.

**Correlation Matrix**

The correlation matrix helps identify relationships between features. A high positive correlation (close to 1) indicates a strong positive relationship, while a high negative correlation (close to -1) indicates a strong negative relationship. For example, Hardness and Conductivity have a moderate positive correlation.

**Pairplot**

Pairplots provide a comprehensive view of pairwise relationships between features. They help identify patterns and relationships that might not be apparent from individual plots. The pairplot with hue based on Potability helps visualize how different features relate to water potability.

**Scatter Plots**

Scatter plots are useful for visualizing the relationship between two features. They help identify trends and patterns. For example, the scatter plot of Hardness vs. Conductivity shows a positive trend, indicating that higher hardness is associated with higher conductivity.

# Machine Learning on Water Quality Data

### Data Preprocessing

1. **Handle Missing Values**: Fill missing values with the mean of the respective column.
2. **Split Data**: Split the data into features (X) and target (y).
3. **Standardize Features**: Standardize the features using `StandardScaler`.

In [18]:
#@title Necessary Import
import pandas as pd
import numpy as np
from google.colab import files
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import seaborn as sns

In [19]:
#@title Handle missing values
df.fillna(df.mean(), inplace=True)

# Split the data into features (X) and target (y)
X = df.drop(columns=['Potability'])
y = df['Potability']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Train Multiple Models

1. **Logistic Regression**
2. **Random Forest**
3. **Gradient Boosting**
4. **Support Vector Machine (SVM)**
5. **K-Nearest Neighbors (KNN)**
6. **Decision Tree**

Evaluate each model using accuracy, precision, recall, and F1-score.

In [None]:
#@title Define and Machine Learning Models and Choose the Best
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split data into training and testing sets
test_size = 0.1  # Adjust as needed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, random_state=42)

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier()
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    results[name] = accuracy
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(f"{name} Classification Report:\n{report}\n")

##Key Visualizations:
**Confusion Matrix:**
Shows TP, TN, FP, FN.

**ROC Curve:**
Evaluates the trade-off between TPR and FPR.

**Precision-Recall Curve:**
Highlights performance for imbalanced data.

In [None]:
#@title Model Analysis
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score

# Find the best model based on accuracy
best_model_name = max(results, key=lambda k: results[k])
best_model = models[best_model_name]

# Generate predictions and classification report
y_pred = best_model.predict(X_test)
print(f"\n{best_model_name} Classification Report:")
print(classification_report(y_test, y_pred))

# 1. Confusion Matrix
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title(f'Confusion Matrix - {best_model_name}')
plt.show()

# 2. ROC Curve
if hasattr(best_model, "predict_proba"):
    y_scores = best_model.predict_proba(X_test)[:, 1]
else:
    y_scores = best_model.decision_function(X_test)

fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve - {best_model_name}')
plt.legend(loc='lower right')
plt.show()

# 3. Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_scores)
avg_precision = average_precision_score(y_test, y_scores)

plt.figure(figsize=(8,6))
plt.plot(recall, precision, color='blue', lw=2, label=f'Precision-Recall curve (AP = {avg_precision:.2f})')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve - {best_model_name}')
plt.legend(loc='lower right')
plt.show()

# 4. Feature Importance (for tree-based models)
if isinstance(best_model, (RandomForestClassifier, GradientBoostingClassifier)):
    plt.figure(figsize=(10,8))
    feature_importance = best_model.feature_importances_
    sorted_idx = np.argsort(feature_importance)[::-1]
    plt.barh(X_train.columns[sorted_idx], feature_importance[sorted_idx], align='center')
    plt.xlabel('Feature Importance')
    plt.ylabel('Features')
    plt.title(f'Feature Importance - {best_model_name}')
    plt.show()

### Select the Best Model

Compare the accuracy of each model and select the best-performing model.

In [22]:
#@title Select the best model
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]
print(f"Best Model: {best_model_name} with Accuracy: {results[best_model_name]:.4f}")

Best Model: SVM with Accuracy: 0.6951


**Calculate and visualize the feature importance based on the best model.**

In [None]:
#@title Feature Importance
if not isinstance(X, pd.DataFrame):
    X = pd.DataFrame(X, columns=df.columns[:-1])  # Exclude the target column 'Potability'

# Check the type of the best model
if isinstance(best_model, (RandomForestClassifier, GradientBoostingClassifier)):
    # For RandomForestClassifier and GradientBoostingClassifier
    feature_importances = best_model.feature_importances_
    feature_names = X.columns
    feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
elif isinstance(best_model, LogisticRegression):
    # For LogisticRegression
    feature_importances = best_model.coef_[0]
    feature_names = X.columns
    feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
elif isinstance(best_model, (SVC, KNeighborsClassifier, DecisionTreeClassifier)):
    # For other models, use permutation importance
    from sklearn.inspection import permutation_importance
    importances = permutation_importance(best_model, X_test, y_test, n_repeats=10, random_state=42)
    feature_importances = importances.importances_mean
    feature_names = X.columns
    feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
else:
    print("Feature importance calculation is not supported for this model type.")
    feature_importance_df = None

# Plot feature importances if available
if feature_importance_df is not None:
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
    plt.title('Feature Importances')

    # Save the plot with high resolution
    plt.savefig('feature_importances.png', dpi=600)

    # Show the plot
    plt.show()

# Using the Best Model For Prediction

In [24]:
#@title Load the unknown data
print("Please upload your unknown data CSV file:")
unknown_data = files.upload()

# Load the unknown data into a pandas DataFrame
unknown_file_name = list(unknown_data.keys())[0]  # Get the name of the uploaded file
unknown_df = pd.read_csv(unknown_file_name)

# Display the first few rows of the unknown data
print("First few rows of the unknown data:")
print(unknown_df.head())

Please upload your unknown data CSV file:


Saving unknown_data_predictions.csv to unknown_data_predictions.csv
First few rows of the unknown data:
         ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0  7.666856  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.716080  129.422921  18630.057858     6.635246  343.198307    592.885359   
2  8.099124  224.236259  19909.541732     9.275884  343.198307    418.606213   
3  8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4  9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   

   Organic_carbon  Trihalomethanes  Turbidity  
0       10.379783        86.990970   2.963135  
1       15.180013        56.329076   4.500656  
2       16.868637        66.420093   3.055934  
3       18.436524       100.341674   4.628771  
4       11.558279        31.997993   4.075075  


In [25]:
#@title Preprocess the unknown data
# Handle missing values
unknown_df.fillna(unknown_df.mean(), inplace=True)

# Ensure the unknown data has the same columns as the training data
if not set(unknown_df.columns).issubset(set(X.columns)):
    print("Unknown data columns do not match the training data columns.")
    print("Training data columns:", X.columns)
    print("Unknown data columns:", unknown_df.columns)
    raise ValueError("Columns mismatch between training and unknown data.")

# Standardize the unknown data using the same scaler
unknown_df_scaled = scaler.transform(unknown_df)

In [26]:
#@title Make predictions on the unknown data
predictions = best_model.predict(unknown_df_scaled)

# Add predictions to the unknown data DataFrame
unknown_df['Predicted_Potability'] = predictions

# Display the first few rows of the unknown data with predictions
print("First few rows of the unknown data with predictions:")
print(unknown_df.head())

First few rows of the unknown data with predictions:
         ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0  7.666856  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.716080  129.422921  18630.057858     6.635246  343.198307    592.885359   
2  8.099124  224.236259  19909.541732     9.275884  343.198307    418.606213   
3  8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4  9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   

   Organic_carbon  Trihalomethanes  Turbidity  Predicted_Potability  
0       10.379783        86.990970   2.963135                     0  
1       15.180013        56.329076   4.500656                     0  
2       16.868637        66.420093   3.055934                     0  
3       18.436524       100.341674   4.628771                     0  
4       11.558279        31.997993   4.075075                     0  


In [28]:
#@title Save the predictions to a CSV file
output_file_name = 'unknown_data_predictions_1.csv'
unknown_df.to_csv(output_file_name, index=False)

print(f"Predictions saved to {output_file_name}")

Predictions saved to unknown_data_predictions_1.csv


## Today, you learned something new. Isn't great.

[**Data Analysis on Weather Dataset**](https://youtu.be/qm7n8Fc8T74)