# ML Analyzer: Predict & Classify Any Dataset

This notebook implements a machine learning analyzer that can perform classification, regression, and clustering tasks on datasets. The implementation includes preprocessing, model training, evaluation, and visualization.

## 1. Import Libraries

First, we import all the necessary libraries for data manipulation, visualization, and machine learning. These libraries provide the foundation for our ML Analyzer application.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, mean_absolute_error, mean_squared_error, r2_score
)
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans

## 2. ML Analyzer Class Definition

Here we define the main ML Analyzer class that will handle all the functionality. This class initializes the necessary variables that will be used throughout the analysis process.

In [None]:
class MLAnalyzer:
    def __init__(self):
        # Initialize variables
        self.df = None
        self.feature_cols = []
        self.target_col = None
        self.task = "classification"
        self.algorithm = None
        self.label_encoders = {}
        self.model = None
        self.random_data = None

## 3. Data Loading and Preview Functions

These functions handle loading the dataset from a CSV file and displaying basic information about the data. They allow us to examine the columns and preview the first few rows of the dataset.

In [None]:
    def upload_csv(self, file_path):
        try:
            self.df = pd.read_csv(file_path)
            print(f"Data loaded successfully with shape: {self.df.shape}")
            return True
        except Exception as e:
            print(f"Failed to load data: {str(e)}")
            return False
    
    def update_column_display(self):
        if self.df is not None:
            print("\nColumns in dataset:")
            for i, col in enumerate(self.df.columns):
                print(f"{i}: {col}")
    
    def update_data_preview(self):
        if self.df is not None:
            print("\nData Preview:")
            display(self.df.head())

## 4. Algorithm Selection Function

This function updates the available algorithms based on the selected task (classification, regression, or clustering). Each task type has specific algorithms that are appropriate for it.

In [None]:
    def update_algorithm_dropdown(self):
        if self.task == "classification":
            algorithms = ["KNN", "SVM", "Decision Tree"]
            self.algorithm = algorithms[0]
        elif self.task == "regression":
            algorithms = ["Linear Regression"]
            self.algorithm = algorithms[0]
        else:  # clustering
            algorithms = ["KMeans"]
            self.algorithm = algorithms[0]
        
        print(f"Available algorithms for {self.task}: {algorithms}")
        print(f"Selected algorithm: {self.algorithm}")

## 5. Model Configuration Function

This function prepares the data for modeling by validating inputs, preprocessing the data, and setting up the feature and target columns. It ensures that all necessary conditions are met before proceeding to model training.

In [None]:
    def continue_to_model(self):
        if self.df is None:
            print("Error: Please upload a CSV file first.")
            return False
        
        if not self.target_col and self.task != "clustering":
            print("Error: Please select a target column.")
            return False
        
        # Preprocess data
        self.preprocess_data()
        
        # Update feature columns
        if self.task != "clustering":
            self.feature_cols = [col for col in self.df.columns if col != self.target_col]
        else:
            self.feature_cols = list(self.df.columns)
        
        print(f"\nFeature columns: {self.feature_cols}")
        if self.task != "clustering":
            print(f"Target column: {self.target_col}")
        
        return True

## 6. Data Preprocessing Function

This function handles data preprocessing tasks such as handling missing values and encoding categorical features. Label encoding is used to convert categorical text values into numeric form that can be used by machine learning algorithms.

In [None]:
    def preprocess_data(self):

        if self.df is None:
            return
        
            self.label_encoders = {}
        
            for col in self.df.columns:
            if self.df[col].dtype in ['int64', 'float64']:
                self.df[col] = self.df[col].fillna(self.df[col].mean())
                print(f"Filled missing values in {col} with mean")
            else:
                self.df[col] = self.df[col].fillna(self.df[col].mode()[0])
                print(f"Filled missing values in {col} with mode")
        
        # Label encode categorical columns
        for col in self.df.columns:
            if self.df[col].dtype == 'object':
                le = LabelEncoder()
                self.df[col] = le.fit_transform(self.df[col])
                self.label_encoders[col] = le
                print(f"Label encoded column {col}")
        
        print("\nPreprocessed data:")
        display(self.df.head())

## 7. Random Data Generation Function

This function generates random data based on the feature columns in the dataset. It creates values within the range of each feature, which can be used for making predictions with the trained model.

In [None]:
    def generate_random_data(self):
        if self.df is None or not self.feature_cols:
            print("Error: Please upload CSV and configure first.")
            return
        
        try:
            random_data = {}
            random_data_original = {}
            
            for feature in self.feature_cols:
                if feature in self.df.columns:
                    if pd.api.types.is_numeric_dtype(self.df[feature]):
                        min_val = self.df[feature].min()
                        max_val = self.df[feature].max()
                        
                        if pd.api.types.is_integer_dtype(self.df[feature]):
                            value = np.random.randint(int(min_val), int(max_val) + 1)
                        else:
                            value = np.random.uniform(min_val, max_val)
                        
                        random_data[feature] = value
                        random_data_original[feature] = value
            
            self.random_data = random_data
            
            print("\nGenerated Random Data:")
            for feature, value in random_data.items():
                print(f"{feature}: {value}")
            
        except Exception as e:
            print(f"Error generating random data: {str(e)}")

## 8. Model Training Function

This function serves as the main entry point for model training. It determines which specific training function to call based on the selected task (classification, regression, or clustering) and handles any errors that might occur during training.

In [None]:
    def train_model(self):
        if self.df is None:
            print("Error: Please upload CSV first.")
            return
        
        try:
            print(f"\nTraining {self.task} model with {self.algorithm} algorithm...")
            
            if self.task == "classification":
                self.train_classification_model(self.algorithm)
            elif self.task == "regression":
                self.train_regression_model()
            else:  # clustering
                self.train_clustering_model()
            
            self.generate_random_data()
            
        except Exception as e:
            print(f"Error training model: {str(e)}")
            import traceback
            traceback.print_exc()

## 9. Classification Model Training Function

This function handles the training of classification models (KNN, SVM, or Decision Tree). It splits the data into training and testing sets, trains the selected model, evaluates its performance using various metrics, and visualizes the confusion matrix.

In [None]:
    def train_classification_model(self, algorithm):

        X = self.df[self.feature_cols]
        y = self.df[self.target_col]
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        print(f"Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}")
        
        if algorithm == "KNN":
            self.model = KNeighborsClassifier(n_neighbors=3)
            print("Using KNN classifier with n_neighbors=3")
        elif algorithm == "SVM":
            self.model = svm.SVC(kernel='rbf')
            print("Using SVM classifier with rbf kernel")
        else:  # Decision Tree
            self.model = DecisionTreeClassifier(criterion="entropy", max_depth=3)
            print("Using Decision Tree classifier with entropy criterion and max_depth=3")
        
        self.model.fit(X_train, y_train)
        print("Model trained successfully")
        
        y_pred = self.model.predict(X_test)
        
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
        recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
        f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
        cm = confusion_matrix(y_test, y_pred)
        
        print(f"\nClassification Metrics:")
        print(f"Accuracy: {accuracy:.3f}")
        print(f"Precision: {precision:.3f}")
        print(f"Recall: {recall:.3f}")
        print(f"F1 Score: {f1:.3f}")
        print(f"\nConfusion Matrix:\n{cm}")
        
        plt.figure(figsize=(8, 6))
        plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
        plt.title("Confusion Matrix")
        plt.colorbar()
        
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                plt.text(j, i, str(cm[i, j]), ha="center", va="center", color="black")
        
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.tight_layout()
        plt.show()

## 10. Regression Model Training Function

This function handles the training of regression models (Linear Regression). It splits the data, trains the model, evaluates its performance using regression metrics (MAE, RMSE, R²), and visualizes the relationship between actual and predicted values with a scatter plot.

In [None]:
    def train_regression_model(self):
        X = self.df[self.feature_cols]
        y = self.df[self.target_col]
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        print(f"Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}")
        
        self.model = LinearRegression()
        self.model.fit(X_train, y_train)
        print("Linear Regression model trained successfully")
        
        y_pred = self.model.predict(X_test)
        
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        print(f"\nRegression Metrics:")
        print(f"MAE: {mae:.3f}")
        print(f"RMSE: {rmse:.3f}")
        print(f"R² Score: {r2:.3f}")
        
        df_compare = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
        print(f"\nActual vs Predicted (first 5 rows):")
        display(df_compare.head())
        
        plt.figure(figsize=(10, 6))
        plt.scatter(y_test, y_pred, alpha=0.6)
        plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
        plt.xlabel("Actual")
        plt.ylabel("Predicted")
        plt.title("Actual vs Predicted Scatter Plot")
        plt.tight_layout()
        plt.show()

## 11. Clustering Model Training Function

This function handles the training of clustering models (KMeans). It identifies natural groupings in the data without requiring a target variable, visualizes the clusters using the first two numeric columns, and displays information about the cluster centers and distribution.

In [None]:
    def train_clustering_model(self):
        numeric_cols = self.df.select_dtypes(include=['int64', 'float64']).columns[:2]
        if len(numeric_cols) < 2:
            print("Error: Need at least 2 numeric columns for clustering visualization")
            return
        
        X = self.df[numeric_cols]
        print(f"Using columns {numeric_cols[0]} and {numeric_cols[1]} for clustering visualization")
        
        kmeans = KMeans(n_clusters=3, random_state=42)
        clusters = kmeans.fit_predict(X)
        self.model = kmeans
        print("KMeans model trained successfully")
        
        print(f"\nKMeans Clustering:")
        print(f"Number of clusters: 3")
        print(f"Cluster centers:\n{kmeans.cluster_centers_}")
        
        unique, counts = np.unique(clusters, return_counts=True)
        for i, (cluster, count) in enumerate(zip(unique, counts)):
            print(f"Cluster {cluster}: {count} samples")
        
        plt.figure(figsize=(10, 6))
        
        for i in range(3):
            cluster_points = X[clusters == i]
            plt.scatter(cluster_points.iloc[:, 0], cluster_points.iloc[:, 1], label=f'Cluster {i}')
        
        plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
                  s=200, marker='*', c='red', label='Centroids')
        
        plt.xlabel(numeric_cols[0])
        plt.ylabel(numeric_cols[1])
        plt.title("KMeans Clustering")
        plt.legend()
        plt.tight_layout()
        plt.show()

## 12. Prediction Function

This function uses the trained model to make predictions on new data. It takes the randomly generated data and passes it through the model, then displays the prediction results in a user-friendly format based on the task type.

In [None]:
    def predict(self):
        if not hasattr(self, 'model') or self.model is None:
            print("Error: Please train a model first.")
            return
        
        if not hasattr(self, 'random_data') or self.random_data is None:
            print("Error: Please generate random data first.")
            return
        
        try:
            input_df = pd.DataFrame([self.random_data])
            
            if self.task == "classification":
                prediction = self.model.predict(input_df)
                
                pred_value = prediction[0]
                if pred_value == 1 or pred_value == True:
                    result = "Yes"
                else:
                    result = "No"
                
                print(f"\nPrediction: {result}")
                
            elif self.task == "regression":
                prediction = self.model.predict(input_df)
                print(f"\nPredicted value: {prediction[0]:.2f}")
                
            else:  # clustering
                cluster = self.model.predict(input_df)
                print(f"\nPredicted cluster: {cluster[0]}")
            
        except Exception as e:
            print(f"Error making prediction: {str(e)}")

## 13. Using the ML Analyzer

Now we'll demonstrate how to use the ML Analyzer class we've defined. We'll create an instance of the analyzer and walk through the complete workflow from data loading to prediction.

In [None]:
analyzer = MLAnalyzer()

## Step 1: Load Dataset

First, we need to load a dataset from a CSV file. This step reads the data into a pandas DataFrame and displays basic information about it.

In [None]:
file_path = "your_dataset.csv"  
analyzer.upload_csv(file_path)

analyzer.update_column_display()
analyzer.update_data_preview()

## Step 2: Configure Analysis

Next, we configure the analysis by selecting the target column, task type, and algorithm. This step also preprocesses the data and prepares it for model training.

In [None]:
analyzer.target_col = "Subscribed"  
analyzer.task = "classification"  
analyzer.update_algorithm_dropdown()

analyzer.continue_to_model()

## Step 3: Train Model

Now we train the model using the selected algorithm. This step will split the data, train the model, evaluate its performance, and display relevant metrics and visualizations.

In [None]:
analyzer.train_model()

## Step 4: Make Predictions

Finally, we generate random data and use the trained model to make predictions. This demonstrates how the model can be used to predict outcomes for new data points.

In [None]:
if analyzer.random_data is None:
    analyzer.generate_random_data()

analyzer.predict()