In [19]:
from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np
from sklearn.utils import check_X_y


## Custom Estimator

In [20]:
class MostFrequentClassClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.most_frequent_ = None

    def fit(self, X, y):

        # Validate input X and target vector y
        X, y = check_X_y(X, y)

        # Ensure y is 1D
        y = np.ravel(y)

        # Manually compute the most frequent class
        unique_classes, counts = np.unique(y, return_counts=True)
        self.most_frequent_ = unique_classes[np.argmax(counts)]

        return self

    def predict(self, X):
        if self.most_frequent_ is None:
            raise ValueError("This classifier instance is not fitted yet.")
        # Predict the most frequent class for each input sample
        return np.full(shape=(X.shape[0],), fill_value=self.most_frequent_)


### Explanation of Custom Estimator Code

#### Class Definition
The `MostFrequentClassClassifier` class is defined as a custom estimator in scikit-learn, extending `BaseEstimator` and `ClassifierMixin`.

#### `__init__` Method
- **Purpose:** Initializes an instance of the `MostFrequentClassClassifier` class.
- **Attributes:**
  - `self.most_frequent_`: Initially set to `None`, it will later store the most frequent class identified during fitting.

#### `fit` Method
- **Purpose:** Fits the model to the training data (`X`, `y`) and learns the most frequent class from `y`.
- **Steps:**
  1. **Input Validation:**
     - Uses `check_X_y(X, y)` to validate and convert `X` and `y` into appropriate formats.
  2. **Ensure `y` is 1D:**
     - Converts `y` to a 1-dimensional array using `np.ravel(y)`.
  3. **Compute Most Frequent Class:**
     - Utilizes `np.unique(y, return_counts=True)` to find unique classes and their counts.
     - Identifies the most frequent class by indexing into `unique_classes` using `np.argmax(counts)`.
     - Stores the most frequent class in `self.most_frequent_`.
  4. **Return Self:**
     - Returns `self` after fitting.

#### `predict` Method
- **Purpose:** Predicts the most frequent class for new data samples `X`.
- **Steps:**
  1. **Check Fitted Status:**
     - Raises a `ValueError` if `self.most_frequent_` is `None`, indicating the estimator has not been fitted.
  2. **Prediction:**
     - Returns an array of the most frequent class (`self.most_frequent_`) repeated for each sample in `X`, shaped `(X.shape[0],)`.

#### Usage Note
- **Functionality:** This custom estimator learns from the training data to predict the most frequent class for new data points.
- **Extends:** Inherits functionalities from `BaseEstimator` and `ClassifierMixin`, providing compatibility with scikit-learn's API.




In [26]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Initialize and fit the custom estimator
classifier = MostFrequentClassClassifier()
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)

# Evaluate the custom estimator
print(f"Predicted class for all test instances: {predictions[0]}")


Predicted class for all test instances: 1


In [28]:
# predictions

### Explanation of Code Snippet

#### Loading Iris Dataset
- **Purpose:** Loads the Iris dataset using `load_iris()` from `sklearn.datasets`.
- **Variables:**
  - `iris`: Contains the loaded dataset.
  - `X, y`: Assigns features (`X`) and target labels (`y`) from the loaded dataset (`iris.data`, `iris.target`).

#### Splitting the Data
- **Purpose:** Splits the dataset into training and testing sets using `train_test_split()`.
- **Variables:**
  - `X_train, X_test, y_train, y_test`: Holds the training and testing subsets of features (`X_train`, `X_test`) and target labels (`y_train`, `y_test`).

#### Initializing and Fitting Custom Estimator
- **Purpose:** Creates an instance of the `MostFrequentClassClassifier` custom estimator.
- **Steps:**
  - Initializes `classifier` as an object of `MostFrequentClassClassifier()`.
  - Fits `classifier` using `X_train` and `y_train` via `classifier.fit(X_train, y_train)`.

#### Making Predictions
- **Purpose:** Uses the fitted `classifier` to predict target labels for `X_test`.
- **Steps:**
  - Predicts `predictions` using `classifier.predict(X_test)`.

#### Evaluating the Custom Estimator
- **Purpose:** Prints the predicted class for the first instance in `X_test`.
- **Output:**
  - Displays the predicted class using `print(f"Predicted class for all test instances: {predictions[0]}")`.

#### Summary
- **Functionality:** Demonstrates a workflow using a custom estimator (`MostFrequentClassClassifier`) to fit, predict, and evaluate on the Iris dataset.
- **Integration:** Integrates seamlessly with scikit-learn's API, showcasing compatibility and usage with standard scikit-learn functions like `train_test_split`.



In [22]:
classifier.most_frequent_

1

### Explanation of `classifier.most_frequent_`

#### Purpose
- **Objective:** `classifier.most_frequent_` is an attribute of the `MostFrequentClassClassifier` custom estimator.
- **Function:** It stores the most frequent class that was determined during the fitting process using the `fit()` method of the custom estimator.

#### Usage
- **During Fitting:** When the `fit()` method of `MostFrequentClassClassifier` is called, `classifier.most_frequent_` is computed.
  - **Computing Process:** The `fit()` method calculates the most frequent class by analyzing the target labels (`y`) provided during training.
  - **Storage:** The computed most frequent class is stored in `classifier.most_frequent_` for later use during predictions.

#### Accessing Predictions
- **During Prediction:** When `predict()` method is called on `classifier`, `classifier.most_frequent_` is used to predict the class for new data points.
  - **Predicting Process:** `predict()` method utilizes `classifier.most_frequent_` to fill the prediction array with the most frequent class determined during training.
  - **Output:** The predictions are generated based on this stored value, ensuring consistent results across all test instances.



In [23]:
from sklearn.model_selection import cross_val_score

cross_val_score(classifier, X_train, y_train)

array([0.34782609, 0.34782609, 0.31818182, 0.36363636, 0.36363636])

### Explanation of `cross_val_score(classifier, X_train, y_train)`

#### Purpose
- **Objective:** `cross_val_score` is a function from `sklearn.model_selection` used to perform cross-validation for evaluating estimator performance.
- **Parameters:** 
  - `classifier`: The estimator object (`MostFrequentClassClassifier` in this case) to be evaluated.
  - `X_train`: Training data features.
  - `y_train`: Training data labels.

#### Cross-Validation Process
- **Methodology:** 
  - **Splitting Data:** `X_train` and `y_train` are split into K folds (default 5 folds).
  - **Training and Evaluation:** 
    - For each fold, the estimator (`classifier`) is trained on a subset of the data and tested on the remaining part.
    - Performance metrics (accuracy in this case) are computed for each fold.
  - **Aggregation:** 
    - The function returns an array of scores, where each score corresponds to the accuracy of the estimator on each fold of the cross-validation process.

#### Output
- **Output Array:** 
  - The array `[0.34782609, 0.34782609, 0.31818182, 0.36363636, 0.36363636]` represents the accuracy scores obtained for each fold during cross-validation.
  - Each score indicates the accuracy of predictions made by `classifier` on the respective fold of `X_train` and `y_train`.

#### Interpretation
- **Evaluation:** 
  - The scores provide insights into how well `MostFrequentClassClassifier` performs across different subsets of the training data.
  - Lower scores might suggest that the model (`classifier`) may not generalize well to unseen data or that improvements could be made to the estimator.

#### Conclusion
- **Utility:** 
  - `cross_val_score` facilitates robust evaluation of estimator performance through cross-validation, offering a more reliable assessment compared to a single train-test split.
  - Helps in tuning model parameters and assessing its suitability for the given dataset (`X_train`, `y_train`).



### Using Scoing function

In [24]:
class MostFrequentClassClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.most_frequent_ = None

    def fit(self, X, y):
        # Ensure y is 1D
        y = np.ravel(y)

        # Compute the most frequent class
        unique_classes, counts = np.unique(y, return_counts=True)
        self.most_frequent_ = unique_classes[np.argmax(counts)]
        return self

    def predict(self, X):
        if self.most_frequent_ is None:
            raise ValueError("This classifier instance is not fitted yet.")
        # Predict the most frequent class for each input sample
        return np.full(shape=(X.shape[0],), fill_value=self.most_frequent_)

    def score(self, X, y):
        """Return the mean accuracy on the given test data and labels."""
        # Ensure y is 1D
        y = np.ravel(y)

        # Generate predictions
        predictions = self.predict(X)

        # Calculate and return the accuracy
        return accuracy_score(y, predictions)


In [25]:
# Load a dataset
iris = load_iris()
X, y = iris.data, iris.target

# Simplify to a binary classification problem
is_class_0_or_1 = y < 2
X_bin = X[is_class_0_or_1]
y_bin = y[is_class_0_or_1]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin, test_size=0.2, random_state=42)

# Initialize and fit the custom classifier
classifier = MostFrequentClassClassifier()
classifier.fit(X_train, y_train)

# Evaluate the classifier using the score method
score = classifier.score(X_test, y_test)
print(f"Accuracy of the MostFrequentClassClassifier: {score}")

Accuracy of the MostFrequentClassClassifier: 0.4


### Scoring Function in Custom Estimator

#### Definition and Purpose:
The `score` method is a crucial part of a custom estimator in scikit-learn. This method is responsible for evaluating the performance of the estimator on given test data and labels. When a custom estimator inherits from `ClassifierMixin`, it gains access to the `score` method, which by default computes the accuracy of the classifier. However, we can override this method to define a custom scoring mechanism if needed.

#### Detailed Explanation:

**Ensuring Target Vector is 1D:**

`y = np.ravel(y)`: Converts the target vector `y` to a 1-dimensional array to ensure compatibility with other scikit-learn functions.

**Generating Predictions:**

`predictions = self.predict(X)`: Uses the `predict` method of the estimator to generate predictions for the input data `X`.

**Calculating Accuracy:**

`accuracy_score(y, predictions)`: Computes the accuracy of the predictions by comparing them with the true labels `y`. Accuracy is the ratio of the number of correct predictions to the total number of predictions.

**Returning the Accuracy:**

The `score` method returns the calculated accuracy, which is a measure of how well the classifier performs on the test data.

#### Importance of the `score` Method:

**Performance Evaluation:**

The `score` method provides a standardized way to evaluate the performance of the custom estimator on unseen data. This is essential for assessing the generalizability of the model.

**Integration with scikit-learn Utilities:**

By defining the `score` method, the custom estimator becomes compatible with various scikit-learn utilities such as `cross_val_score`, `GridSearchCV`, and `train_test_split`. These utilities rely on the `score` method to evaluate and compare different models.

#### Inheritance from `ClassifierMixin`:

**Default Scoring Method:**

By inheriting from `ClassifierMixin`, the custom estimator inherits a default implementation of the `score` method, which calculates accuracy. This ensures that even without explicitly defining the `score` method, the custom classifier can still be evaluated using scikit-learn's built-in tools.

**Flexibility:**

While the default scoring method is accuracy, we can override the `score` method to implement custom evaluation metrics tailored to specific needs, providing flexibility in how the model's performance is measured.
- eg. calculating f1 score


In [29]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import f1_score
import numpy as np

class CustomF1ScoreClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.most_frequent_ = None

    def fit(self, X, y):
        # Ensure y is 1D
        y = np.ravel(y)

        # Compute the most frequent class
        unique_classes, counts = np.unique(y, return_counts=True)
        self.most_frequent_ = unique_classes[np.argmax(counts)]
        return self

    def predict(self, X):
        if self.most_frequent_ is None:
            raise ValueError("This classifier instance is not fitted yet.")
        # Predict the most frequent class for each input sample
        return np.full(shape=(X.shape[0],), fill_value=self.most_frequent_)

    def score(self, X, y):
        """Return the F1 score on the given test data and labels."""
        # Ensure y is 1D
        y = np.ravel(y)

        # Generate predictions
        predictions = self.predict(X)

        # Calculate and return the F1 score
        return f1_score(y, predictions, average='weighted')

# Load a dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

# Simplify to a binary classification problem
is_class_0_or_1 = y < 2
X_bin = X[is_class_0_or_1]
y_bin = y[is_class_0_or_1]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin, test_size=0.2, random_state=42)

# Initialize and fit the custom classifier
classifier = CustomF1ScoreClassifier()
classifier.fit(X_train, y_train)

# Evaluate the classifier using the custom score method
score = classifier.score(X_test, y_test)
print(f"F1 Score of the CustomF1ScoreClassifier: {score}")


F1 Score of the CustomF1ScoreClassifier: 0.22857142857142856
