### Dummy Classifier

> Setting the bar in machine learning with simple baseline models

A dummy classifier is a simple machine learning model that makes predictions using basic rules, without actually learning from the input data. It serves as a baseline for comparing the performance of more complex models. The dummy classifier helps us understand if our sophisticated models are actually learning useful patterns or just guessing.

The **dummy classifier** operates on simple strategies to make predictions. These strategies don’t involve any actual learning from the data. Instead, they use basic rules like:

- Always predicting the most frequent class
- Randomly predicting a class based on the training set’s class distribution
- Always predicting a specific class

#### Training Steps

##### 1. Select Strategy

Choose one of the following strategies:

- **Stratified**: Makes random guesses based on the original class distribution.
- **Most Frequent**: Always picks the most common class.
- **Uniform**: Randomly picks any class.

##### 2. Collect Training Labels

Collect the class labels from the training dataset to determine the strategy parameters.

##### 3. Apply Strategy to Test Data
Use the chosen strategy to generate a list of predicted labels for your test data.

#### Evaluate the Model

In [1]:
# Import libraries
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.dummy import DummyClassifier

# Choose a strategy for your DummyClassifier (e.g., 'most_frequent', 'stratified', etc.)
strategy = 'most_frequent'

# Make a dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# One-hot Encode 'Outlook' Column
df = pd.get_dummies(df, columns=['Outlook'],  prefix='', prefix_sep='', dtype=int)

# Convert 'Windy' (bool) and 'Play' (binary) Columns to 0 and 1
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Set feature matrix X and target vector y
X, y = df.drop(columns='Play'), df['Play']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Initialize the DummyClassifier
dummy_clf = DummyClassifier(strategy=strategy)

# "Train" the DummyClassifier (although no real training happens)
dummy_clf.fit(X_train, y_train)

# Use the DummyClassifier to make predictions
y_pred = dummy_clf.predict(X_test)
print("Label     :",list(y_test))
print("Prediction:",list(y_pred))

# Evaluate the DummyClassifier's accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Dummy Classifier Accuracy: {round(accuracy,4)*100}%")

Label     : [0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1]
Prediction: [np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1)]
Dummy Classifier Accuracy: 64.29%


#### Key Parameters

While dummy classifiers are simple, they do have a few important parameters:

1. **Strategy**: This determines how the classifier makes predictions. Common options include:
  
  - **'most_frequent'**: Always predicts the most common class in the training set.
  - **'stratified'**: Generates predictions based on the training set’s class distribution.
  - **'uniform'**: Generates predictions uniformly at random.
  - **'constant'**: Always predicts a specified class.

2. **Random State**: If using a strategy that involves randomness (like ‘stratified’ or ‘uniform’), this parameter ensures reproducibility of results.

3. **Constant**: When using the ‘constant’ strategy, this parameter specifies which class to always predict.

#### Pros and Cons
Like any tool in machine learning, dummy classifiers have their strengths and limitations.

* **Pros**:

1. Simplicity: Easy to understand and implement.
2. Baseline Performance: Provides a minimum performance benchmark for other models.
3. Overfitting Check: Helps identify when complex models are overfitting by comparing their performance to the dummy classifier.
4. Quick to Train and Predict: Requires minimal computational resources.

* **Cons**:

1. Limited Predictive Power: By design, it doesn’t learn from the data, so its predictions are often inaccurate.
2. No Feature Importance: It doesn’t provide insights into which features are most important for predictions.
3. Not Suitable for Complex Problems: In real-world scenarios with intricate patterns, dummy classifiers are too simplistic to be useful on their own.