# Metadata

**Python Topics** - Machine Learning using scikit-learn

# Conversation

**Human**

Do you know the iris dataset for ml? I wanted to explore scikit learn algorithms on it. Give me the code to solve it using some algo from scikit learn. I'm in an empty folder right now. linux

**LLM**

Yes, I know the iris dataset that is commonly used as a toy dataset to showcase the work of machine learning classification algorithms.

scikit-learn has multiple algorithms that are capable of "solving" it, let's start with a simple logistic regression.

But first, to start we will need to set up an environment with scikit-learn python package installed.

Assuming that you are in a project root directory, let's set up a virtual environment for this task.

Run the following in the command line:

```
python -m venv venv
```

This will create a folder named venv with the virtual environment for you.

Next, you will need to activate it and install the scikit-learn package using pip

```
source venv/bin/activate && pip install scikit-learn
```

Now with all this out of the way, here's the code to actually train an algorithm to solve this dataset. 

iris_tests.py
```
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

logreg = LogisticRegression(max_iter=200)

logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Classification accuracy using logistic regression:", accuracy)
```

All you have to do is `python iris_tests.py`

**Human**

I got 1.0. That was pretty good. Why 42 though?

**LLM**

Just keeping in with the tradition - 42 is a humorous number from the book The Hitchhiker's Guide to the Galaxy by Douglas Adams. In the book, the number was the answer to "the Ultimate Question of Life, the Universe, and Everything" provided by a supercomputer when it was asked to answer the ultimate question. So, it's an inside joke for the programmers to use it in various places, often as a seed for a random number generator.

**Human**

lol. Alright.

Back to the scikit, now that I know it works, show me multiple algos to solve it at once.

**LLM**

The package has quite a few ml algorithms that would work on this dataset. Let's use some of them:

Create a new file called `many_classifiers_on_iris.py`

The code:

```
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features by removing the mean and scaling to unit variance for the algorithms that benefit from it
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)


# First we will gather all the algorithms into a single array as scikit provides a universal API to work with the algorithms - very convenient
algorithms = []

# Initialize Logistic Regression as the last time
logreg = LogisticRegression(max_iter=200)
algorithms.append({'algo': logreg, 'scale': True})

# Initialize the K-Nearest Neighbors classifier with n_neighbors=3
knn = KNeighborsClassifier(n_neighbors=3)
algorithms.append({'algo': knn, 'scale': True})

# Create a SVM Classifier
svm_classifier = SVC(kernel='linear', random_state=42)
algorithms.append({'algo': svm_classifier, 'scale': True})

# Creating a Decision Tree Classifier
clf = DecisionTreeClassifier()
algorithms.append({'algo': clf, 'scale': False})

# Initialize RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
algorithms.append({'algo': rf_classifier, 'scale': False})

# Initialize GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(random_state=42)
algorithms.append({'algo': gb_clf, 'scale': True})

# Creating a Gaussian Naive Bayes model
gnb = GaussianNB()
algorithms.append({'algo': gnb, 'scale': False})

# Now we can loop over the algorithms list and apply each one
for algorithm in algorithms:
    current_x = X_train
    current_x_test = X_test
    
    if algorithm['scale']:
        current_x = scaled_X_train
        current_x_test = scaled_X_test
    
    algorithm['algo'].fit(current_x, y_train)
    y_pred = algorithm['algo'].predict(current_x_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Classification accuracy using", type(algorithm['algo']).__name__, ":", accuracy)

```

You can now run `python many_classifiers_on_iris.py` and see the accuracy printed for each one of them.

**Human**

Got 1.0 for almost all of them. Do you know why SVM is called SVC not SVMC?

**LLM**

In SVM or Support Vector Machine, "machine" just means it's a system that learns from data to make predictions. When SVM is used for sorting data into categories, people call it "Support Vector Classification" or SVC in scikit-learn. This name change simply points out that the SVM approach is specifically being used for classification.

**Human**

I'm wondering about about a more difficult dataset, do you have any in mind that would worsen the results? datasets from scikit datasets

**LLM**

To find a more challenging dataset for the classifiers we have, we can move them into a separate module and call it like a function for a given dataset. After testing it on a few datasets, we can find a more challenging dataset for the algorithms we have given the same hyperparameters.

Create a file classifiers_set.py

```
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def run_classifiers(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Standardize features by removing the mean and scaling to unit variance for the algorithms that benefit from it
    scaler = StandardScaler()
    scaled_X_train = scaler.fit_transform(X_train)
    scaled_X_test = scaler.transform(X_test)

    # Gather all the algorithms into a single array
    algorithms = []

    # Initialize Logistic Regression as the last time
    logreg = LogisticRegression(max_iter=200)
    algorithms.append({'algo': logreg, 'scale': True})

    # Initialize the K-Nearest Neighbors classifier with n_neighbors=3
    knn = KNeighborsClassifier(n_neighbors=3)
    algorithms.append({'algo': knn, 'scale': True})

    # Create a SVM Classifier
    svm_classifier = SVC(kernel='linear')
    algorithms.append({'algo': svm_classifier, 'scale': True})

    # Creating a Decision Tree Classifier
    clf = DecisionTreeClassifier()
    algorithms.append({'algo': clf, 'scale': False})

    # Initialize RandomForestClassifier
    rf_classifier = RandomForestClassifier(n_estimators=100)
    algorithms.append({'algo': rf_classifier, 'scale': False})

    # Initialize GradientBoostingClassifier
    gb_clf = GradientBoostingClassifier()
    algorithms.append({'algo': gb_clf, 'scale': True})

    # Creating a Gaussian Naive Bayes model
    gnb = GaussianNB()
    algorithms.append({'algo': gnb, 'scale': False})

    results = []

    # Loop over the algorithms list and apply each one
    for algorithm in algorithms:
        current_x = X_train
        current_x_test = X_test

        if algorithm['scale']:
            current_x = scaled_X_train
            current_x_test = scaled_X_test

        algorithm['algo'].fit(current_x, y_train)
        y_pred = algorithm['algo'].predict(current_x_test)
        accuracy = accuracy_score(y_test, y_pred)
        results.append({'classifier': type(algorithm['algo']).__name__, 'accuracy': accuracy})

    return results

```

And to test multiple datasets let's create a file `test_multiple_datasets.py`

```
from sklearn.datasets import load_iris, load_digits, load_wine, load_breast_cancer
from classifiers_set import run_classifiers

datasets = [load_iris, load_digits, load_wine, load_breast_cancer]

for dataset in datasets:
    X, y = dataset(return_X_y=True)
    results = run_classifiers(X, y)
    print(f"Results for {dataset.__name__}:")
    for result in results:
        print(f"Classifier: {result['classifier']}, Accuracy: {result['accuracy']}")


```

To check the results: `python test_multiple_datasets.py`

**Human**

Huh. These basic classifiers tore through all the datasets you've tried. I think what would be interesting to do now is to get back to our iris dataset and see how the classifiers handle noise in the dataset. Add 5% disturbance for each feature.

**LLM**

Interesting result and a great idea on how to explore these classifiers more!
To add noise to the training set of a dataset and run through the classifiers we have you will need to modify the run classifiers function:

Replace
```
def run_classifiers(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
with
```
def run_classifiers(X, y, X_test=None, y_test=None):
    if X_test is None:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    else:
        X_train, y_train = X, y
```
With this you will be able to provide run_classifiers with the modified training data and unmodified test data.

Now create a file `test_with_noise.py`

```
from sklearn.datasets import load_iris
from classifiers_set import run_classifiers
from sklearn.model_selection import train_test_split
import numpy as np

def add_noise(X_train, noise_level=0.05):
    """
    Function to add uniform random noise to the features in %
    """
    noise = np.random.uniform(-noise_level, noise_level, X_train.shape)
    return X_train + X_train*noise

def run_experiment(noise_level):
    """
    Function to run the experiment with the classifiers.
    """
    X, y = load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_train_noisy = add_noise(X_train, noise_level)
    results = run_classifiers(X_train_noisy, y_train, X_test, y_test)
    print(f"Results for {load_iris.__name__}:")
    for result in results:
        print(f"Classifier: {result['classifier']}, Accuracy: {result['accuracy']}")

if __name__ == "__main__":
    run_experiment(0.05)
```

Run it with `python test_with_noise.py`

**Human**

Result is still too good. Do a grid search over the noise level to find a tipping point.

**LLM**

Wonderful idea. Here's the code to look for a noise level that degrades at least one classifier to less than 50% accuracy.
I've split the noise level into 5% increments for this search.

Add ```return results``` to `test_with_noise.py` file at the end of run_experiment function.

Then create a file `noise_level_search_on_iris.py`

```
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from test_with_noise import run_experiment
import numpy as np


def grid_search_noise_levels():
    """
    Function to perform a grid search over the noise levels.
    """
    noise_levels = np.linspace(0, 1, 21)
    for noise_level in noise_levels:
        print('-'*60)
        print(f"Results for noise level {noise_level}:")
        results = run_experiment(noise_level)
        if any(result['accuracy'] < 0.5 for result in results):
            print('First <0.5 accuracy result')
            return

grid_search_noise_levels()

```

Run it with: `python noise_level_search_on_iris.py`

**Human**

This is too much information on the screen for me. I want you to show me how to plot this data as a line for each of the classifiers depending on the noise level.

**LLM**

To plot the results, you will need to install another package - matplotlib.
Run `pip install matplotlib`.

Next, replace the `noise_level_search_on_iris.py` with the following code:

```
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from test_with_noise import run_experiment
import numpy as np


def grid_search_noise_levels():
    """
    Function to perform a grid search over the noise levels.
    """
    noise_levels = np.linspace(0, 1, 21)
    all_results = []
    for noise_level in noise_levels:
        print('-'*60)
        print(f"Results for noise level {noise_level}:")
        results = run_experiment(noise_level)
        all_results.append((noise_level, results))
        if any(result['accuracy'] < 0.5 for result in results):
            return all_results
    return all_results

def plot_results(all_results):
    """
    Function to plot the results.
    """
    classifier_names = [result['classifier'] for result in all_results[0][1]]
    noise_levels = [result[0] for result in all_results]
    for i, classifier_name in enumerate(classifier_names):
        accuracies = [result[1][i]['accuracy'] for result in all_results]
        plt.plot(noise_levels, accuracies, label=classifier_name)
    plt.xlabel('Noise level')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.savefig('results.png')

all_results = grid_search_noise_levels()
plot_results(all_results)
```

After running `python noise_level_search_on_iris.py` you should see `results.png` file in the folder with the chart we created.

**Human**

I've generated the image. Give me a cluster bar chart for this thing.
But reduce number of noise points to one per 30%. Make the bars thin as we have lots of classifiers.

**LLM**

Sure thing. Add this function to the `noise_level_search_on_iris.py` after `plot_results(all_results)`

```
def plot_results_cluster_bar(all_results):
    """
    Function to plot the results.
    """
    classifier_names = [result['classifier'] for result in all_results[0][1]]
    print(all_results)
    noise_levels = [result[0] for result in all_results if result[0] % 0.3 < 0.001]
    print(noise_levels)
    width = 0.1 / len(classifier_names)  # Adjust bar width based on number of classifiers
    fig, ax = plt.subplots()
    for i, classifier_name in enumerate(classifier_names):
        accuracies = [result[1][i]['accuracy'] for result in all_results if result[0] % 0.3  < 0.001]
        ax.bar([noise + i * width for noise in noise_levels], accuracies, width, label=classifier_name)
    ax.set_xlabel('Noise level')
    ax.set_ylabel('Accuracy')
    ax.legend()
    plt.savefig('results_cluster_bar.png')
```

and then add this line at the end of the file to create the new chart - `plot_results_cluster_bar(all_results)`.

The new image will be named `results_cluster_bar.png`.

**Human**

Thanks! What was the point of doing < 0.001 in the code above instead of == 0 though?

**LLM**

I've used < 0.001 comparisons instead of == 0 to avoid problems with the float point numbers precision. As we used `np.linspace(0, 1, 21)` to generate the noise levels, we might not get precise values for 0.3 0.6 0.9 even though we've separated [0, 1] into 21 points due to how float numbers are stored in the computer. So to avoid a situation, where a very small difference prevents numbers from being seen as equal, I've added a threshold to ignore these small differences.