<a href="https://colab.research.google.com/github/saifulislamdev/artificial-intelligence/blob/main/judging_flowers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [519]:
# Imports and pip installations (if needed)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

import io
import requests

In [520]:
# Sample datapoints used for classifiers throughout the code 
# In this case, just one sample datapoint
# This datapoint is similar to row 4 (0-based index) of the Iris dataset

sample_datapoints = [[5.0, 3.5, 1.5, 0.2]]

# Part 1: Load the dataset

In [521]:
# Load the dataset (load remotely, not locally)
url = 'https://raw.github.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/iris.csv'
url_content = requests.get(url).content
iris = pd.read_csv(io.StringIO(url_content.decode('utf-8')))


In [522]:
# Output the first 15 rows of the data
iris.head(15)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [523]:
# Display a summary of the table information (number of datapoints, etc.)
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SepalLength  150 non-null    float64
 1   SepalWidth   150 non-null    float64
 2   PetalLength  150 non-null    float64
 3   PetalWidth   150 non-null    float64
 4   Name         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


## About the dataset
**Explain what the data is in your own words. What are your features and labels? What is the mapping of your labels to the actual classes?**

The data set consists of information about 150 samples of Iris flowers; 50 samples of Iris setosa, 50 samples of Iris virginica, and 50 samples of Iris versicolor (3 different species of Iris flowers). 

There are four features: sepal length, sepal width, petal length, and petal width. The labels are species of Iris flower: Iris setosa, Iris virginica, or Iris versicolor. Each label corresponds to an actual class (total of 3 classes). Hence, each label maps to one and only one class. 

One species is linearly separable from the other 2 and the other 2 are not linearly separable from each other. Taken from Fisher's paper, the Iris dataset is commonly used nowadays as a beginner's dataset for machine learning beginners like me.

# Part 2: Split the dataset into train and test

In [524]:
# Take the dataset and split it into our features (X) and label (y)

# features (X)
X = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']].values 

# label (y)
y = iris['Name'].values 

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=1)

# Part 3: Logistic Regression

In [525]:
# i. Use sklearn to train a LogisticRegression model on the training set
lr=LogisticRegression(max_iter=1000, random_state=1)
lr.fit(X_train, y_train) 

LogisticRegression(max_iter=1000, random_state=1)

In [526]:
# ii. For a sample datapoint, predict the probabilities for each possible class
lr_probabilities = lr.predict_proba(sample_datapoints)[0]
for i in range(len(lr_probabilities)):
    print('Probability of', lr.classes_[i], ':', lr_probabilities[i])

Probability of Iris-setosa : 0.978466027961796
Probability of Iris-versicolor : 0.021533922417967616
Probability of Iris-virginica : 4.962023630008219e-08


In [527]:
# iii. Report on the score for Logistic regression model, what does the score measure?
lr.score(X_test, y_test)

1.0

The score above represents the mean accuracy on the given test data and labels. It measures how well the model can predict the outputs of the inputs, given the inputs and expected outputs. Since I used the test data, it shows that the model accurately predicts the test labels as it returns the highest possible score (1.0).

In [528]:
# iv. Extract the coefficents and intercepts for the boundary line(s)
print('Coefficients:', lr.coef_)
print('Intercepts:', lr.intercept_)

Coefficients: [[-0.44134475  0.85285733 -2.45739786 -1.00501208]
 [ 0.55059128 -0.31499028 -0.17440822 -0.93050419]
 [-0.10924652 -0.53786705  2.63180607  1.93551628]]
Intercepts: [  9.97997382   1.85201766 -11.83199148]


# Part 4: Support Vector Machine

In [529]:
# i. Use sklearn to train a Support Vector Classifier on the training set
svm=SVC(kernel='linear', probability = True, max_iter=-1, random_state=1)
svm.fit(X_train, y_train) 

SVC(kernel='linear', probability=True, random_state=1)

In [530]:
# ii. For a sample datapoint, predict the probabilities for each possible class
svm_probabilities = svm.predict_proba(sample_datapoints)[0]
for i in range(len(svm_probabilities)):
    print('Probability of', svm.classes_[i], ':', svm_probabilities[i])

Probability of Iris-setosa : 0.9673247206872928
Probability of Iris-versicolor : 0.02189335691315819
Probability of Iris-virginica : 0.010781922399549079


In [531]:
# iii. Report on the score for the SVM, what does the score measure?
svm.score(X_test, y_test)

1.0

The score above represents the mean accuracy on the given test data and labels. It measures how well the model can predict the outputs of the inputs, given the inputs and expected outputs. Since I used the test data, it shows that the model accurately predicts the test labels as it returns the highest possible score (1.0).

# Part 5: Neural Network

In [532]:
# i. Use sklearn to train a Neural Network (MLP Classifier) on the training set
mlp = MLPClassifier(activation='logistic', solver='lbfgs', max_iter=1000, random_state=1)
mlp.fit(X_train, y_train)

MLPClassifier(activation='logistic', max_iter=1000, random_state=1,
              solver='lbfgs')

In [533]:
# ii. For a sample datapoint, predict the probabilities for each possible class
mlp_probabilities = mlp.predict_proba(sample_datapoints)[0]
for i in range(len(mlp_probabilities)):
    print('Probability of', mlp.classes_[i], ':', mlp_probabilities[i])

Probability of Iris-setosa : 0.9999913086802718
Probability of Iris-versicolor : 8.691319728170479e-06
Probability of Iris-virginica : 1.2059261335854125e-33


In [534]:
# iii. Report on the score for the Neural Network, what does the score measure?
mlp.score(X_test, y_test)

1.0

The score above represents the mean accuracy on the given test data and labels. It measures how well the model can predict the outputs of the inputs, given the inputs and expected outputs. Since I used the test data, it shows that the model accurately predicts the test labels as it returns the highest possible score (1.0).

In [535]:
# iv: Experiment with different options for the neural network, report on your best configuration (the highest score I was able to achieve was 0.8666)
worst_config = { 
    'score': float('inf'), # lowest score
}
best_config = {
    'score': float('-inf'), # highest score
}

# Different options/configurations
configs = [
    {
        'activation': 'relu', 
        'solver': 'adam', 
    },
    {
        'activation': 'logistic', 
        'solver': 'adam', 
    }, 
    {
        'activation': 'relu', 
        'solver': 'lbfgs', 
    },
    {
        'activation': 'logistic', 
        'solver': 'lbfgs', 
    }
]

# Experiment with different options/configurations
for config in configs:
    curr_activation = config['activation']
    curr_solver = config['solver']
    curr_mlp = MLPClassifier(activation=curr_activation, solver=curr_solver, max_iter=1000, shuffle=True, random_state = 1)
    curr_mlp.fit(X_train, y_train)
    curr_mlp_score = curr_mlp.score(X_test, y_test)
    if curr_mlp_score < worst_config['score']:
        worst_config = {
            'activation': curr_activation,
            'score': curr_mlp_score,
            'solver': curr_solver
        }
    if curr_mlp_score > best_config['score']:
        best_config = {
            'activation': curr_activation,
            'score': curr_mlp_score,
            'solver': curr_solver
        }

print('Worst config:', worst_config)
print('Best config:', best_config)

Worst config: {'activation': 'relu', 'score': 1.0, 'solver': 'adam'}
Best config: {'activation': 'relu', 'score': 1.0, 'solver': 'adam'}


Best config: 
* Activation = 'relu'
* Score = 1.0
* Solver = 'adam'

# Part 6: K-Nearest Neighbors

In [536]:
# i. Use sklearn to 'train' a k-Neighbors Classifier
# Note: KNN is a nonparametric model and technically doesn't require training
# fit will essentially load the data into the model see link below for more information
# https://stats.stackexchange.com/questions/349842/why-do-we-need-to-fit-a-k-nearest-neighbors-classifier

neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)

KNeighborsClassifier()

In [537]:
# ii. For a sample datapoint, predict the probabilities for each possible class
neigh_probabilities = neigh.predict_proba(sample_datapoints)[0]
for i in range(len(neigh_probabilities)):
    print('Probability of', neigh.classes_[i], ':', neigh_probabilities[i])

Probability of Iris-setosa : 1.0
Probability of Iris-versicolor : 0.0
Probability of Iris-virginica : 0.0


In [538]:
# iii. Report on the score for kNN, what does the score measure?
neigh.score(X_test, y_test)

1.0

The score above represents the mean accuracy on the given test data and labels. It measures how well the model can predict the outputs of the inputs, given the inputs and expected outputs. Since I used the test data, it shows that the model accurately predicts the test labels as it returns the highest possible score (1.0).

# Part 7: Conclusions and takeaways

**In your own words describe the results of the notebook. Which model(s) performed the best on the dataset? Why do you think that is? Did anything surprise you about the exercise?**

All the models gave a perfect score on the test set, which means that each model predicted the test labels accurately. Instead, the metric I will use is the probabilities of each possible class for each classifier. K-Neighbors Classifier returned the highest probability for Iris-setosa (1.0) among the classifiers for the same sample datapoint. The MPL Classifier (Neural Network) was super close though in having the highest probability for Iris-setosa (it was 0.99999 for MPL). 

It is no surprise that kNN and MPL (Neural Network) classifiers performed best on the dataset. After all, they are nonlinear classifiers and the dataset is nonlinear.

At first, I found the certain probability for Iris-setosa from kNN surprising. However, looking into it further, I realized that it's not surprising when you realize what kNN really is (looking at neighbors) so it makes sense why the probability is certain. Also, I was surprised at how simple working with classification models is when utilizing machine learning libraries in Python, specifically `sklearn`.