In [185]:
# Imports
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

# Part 1: Load the dataset


In [186]:
# Load the dataset (load remotely, not locally)
# Output the first 15 rows of the data
d = datasets.load_iris()
data = pd.DataFrame(d.data, columns=d.feature_names)
data.head(15)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [187]:
# Display a summary of the table information (number of datapoints, etc.)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


### About the dataset
Explain what the data is in your own words. What are your features and labels? What is the mapping of your labels to the actual classes? 
* The Iris data set is composed of 150 data samples across 3 different iris variations. Each iris variant have 50 samples that records many features of each individual flower. The labels include Setosa, Veriscolor, and Virginica. As for the features: We have the measurements of the sepal length, sepal width, petal length and petal width.


# Part 2: Split the dataset into train and test


In [188]:
# Take the dataset and split it into our features (X) and label (y)
X = d.data
y = d.target
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, test_size=0.1)
# 135 went into train set, 15 went into test
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(135, 4) (15, 4) (135,) (15,)


# Part 3: Logistic Regression

In [189]:
# i. Use sklearn to train a LogisticRegression model on the training set
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)

# ii. For a sample datapoint, predict the probabilities for each possible class
logisticRegr.predict(X_test)

array([0, 0, 0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 1])

In [190]:
# iii. Report on the score for Logistic regression model, what does the score measure?
score = logisticRegr.score(X_test, y_test)
print(score)

1.0


In [191]:
# iv. Extract the coefficents and intercepts for the boundary line(s)
print(logisticRegr.coef_, logisticRegr.intercept_)

[[-0.41299308  0.93238632 -2.43006944 -1.04809455]
 [ 0.56795794 -0.31447395 -0.23280627 -0.88790594]
 [-0.15496487 -0.61791237  2.66287572  1.9360005 ]] [  9.60720475   1.94978397 -11.55698871]


### Score = ```1.0```
* After applying our Logistic Regression Model onto the training set, We tried to predict the results for a sample datapoint. The sample score obtained was ```1.0```. This score is exactly 1.0 which means that this model can be said most accurate based on the data provided.

# Part 4: Support Vector Machine


In [192]:
# i. Use sklearn to train a Support Vector Classifier on the training set
model = SVC()
model.fit(X_train, y_train)
# ii. For a sample datapoint, predict the probabilities for each possible class
model.predict(X_test)


array([0, 0, 0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 1])

In [193]:
# iii. Report on the score for the SVM, what does the score measure?
svm_score = model.score(X_test, y_test)
print(svm_score)

1.0


### Score = ```0.93```
* After applying our Logistic Regression Model onto the training set, We tried to predict the results for a sample datapoint. The sample score obtained was ```0.93```. This score is fairly close to 1.0 which means that this model can be said to be fairly accurate based on the data provided.

# Part 5: Neural Network


In [194]:
# i. Use sklearn to train a Neural Network (MLP Classifier) on the training set
mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(4,2), max_iter=1000)
mlp.fit(X_train, y_train)
# ii. For a sample datapoint, predict the probabilities for each possible class
mlp.predict(X_test)

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [195]:
# iii. Report on the score for the Neural Network, what does the score measure?
mlp_score = mlp.score(X_test, y_test)
# iv: Experiment with different options for the neural network, report on your best configuration (the highest score I was able to achieve was 0.8666)
print(mlp_score)

0.2


# Part 6: K-Nearest Neighbors


In [196]:
# i. Use sklearn to 'train' a k-Neighbors Classifier
# Note: KNN is a nonparametric model and technically doesn't require training
# fit will essentially load the data into the model see link below for more information
# https://stats.stackexchange.com/questions/349842/why-do-we-need-to-fit-a-k-nearest-neighbors-classifier
K = 3
knn = KNeighborsClassifier(K)
knn.fit(X_train, y_train)
# ii. For a sample datapoint, predict the probabilities for each possible class
knn.predict(X_test)

array([0, 0, 0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 1])

In [197]:
# iii. Report on the score for kNN, what does the score measure?
print(knn.score(X_test, y_test))

1.0


### Score = ```1.0```
* After applying our Logistic Regression Model onto the training set, We tried to predict the results for a sample datapoint. The sample score obtained was ```1.0```. This score is exactly 1.0 which means that this model can be said most accurate based on the data provided.

# Part 7: Conclusions and takeaways
In your own words describe the results of the notebook. Which model(s) performed the best on the dataset? Why do you think that is? Did anything surprise you about the exercise?

### Findings
#### This experiment lead to utilizing multiple different machine learning tools on the famous iris-dataset. There were certainly some models that had overall, performed better on this dataset than others. 
* In this case, the models with good and consistent performance were: Logistic Regression, SVM, and K-Nearest Neighbors.
These 3 models had the highest scores across the board for all the models used. The reasons could vary from being on how effective some models are on small datasets vs large datasets. For the iris dataset, we consider it to be relatively small with ~150 samples. It may be the case that these models perform better on the smaller sets compared to other models. 
* Another possible case could be that the methods that a model follows isn't the most effective on this dataset. From our results, we note that Neural Networks performance is relatively inconsistent and somewhat low. The score for neural network is consistently lower due to the number of possible configurations we must optimize. On some occassions, I've gotton scores of ```0.266``` and ```1.0``` on others. 

* What had surprised me overall about this exercise, is how powerful these tools are and that it is necessary to learn where to best apply what model. All of these models worked incredibly fast and to be able to determine/predict these outcomes based on using recorded data is certainly impressive.