[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/samckoy/Assignment-4/blob/main/Assignment%20%234.ipynb)

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

# Part 1: Load the Dataset

Using pandas, we are remotely loading and reading the csv file.

In [2]:
iris = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/iris.csv")

Using the head( ) function, we can create a table that only displays the first 15 rows of data.

In [3]:
iris.head(15)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


The info( ) function provides a technical summary of the data.

In [4]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SepalLength  150 non-null    float64
 1   SepalWidth   150 non-null    float64
 2   PetalLength  150 non-null    float64
 3   PetalWidth   150 non-null    float64
 4   Name         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


### About the Dataset

This dataset is using the dimensions of the sepal and petal of the iris to predict what species it is. The sepal length, sepal width, petal length, and petal width, would be considered the features. The name of the species would be considered the label. There are three classifications for the Iris: Setosa, Versicolor, and Virginica. Given the order in which the datapoints are presented, in an array Setosa would be at index 0, Versicolor would be at index 1, and Virginica would be at index 2. 

# Part 2: Split the Dataset into Train and Test

In the case of this dataset, the dimensions of the Iris (the independent variables) determine the classification of the Iris's species (the dependent variable). Therefore, **SepalLength**, **SepalWidth**, **PetalLength**, and **PetalWidth** are our *X values*, **Name** is our *y value*.

In [5]:
X = iris[["SepalLength","SepalWidth","PetalLength","PetalWidth"]]
y = iris["Name"]

From here, we split this data into a training set and a test set, where the training set is 90% of the data, and the test set is 10% of the data. I am setting the *random_state* parameter to 10 so that my results are consistent.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.1,random_state = 10,shuffle=True)

# Part 3: Logistic Regression

Now, I'm using the training set to train my model. I am doing this by using the fit( ) function. The *max_iter* was changed from its default value to 3000, so that it could perform enough iterations to reach its convergence point. 

In [7]:
log = LogisticRegression(max_iter=3000)

In [8]:
log.fit(X_train.values,y_train)

LogisticRegression(max_iter=3000)

I now want to see what my model would predict the classification of the Iris's species to be if the SepalLength was 7.9, SepalWidth was 3.8, PetalLength was 6.4, and PetalWidth was 2.0. 

In [9]:
log.predict_proba([[7.9,3.8,6.4,2.0]])

array([[9.55123664e-07, 2.11458405e-02, 9.78853204e-01]])

The *predict_proba* function shows that there is a 0.0000009 chance the iris is a setosa, a 0.02 chance the iris is a versicolor, and a 0.97 chance that the iris is a virginica. The model predicts that this Iris is most likely a virginica. 

In [10]:
log_scores = cross_val_score(log,X,y,cv=5)
log_scores.mean()

0.9733333333333334

The score of our Logistic Regression model is approximately 0.973. This score shows us that our model is really good (almost perfect) at predicting the classification of the iris's species. 

In [11]:
log.coef_

array([[-0.42048506,  0.92738651, -2.44019517, -1.0618295 ],
       [ 0.53364864, -0.2484138 , -0.20532209, -0.84135553],
       [-0.11316358, -0.67897271,  2.64551727,  1.90318503]])

In [12]:
log.intercept_

array([  9.59028325,   1.78164547, -11.37192873])

Using the *coef_* and *intercept_* functions, the coefficients and intercepts are extracted.

# Part 4: Support Vector Machine

Using the *fit( )* function, I am training the Support Vector Classifier with the training set. 

In [13]:
svm = SVC(probability=True)

In [14]:
svm.fit(X_train.values,y_train)

SVC(probability=True)

Using the same sample datapoint as I did in Part 3, I want to see how the probabilities compare in predicting the classification of the Iris's species. 

In [15]:
svm.predict_proba([[7.9,3.8,6.4,2.0]])

array([[0.01613359, 0.00692785, 0.97693856]])

The predict_proba function shows that there is a 0.016 chance the iris is a setosa, a 0.008 chance the iris is a versicolor, and a 0.976 chance that the iris is a virginica. Similarly to model in Part 3, this model predicts that this Iris is most likely a virginica.

In [16]:
svm_scores = cross_val_score(svm,X,y,cv=5)
svm_scores.mean()

0.9666666666666666

The score for the SVM was approximately 0.966. This score shows us that the SVM does a good job at predicting the classification of the iris's species based on its given measurements. Its score is very similar to the score of the Logistic Regression model. 

# Part 5: Neural Network

I am now training the neural network (MLP Classifier) with the training set. 

In [17]:
nn = MLPClassifier(max_iter=3000)

In [18]:
nn.fit(X_train.values,y_train)

MLPClassifier(max_iter=3000)

Using the same sample datapoint as I did in the previous parts, I want to see how the probabilities compare in predicting the classification of the Iris's species.

In [19]:
nn.predict_proba([[7.9,3.8,6.4,2.0]])

array([[2.07202568e-05, 1.80869402e-01, 8.19109878e-01]])

The predict_proba function (at the moment I am running the program) shows that there is a 0.00002 chance the iris is a setosa, a 0.18 chance the iris is a versicolor, and a 0.82 chance that the iris is a virginica. Similarly to models in previous parts, this model predicts that this Iris is most likely a virginica. However, It's probability for predicting virginica is about 10% less than the previous models.  

In [25]:
nn_scores = cross_val_score(nn,X,y,cv=5)
nn_scores.mean()

0.9800000000000001

The score for the Neural Network was approximately 0.9801. This score shows us that the neural network does a really good job at predicting the iris's species classification based on its given measurements. Its score is very similar to the scores of the models in the previous parts.

After experimenting with several different configurations for the neural network, I couldn't get a better score the model than 0.9801. When I got a worse score than 0.9801, it was because the model wasn't able to reach its convergence point. It seemed to be that any configuration (as long as the model reach its convergence point) gave me the same score. 

# Part 6: K-Nearest Neighbors

Although the K-Nearest Neighbors method doesn't require "training", we use the *fit( )* function to store the training set in the KNeighborsClassifier. 

In [26]:
kn = KNeighborsClassifier()

In [27]:
kn.fit(X_train.values,y_train)

KNeighborsClassifier()

Using the same sample datapoint as I did in the previous parts, I want to compare the probabilities for predicting the classification of the Iris's species.

In [28]:
kn.predict_proba([[7.9,3.8,6.4,2.0]])

array([[0., 0., 1.]])

The predict_proba function shows that there is a 0.00 chance the iris is a setosa, a 0.00 chance the iris is a versicolor, and a 1.00 chance that the iris is a virginica. Similarly to models in previous parts, this model predicts that this Iris is most likely a virginica. However, this model predicts its classification with 100% certainty. This is probably because most of the surrounding datapoints (if not all of them) are classified as virginicas, and because of K-Nearest Neighbors's ***blend into your surroundings*** approach, the datapoint is classified with utmost certainty. 

In [29]:
kn_scores = cross_val_score(kn,X,y,cv=5)
kn_scores.mean()

0.9733333333333334

The score for out K-Nearest Neighbors model is 0.9734, which shows that this model is really good at predicting an Iris's species classification. This score is very similar to the scores of the models in the previous parts. 

# Part 7: Conclusions and Takeaways

According to the results of each model, they all performed about the same. While experimenting with the data, I decided at one point to take away the *random_state* parameter to see how the results would differ. After re-running each model, I found that **all** of the scores seemed to range anywhere from 0.95 to 1.00. This seems to solidify the conclusion that all of the models were equally as good at predicting the Iris's classification. 

It surprised me that the results were very close in score. However, I think this is because of the small test set. Given that the test set was only 15 datapoints, there probably wasn't enough data to evaluate the model with. If the test set was a greater percentage of the data, there might have been more differentiation between the scores for each model. 