python 3.7

This project exercises simple Machine Learning using the Iris dataset in using DecisionTreeClassifier, visualizing it, and then also creating two different classifiers of my own. 

The classifiers that I create are:
* A Random Classifier, which simply selects a random label and uses that as the prediction.
* A Nearest Neighbor Classifier, which uses the euclidean distance formula.

addition from a previous notebook

### Creating a Random Classifier

This just guesses the label based off of random choice

In [1]:
import random

class RandomGuessClassifier():
    #we need a .fit(x_train, y_train) and .predict(x_test)
    
    def fit(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
        
    def predict(self, x_test):
        predictions = []      #a list of predictions is returned
        for row in x_test:    #each row contains all the features in the training data.
            label = random.choice(self.y_train)    #takes a random element in the y train list and selects it.
            predictions.append(label)
        return predictions
    

### Creating a Simple NearestNeighbor Classifier

Pros: 
* pretty simple to understand

Cons:
* computationally intensive because it iterates through all testing points to find nearest neighbor
* difficult to represent relationships between features, and to tell which features matter more

In [1]:
from scipy.spatial import distance

# a is a point from training data, b is point from testing data
def euc(a,b):
    return distance.euclidean(a,b)
    
class My_KNN():
    
    def fit(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
        
    def predict(self, x_test):
        predictions = []     
        for row in x_test:   
            label = self.closest(row)    #feeds our "closest" function row, which is one row of features.
            predictions.append(label)
        return predictions
    
    def closest(self, row): 
        best_dist = euc(row, self.x_train[0]) #compares test point to first training point
        best_index = 0
        for i in range(1, len(self.x_train)): #compares the test point to all the other training points
            dist = euc(row, self.x_train[i])
            if(dist < best_dist):
                best_dist = dist
                best_index = i
        return self.y_train[best_index]

In [2]:
from sklearn import datasets
iris = datasets.load_iris()

x = iris.data
y = iris.target

In [54]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [56]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


We want to partition our data for a training set and a testing set. One to train our models and one to test them to see if they worked properly.

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.5)  #use half the data for testing

In the below cell we are using a decision tree classifier to help classify what our labels are.

In [5]:
from sklearn import tree

my_classifier = tree.DecisionTreeClassifier()    #you can use different classifiers 

In [6]:
#from sklearn.neighbors import KNeighborsClassifier

#my_classifier = KNeighborsClassifier()          #this is a different classifier 

### Using my own classifiers

In [7]:
#my_classifier = RandomGuessClassifier()

In [8]:
#my_classifier = My_KNN()

We will give the classifier the training data for the features and labels of the training data that we use

In [9]:
my_classifier.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Predicting the Y for all of the testing data.

In [10]:
predictions = my_classifier.predict(x_test)
print(predictions)

[1 1 1 1 1 1 0 2 1 1 2 2 1 2 2 2 0 1 2 2 0 1 0 0 1 1 2 2 2 0 2 0 1 0 0 0 2
 0 0 2 0 2 2 0 2 1 2 2 2 0 1 0 0 0 2 1 0 0 0 2 1 1 2 2 1 0 1 1 0 0 2 1 1 0
 2]


In [11]:
print(y_test)

[1 1 1 1 1 1 0 2 1 2 2 2 1 2 2 2 0 1 2 1 0 2 0 0 1 1 2 2 2 0 2 0 1 0 0 0 2
 0 0 2 0 2 2 0 2 1 2 2 2 0 1 0 0 0 2 1 0 0 0 2 2 1 2 2 1 0 1 1 0 0 2 1 1 0
 2]


In [12]:
#simple display of the data in order to check the thought tree decision process. 
#'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
#for i in range(2):
#    print(str(y_test[i]) + ": " + str(x_test[i]))
    
print(str(y_test[10]) + ": " + str(x_test[10]))
print(str(y_test[41]) + ": " + str(x_test[41]))

2: [6.7 3.3 5.7 2.1]
2: [6.3 2.5 5.  1.9]


How accurate is this data?

In [92]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

0.96


In [93]:
features = iris.feature_names

import graphviz
dot_data = tree.export_graphviz(my_classifier, out_file=None,  feature_names=features,  
    class_names=iris.target_names,  filled=True, rounded=True, special_characters=True) 
graph = graphviz.Source(dot_data)  
graph.render('dtree_render',view=True)

'dtree_render.pdf'