<a href="https://colab.research.google.com/github/thearcadio/ITI-421/blob/main/Classification_in_Python_Sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# you will just load your own data later and can skip this step
# for the sake of having an example to see I'm using an already available data set
from sklearn.datasets import load_wine

# you will need to import pandas
import pandas as pd

# Above I imported the data, this is actually loading it in
wine = load_wine() # import your data set at this step

In [4]:
print(type(wine)) # as you can see, the type of this dat set is Bunch
# you may find it easier to to work with Bunches than Data Frames, check out this forum:
# https://stackoverflow.com/questions/59946315/how-can-we-convert-a-dataframe-in-to-bunch-data-type

<class 'sklearn.utils.Bunch'>


In [5]:
# we are importing what we need to split our data... more below
from sklearn.model_selection import train_test_split

# we talked about this earlier in the semester, we split our data for training and testing our model
# the example below is saying to hold 20% for a test, this is a common value used, it make make
# sense for you to use more or less, I can answer questions specific to your application :)
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2)

In [6]:
# Another thing we discussed in class was either scaling or normalizing values to ensure
# that different measurements are appropriately weighted. This isn't so straightforward
# in SPSS, but in Python it is just a matter of a few lines of code

# We are importing the needed package
from sklearn.preprocessing import StandardScaler

# Actually creating the scaler
scaler = StandardScaler()

# We fit the scalar to the testing and training seperate so that the values in one
# don't influence the others
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [9]:
# While there are many classifiers available in the SciKit Learn package I'd say
# the most useful for you all would be the logistic regression one
# If you see others that interest you here: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
# I'd be happy to schedule office hours to walk you through them in detail

# Importing the packages we need
from sklearn.linear_model import LogisticRegression

# Creating our actual classifier with our training data
classifier = LogisticRegression().fit(X_train, y_train)

In [10]:
# What good is a model if we can't easily assess it? To do this, we will use the
# the metrics package included in SciKit Learn
from sklearn import metrics

# We are now using the testing data we held out before...
# It is fed into our classifier and that classifier will make a prediction
# as to what class each row belongs to
predictions = classifier.predict(X_test)

# One way we can assess our model is a confusion matrix
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

# The values in a confusion matrix that lie along the diagonal represent the 
# correct classificaitons... The more of the values that lie on the diagnol, the
# more your model correctly classified

[[11  0  0]
 [ 1 12  0]
 [ 0  0 12]]


In [11]:
# This was my model's output
# [[11  0  0]
#  [ 1 12  0]
# [ 0  0 12]]
# The great thing about a confusion matrix is that it is easy to interpret
# The number of classes you have determines the size of the matrix, for example
# if I were to build the classifier I mentioned in class, the one to predict 
# a student's academic class at Rutgers, they'd be one of 4 classes: FR, SO, JR, SR
# Thus our matrix would be 4 x 4.
# You can put this table into you report pretty easily too, it just is something that
# is a quick/consise way to summarize how "good" your model is

In [12]:
# Another set of available metrics are accuracy, precision, recall, and f1
accuracy = metrics.accuracy_score(y_test, predictions) # accuracy
precision = metrics.precision_score(y_test, predictions, average = 'macro') # precision
recall = metrics.recall_score(y_test, predictions, average = 'macro') # recall
f1 = metrics.f1_score(y_test, predictions, average='macro') # f1

print("Accuracy = " + str(accuracy))
print("Precision = " + str(precision))
print("Recall = " + str(recall))
print("F1-score = " + str(f1))

Accuracy = 0.9722222222222222
Precision = 0.9722222222222222
Recall = 0.9743589743589745
F1-score = 0.9721739130434783


In [13]:
# The results I got were as follows
# Accuracy = 0.9722222222222222
# Precision = 0.9722222222222222
# Recall = 0.9743589743589745
# F1-score = 0.9721739130434783
# Based on how SciKit built your model, you may get different results
# A brief explanation of each can be found here: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
# Which is the most appropriate or expected is domain specific, so for your cases
# You can just report all...
#
# For some extra context, the article I linked goes into great detail, but again,
# the domain matters... when you're trying to avoid missing a "positive" result
# which is especially true in the medical field, then recall, which is more geared
# to address this, may be appropriate.
#
# Again, for the scope of this project and for how minimal the code is, you can
# just report all, but I wanted to at least give you some context on each to 
# read if you want it
#
# Last, but not least, I want to reiterate that logistic regression is just one approach
# to building a classifier, and perhaps one of the easiest to implement and then
# interpret the results. It isn't the end all/automatic best choice always. The
# intent is just to teach you all when you'd use classification versus regression.