Github Link: https://github.com/sbains2/GSBS544

Instructions
You will submit an HTML document to Canvas as your final version.

Your document should show your code chunks/cells as well as any output. Make sure that only relevant output is printed. Do not, for example, print the entire dataset in your final rendered file.

Your document should also be clearly organized, so that it is easy for a reader to find your answers to each question.

The Data
This week, we consider a dataset generated from text data.

The original dataset can be found here: https://www.kaggle.com/datasets/kingburrito666/cannabis-strains. It consists of user reviews of different strains of cannabis. Users rated their experience with the cannabis strain on a scale of 1 to 5. They also selected words from a long list to describe the Effects and the Flavor of the cannabis.

In the dataset linked above, each row is one strain of cannabis. The average rating of all testers is reported, as well as the most commonly used words for the effect and flavor.

Some data cleaning has been performed for you: The Effect and Flavor columns have been converted to dummy variables indicating if the particular word was used for the particular strain.

This cleaned data can be found at: https://www.dropbox.com/s/s2a1uoiegitupjc/cannabis_full.csv

Our goal will be to fit models that identify the Sativa types from the Indica types, and then to fit models that also distinguish the Hybrid types.

IMPORTANT: In this assignment, you do not need to consider different feature sets. Normally, this would be a good thing to try - but for this homework, simply include all the predictors for every model.

In [None]:
# Importing all necessary libraries
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from plotnine import *
import matplotlib.pyplot as plt

In [None]:
# Reading in the data
df = pd.read_csv('weed.csv')
# Dropping NaNs
weed = df.dropna()
weed.info()
weed.describe()

Part One: Binary Classification
Create a dataset that is limited only to the Sativa and Indica type cannabis strains.

This section asks you to create a final best model for each of the four new model types studied this week: LDA, QDA, SVC, and SVM. For SVM, you may limit yourself to only the polynomial kernel.

For each, you should:

Choose a metric you will use to select your model, and briefly justify your choice. (Hint: There is no specific target category here, so this should not be a metric that only prioritizes one category.)

Find the best model for predicting the Type variable. Don’t forget to tune any hyperparameters.

Report the (cross-validated!) metric.

Fit the final model.

Output a confusion matrix.

Q1: LDA
Q2: QDA
Q3: SVC
Q4: SVM

Creating a dataset with targets Indica and Sativa

In [None]:
# Subsetting the dataset with limiting targets to Indica and Sativa
weed = weed[(weed['Type'] == 'indica') | (weed['Type'] == 'sativa')]

# Checking that there are only two types in the Type column
weed['Type'].nunique()

Basic model setup

In [None]:
# Creating target and explanatory variables
X = weed.drop(columns=['Type'])
y = weed['Type']

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

# Defining columnTransformer for preprocessing steps
ct = ColumnTransformer(
    transformers=[
    ('num', StandardScaler(), num),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

# Doing a test-train split to partition training and testing data
Xt, Xv, yt, yv = train_test_split(X, y, test_size=0.2, random_state=42)

For LDA

In [None]:
import math
# Creating a pipeline for Linear Discriminatory Analysis
LDApipe = Pipeline([
    ('preprocess', ct),
    # adding param solver='lsqr' to mitigate issue when predictors exceeds the number of samples per class,
    # using least squares formulation instead of SVD inversion so that the model tolerates high-dimensional
    # highly correlated feature sets.
    ('model', LinearDiscriminantAnalysis(solver='lsqr'))
])

# Fitting the pipe on the training data
LDApipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(LDApipe, X, y, cv=5)))

# Generating predictions on the validation set
y_pred = LDApipe.predict(Xv)

# specifying class order for rows/columns
labels = ['indica', 'sativa']

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

For the LDA model, we find that the mean cross validated metric is 0.54


For QDA

In [None]:
import math
# Creating a pipeline for Quadratic Discriminatory Analysis
QDApipe = Pipeline([
    ('preprocess', ct),
    ('model', QuadraticDiscriminantAnalysis())
])

# Fitting the pipe on the training data
QDApipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(QDApipe, X, y, cv=5)))

# Generating predictions on the validation set
y_pred = QDApipe.predict(Xv)

# specifying class order for rows/columns
labels = ['indica', 'sativa']

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

cross validated accuracy: 0.34362550836791794


For SVC

In [None]:
# Creating a pipeline for Support Vector Classifier
SVCpipe = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the pipe on the training data
SVCpipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(SVCpipe, X, y, cv=5)))

# Generating predictions on the validation set
y_pred = SVCpipe.predict(Xv)

# specifying class order for rows/columns
labels = ['indica', 'sativa']

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

For SVM

In [None]:
# Creating a pipeline for SVM
SVMpipe = Pipeline([
    ('preprocess', ct),
    ('model', SVC(kernel='poly'))
])

# Tuning key hyperparams for the polynomial kernel
svm_params = {
    'model__C': [0.1, 1, 10],
    'model__degree': [2,3,4],
    'model__gamma': ['scale', 'auto'],
    'model__coef0': [0,1]
}

svm_grid = GridSearchCV(SVMpipe, svm_params, cv=5)
svm_grid.fit(Xt, yt)

# Cross validated accuracy for the best polynomial SVM
print(svm_grid.best_score_)
print(svm_grid.best_params_)

# Generating predictions on the validation set using tuned model
best_svm = svm_grid.best_estimator_
y_pred = best_svm.predict(Xv)

# specifying class order for rows/columns
labels = ['indica', 'sativa']

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

For the SVM model, we find that the mean cross validated metric is 0.858


Part Two: Natural Multiclass
Now use the full dataset, including the Hybrid strains.

Q1
Fit a decision tree, plot the final fit, and interpret the results.

In [None]:
p2 = df.dropna()

# Creating new target and explanatory variables
X = p2.drop(columns=['Type'])
y = p2['Type']

Xt, Xv, yt, yv = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Creating a pipeline for DTC
tree = Pipeline([
    ('preprocess', ct),
    ('model', DecisionTreeClassifier())
])

tree_grid = {
    'model__max_depth': [3,5,7,9,11],
    'model__min_samples_split': [2,3,4],
    'model__criterion': ['gini', 'entropy']
}

grid_tree = GridSearchCV(tree, tree_grid, cv=5, scoring='accuracy')
grid_tree.fit(Xt, yt)
y_pred = grid_tree.predict(Xv)
print('tree accuracy', accuracy_score(yv, y_pred))
print('classification report', classification_report(yv, y_pred))

# plot the tuned tree

# getting tuned decision tree
best_tree = grid_tree.best_estimator_

# extracting importance scores
imp = best_tree.named_steps['model'].feature_importances_

# extracting feature names after preprocessing
feat = best_tree.named_steps['preprocess'].get_feature_names_out()

# Building a dataframe of top important feature for plotting
formatted = (
    pd.DataFrame({'feature': feat, 'importance': imp})
                .sort_values('importance', ascending=False)
                .head(10) # Limiting to 15 entries
)

# Plotting
(
    ggplot(formatted, aes(x='reorder(feature, importance)', y='importance'))
    + geom_col()
    + coord_flip()
    + labs(
        x='feature',
        y='importance',
        title='Decision Tree Feature Importances (top 10)'
    )
)

From interpreting the plot of the fitted Decision Tree Classifier, we're able to see that the top split drivers are 'sleepy', 'energetic', 'relaxed', 'citrus', 'uplifted', and 'rating', with 'sleepy' accountign for most of the impurity reduction (~0.78).


Q2
Repeat the analyses from Part One for LDA, QDA, and KNN.

In [None]:
# For LDA
# Creating a pipeline for Linear Discriminatory Analysis
LDApipe = Pipeline([
    ('preprocess', ct),
    # adding param solver='lsqr' to mitigate issue when predictors exceeds the number of samples per class,
    # using least squares formulation instead of SVD inversion so that the model tolerates high-dimensional
    # highly correlated feature sets.
    ('model', LinearDiscriminantAnalysis(solver='lsqr'))
])

# Fitting the pipe on the training data
LDApipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(LDApipe, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = LDApipe.predict(Xv)

# specifying class order for rows/columns
labels = ['indica', 'sativa', 'hybrid']

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


cross validated accuracy: 0.3094756659489708

In [None]:
# For QDA
# Creating a pipeline for Quadratic Discriminatory Analysis
QDApipe = Pipeline([
    ('preprocess', ct),
    ('model', QuadraticDiscriminantAnalysis())
])

# Fitting the pipe on the training data
QDApipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(QDApipe, X, y, cv=5)))

# Generating predictions on the validation set
y_pred = QDApipe.predict(Xv)

# specifying class order for rows/columns
labels = ['indica', 'sativa', 'hybrid']

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

cross validated accuracy: 0.34362550836791794

In [None]:
# For kNN
knnpipe = Pipeline([
    ('preprocess', ct),
    ('model', KNeighborsClassifier())
])

knnpipe.fit(Xt,yt)

knn_grid = {
    'model__n_neighbors': [3,5,7,9],
    'model__weights': ['uniform', 'distance']
}

gridknn = GridSearchCV(knnpipe, knn_grid, cv=5, scoring='accuracy')
gridknn.fit(Xt, yt)
y_pred = gridknn.predict(Xv)


print('best score:', gridknn.best_score_)
print('knn best params:', gridknn.best_params_)

labels = ['indica', 'sativa', 'hybrid']

cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

cross validated accuracy: 0.5690565730565731

Q3
Were your metrics better or worse than in Part One? Why? Which categories were most likely to get mixed up, according to the confusion matrices? Why?

In part 2, all 3 models lost accuracy compared to Part 1 because Hybrid overlaps Indica/Sativa, which shrinks the margins. LDA held up the best, while QDA decreased significantly because full covariances get unstable with lots of correlated dummies. kNN landed in between, with Hybrid points often pulled toward nearby Indica and Sativa. Overall,most mistakes were hybrids being predicted as indica or sativa. Indica vs. Sativa stayed cleaner. Because hybrids overlap both groups, accuracy drops overall. With simpler boundaries in LDA, it was able to handle the overlap better than the more flexible models like kNN or QDA. 

Part Three: Multiclass from Binary
Consider two models designed for binary classification: SVC and Logistic Regression.

Q1
Fit and report metrics for OvR versions of the models. That is, for each of the two model types, create three models:

Indica vs. Not Indica

In [None]:
# For SVC
base = df.dropna()
tmp = base.copy()
tmp['indica'] = (tmp['Type'] == 'indica')

X = tmp.drop(columns=['Type', 'indica'])
y = tmp['indica']

Xt, Xv, yt, yv = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

# Defining columnTransformer for preprocessing steps
ct = ColumnTransformer(
    transformers=[
    ('num', StandardScaler(), num),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

SVCpipe = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the pipe on the training data
SVCpipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(SVCpipe, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = SVCpipe.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

cv score: 0.7781945267887791

In [None]:
# For Logistic Regression
logit = Pipeline([
    ('preprocess', ct),
    ('model', LogisticRegression())
])

# Fitting the pipe on the training data
logit.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(logit, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = logit.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

cv score: 0.7864085041761579

Sativa vs. Not Sativa

In [None]:
# For SVC
tmp = base.copy()
tmp['sativa'] = (tmp['Type'] == 'sativa')

X = tmp.drop(columns=['Type', 'sativa'])
y = tmp['sativa']

Xt, Xv, yt, yv = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

# Defining columnTransformer for preprocessing steps
ct = ColumnTransformer(
    transformers=[
    ('num', StandardScaler(), num),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

SVCpipe = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the pipe on the training data
SVCpipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(SVCpipe, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = SVCpipe.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

CV scores: 0.8188057124431823

In [None]:
# For Logistic Regression
logit = Pipeline([
    ('preprocess', ct),
    ('model', LogisticRegression())
])

# Fitting the pipe on the training data
logit.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(logit, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = logit.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

CV scores: 0.825653987372713


Hybrid vs. Not Hybrid

In [None]:
# For SVC
tmp = base.copy()
tmp['hybrid'] = (tmp['Type'] == 'hybrid')

X = tmp.drop(columns=['Type', 'hybrid'])
y = tmp['hybrid']
Xt, Xv, yt, yv = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

# Defining columnTransformer for preprocessing steps
ct = ColumnTransformer(
    transformers=[
    ('num', StandardScaler(), num),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

SVCpipe = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the pipe on the training data
SVCpipe.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(SVCpipe, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = SVCpipe.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

cv scores: 0.6147834950749421

In [None]:
# For Logistic Regression
logit = Pipeline([
    ('preprocess', ct),
    ('model', LogisticRegression())
])

# Fitting the pipe on the training data
logit.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(logit, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = logit.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)

CV scores: 0.6193538656764543

Q2
Which of the six models did the best job distinguishing the target category from the rest? Which did the worst? Does this make intuitive sense?

When assessing which of the 6 models did the best job distinguishing the target category from the rest, the Sativa vs. rest Logistic Regression (cv=.826), followed closely by Sativa SVC. The worst was Hybrid vs. rest (SVC = 0.615, logit = 0.619). Overall, this makes sense, as Sativa is the most distinct from the other two types, while Hybrid overlaps both Indica and Sativa, making it the hardest one vs rest separation.

Q3
Fit and report metrics for OvO versions of the models. That is, for each of the two model types, create three models:

Indica vs. Sativa

In [None]:
# For Logistic Regression
ovo = df.dropna()
ovo = ovo.copy()
ovo = ovo[ovo['Type'].isin(['indica', 'sativa'])]

ovo['indica_v_sativa'] = ovo['Type'] == 'indica'

X = ovo.drop(columns=['Type', 'indica_v_sativa'])
y = ovo['indica_v_sativa']

Xt, Xv, yt, yv = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

# Creating the Logistic Regression Pipeline

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

ct = ColumnTransformer([
      ('num', StandardScaler(), num),
      ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

logit = Pipeline([
    ('preprocess', ct),
    ('model', LogisticRegression())
])

# Fitting the OvO model on the training dataset
logit.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(logit, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = logit.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


Cross validated accuracy score: 0.8539423456627617


In [None]:
# For Logistic Regression
ovo = df.dropna()
ovo = ovo.copy()
ovo = ovo[ovo['Type'].isin(['indica', 'sativa'])]

ovo['indica_v_sativa'] = ovo['Type'] == 'indica'

X = ovo.drop(columns=['Type', 'indica_v_sativa'])
y = ovo['indica_v_sativa']

Xt, Xv, yt, yv = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

# Creating the SVM Pipeline

svm = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the OvO model on the training dataset
svm.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(svm, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = svm.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


Cross validation accuracy score: 0.8408757843008206

Indica vs. Hybrid

In [None]:
# For Logistic Regression
ovo = df.dropna()
ovo = ovo.copy()
ovo = ovo[ovo['Type'].isin(['indica', 'hybrid'])]

ovo['indica_v_hybrid'] = ovo['Type'] == 'indica'

X = ovo.drop(columns=['Type', 'indica_v_hybrid'])
y = ovo['indica_v_hybrid']

Xt, Xv, yt, yv = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

# Creating the Logistic Regression Pipeline

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

ct = ColumnTransformer([
      ('num', StandardScaler(), num),
      ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

logit = Pipeline([
    ('preprocess', ct),
    ('model', LogisticRegression())
])

# Fitting the OvO model on the training dataset
logit.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(logit, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = logit.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


Cross validated accuracy score: 0.7401882101155068


In [None]:
ovo = df.dropna()
ovo = ovo.copy()
ovo = ovo[ovo['Type'].isin(['indica', 'hybrid'])]

ovo['indica_v_hybrid'] = ovo['Type'] == 'indica'

X = ovo.drop(columns=['Type', 'indica_v_hybrid'])
y = ovo['indica_v_hybrid']

Xt, Xv, yt, yv = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

# Creating the SVM Pipeline

svm = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the OvO model on the training dataset
svm.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(svm, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = svm.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


Cross validation accuracy score: 0.7317801277814496

Hybrid vs. Sativa

In [None]:
# For Logistic Regression
ovo = df.dropna()
ovo = ovo.copy()
ovo = ovo[ovo['Type'].isin(['hybrid', 'sativa'])]

ovo['hybrid_v_sativa'] = ovo['Type'] == 'hybrid'

X = ovo.drop(columns=['Type', 'hybrid_v_sativa'])
y = ovo['hybrid_v_sativa']

Xt, Xv, yt, yv = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

# Creating the Logistic Regression Pipeline

# Getting numeric and categorical variables
num = X.select_dtypes(include=['number']).columns
cat = X.select_dtypes(exclude=['number']).columns

ct = ColumnTransformer([
      ('num', StandardScaler(), num),
      ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat)
])

logit = Pipeline([
    ('preprocess', ct),
    ('model', LogisticRegression())
])

# Fitting the OvO model on the training dataset
logit.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(logit, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = logit.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


Cross validated accuracy score: 0.7506482723382513


In [None]:
ovo = df.dropna()
ovo = ovo.copy()
ovo = ovo[ovo['Type'].isin(['hybrid', 'sativa'])]

ovo['hybrid_v_sativa'] = ovo['Type'] == 'hybrid'

X = ovo.drop(columns=['Type', 'hybrid_v_sativa'])
y = ovo['hybrid_v_sativa']

Xt, Xv, yt, yv = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

# Creating the SVM Pipeline

svm = Pipeline([
    ('preprocess', ct),
    ('model', SVC())
])

# Fitting the OvO model on the training dataset
svm.fit(Xt, yt)

# Finding cross-validated accuracy
print(np.mean(cross_val_score(svm, X, y, cv=5, scoring='accuracy')))

# Generating predictions on the validation set
y_pred = svm.predict(Xv)

# specifying class order for rows/columns
labels = [False, True]

# Designing a confusion matrix
cm = confusion_matrix(yv, y_pred, labels=labels)
cm_df = pd.DataFrame(cm, index=[f"Actual {lbl}" for lbl in labels], # appending row labels
                    columns=[f"Predicted {lbl}" for lbl in labels]) # appending column labels
print(cm_df)


Cross validation accuracy score: 0.7336739690447297

Q4
Which of the six models did the best job distinguishing at differentiating the two groups? Which did the worst? Does this make intuitive sense?

The sativa v. indica logistic regression model performed the best in terms of cross validated score, with an accuracy of ~0.85. The model that performed the worst was the indica v. hybrid SVM model, with a cross validated accuracy score of ~0.73. Intuitively, this makes sense, as sativa versus indica strains separate more cleanly, but hybrids overlap the types, making a harder boundary for the SVM model.


Q5
Suppose you had simply input the full data, with three classes, into the LogisticRegression function. Would this have automatically taken an “OvO” approach or an “OvR” approach?

For logistic regression, by default the (multi_class='auto' with a multinomial-capable solver like lbfgs), it fits a multinomial model, not OvO. If you pick a solver that can't do multinomial, it falls back to one-versus-rest.


What about for SVC?
For multi-class inputs, it always uses one-vs-one.


Note: You do not actually have to run code here - you only need to look at sklearn’s documentation to see how these functions handle multiclass input.