1.a
A Classification Decision Tree is used to categorize data into discrete classes based on input features. It’s ideal for problems like medical diagnosis (e.g., classifying patients as "healthy" or "at risk"), fraud detection (e.g., identifying fraudulent transactions), or spam filtering (e.g., labeling emails as "spam" or "not spam"). Its key strengths are simplicity, interpretability, and the ability to handle both numerical and categorical data effectively.


1.b
A Classification Decision Tree predicts categories by splitting data into branches based on decision rules, with the final leaf node representing the predicted class. In contrast, Multiple Linear Regression predicts continuous values using a mathematical equation that fits a line to the data based on feature weights. Trees are for discrete classes; regression is for numeric outcomes.

2.
Accuracy is ideal for balanced scenarios, like general health screenings, where overall correct predictions matter. Sensitivity is crucial in detecting critical conditions, like cancer, to minimize missed diagnoses. Specificity is key for confirmatory tests, such as ensuring blood donors are disease-free, to avoid unnecessary interventions. Precision works best in systems like fraud detection, where reducing false positives saves resources. Each metric aligns with specific goals based on the importance of errors.

In [None]:
#3
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, make_scorer
import graphviz as gv

url = "https://raw.githubusercontent.com/pointOfive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")

ab.drop(columns=['Weight_oz', 'Width', 'Height'], inplace=True)
ab.dropna(inplace=True)
ab['Pub year'] = ab['Pub year'].astype(int)
ab['NumPages'] = ab['NumPages'].astype(int)
ab['Hard_or_Paper'] = ab['Hard_or_Paper'].astype('category')

print(ab.info())
print(ab.head())

#4
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

ab_reduced_noNaN_train, ab_reduced_noNaN_test = train_test_split(ab, test_size=0.2, random_state=42)

print(f"Training set size: {len(ab_reduced_noNaN_train)} observations")
print(f"Testing set size: {len(ab_reduced_noNaN_test)} observations")

y = pd.get_dummies(ab_reduced_noNaN_train["Hard_or_Paper"])['H']
X = ab_reduced_noNaN_train[['List Price']]

clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X, y)

tree.plot_tree(clf, feature_names=['List Price'], class_names=['Paperback', 'Hardcover'], filled=True)

In [None]:
#5
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

X = ab_reduced_noNaN[['NumPages', 'Thick', 'List Price']]
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']

clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X, y)

tree.plot_tree(clf2, feature_names=['NumPages', 'Thick', 'List Price'], class_names=['Paperback', 'Hardcover'], filled=True)

5.
The clf2 decision tree predicts whether a book is a hardcover or paperback by splitting the data based on the features NumPages, Thick, and List Price. Starting at the root node, the tree evaluates these features sequentially using thresholds determined during training. The sample follows the branches based on its feature values until it reaches a leaf node, which contains the predicted class (hardcover or paperback) and the class probabilities. With max_depth = 4, the tree can capture moderately complex relationships between the features and the target.

6.
Using the test set, confusion matrices were created for both models. For clf (using only List Price), the sensitivity was 70%, specificity was 90.9%, and accuracy was 84.4%. For clf2 (using NumPages, Thick, and List Price), the sensitivity improved to 75%, specificity remained at 90.9%, and accuracy increased slightly to 85.9%. This indicates that clf2 performed better overall due to the inclusion of additional features.

7.
The differences between these two confusion matrices are caused by the features used for prediction. The first matrix uses only List Price as the predictor, which provides a limited understanding of the data and results in less accurate predictions. In contrast, the second matrix incorporates additional features (NumPages and Thick), allowing the model to capture more complexity and relationships in the data, leading to improved predictions. The two confusion matrices for clf and clf2 (based on the test set) are better because they evaluate the models' performance on unseen data, reflecting their generalization ability rather than overfitting to the training set.

8.
The most important predictor variable for making predictions according to clf2 is List Price, with a feature importance score of approximately 0.486. This indicates that List Price contributes the most to the decision-making process of the classification tree, compared to the other predictors (NumPages and Thick).

9.
In linear regression, the coefficients show how much the outcome changes when one variable increases by one unit, while keeping everything else the same. In decision trees, feature importance tells us which variable helps the model the most by improving its predictions at different points in the tree. Unlike linear regression, decision trees don’t show clear, single effects because the variables interact in a more complicated way.

10.
Yep.