# 1. Start a ChatBot session to understand what a Classification Decision Tree is: (a) ask the ChatBot to describe the type of problem a Classification Decision Tree addresses and provide some examples of real-world applications where this might be particularly useful, and then (b) make sure you understand the difference between how a Classification Decision Tree makes (classification) predictions versus how Multiple Linear Regression makes (regression) predictions

a) A Classification Decision Tree is a type of supervised machine learning algorithm designed to solve classification problems, where the goal is to assign a category (or class) label to each data point. It works by recursively splitting the dataset into subsets based on the value of a specific feature, creating a tree structure. Each internal node represents a decision based on a feature, each branch represents an outcome of the decision, and each leaf node represents a class label.
b) Key Differences:
- Output Type:
1. Decision Tree: Discrete (class labels).
2. Regression: Continuous (numerical values).
- Prediction Mechanism:
1. Decision Tree: Sequence of decisions based on feature values.
2. Regression: Weighted linear combination of features.
- Applicability:
1. Decision Tree: Classification problems (e.g., spam detection).
2. Regression: Regression problems (e.g., stock price prediction).

# 2. Continue your ChatBot session and explore with your ChatBot what real-world application scenario(s) might be most appropriately addressed by each of the following metrics below: provide your answers and, in your own words, concisely explain your rationale for your answers.

### 1.Accuracy measures the proportion of true results (both true positives and true negatives) in the population.

Scenario: General diagnostics in stable populations with balanced classes.
My explanation: when we have stable populations and balanced classes, we have very similar false positives and false negatives that eventually balance out.

### 2.Sensitivity measures the proportion of actual positives that are correctly identified.

Scenario: Medical screening for critical conditions where missing a positive case is costly.
My explanation: when false negatives give very big outcomes that are rather considered bad, sensitivity really helps in these situations.

### 3.Specificity measures the proportion of actual negatives that are correctly identified.

Scenario: Preventing false alarms in security systems or diagnostics where over-detection causes unnecessary action.
My explanation: when false positives lead to more mental workload, specificity plays a big role.

### 4.Precision measures the proportion of positive identifications that were actually correct.

Scenario: High-stakes decision-making where false positives must be minimized.
My explanation: when the cost of false positive is higher than the cost of false negatives, we need precision.

# 3. Explore the amazon books dataset, seen previously at the start of the semester, providing some initital standard exploratory data analysis (EDA) and data summarization after pre-processing the dataset to meet the requirements below

1. remove Weight_oz, Width, and Height
2. drop all remaining rows with NaN entries
3. set Pub year and NumPages to have the type int, and Hard_or_Paper to have the type category

In [3]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, make_scorer
import graphviz as gv

url = "https://raw.githubusercontent.com/pointOfive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")
# create `ab_reduced_noNaN` based on the specs above

Here is the code for doing all that was asked to the previous dataset.

In [None]:
import pandas as pd

# Load the dataset
file_path = "amazonbooks.csv"  # Ensure the file is in the same directory as your Jupyter notebook
ab = pd.read_csv(file_path, encoding="ISO-8859-1")

# Pre-process the dataset based on the requirements
# Remove specified columns: Weight_oz, Width, and Height
ab_reduced = ab.drop(columns=["Weight_oz", "Width", "Height"])

# Drop all remaining rows with NaN entries
ab_reduced_noNaN = ab_reduced.dropna()

# Set Pub year and NumPages to type int, and Hard_or_Paper to type category
ab_reduced_noNaN["Pub year"] = ab_reduced_noNaN["Pub year"].astype(int)
ab_reduced_noNaN["NumPages"] = ab_reduced_noNaN["NumPages"].astype(int)
ab_reduced_noNaN["Hard_or_Paper"] = ab_reduced_noNaN["Hard_or_Paper"].astype("category")

# Display a summary of the dataset
print("Dataset Shape:", ab_reduced_noNaN.shape)
print("\nColumn Types:\n", ab_reduced_noNaN.dtypes)
print("\nFirst Few Rows:\n", ab_reduced_noNaN.head())
print("\nStatistical Summary:\n", ab_reduced_noNaN.describe(include="all"))

# Save the processed dataset to a new file if needed
# ab_reduced_noNaN.to_csv("processed_amazonbooks.csv", index=False)


# 4. Create an 80/20 split with 80% of the data as a training set ab_reduced_noNaN_train and 20% of the data testing set ab_reduced_noNaN_test using either df.sample(...) as done in TUT or using train_test_split(...) as done in the previous HW, and report on how many observations there are in the training data set and the test data set.

# Tell a ChatBot that you are about to fit a "scikit-learn" DecisionTreeClassifier model and ask what the two steps given below are doing; then use your ChatBots help to write code to "train" a classification tree clf using only the List Price variable to predict whether or not a book is a hard cover or paper back book using a max_depth of 2; finally use tree.plot_tree(clf) to explain what predictions are made based on List Price for the fitted clf model
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']
X = ab_reduced_noNaN[['List Price']]

(1) Splitting the Data into Training and Testing Sets
You can create an 80/20 split using either df.sample() or train_test_split() from scikit-learn. Here’s the code for splitting:

In [None]:
from sklearn.model_selection import train_test_split

# Define the predictors (X) and the target variable (y)
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']
X = ab_reduced_noNaN[['List Price']]

# Perform an 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Report the number of observations in each set
print("Number of observations in the training set:", len(X_train))
print("Number of observations in the testing set:", len(X_test))

(2) Steps in Fitting a Decision Tree Classifier
Here are the two steps typically involved:

Step 1 - "Training" the Model:
This step uses the fit method of the DecisionTreeClassifier. The model learns patterns from the training data 𝑋 train(predictor variables) and 𝑦 train(target variable). It creates a tree by recursively splitting the data based on feature values to minimize classification error.

In [None]:
clf.fit(X_train, y_train)

Step 2 - "Making Predictions":
This step uses the predict method to apply the trained model to new, unseen data (e.g., 𝑋 test). The predictions are the class labels inferred by the model.

In [None]:
predictions = clf.predict(X_test)

(3) Training a Decision Tree Classifier
Here’s the code to train a classification tree using List Price to predict whether a book is hardcover (Hard_or_Paper = 'H') with a max_depth of 2:

In [None]:
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

# Create the DecisionTreeClassifier with max_depth=2
clf = DecisionTreeClassifier(max_depth=2, random_state=42)

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Visualize the tree
plt.figure(figsize=(12, 8))
tree.plot_tree(clf, feature_names=["List Price"], class_names=["Paper", "Hard"], filled=True)
plt.show()

(4) Interpreting the Tree
The tree visualization provides the following insights:

Root Node: Splits the data based on List Price at a certain threshold. For example, if List Price <= $20, the books may tend to be paperback, and otherwise hardcover.
Leaf Nodes: Indicate the predicted class (paperback or hardcover) and the proportion of training samples classified at each leaf.

# 5. Repeat the previous problem but this time visualize the classification decision tree based on the following specifications below; then explain generally how predictions are made for the clf2 model
X = ab_reduced_noNaN[['NumPages', 'Thick', 'List Price']]
max_depth set to 4

Step 1: Define the Predictors and Target Variable
In this problem, the predictors will include NumPages, Thick, and List Price. The target remains whether the book is hardcover (Hard_or_Paper = 'H').

In [None]:
# Define predictors and target variable
X = ab_reduced_noNaN[['NumPages', 'Thick', 'List Price']]
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']

Step 2: Split the Data into Training and Testing Sets

In [None]:
# Perform an 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Train the Classification Tree Model

In [None]:
# Create the DecisionTreeClassifier with max_depth=4
clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)

# Train the classifier on the training data
clf2.fit(X_train, y_train)

Step 4: Visualize the Decision Tree

In [None]:
# Visualize the tree
plt.figure(figsize=(16, 10))
tree.plot_tree(clf2, feature_names=['NumPages', 'Thick', 'List Price'], 
               class_names=["Paper", "Hard"], filled=True)
plt.show()

General Explanation of Predictions for clf2
1. Feature Splits: The tree splits the data based on the most important features (e.g., NumPages, Thick, List Price), optimizing for classification accuracy at each level.
2. Decision Path: Each data point traverses the tree, starting from the root node and following the feature-based decision rules until it reaches a leaf node.
3. Leaf Nodes: At the leaf nodes, the model makes predictions based on the majority class of the training samples that fall into that node. For example:
- If most books in a leaf node are hardcovers, the model predicts "Hard."
- If most are paperbacks, it predicts "Paper."
The additional depth (max_depth=4) allows the tree to consider more complex relationships among the predictors, potentially leading to more nuanced predictions but with a risk of overfitting.

# 6. Use previously created ab_reduced_noNaN_test to create confusion matrices for clf and clf2. Report the sensitivity, specificity and accuracy for each of the models

1. Recreate the models:

- Train clf using List Price only.
- Train clf2 using NumPages, Thick, and List Price.

2. Redo predictions:

- Generate predictions for both clf and clf2.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score
from sklearn.tree import DecisionTreeClassifier

# Retrain the models
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train_list_price, y_train)

clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X_train_all_features, y_train)

# Generate predictions for the test set using both models
y_pred_clf = clf.predict(X_test_list_price)
y_pred_clf2 = clf2.predict(X_test_all_features)

# Create confusion matrices for both models
conf_matrix_clf = confusion_matrix(y_test, y_pred_clf)
conf_matrix_clf2 = confusion_matrix(y_test, y_pred_clf2)

# Calculate metrics for clf
accuracy_clf = accuracy_score(y_test, y_pred_clf)
sensitivity_clf = recall_score(y_test, y_pred_clf)  # Sensitivity = True Positive Rate
specificity_clf = conf_matrix_clf[0, 0] / (conf_matrix_clf[0, 0] + conf_matrix_clf[0, 1])  # TN / (TN + FP)

# Calculate metrics for clf2
accuracy_clf2 = accuracy_score(y_test, y_pred_clf2)
sensitivity_clf2 = recall_score(y_test, y_pred_clf2)  # Sensitivity = True Positive Rate
specificity_clf2 = conf_matrix_clf2[0, 0] / (conf_matrix_clf2[0, 0] + conf_matrix_clf2[0, 1])  # TN / (TN + FP)

# Summarize results
metrics_summary = {
    "Model": ["clf (List Price)", "clf2 (All Features)"],
    "Accuracy": [accuracy_clf, accuracy_clf2],
    "Sensitivity": [sensitivity_clf, sensitivity_clf2],
    "Specificity": [specificity_clf, specificity_clf2]
}

# Convert to a DataFrame for better readability
metrics_summary_df = pd.DataFrame(metrics_summary)

# Display the summary
print(metrics_summary_df)

# 7. Explain in three to four sentences what is causing the differences between the following two confusion matrices below, and why the two confusion matrices above (for clf and clf2) are better
ConfusionMatrixDisplay(
    confusion_matrix(ab_reduced_noNaN_train.your_actual_outcome_variable, 
                     clf.predict(ab_reduced_noNaN_train[['List Price']]), 
                     labels=[0, 1]), display_labels=["Paper","Hard"]).plot()
ConfusionMatrixDisplay(
    confusion_matrix(ab_reduced_noNaN_train.your_actual_outcome_variable, 
                     clf.predict(
                         ab_reduced_noNaN_train[['NumPages','Thick','List Price']]), 
                     labels=[0, 1]), display_labels=["Paper","Hard"]).plot()

The differences between the confusion matrices arise from the features used by the models. The first matrix (using List Price only) relies on limited information, leading to potentially higher misclassification rates. The second matrix (using NumPages, Thick, and List Price) incorporates more features, capturing complex relationships and improving prediction accuracy. The test set confusion matrices for clf and clf2 are more reliable as they evaluate generalization to unseen data, while training matrices often show overly optimistic results due to overfitting.

# 8. Read the paragraphs in Further Guidance and ask a ChatBot how to visualize feature Importances available for scikit-learn classification decision trees; do so for clf2; and use .feature_names_in_ corresponding to .feature_importances_ to report which predictor variable is most important for making predictions according to clf2

In [None]:
import matplotlib.pyplot as plt

# Extract feature importances and corresponding feature names
importances = clf2.feature_importances_
feature_names = clf2.feature_names_in_

# Create a bar chart for visualization
plt.figure(figsize=(8, 6))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importances for clf2")
plt.show()

# Report the most important predictor variable
most_important_feature = feature_names[importances.argmax()]
print(f"The most important predictor variable is: {most_important_feature}")

The feature_importances_ values represent how much each feature contributes to reducing impurity in the tree's splits. Higher values indicate more significant predictors for the classification task. After running this code, the bar chart will highlight the relative importance of NumPages, Thick, and List Price in clf2.

# 9. Describe the differences of interpreting coefficients in linear model regression versus feature importances in decision trees in two to three sentences

In linear regression models, coefficients represent the direct effect of each predictor on the target variable, assuming all other predictors are held constant; they provide insight into the magnitude and direction of the relationship (positive or negative). In decision trees, feature importances indicate how much each feature contributes to reducing impurity (e.g., Gini index or entropy) across all splits, without providing a specific numerical relationship or direction. Unlike regression coefficients, feature importances in trees are relative, additive, and less interpretable for understanding linear or causal effects.