# Assignment: Trees

## Do two questions in total: "Q1+Q2" or "Q1+Q3"

`! git clone https://github.com/ds3001f25/linear_models_assignment.git`

**Q1.** Please answer the following questions in your own words.
1. Why is the Gini a good loss function for categorical target variables? 

Because it is a clear and intuitive measure of how much that data is split into each class of the categorical target variable. 

2. Why do trees tend to overfit, and how can this tendency be constrained? 

Trees tend to overfit because they learn the training set too well as they build a complex and deep decision tree, so they capture noise or random fluctuations in the training data, rather than learning the underlying patterns that generalize well to unseen data.
To constrain the tendency of overfitting, we could truncate the tree and limit its depth, so that it isn't making splits on very fine distinctions among a handful of observations. Also, we could impose a lower bound on the impurity that can appear at a terminal node (don't allow the terminal nodes to be "too pure"), or we could impose a limit on how few cases can appear at a terminal node (don't allow the terminal nodes to be "too small"). 

3. True or false, and explain: Trees only really perform well in situations with lots of categorical variables as features/covariates. 

False. Decision trees can handle both categorical and numerical variables effectively and they are good at capturing non-linear relationships for the continuous variables. 

4. Why don't most versions of classification/regression tree concept allow for more than two branches after a split?

Because when we are choosing the sub-nodes, we are choosing the one that minimizes the impurity of the resulting subsets, so that we constrain the tendency of overfitting. 

5. What are some heuristic ways you can examine a tree and decide whether it is probably over- or under-fitting?

We could split the sample dataset into training data and testing data, and then use the testing data to find out how the model perform and see if it is over- or under-fitting. Also, we could calculate the accuracy based on the confusion matrix to determine the performance, or carefully stop to avoid overfitting, 

**Q2.** This is a case study about classification and regression trees.

1. Load the `Breast Cancer METABRIC.csv` dataset. How many observations and variables does it contain? Print out the first few rows of data.

In [6]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [34]:
df = pd.read_csv("data/Breast Cancer METABRIC.csv")
print(df.shape) # 2509 observations, 34 variables
df.head()

(2509, 34)


Unnamed: 0,Patient ID,Age at Diagnosis,Type of Breast Surgery,Cancer Type,Cancer Type Detailed,Cellularity,Chemotherapy,Pam50 + Claudin-low subtype,Cohort,ER status measured by IHC,...,Overall Survival Status,PR Status,Radio Therapy,Relapse Free Status (Months),Relapse Free Status,Sex,3-Gene classifier subtype,Tumor Size,Tumor Stage,Patient's Vital Status
0,MB-0000,75.65,Mastectomy,Breast Cancer,Breast Invasive Ductal Carcinoma,,No,claudin-low,1.0,Positve,...,Living,Negative,Yes,138.65,Not Recurred,Female,ER-/HER2-,22.0,2.0,Living
1,MB-0002,43.19,Breast Conserving,Breast Cancer,Breast Invasive Ductal Carcinoma,High,No,LumA,1.0,Positve,...,Living,Positive,Yes,83.52,Not Recurred,Female,ER+/HER2- High Prolif,10.0,1.0,Living
2,MB-0005,48.87,Mastectomy,Breast Cancer,Breast Invasive Ductal Carcinoma,High,Yes,LumB,1.0,Positve,...,Deceased,Positive,No,151.28,Recurred,Female,,15.0,2.0,Died of Disease
3,MB-0006,47.68,Mastectomy,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,Yes,LumB,1.0,Positve,...,Living,Positive,Yes,162.76,Not Recurred,Female,,25.0,2.0,Living
4,MB-0008,76.97,Mastectomy,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,Yes,LumB,1.0,Positve,...,Deceased,Positive,Yes,18.55,Recurred,Female,ER+/HER2- High Prolif,40.0,2.0,Died of Disease


2.  We'll use a consistent set of feature/explanatory variables. For numeric variables, we'll include `Tumor Size`, `Lymph nodes examined positive`, `Age at Diagnosis`. For categorical variables, we'll include `Tumor Stage`, `Chemotherapy`, and `Cancer Type Detailed`. One-hot-encode the categorical variables and concatenate them with the numeric variables into a feature/covariate matrix, $X$.

In [35]:
# Dummy variables:
tumor_dummy = pd.get_dummies(df['Tumor Stage'])
chemo_dummy = pd.get_dummies(df['Chemotherapy'])
type_dummy = pd.get_dummies(df['Cancer Type Detailed'])

vars = ["Tumor Size", "Lymph nodes examined positive", "Age at Diagnosis"]
X = pd.concat([df.loc[:,vars], tumor_dummy, chemo_dummy, type_dummy],axis=1)
X.head()


Unnamed: 0,Tumor Size,Lymph nodes examined positive,Age at Diagnosis,0.0,1.0,2.0,3.0,4.0,No,Yes,Breast,Breast Angiosarcoma,Breast Invasive Ductal Carcinoma,Breast Invasive Lobular Carcinoma,Breast Invasive Mixed Mucinous Carcinoma,Breast Mixed Ductal and Lobular Carcinoma,Invasive Breast Carcinoma,Metaplastic Breast Cancer
0,22.0,10.0,75.65,False,False,True,False,False,True,False,False,False,True,False,False,False,False,False
1,10.0,0.0,43.19,False,True,False,False,False,True,False,False,False,True,False,False,False,False,False
2,15.0,1.0,48.87,False,False,True,False,False,False,True,False,False,True,False,False,False,False,False
3,25.0,3.0,47.68,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False
4,40.0,8.0,76.97,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False


3. Let's predict `Overall Survival Status` given the features/covariates $X$. There are 528 missing values, unfortunately: Either drop those rows from your data or add them as a category to predict. Constrain the minimum samples per leaf to 10. Print a dendrogram of the tree. Print a confusion matrix of the algorithm's performance. What is the accuracy? 

In [None]:
y = df["Overall Survival Status"].isnull()==False
X = X.isin([y.isnull == False])
X.columns = X.columns.astype(str)


0    True
1    True
2    True
3    True
4    True
Name: Overall Survival Status, dtype: bool

In [28]:
from sklearn.tree import DecisionTreeRegressor # Import the tree classifier
from sklearn.tree import plot_tree # Import the tree classifier
from sklearn.model_selection import train_test_split # Train/test splitter

# Train-test split:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=100)

# Fit decision tree:
cart = DecisionTreeRegressor(min_samples_leaf = 10) # Create a regression object
cart = cart.fit(X_train, y_train) # Fit the regression

## Make Predictions on the Test Set
y_hat = cart.predict(X_test)

# Visualize results:
plt.figure(figsize=(30,30))
var_names = cart.feature_names_in_
plot_tree(cart,filled=True,feature_names=var_names)
plt.show()

ValueError: Found input variables with inconsistent numbers of samples: [2509, 1981]

4. For your model in part three, compute three statistics:
    - The **true positive rate** or **sensitivity**:
        $$
        TPR = \dfrac{TP}{TP+FN}
        $$
    - The **true negative rate** or **specificity**:
        $$
        TNR = \dfrac{TN}{TN+FP}
        $$
    Does your model tend to perform better with respect to one of these metrics?



5. Let's predict `Overall Survival (Months)` given the features/covariates $X$. Use the train/test split to pick the optimal `min_samples_leaf` value that gives the highest $R^2$ on the test set (it's about 110). What is the $R^2$? Plot the test values against the predicted values. How do you feel about this model for clinical purposes?

---

**Q3.** This is a case study about trees using bond rating data. This is a dataset about bond ratings for different companies, alongside a bunch of business statistics and other data. Companies often have multiple reviews at different dates. We want to predict the bond rating (AAA, AA, A, BBB, BB, B, ..., C, D). Do business fundamentals predict the company's rating?

1. Load the `./data/corporate_ratings.csv` dataset. How many observations and variables does it contain? Print out the first few rows of data.

2.  Plot a histogram of the `ratings` variable. It turns out that the gradations of AAA/AA/A and BBB/BB/B and so on make it hard to get good results with trees. Collapse all AAA/AA/A ratings into just A, and similarly for B and C.

3. Use all of the variables **except** Rating, Date, Name, Symbol, and Rating Agency Name. To include Sector, make a dummy/one-hot-encoded representation and include it in your features/covariates. Collect the relevant variables into a data matrix $X$. 

4. Do a train/test split of the data and use a decision tree classifier to predict the bond rating. Including a min_samples_leaf constraint can raise the accuracy and speed up computation time. Print a confusion matrix and the accuracy of your model. How well do you predict the different bond ratings?

5. If you include the rating agency as a feature/covariate/predictor variable, do the results change? How do you interpret this?