In [None]:
library(data.table)
library(tidyverse)
library(plotly)
library(C50) # for churn data
library(rpart) # for recursive partioning trees
library(rpart.plot) # for plotting recursive partioning trees
library(visNetwork) # for better plotting recursive partioning trees
library(caret) # for a better confusion matrix

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# PCA and Recursive Partitioning Trees with Realty Dataset

We continue with the realty dataset.

Remember that, we calculated the premium_neigh variable which is premium of the unit price of the property over the median unit price of the neighborhood.

Now we will try to classify the properties into premium and discount

Let's first import the realty dataset:

In [None]:
realty_data3 <- readRDS(sprintf("%s/rds/06_02_realty_data3.rds", datapath))

In [None]:
realty_data3

Let's add the binary variable premium, which takes 1 when the premium is above 0, and 0 otherwise

In [None]:
realty_data3[, premium := as.integer(premium_neigh > 0)]

Let's see the structure:

In [None]:
realty_data3 %>% str

Now, select some of the variables:

In [None]:
vars <- c("premium", "esyali", "krediye_uygunluk", "bina_yasi", "kat_sayisi", "kat", realty_data3 %>% keep(is.logical) %>% names)
vars

And assign the subset:

In [None]:
realty_data4 <- realty_data3 %>% select(all_of(vars)) %>% na.omit

In [None]:
realty_data4

Your tasks are to:

- Collect logical features into a separate data.table
- Conduct PCA using prcomp
- Select the appropriate principal components. You may use the cumulative proportion of variance or the sd values above 1. Squares of sd values are eigenvalues and optionally can be visualized as a scree plot.
- Replace the original logical features with the selected principal components
- Partition the data set into 70% train and 30% test sets
- Train the dataset using rpart
- Create a simple simulation function to try each complexity parameter value for pruning the original tree, make predictions and get the accuracy on the test set. Do not use type = "class" argument since the target feature is integer (0, 1) and not a factor. The output will be probability and taking 0.5 as the cutting point, create integer 0 and 1 values as fitted classes accordingly. Accuracy is just the proportion of values where actual and fitted classes agree.
- Select the complexity parameter where accuracy on the test set is at a maximum (this should be around 61%)
- Prune the tree at that optimal complexity parameter, visualize the pruned tree using a single method (mutliple visualization will create a large html that is hard to upload and submit)
- Create the confusion matrices for train and test sets using positive = "1".

# Answer