**30E03000 - Data Science for Business I (2022)**

# First exam - 22.02.2022

### Case: Healthcare analytics (60pt)

Assume that you are consulting a hospital that aims at improving patient outcomes, efficiency and reducing costs. The data set that the hospital has provided to you contains profiles of several patients including their demographic details along with a set of medical measurements taken during their recent visit (M1 - M9). The objective is to predict which patients should be called for further clinical testing. The testing procedure is costly and the hospital would like to avoid calling healthy patients for an additional test.

The data set is provided in the file **healthcare.csv**. Relevant background information on the variables is provided in the file **healthcare-variables.txt**.

When completing the following steps, you should answer the quiz in MyCourses. Please remember, that you still need to submit the completed notebook. **Add markdown and code boxes as many as you need**.

***

***

### Submission

Answer the quiz and upload your filled-out notebook in **.ipynb format** in the exam submission box. Name your file studentNum-lastname-firstname-exam.ipynb.

## Task 1 (10pt):

Load data set from "healthcare.csv" file. Perform exploratory data analysis to understand your data better. Check for missing (or otherwise weird) values and variable types. Transform variables when needed (e.g., dummies). You can also drop variables that you don't want to include, but justify your decision!

To preprocess,
1. If you find missing values, remove every row containing a missing value
2. Drop the running identifier column
3. Convert the target variable to integer type
4. Perform one-hot encoding for the categorical variables (sex, education, employment, marital, abroad, trust, sport, smoke) using `get_dummies` from Pandas library

Please note that any mistake in the preprocessing steps will potentially cost you a lot of points in the subsequent steps.

Q 1.1 How many rows do you have after the elimination of missing values?
* 1001
* 1482
* 1500
* 1082
* 1020
* none of the mentioned

Q 1.2 How many columns do you have right after dropping the identifier column and before one-hot encoding?
* 24
* 25
* 1500
* 23
* none of the mentioned

Q 1.3 How many columns do you have right after performing one-hot encoding?
* 40
* 46
* 51
* 24
* 25
* none of the mentioned



***

## Task 2 (5pt):

Split the data into training (70%) and testing (30%) data sets. Check outcome distributions on training and test datasets. Use `train_test_split` from `sklearn.model_selection`. Set the seed with `random_state=12345` when performing the split.

What is the proportion of the class "no need" in the training set? Choose the closest value.
* 79%
* 21%
* 78%
* 22%
* 50%

What is the proportion of the class "no need" in the test set? Choose the closest value.
* 79%
* 21%
* 78%
* 22%
* 50%

***

## Task 3 (15pt):

Train a simple decision tree model using `DecisionTreeClassifier` from `sklearn.tree` with the following parameters: `criterion="gini", random_state=100, max_depth=3, min_samples_leaf=3`.

Use the classifier to predict labels and probabilities for the test set.

### Analyze the performance of the model. How does it perform on the test set? (5pt)

Q 3.1 What is the accuracy of the decision tree model? Choose the interval that contains the value.
* [50%, 55%]
* [55%, 60%]
* [60%, 65%]
* [65%, 70%]
* [70%, 75%]
* [75%, 80%]

Q 3.2 What is the AUC score? Choose the interval that contains the value.
* [0.50, 0.55]
* [0.55, 0.60]
* [0.60, 0.65]
* [0.65, 0.70]
* [0.70, 0.75]
* [0.75, 0.80]


### Which variables appear to be most important for predicting the success of a campaign? (5pt)

Q 3.3 Choose the top-3 features based on the importance of the variable.
* bmi
* sport_NO
* employment_LONG-TERM SICK/DISABLED
* marital_MARRIED
* income
* smoke_YES
* m1
* depress

### Interpret the decision tree produced by the model. What insights can you get from it? (5pt)

Q 3.4 Interpret the generated decision tree. Which of the following claims are true?
* Everyone with a low score for depress is classified as "no need".
* If a person is married, there is no need for a test according to the model.
* According to the model, an unmarried person with income below 1335 needs a test regardless of other variables.
* It is possible that the model classifies a person with a low m6-value as "no need".

Q 3.5 Does the following case need a test according to the model?
* depress=5
* married
* m1 -0.5
* m5 1.5
* income 2100
* m6 0.010
* the person smokes

***

## Task 4 (10pt):

Could the model be improved by using rebalancing techniques? To answer this, check how the model from Task 3 would perform on a balanced dataset. For balancing, use `RandomUnderSampler` from the package `imblearn.under_sampling`. Use the settings `random_state=1234` and `sampling_strategy='majority'` for `RandomUnderSampler`.

Q 4.1 What is the accuracy of the decision tree model with balanced data? Choose the interval that contains the value.
* [50%, 55%]
* [55%, 60%]
* [60%, 65%]
* [65%, 70%]
* [70%, 75%]
* [75%, 80%]

Q 4.2 What is the AUC score of the decision tree model with balanced data? Choose the interval that contains the value.
* [0.50, 0.55]
* [0.55, 0.60]
* [0.60, 0.65]
* [0.65, 0.70]
* [0.70, 0.75]
* [0.75, 0.80]

***

## Task 5 (25pt):

Use the data to train additional competing classification models (in addition to the decision tree models from Task 3 and 4). Use logistic regression with settings `LogisticRegression(solver='liblinear', max_iter=10000, penalty='l1', C=0.05, random_state=1234)`. Fit a model with the unbalanced training data and another with the balanced data.

Plot the ROC curves for all the models you have fitted thus far (4 models: two decision trees and two logistic regression models).

Q 5.1 Look at the the ROC curves. Which of the models has the highest true positive rate when the false positive rate is 0.6?
* Decision tree, unbalanced
* Decision tree, balanced
* Logistic regression, unbalanced
* Logistic regression, balanced

### Suppose that the cost of a single test is 500, and the reward (or costs saved) due to a successfully detected case is 13000. (10pt)

Q 5.2 What is the expected benefit per patient for the unbalanced decision tree model (the model from Task 3)? Choose the interval that contains the value.
* 0-10
* 10-100
* 100-1000
* 1000-2000
* 2000-10000

Q 5.3 Which of the four models has the highest expected benefit?
* Decision tree, unbalanced
* Decision tree, balanced
* Logistic regression, unbalanced
* Logistic regression, balanced

Q 5.4 Which of the four models has the lowest expected benefit?
* Decision tree, unbalanced
* Decision tree, balanced
* Logistic regression, unbalanced
* Logistic regression, balanced