## Digital Technologies and Data-Driven Business
# Mandatory Assignment 3

In the following, you find tasks that need to be solved as part of the third mandatory assignment in Digital Technologies and Data-driven Business. Once you solved the tasks, please save the .ipynb file (i.e., _File_ >> _Download as_ >> _Notebook (.ipynb)_) and upload the saved file to Canvas. The deadline is __November 4 at 10:00__. Mandatory assignments are either __approved__ or __not approved__. If a mandatory assignment is not approved, you will have the opportunity for a retake. 

Please read the instructions carefully and pay particular attention to the following points:
1. Please provide correct Python code (i.e., code that can be executed without errors).
2. Explain the code you have written in your own words (either with markdown or comments).
3. You may work in groups but your submission must be individual, i.e. you each have to provide a functioning .ipynb file with __your own__ solutions and explanations. Do not copy the answers from others. Answers that are not your own (plagiarized) will lead to the mandatory assignment not being approved.

Good luck.

### Copenhagen Bank

Despite your great insights, RideDenmark has not been able to establish itself in the Danish market and closed down business. However, your skills are now well known in Copenhagen and a nearby bank has requested your services as part of their customer insights team. They have had a lot of customers leaving recently and they want to understand why customers are leaving so they can address the issue in advance.

Emma, who is your new supervisor has extracted some data from the system regarding customer records. She sent it as a file named `bank-customers.csv` and left some questions that you can find below.

__Emma:__ _We had a lot of customers leaving us recently. We would like to explore what attributes of customers contribute to this churn. Can you please help us with that?_

__Important:__ Through the assignment you will be manipulating the dataframe in order to prepare it for machine learning. When you start a new task, you should always continue with the updated dataframe (do not import the data again after Task 1).

## Installing and importing libraries

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} scikit-learn

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score, precision_score, confusion_matrix

# This command helps you to show all columns
pd.set_option('display.max_columns', None)

# Task 1 (1 point)

Read the file named `bank-customers.csv`.

In [None]:
# Write your answer here


# Task 2 (2 points)

__Emma:__ _The dataset contains a list of customer records. The column `Churn` indicates if the customer has left the bank. Please make a simple plot showing the distribution of this column._

In [None]:
# Write your answer here


__Emma:__ _Please also determine the distribution in absolute numbers._

In [None]:
# Write your answer here


# Task 3 (1 points)

__Emma:__ _The columns `CustomerID` and `Surname` might not give us relevant information in predicting our target. Please drop these columns._

In [None]:
# Write your answer here


# Task 4 (2 points)

__Emma:__ _Could you please indicate the statistical correlations between the numerical columns in a heatmap?_

_Hint:_ Seaborn has a built-in `heatmap()` feature.

In [None]:
# Write your answer here


__Emma:__ _Which attribute seems to have the highest positive and which the highest negative correlation with our target variable (`Churn`)?_

__Write your answer here:__


# Task 5 (1 point)

__Emma:__ _In order to use all of the remaining columns, we need to convert the categorical columns to numerical columns. Therefore, each card type has been assigned a score. Could you please edit the column `Card Type` so that the strings are replaced with the following numbers?_

DIAMOND: 5<br>
PLATINUM: 3.5<br>
GOLD: 2.5<br>
SILVER: 1

_Hint:_ Look at the pandas `map` method.

In [None]:
# Write your answer here


# Task 6 (2 points)

__Emma:__ _We still have two categorical columns that we need to alter using a process called one-hot encoding. Please encode `Geography` and `Gender` using one-hot encoding._

_Hint:_ Why do we need [one-hot encoding](https://towardsdatascience.com/encoding-categorical-variables-one-hot-vs-dummy-encoding-6d5b9c46e2db) and what is it?

_Hint2:_ Take a look at the pandas documentation about [one-hot encoding](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

In [None]:
# Write your answer here


# Task 7 (2 points)

__Emma:__ _My colleague told me that a crucial concept when working with most machine learning algorithms is to split your data into train and test data._

_He also showed me that the following code splits your existing DataFrame into a test set and a train set. In this case, 80% of the data will be hosted in the train set, the remaining 20% will be hosted in the test set._

_However, I still do not quite understand why one would want to split the dataset into two parts (train and test). Could you maybe give me some usefull insights?_

Hint: Please describe in your own words. The lecture and readings provide useful information. Your answer should be min. 80 words.

In [None]:
# code provided by Emma's colleague
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1, stratify=y)

__Write your answer here__

# Task 8 (2 points)

__Emma:__ _Thank you for the explanation. Now we need to get started with the machine learning. Could you please help split the data into a training and test dataset?_

_Hint:_ The easiest way to split our data is using the built-in function `train_test_split` in Sci-Kit Learn. Before you make the actual split you should create the following variables, so the train_test_split knows what your target is and which attributes you want to use in the prediction:<br>

__X__ - Should contain all the columns except the target.<br>
__y__ - Should only contain the target column.

The test size should be __30%__.

You should end up having four variables that are named X_train, X_test, y_train, and y_test.

In [None]:
# Write your answer here


# Task 9 (2 points)

__Emma:__ _Let's try to create our first model using the DecisionTreeClassifier from Sci-Kit Learn. One of our colleagues has provided us with the following code._

In [None]:
DTC = DecisionTreeClassifier(criterion = "entropy", max_depth = 3).fit(X_train, y_train)

y_pred = DTC.predict(X_test)

print(f'Accuracy: {DTC.score(X_test, y_test)}')
print(f'Precision: {precision_score(y_test, y_pred)}')
print(f'Recall: {recall_score(y_test, y_pred)}')

__Emma:__ _Could you please help interpret the results for me? What does the accuracy, precision, and recall mean. And was our first model a success. Why/why not?_

Write a minimum of 80 words.

__Write your answer here__

# Task 10 (2 points)

__Emma:__ _The code below allows us to plot the decision tree for our model. Please explain what you see. Explain how to read the contents of the nodes (i.e., boxes)._

Write a minimum of 50 words.

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(15,10))
class_names = ['Not Churn', 'Churn']
plot_tree(DTC, max_depth=3, fontsize=8, feature_names=X.columns.tolist(), filled=True, class_names=class_names)
plt.title("Decision tree")
plt.show()

__Write your answer here__

# Task 11 (2 points)

__Emma:__ _In order to better understand the results we should maybe create a confusion matrix to visualize our data. I have found the following code, but I am having trouble reading the visualization. Can you please help me? What does the different numbers mean?_

Write a minimum of 40 words.

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)

ax = sns.heatmap(conf_matrix, annot=True, fmt='d', cbar=False, cmap='Blues')

ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
plt.show()

__Write your answer here__

# Task 12 (1 point)

__Emma:__ _In order to obtain some insights from our model I have gotten one of our colleagues to write a code that creates a dataframe that contains the feature importance according to the DecisionTreeClassifier from before. Could you please visualize this in a sorted barchart?_

In [None]:
data = list(zip(X_train.columns.to_list(), DTC.feature_importances_))
feature_importance = pd.DataFrame(data, columns=['Feature', 'Importance'])

In [None]:
# Write your answer here


# Task 13 (2 points)

__Emma:__ _Now let's try to see if we can make our model even better. We should try tuning the `max_depth` hyperparameter of the DecisionTreeClassifier. Create a for loop that stores the accuracy, precision, and recall of models with a `max_depth` of 3, 5, 10, 15, 20. I have created a result dataframe that you can use to store the data of the different iterations._

_Hint:_ You can reuse much of the code from previous tasks.

In [None]:
result_df = pd.DataFrame(columns=['max_depth', 'Accuracy', 'Precision', 'Recall'])

In [None]:
# Write your answer here


__Emma:__ _Which `max_depth` gave the highest accuracy? Explain in your own words what the hypterparameter `max_depth` does._

Hint: You could look into the documentation (**https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html**).

__Write your answer here__


# Task 14 (2 points)

__Emma:__ _Great work - we got a couple of models to run and extracted a lot of data about the performance. How was your overall impression - did we succeed in making a useful model? Are there any insights I can bring to the executive team. Did we gather some new information based on our machine learning that we did not obtain from our standard statistical correlation?_

_Hint_: Change the `max_depth` hyperparameter in Task 9 to the optimal value from Task 13. Afterwards, run Task 9 to Task 12 again. Write a minimum of 100 words.

__Write your answer here__