# <font color=Pink>A step-by-step guide to getting started with modelling using your own dataset</font>

This notebook takes you through all the steps, you need to perform, when you start working with your own dataset. However, be aware that you will most likely go back and forth between some of the steps or reiterate the whole cycle one more time after making and evaluating your first model(s). Thus, *<font color=Orchid><b>although this notebook is a linear step-by-step guide, the process of creating a classifiation model is NOT linear</b></font>*, i.e., you should be going back and forth between the steps. For instance, having trained and evaluated a model will likely make you want to go back and train it again on different features or with different parameters.

The steps below are partly inspired by the stages of the CRISP-DM process model for data science ('The CRoss Industry Standard Process for Data Mining').<br>

The steps are as follows:
<ol>
    <li><b><font color=Orchid>Import your dataset and libraries</font></b></li>
    <li><b><font color=Orchid>Perform exploratory data analysis of your dataset</font></b> ('Data Understanding')</li>
    <li><b><font color=Orchid>Clean your dataset</font></b> ('Data Preparation')</li>
    <li><b><font color=Orchid>Train one or more classification models</font></b> ('Modeling')</li>
    <li><b><font color=Orchid>Evaluate your model(s)</font></b> ('Evaluation')</li>
</ol>

**Remember, you can find all the code, you need to use in each step, in the relevant exercise notebooks!**

**The CRISP-DM process model:** <br>

<img src="crisp-dm.png" width="400"/>

# 1. Import dataset and libraries 

In this section you import you dataset as well as the libraries you plan on using, i.e., pandas, sklearn etc.

In [None]:
# your code goes here

# 2. Explore the dataset

In this step, you examine your dataset performing what is called "exploratory data analysis". For instance, you likely want to know:
- which columns/features your dataset contains
- the data type of the features
- whether there are any missing values
- whether there are any outliers
- whether your dataset is balanced when it comes to your target feature
- how other features are distributed
- how your target variable correlates with various features in the dataset, i.e., making a correlation matrix/heatmap (this might require you to clean the dataset first, see step 3)

N.B.: You will likely be going back and forth between exploring and cleaning the dataset.

Hint: the notebooks <font color=Orchid>"BDP_Basics" and "BDP_EDA"</font> will be your friend here!

In [None]:
# your code goes here

# 3. Clean the dataset

This step is based on your findings in step 2. Thus, cleaning the dataset may entail:
- handling missing values
- handling outliers
- handling nominal data (i.e., data which is nut numerical)

After the initial cleaning, you will likely go back to step 2 and make a correlation matrix.

Hint: the notebooks <font color=Orchid>"BDP_Basics" and "BDP_EDA"</font> will be your friend here!

In [None]:
# your code goes here

# 4. Modeling

Modeling can be done by following these steps. However, be aware that **<font color=Orchid>modeling is NOT a linear process, but an iterative process</font>**, i.e., you will very likely have to go back and forth between the steps below.

- 4.1. Select variables to use in your model
- 4.2 Split your data into test and train data
- 4.3 Apply datasampling (if your dataset is unbalanced)
- 4.4 Select model type
- 4.5 Select parameters for your model
- 4.6 Train your model

### 4.1 Start by selecting the variables you want to use in your model

Regardles which model you plan on using, you need to start by:
- selecting the **<font color="Orchid">target/dependent variable ('y')</font>**
- selecting the **<font color="Orchid">features/independent variables</font>** you want to use for your prediction ('X')

The features you choose can depend on one or more of the following:
- you own hypotheses/existing literature 
- findings from your data exploration
- findings and insights gained while modeling (you will likely go back and forth in your modeling process, trying out various features and parameters)

In [None]:
# your code goes here

### 4.2 Next, split your data into test and train data

Remember - if you plan on using KNN for classification, you need to scale your data before training/fitting the model!

In [None]:
# your code goes here

### 4.3 For unbalanced datasets, you can now apply data sampling techniques

If your data is not balanced (i.e., there is an unequal distribution of observations belonging to each of the classes in your target variable), you can balance the classes using over- or undersampling.

Over-sampling techniques:
- random oversampling
- SMOTE oversampling (synthethic oversampling)

Undersampling techniques:
- random undersampling
- TomekLinks undersampling

Combining over- and undersampling:
- SMOTETomek

<br>
<img src="../week5/Data_sampling.png" width="1000"/>

In [None]:
# your code goes here

### 4.4 Select the type of model you want to try out

Now, you are ready to do some modelling!

You can do classification modeling using a *single* model or a mix of *several* models (i.e., an ensemble)

We have looked at the following *single* classification models:
- Decision trees
- KNN

We have looked at the following *ensemble models* for classification:
- Bagging (including random forests)
- Boosting
- Ensemble voting

<br>
<img src="../week4/Bagging_Boosting.png"/>

In [None]:
# your code goes here

### 4.5 Select parameters for your model (hyperparameter tuning)

When building/training a classification model, you can provide it with various *parameters*.

For **<font color="Orchid">decision trees</font>**, relevant parameters include:
- max_depth
- min_samples_split
- min_samples_leaf

For **<font color="Orchid">KNN</font>**, we have looked at the parameter:
- k (number of neighbours)

The performance of your model depend on these parameters. To identify the most optimal values of these parameters, you perform hyperparameter tuning.

In [None]:
# your code goes here

### 4.6 Training your model

After having decided on:
- which model you want to try out
- whether you need to apply samling methods
- optimal parameters

... you fit/train yout model on your training data. Again, remember to scale the data, if you are using KNN.

In [None]:
# your code goes here

# 5. Evaluating your model

Having trained a model, you now want to assess its performance.

You can assess the performance using various methods and metrics:
- accuracy
- confusion matrix
- recall
- precision
- specificity
- AUC-ROC
- F1

**<font color=Orchid>Having evaluated your model, you will likely go back to previous stages, such as 'Data Understanding' and 'Modeling', training new models based on your findings so far.</font>**

In [None]:
# your code goes here