# Warm-up challenges for week 5

Now that we've seen how to use scikit-learn, statsmodels and LIME to do modelling, it's time for you to  apply this knowledge. This week has three preparatory challenges. 

Each challenge has two components:
1. **Programming and interpretation**
3. **Reflection**

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try on all of them. 
2. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to give it a try).
3. These challenges are ungraded, yet they help you prepare for the graded challenges in the portfolio. If you want to be efficient, have a look at what you need to do for the upcoming graded challenges and see how to combine the work.

### Facing issues? 

We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. 


### Using Markdown

1. Make sure to combine code *and* markdown to answer all questions. Mention specifically the question (and question number) and the answer in markdown, relating to the code and the output of the code. For the graded challenges, failing to do so will impact the grade, as we will not be able to see whether you answered the question.
2. For every line of code, please include a cell in MarkDown explaining what the code is expected to do.

## General Scenario - valid for Challenges 1, 2 and 3

In DA5 and 6, we will work with a simulated dataset from an online store that created a set of campaigns. This dataset will be used to predict whether someone made purchases, or not (i.e., ```purchase``` ) and, if so, how much they spent (i.e.,```order_euros```). 

These predictions will be later used for targeting users. For example, the marketing manager wants to:
* Identify users that have a high likelihood of making a purchase so she can run targeted advertising campaigns towards these users as a way to increase their awareness of the store.
* Identify users that are likely to spend more than 100 euros on the store, and give them discounts that other users would not receive. 

These predictions will be made about the dependent variables (or targets): ```purchase``` and ```order_euros```. Your independent variables (or features) will be relevant characteristics available on (or created from) the datasets as outlined in Challenge 1.


## Challenge 1


### Programming challenge

You will find three files in this folder (starting with da5-6):
* ```da56_orders.csv.gz``` (or ```da56_orders.pkl.gz``` for a pickle version). This file contains a list of orders made on an online store. It contains the session_id that made the order, whether there was a purchase, and how many euros were spent in that order.
* ```da56_sessions.csv.gz``` (or ```da56_sessions.pkl.gz``` for a pickle version). This file contains a list of online sessions in the website. For each session, it contains the details of the browser that visited the site (user_agent), when the visit started, what the referral was, and if the referral was part of a paid campaign. If a user had a profile when doing the purchase, this file will also contain their user id.
* ```da56_users.jsonl```. This file contains a list of registered users on the website, their name, when they created their profile, their age, the referral that they followed when creating their profile, and whether they are part of an exclusive club of preferential clients.

In this first challenge, you will **load, inspeact, clean and visualize** your own dataset, and:
1. Load and inspect each file using the steps you learned in the course
2. Select which files and columns you will use for the challenge 
3. Transform the files into the appropriate format and merge them appropriately
4. Review the dataset created through merging, and remove personally identifiable data
5. Identify at one key independent variable and at least three control variables that should influence the dependent variables (```purchase``` and ```order_euros```), and perform any additional data cleaning steps required to create or transform these variables so they can be used for statistical testing and machine learning.
6. Provide descriptive statistics for all the relevant dependent, independent and control variables
7. Create univariate charts for each dependent, independent and control variable, and at least one chart with an interesting bivariate relationship between one independent variable and one dependent variable.

**Tips:** 
* You are free to decide which files you find interesting to work with, but for the challenge to work properly there are at least two files that **have** to be used so you can create a meaningful dataset.
* If you decide to use the ```da56_users``` dataset, you may end up with several missing values (for users that entered the website but did not have a profile, or made the purchase checking out as guest). You can decide whether to fill these missing values and keep the cases, if appropriate, or simply to work with a smaller dataset (only for registered users). Both ways are fine but, whatever you do, make sure to explain what you are doing and justify your choices.


### Reflection

In sections 1 and 2 of their article, Tr√†mer et al. (2017) discuss how unfair treatments can arise from irresponsible data usage, and provide a few examples. You are now working with digital trace data from a (fictitious) online store, and will later create models that will be used to target campaigns or give discounts to consumers. Please briefly discuss how this may create the risk for unfair treatment (and indicate specifically what that unfair treament would be, and why). Moving one step further, look at the variables in the original datasets (da56_orders, da56_sessions and da56_users). Select one that you believe that has the most potential for creating risks of unfair treatment, and explain why.

## Challenge 2

### Programming challenge

Now that you have your dataset created (challenge 1), we would like you to focus on the variable ```purchase``` (binary variable) as the DV. You need to:
1. Propose one RQ and one hypothesis for your model (i.e., how at least one IV influences the DV)
2. Create one statistical model with statsmodels that (a) tests your hypothesis and (b) checks the influence of at least two other relevant control variables
3. Use machine learning (scikit-learn) to create the same model (as in item 2), and use this model to run predictions (e.g., what is the likelihood of a purchase depending on different combinations of values for the IV and controls?)

**Optional:** Use LIME or SHAP to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.


### Reflection

Hind (2019) discusses four main groups interested in explainable AI. Select the three most relevant groups for this challenge, and explain, in the context of this challenge, why they would be interested by understanding the model being used, and what would they want to know about the model in order to be able to trust its predictions. 

## Challenge 3

### Programming challenge

Now that you have your dataset created (challenge 1), we would like you to focus on the variable ```order_euros``` as the DV. You need to:
1. Propose one RQ and one hypothesis for your model (i.e., how at least one IV influences the DV)
2. Create one statistical model with statsmodels that (a) tests your hypothesis and (b) checks the influence of at least two other relevant control variables
3. Use machine learning (scikit-learn) to create the same model (as in item 2), and use this model to run predictions (e.g., what is the likelihood of a purchase depending on different combinations of values for the IV and controls?)

**Optional:** Use LIME or SHAP to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.

### Reflection

Hind (2019) briefly touches upon global and local approaches for model explanations. Some of the models and predictions you created would fit into the idea of *global* explanations, and others would fit the idea of *local* explanations. 

Using one of the three relevant groups (identified in the reflection of challenge 2) as your main stakeholder, write:
* A brief *local* explanation for your model, and refer back to steps in the data analysis that show local explanations
* A brief *global* explanation for the model, and refer back to steps in the data analysis that show global explanations

Don't forget to indicate in the reflection who was the stakeholder you selected.

### Important exception
As LIME is a framework still in development, we are not sure if it will work in all computers and configurations. If by any chance you get an error message when running LIME, please log a GitHub issue but do not wait for a resolution before handing in the challenge.