# Warm-up challenges for week 6

Now that we've seen how to use scikit-learn, statsmodels and LIME to do modelling, it's time for you to  apply this knowledge. This week has three preparatory challenges. 

Each challenge has two components:
1. **Programming and interpretation**
3. **Reflection**

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try on all of them. 
2. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to give it a try).
3. These challenges are ungraded, yet they help you prepare for the graded challenges in the portfolio. If you want to be efficient, have a look at what you need to do for the upcoming graded challenges and see how to combine the work.

### Facing issues? 

We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. 


### Using Markdown

1. Make sure to combine code *and* markdown to answer all questions. Mention specifically the question (and question number) and the answer in markdown, relating to the code and the output of the code. For the graded challenges, failing to do so will impact the grade, as we will not be able to see whether you answered the question.
2. For every line of code, please include a cell in MarkDown explaining what the code is expected to do. 

## General Scenario - valid for Challenges 1, 2 and 3

In DA5 and 6, we will work with a simulated dataset from an online store that created a set of campaigns. This dataset will be used to predict whether someone has made purchases, or not (i.e., ```purchase``` ) and, if so, how much they have spent (i.e.,```order_euros```). 

Scenario:
Our webshop has launched new campaigns to increase sales (as binary, ```purchase```) and revenue (```order_euros```). We are now only interested in sales (i.e., binary DV).

We are interested in two campaigns (indicated in column ```type_campaign```):
* The CPC campaign running on Facebook and Google
* The influencer campaign running on Instagram and Facebook

We want to know if each campaign led to an increase in sales compared to the other campaigns (i.e., any traffic source that is not set as CPC or influencer).

These predictions will be made about the dependent variable (or target): ```purchase```. Your independent variables (or features) will be relevant characteristics available on (or created from) the dataset.


## Challenge 1


### Programming challenge

In this challenge, we focus on modelling and evaluating machine learning models. Therefore, we have prepared a merged dataset with all variables necessary. You can find it in the Github folder; it is called ```da6_fulldata.csv.gz``` (for csv file) or ```da6_fulldata.pkl.gz``` (fir pickle file). It contains information about:
* sessions. Columns ```session_id```, ```session_timestamp```, ```user_agent```, ```referral```, ```type_campaign``` tell us  the details of the browser that visited the site (user_agent), when the visit started, what the referral was, and what campaign the referral came from.
* orders. Columns ```order_euros``` and ```purchase``` include information if a user made a purchase and how much they spent.
* users. Columns ```user_id```, ```reg_name```, ```registration_date```, ```initial_referrer```, ```preferential_client``` and ```age``` for registered users tell us their name, when they created their profile, their age, the referral that they followed when creating their profile, and whether they are part of an exclusive club of preferential clients

*Note: this is a different dataset than you used last week. This dataset is already merged and has additional information about campaigns the webshop is running on different sites. The dataset does not have all the variables you need. You will have to crate them in this challenge.* 

In the first challenge, you need to:
* Propose which control variables are relevant to be included in the model.
* Create the necessary features for different campaigns based on the column ```type_campaign``` (this column includes information about what kind of camapign someone has seen. You can use this information to create the necessary variables for identifying users coming from the CPC or infleuncer campaign) and for your control variables (that you chose in step 1).
* Create two models for statistical testing (with **statsmodels**). One only with the campaign information (cpc and infleuncer), and a second model with at least one additional independent or control variable. It can be one of the variables about the session or user. Sales, as a binary variable should be your DV.

*IMPORTANT: Don't forget to split your data in train / test datasets, and use only the train dataset here, as done also in the DA6 tutorial.*


### Reflection
Finn & Wadhwa (2014) argue that digital analytic practices such as the example case in this week's challanges have ethical impacts including identifiability, inequality, a chilling effect, the objectification, exploitation and manipulation of consumers as well as information asymmetries. For the reflection, choose one of the impacts and reflect to what extent such a campaign evaluation and performance prediction as conducted in this challange may have such an impact. Provide specific examples to link the ethical impact to the case.

## Challenge 2

### Programming challenge

Using the LogisticRegression classifier from **sklearn**, recreate the same models that you created in statsmodels for challenge 1 using your training dataset. Make predictions for different user cases. To compare  models, don't forget to compute:
* How their confusion matrix looks like for the test dataset.
* Their precision, recall and F1-score for the test dataset.

**Optional:** Use LIME to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.


### Reflection

Based on your analysis, you have made one recommendation to the business stakeholder about the campaign. boyd and Crawford (2012) argue that it is misleading to think that we can make objective and accurate claims based on big data analytics. Looking at the campaign evaluation, reflect 
1) to what extent the conclusions and recommendations you have made are objective and accurate,
2) what may be the threats to objecitvity and accuracy,
3) provide concrete suggestions how these issues could be mitigated. 

Please motivate your response with specific issues about the case.


## Challenge 3

### Programming challenge

Using the DecisionTree classifier from **sklearn**, recreate the best performing model from challenge 2 (the one you indicated you'd choose in the technical summary). After the DecisionTree model is created, please:
* Create a chart for the decision tree,
* Compute the precision, recall and F1-score of this model for the test dataset.


### Reflection

In their paper, boyd and Crawford (2012) describe six critical questions for big data research. The questions concern the value of big data, ethical questions related to data collection and analysis as well the as broader impact of big data analytics on science, industry and the society. For the reflection, select one of the questions (except question 2 discussed in challenge 2) and apply it to the case of this week (a company collects digital trace data to analyse campaign effectiveness). Please motivate your response with specific issues about the case.

### Important exception
As LIME is a framework still in development, we are not sure if it will work in all computers and configurations. If by any chance you get an error message when running LIME, please log a GitHub issue but do not wait for a resolution before handing in the challenge.