# Designing Data Products


<div class="slide-title"> 
    
# Designing Data Products
    
</div>

## Why would we build a model?
      
* Exploratory analysis - understand what happened in the past
* Predictive analyis - predict what will happen
* **Predict what, for whom and for what purpose?**

<div class="alert alert-block alert-info">
<b>Note: </b> 
    You do not always need a ML model.
</div> 

In some cases you can find a deterministic solution and also sometimes simple relations can already lead to a good prediction.

### Product = Customer x Business x Technology

* Usability
* Business viability
* Feasibility
      
Value = product of the three (If one is zero then the value is zero too.)  




Notes: You're not only building a model for the sake of the model . <br>
The value of your product is defined by three characteristics.

## Measuring success

The first model you build should be the simplest model that could address the product needs.

**Business performance:** measured usually by one KPI (key performance indicator)
<br><br>

**Model performance:** an offline metric that captures how well the model will fit the business need
<br><br>

<div class="alert alert-block alert-info">
<b>Note: </b> 
    The business metric is independent from the model metric... It is a measure of the product success.
</div> 


In most cases there is a solution which is not using ML (e.g. Titanic: all female survived), I need to justify why a ML-model is necessary since this is involving more complexity and additional expenses. <br>
KPIs can be defined for anything.


### Business performance vs. model performance

<center>  
<img src="../images/ml_project_data_products/Business_vs_Model_Performance_1.png" >
</center> 


Example: Cab app, navigation system

### Business performance vs. model performance

<center>  
<img src="../images/ml_project_data_products/Business_vs_Model_Performance_2.png">
</center> 

### Business performance vs. model performance

<center>  
<img src="../images/ml_project_data_products/Business_vs_Model_Performance_3.png">
</center> 

### Business performance vs. model performance

<center>  
<img src="../images/ml_project_data_products/Business_vs_Model_Performance_4.png">
</center> 

### Examples of measuring business performance

Business metrics:

* Click through rate (CTR) - for recommenders
* Usage - model that generates html from hand drawn diagrams
* Adoption by finance team - internal revenue forecasting

<div class="alert alert-block alert-info">
<b>Note: </b> 
    The business metrics for your model might be impossible to measure before the model goes live!
</div> 


Q: What is the kpi of zoom-backgrounds? (people turn on their cameras, in public meeting this is often not the case)

### Examples of measuring model performance
 
<b>Regression: </b>  
* RMSE, RMSLE
* MAPE ( mean absolute percentage error) - accuracy as a ratio

<b>Classification: </b>  
* Accuracy
* Precision
* Recall

<b>Custom metric: </b> based on the worst case scenarios of your product. 

<div class="alert alert-block alert-info">
<b>Note: </b> 
    If you need to present to stakeholders you need a simple metric!
</div> 
<!--
<div class="alert alert-block alert-info">
<b>Note: </b> 
    If you need to present to stakeholders you need a simple metric... RMSE , precision, recall etc. are too complex to explain!
</div> 
-->

Notes: What could the “L” possibly stand for in RMSLE? (logarithmic) RMSLE: best to use when targets having exponential growth (population counts, average sales of a commodity over a span of years etc.) Note: that this metric penalizes an under-predicted estimate greater than an over-predicted estimate.

### Relationship between business performance & model performance

Thinking of the business value of your model and the cost of being
wrong can help you choose the right model metric.

**Always start from the value!**


## Error Analysis

### Remember the Summary vs details?

<center>  
    <img src="../images/ml_project_data_products/img_p13.gif" width=800>
</center> 


### Going beyond aggregated metrics

* Most model performance metrics we’ve seen are aggregated metrics

* They help determine whether a model has learned well from a dataset or needs improvement

* Next step: examine results and errors to understand why and how is the model failing or succeeding 
      
**Why: validation and iteration**


<div class="alert alert-block alert-info">
<b>Note: </b> 
    Performance metrics can be deceptive, on highly imbalanced datasets a classifier can reach very high accuracy without any predictive power
</div> 


you have to validate and then you can catch mistakes or/and make improvements


### Types of supervised learning


<center>  
<img src="../images/ml_project_data_products/img_p15.png" width=800>
</center>

[source](https://towardsdatascience.com/regression-or-classification-linear-or-logistic-f093e8757b9c)


### Validate your model - inspect how it is performing

There are lot of ways to do this. You want to contrast data (target and/or features) and predictions.

**Regression:** looking at residuals, for example doing EDA on residuals and inspecting the outliers

**Classification:** one can start with a confusion matrix, breaking results in true class and predictions


### Confusion Matrix for classification

<div class="group">
    <div class="text">

Counts how often the model predicted correctly and how often it got confused.
        
* False Positive: false alarm / type I error
* False Negative: missed detection / type II error 

    </div>
    <div class="text">
        
<table class="confusion_matrix">
    <thead>
        <tr>
            <th>&nbsp;</th>
            <th>&nbsp;</th>
            <th colspan=2>Predicted</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>&nbsp;</th>
            <td class="cell-class">&nbsp;</td>
            <td class="cell-class">Negatives</td>
            <td class="cell-class">Positives</td>
        </tr>
        <tr>
            <th rowspan=2>Actual</th>
            <td class="cell-class">Negatives</td>
            <td><span style="color:#40B0A6">TN</span></td>
            <td><span style="color:#E1BE6A">FP</span></td>
        </tr>
        <tr>
            <td class="cell-class">Positives</td>  
            <td><span style="color:#E1BE6A">FN</span></td>
            <td><span style="color:#40B0A6">TP</span></td>
        </tr> 
    </tbody>   
</table>
    
   </div>
</div>  

**What do the misclassified examples have in common?**

Notes: Here, positives are not necessarily positive but are simply what the model should predict

### Residual analysis for regression
<div class="group">
  <div class="text"> 
      
* This is like EDA again but on residuals (predicted - observed)
* Plot residuals /and standardized residuals vs predicted
* We want our residuals to have no patterns, to be symmetrically distributed, centered in the middle of the plot
* **IF NOT...**      
      
    </div>
    <div class="images">
        <img src="../images/ml_project_data_products/img_p18_1.png" width=600>
        
    </div>
</div>



### Residual analysis for regression
<div class="group">
  <div class="text"> 
      
* **IF NOT..** then there is room for improvement in the model.
      
What if my residuals look like this? Check out [this walkthrough](https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/)     
    </div>
    <div class="images">
        <img src="../images/ml_project_data_products/img_p18_2.png" width=600>   
    </div>
</div>





## Resources

https://svpg.com/what-is-a-product/  
https://medium.com/analytics-vidhya/root-mean-square-log-error 

Building Machine Learning Powered Applications(https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a) - EmmanuelAmeisen  
https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/  
https://www.scikit-yb.org/en/latest/api/regressor/residuals.html  

Example of EDA with error analysis  
https://www.kaggle.com/elitcohen/forest-cover-type-eda-modeling-error-analysis#Error-Analysis  
https://www.kaggle.com/pestipeti/error-analysis  
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python



## ML Project Topics

### Kickstarter Project Success
<div class="group">
  <div class="text"> 
      
Analyse and model success factors of kickstarter campaigns. Give
new projects an idea what is needed for a successful funding and
potentially even predict campaign success upfront.

* 221811 rows of data on campaigns
* (medium)

      
    </div>
    <div class="images: position: absolute; top: 0; right: 0;">
        <center><img src="../images/ml_project_data_products/img_p22.png" alt="Example Image" style="width: 200px; height: 200px;"></center>
        
    </div>
</div>


Notes: 3 people (medium/hard)

### Tanzania Tourism Prediction

Can you use tourism survey data and ML to predict how much money a tourist will spend when visiting Tanzania?

* Survey Data from 6476 participants
* (easy/medium)

<a href="https://zindi.africa/competitions/tanzania-tourism-prediction/data">Zindi-Tansania-Tourism</a>


Notes: Suitable for 2 people (easy/medium)<br>
a lot of EDA

### Fraud Detection Challenge in Electricity and Gas Consumption

* Based on client’s billing history detect clients involved in fraudulent activities
* (medium/advanced)  

<a href="https://zindi.africa/competitions/fraud-detection-in-electricity-and-gas-consumption-challenge">Fraud Detection Challenge</a>


Notes: 2-3 people (medium)<br>
highly imbalanced <br>
lots of data, you need to delete something

### Urban Air Pollution Challenge

Predict air quality levels and empower communities to plan and protect their health

* weather data and daily observations collected from Sentinel 5P satellite tracking various pollutants in the atmosphere
* (medium/advanced -> domain knowledge helpful)

<a href="https://zindi.africa/competitions/zindiweekendz-learning-urban-air-pollution-challenge">Air Pollution Challenge</a>

Notes: 3 people

### Flight Delay Prediction Challenge

Predict airline delays for Tunisian aviation company, Tunisair

* Data on flight delays. Can be combined with airport locations
* (medium)

<a href="https://zindi.africa/competitions/flight-delay-prediction-challenge">Flight Delay Prediction Challenge</a>


Notes: 3 people <br>
compine datasets

### Financial Inclusion in East Africa

Can you predict who in East Africa is most likely to have a bank account?

* Survey data on financial inclusion of ~33,600 participants
* (easy/medium)

<a href="https://zindi.africa/competitions/financial-inclusion-in-africa/data">Financial Inclusion in East Africa</a>


Notes: 2 people <br>
imbalanced data



### Turtle Rescue Forecast Challenge

Anticipate the number of turtles to rescue

* Lots of data cleaning
* (easy/medium)

<a href="https://zindi.africa/competitions/turtle-rescue-forecast-challenge">Turtle Rescue Forecast Challenge</a>
