In [11]:
%matplotlib inline

In [12]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy as sp
import statsmodels as sm

# Task: Correlation & Dependency

In this task you are given three datasets `corrdep1.csv`, `corrdep2.csv`, `corrdep3.csv`.

Each dataset contains samples $(x_1, y_1), (x_2, y_2), (x_3, y_3), \dots$ from a joint distribution $(X, Y)$.

For each dataset answer the following questions

1. Are  $X$ and $Y$ correlated? Explain why/why not.

    * What statistical measure can you use answer this question?
    * Compute this this measure for your samples x and y
    * Render a plot which helps you to see the correlation between x and y (if it exists)

2. Are the $X$ and $Y$ stochastically independent? Explain why/why not.

You can argument using plots in the presentation.

Explain in general what correlation / stochastical (in)dependence using the plots of the three datasets.

In [13]:
df = pd.read_csv('data/corrdep1.csv', index_col=0)

In [14]:
df.head()

Unnamed: 0,x,y
0,-8.645908,6.103646
1,-7.194115,6.090273
2,-8.531907,6.100312
3,-6.789834,6.090637
4,-7.40593,6.029871


# Task: A/B Test

A product manager at your company has exciting news

> "We've developed a new improved experience for our customers.
> We believe a customer with the new experience will perfom better. The new experience however costs 50 EUR per customer compared to 40 EUR of the old experience"

To measure the perfomance of a customer's experience your company uses these indicators:

1. The *revenue* from the customer after the experience
2. The *profit* from the customer after the experience, which is defined as `profit = revenue - cost`

Your job is to use your data skills to give advice for business related questions like:

* Is the new experience better? In what sense?
* Should we give all our customers the new experience if we want to increase or profit?
* Do you see any opportunities?

Task:
*For the presentation of this task please prepare how you would answer these questions*



### AB-Test Data

To compare the perfomance of both experiences you set up A/B test.
1000 customers get randomly assigned to one of the groups 'A' or 'B':

* Customers in group `A` get the old experience
* Customers in group `B` get the new experience

Afterwards you measure the revenue of each customer.

You can find the results of the A/B test in the file `abtest.csv`.

* Each row corresponds to customer.
* There is a column for the Group and the revenue of that customer after he had the experience.
* The `profit` indicator is missing however.

In [81]:
df = pd.read_csv('./data/abtest.csv', index_col=0)

In [82]:
df.head()

Unnamed: 0,group,revenue,cost
0,A,128.163311,40
1,B,139.377046,50
2,A,208.989308,40
3,A,186.709871,40
4,A,226.005842,40


### Exploratory Data Analysis

* Compute the `profit` performance indicator which is missing from the data
* Generate a plot that allows one to compare the performance of both groups

### Hypothesis testing

Answer the question
> Are the differences in perfomance in both groups statistically significant?

* Find a statistical test that you could use to answer this question. You can assume that you don't know any population parameters. You can use [this overview](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Common_test_statistics) from wikipedia to find a suitable test. Explain why you chose it.

* What is the hypothesis / null-hypothesis of your test?

* Apply your test to the data of indicators and compute the *p* value of your test.

  Use an existing implementation from a library, don't write it yourself. Implementations exist in the [scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html#statistical-functions)
  and [statsmodels](http://statsmodels.sourceforge.net/stable/stats.html) libraries

# Task: Least squares constant

Let $y_1, y_2, \ldots, y_n$ be real numbers. For a number $x$ we define
the sum of squared distances
$$f(x) := \sum_{i=1}^n (x-y_i)^2$$

* Which $x$ minimizes $f(x)$ in general? Solve analytically.

* Implement $f$ as python function for some real data $y$ in file `leastsquares.csv`

* Find `x_min` that minimizes the python function

* Plot `f(x)` around the the optimal `x_min`

In [12]:
df = pd.read_csv('data/leastsquares.csv', index_col=0);

In [13]:
df['y'].head()

0    45.631415
1    14.737389
2    92.859182
3    51.230321
4    49.903596
Name: y, dtype: float64

# Bonus: Maximum Sum Path

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23 = 3 + 7 + 4 + 9.

```
   3
  7 4 
 2 4 6
8 5 9 3
```


Find an algorithm that can find the maximum total from top to bottom in a general triangle of any size.

Don't implement anything. You only need to describe the algorithm in words or pseudo code. Also pls don't google the solution.