## Beginning of a Data Story

One day, you decided to test your luck by stepping into a venture that was entirely unfamiliar to you. Of course, you didn't venture into it empty-handed, but with a monthly sum of money \$<b>M</b> that you earned from your current business. After allocating a small portion of it to hire a data scientist to optimize your investments, your remaining budget for the venture is \$<b>A</b>. 

### Solution

After discussing with the data scientist, you aggree that evaluation/opinion of your partners/customers is an important channel for the success of the venture. Finally, you end up with a solution is conducting a survey on your partners/customers about their evaluation/opinion on the venture and make decision based on the survey's result. Using your experience of the prior business, the data scientist designs a metric to express the impact of evaluation/opinion of partners/customers on the venture profit while still be able to be inferred from the survey. 

## Mathematical Description

This is an extreme case when you don't have any clues about the bound of risks that you may encounter in the venture. What this story brings to you is the reason behind the significant level ($\alpha$) at the mathematical detail. Precisely, for the above solution, $\alpha$ depends on the solution of this mathematical programming problem: 

$\textrm{minimize}_{(N_\alpha, \mu_\alpha)} \quad \tau$

$\textrm{subject to} \quad N_\alpha C_0 - \kappa C_1 \Phi_{N_\alpha-1} (\kappa-\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma}) \le A \quad (1)$

$\quad N_\alpha C_0 - \tau' C_1 \Phi_{N_\alpha-1} (\tau'-\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma}) \le -B \quad (2)$

$\quad \kappa \lt 0 \quad (3)$

$\quad \gamma \le 0 \quad (4)$

$\quad \tau' = \tau - \gamma \quad (5)$

Where $A, B$ is non-negative real numbers corresponding to your budget and your minimal expected net profit of the venture. $N_\alpha$ is the number of samples of the survey. $C0$ is the per-sample cost for the survey. $C1$ is the impact factor of the evaluating metric's result on your revenue. Thus, a unit increment in the evaluating metric is equivalent to an increase of \$<b>C1</b> in your revenue, and vice versa. $\Phi_{N_\alpha-1}$ is the CDF of the Student's t-distribution of $N_\alpha-1$ degrees of freedom. Although $\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma} \sim t_{N_\alpha-1, \kappa}$ is the distribution of the evaluation metric over similar surveys given the real impact $\kappa$, we will use distribution $t_{N_\alpha-1}+\kappa$ as an approximation of $t_{N_\alpha-1, \kappa}$ for $N_\alpha \gt 30$, the rule of thumb number. 

### Interpretation

Equation (1), along with equation, (3) controls the bound of expectation of the risk to be always equal or less than \$<b>A</b>, the budget. Other equations, along with the objective function, are for optimizing the range of worthy opportunities which will be mentioned when the "Be Open for Opportunities" strategy is considered later. 

## A Sub-solution

In order to solve the programming problem, we consider a sub programming problem: 

$\textrm{minimize}_{\mu_\alpha} \quad -\mu_\alpha$

$\textrm{subject to} \quad N_\alpha C_0 - \kappa C_1 \Phi_{N_\alpha-1} (\kappa-\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma}) \le A$

$\quad \kappa \lt 0$

Thus, we're considering the programming problem for each fixed value of $N_\alpha$ and ignoring equations of controlling the range of worthy opportunities (just for this moment). 

In [1]:
import numpy as np
import scipy.stats as sts

def newton_solve_max(f, df, x0, tol, upperbound):
    while True:
        f_crt = f(x0)
        alpha = 1.
        rate = f_crt/df(x0)
        if (x0 - rate)>=upperbound:
            x0 = x0+(upperbound-x0)*.5
        else:
            x0 = x0 - rate*alpha
        if abs(f_crt)<tol:
            break
        if abs(rate) < 1e-9 and abs(upperbound-x0) < tol:
            x0 = upperbound-tol
            break
            
    return x0

def solve_max(N, sigma, C0, C1, B): # N*C0-B < 0
    k_max = newton_solve_max(lambda x: (1.+1./sts.t.pdf(sts.t.ppf((N*C0-B)/x/C1, N-1),N-1)*((N*C0-B)/x**2/C1)), 
                         lambda x: 1e7*((1./sts.t.pdf(sts.t.ppf((N*C0-B)/x/C1, N-1),N-1)*((N*C0-B)/x**2/C1))-(1./sts.t.pdf(sts.t.ppf((N*C0-B)/(x-1e-7)/C1, N-1),N-1)*((N*C0-B)/(x-1e-7)**2/C1))),
                         float(1./(1.-1e-7)*((N*C0-B)/C1)), 1e-5, float((N*C0-B)/C1))
    
    mu_max = sigma/np.sqrt(N)*(k_max-sts.t.ppf((N*C0-B)/k_max/C1, N-1))
    return mu_max
    
sigma = 3e-1
C0 = .5
C1 = 5000.
A = 5000.
B = 1500.

mu_max = solve_max(3000, sigma, C0, C1, A)

mu_max

-0.007646242761611315

# Venture Strategies

## Don't Do Meaningless Things

_However, you will see this strategy is not always true. Surprised? :)_

So, what is "do meaningless things"? You may guess about spending efforts on nothing? That's right! Statistically, it is when your null hypothesis is **H0**: $\mu = 0$ and you want to optimize the probability of Type I error (do something when it actually returns nothing). The solution to the above optimization problem will play the role of a rule of thumb in choosing a significant level for the hypothesis test based on the survey's result. Thus, the threshold for designing the test will be greater than or equal to `mu_max`. As the threshold gets larger and larger, the Type I error is smaller and smaller. Consequently, if you're a hard fan of "don't do meaningless things" without any clues about what is meaningful things, just "do nothing"!

Still want to do something? Let's consider the second strategy ... 

## Be Open for Opportunities. 

This brings us back to the beginning programming problem. Now, you want to be as open as possible for opportunities in which you can get back at least a worthy amount of \$<b>B</b> from your investment. Thus, you must optimize the half-bounded interval $[\tau,\infty)$ which represents a range of worth opportunities.

$\textrm{minimize}_{\mu_\alpha} \quad \mu_\alpha$

$\textrm{subject to} \quad N_\alpha C_0 - \kappa C_1 \Phi_{N_\alpha-1} (\kappa-\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma}) \le A$

$\quad N_\alpha C_0 - \tau' C_1 \Phi_{N_\alpha-1} (\tau'-\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma}) \le -B$

$\quad \kappa \lt 0$

$\quad \gamma \le 0$

$\quad \tau' = \tau - \gamma$

Again, we solve the programming problem for a fixed value of $N_\alpha$. Let $\mu_{max}$ is the solution of the sub programming problem mentioned above, this programming problem can be rewritten as below: 

$\textrm{minimize}_{\mu_\alpha} \quad \tau$

$\textrm{subject to} \quad N_\alpha C_0 - \tau' C_1 \Phi_{N_\alpha-1} (\tau'-\frac{\mu_\alpha \sqrt{N_\alpha}}{\sigma}) \le -B$

$\quad -\mu_\alpha \le -\mu_{max}$

$\quad \gamma \le 0$

$\quad \tau' = \tau - \gamma$

In [2]:
def newton_solve_min(f, df, x0, tol, lower_bound):
    while True:
        f_crt = f(x0)
        alpha = 1.
        rate = f_crt/df(x0)
        if (x0 - rate)<=lower_bound:
            x0 = x0-(x0-lower_bound)*.5
        else:
            x0 = x0 - rate*alpha
        if abs(f_crt)<tol:
            break
        if abs(rate) < 1e-9 and abs(x0-lower_bound) < tol:
            x0 = lower_bound+tol
            break
            
    return x0

def solve_min(N, sigma, C0, C1, B, mu):
    k_min = newton_solve_min(lambda x: 1./np.sqrt(N)*(x-sts.t.ppf((N*C0-B)/x/C1, N-1))-mu/sigma, 
                         lambda x: 1./np.sqrt(N)*(1.+1./sts.t.pdf(sts.t.ppf((N*C0-B)/x/C1, N-1),N-1)*((N*C0-B)/x**2/C1)),
                         float(1./(1.-1e-7)*((N*C0-B)/C1)), 1e-5, float((N*C0-B)/C1))
    
    return k_min

k_min = solve_min(3000, sigma, C0, C1, -B, mu_max)

k_min

0.6136709697009499

# Explicit Significant Level

Solve the original programming problem. 

In [3]:
max_N = int(A/C0)

sol_k = 1e7
sol_N = -1
sol_mu = -1

for N in range(30,300):
    if (N*C0-A) >= 0:
        continue
    
    mu_max = solve_max(N, sigma, C0, C1, A)
    k_min = solve_min(N, sigma, C0, C1, -B, mu_max)
    print((N, mu_max, k_min, sts.t.sf(mu_max*np.sqrt(N)/sigma,N-1)))
    if k_min <= sol_k:
        sol_N = N
        sol_mu = mu_max
        sol_k = k_min

(30, -0.10495794098397018, 0.3082490821408266, 0.9673821794585686)
(31, -0.10323749382221158, 0.3083102643175411, 0.9675315960902154)
(32, -0.10159843285751982, 0.3083741521797271, 0.9676706464940428)
(33, -0.10003452872067907, 0.30844048593069906, 0.9678003061904016)
(34, -0.09854019992414469, 0.30850903793993484, 0.9679214320755736)
(35, -0.0971104287135845, 0.3085796079231798, 0.9680347799081892)
(36, -0.09574068990448768, 0.30865201896268274, 0.968141018792933)
(37, -0.09442689040015749, 0.3087261142017678, 0.9682407432460155)
(38, -0.09316531754649122, 0.3088017540842714, 0.9683344833004562)
(39, -0.09195259483702979, 0.3088788140369909, 0.9684227130127254)
(40, -0.09078564376303579, 0.30895718251446436, 0.9685058576580715)
(41, -0.08966165082579833, 0.3090367593417467, 0.9685842998443479)
(42, -0.08857803890538823, 0.3091174543035679, 0.9686583847292661)
(43, -0.08753244232181887, 0.3091991859382185, 0.9687284244907401)
(44, -0.0865226850386921, 0.30928188050235805, 0.96879470217

(153, -0.04593673738875771, 0.31983012306308994, 0.9699389650486739)
(154, -0.045783326136743074, 0.3199307200587784, 0.9699340539660555)
(155, -0.04563138984572423, 0.32003133657060623, 0.9699290622004231)
(156, -0.04548090486706071, 0.32013197223000656, 0.9699239912411741)
(157, -0.04533184808006322, 0.32023262667793223, 0.969918842539427)
(158, -0.04518419687693072, 0.3203332995645504, 0.9699136175092422)
(159, -0.045037929148209734, 0.32043399054894933, 0.9699083175287969)
(160, -0.04489302326875433, 0.3205346992988559, 0.9699029439415142)
(161, -0.04474945808416665, 0.32063542549036383, 0.9698974980571515)
(162, -0.04460721289769859, 0.32073616880767236, 0.9698919811528468)
(163, -0.04446626745759634, 0.3208369289428344, 0.9698863944741278)
(164, -0.04432660194486999, 0.32093770559551404, 0.9698807392358814)
(165, -0.044188196961471735, 0.32103849847275306, 0.9698750166232909)
(166, -0.0440510335188662, 0.3211393072887462, 0.9698692277927357)
(167, -0.04391509302697813, 0.32124013

(275, -0.033899319009948504, 0.33218565902349956, 0.9689927169342404)
(276, -0.03383487494100193, 0.33228733311632075, 0.9689832449922511)
(277, -0.0337707748021046, 0.3323890110569176, 0.9689737558958705)
(278, -0.03370701550576542, 0.3324906928104106, 0.9689642497870536)
(279, -0.03364359400322132, 0.3325923783424271, 0.9689547268056915)
(280, -0.03358050728381457, 0.3326940676190926, 0.9689451870896485)
(281, -0.03351775237438228, 0.33279576060702154, 0.9689356307747998)
(282, -0.033455326338657776, 0.33289745727330877, 0.9689260579950651)
(283, -0.03339322627668367, 0.33299915758552084, 0.9689164688824446)
(284, -0.03333144932423626, 0.3331008615116877, 0.9689068635670519)
(285, -0.0332699926522611, 0.3332025690202945, 0.9688972421771482)
(286, -0.03320885346631932, 0.33330428008027374, 0.9688876048391739)
(287, -0.033148029006044744, 0.3334059946609971, 0.9688779516777818)
(288, -0.03308751654461119, 0.33350771273226787, 0.9688682828158661)
(289, -0.03302731338821006, 0.3336094342

This is the "rule of thumb" significant level for your business venture: 

In [4]:
sts.t.sf(sol_mu*np.sqrt(sol_N)/sigma,sol_N-1) # mu_max/sigma*np.sqrt(N) # 

0.9673821794585686