## Welcome to our notebook on beta distribution!
To begin, click the **_Copy and Edit_** button in the top right-hand corner.  

To run a cell, click on the cell and then the play button either to the left of the cell, or in the toolbar up top. The first cell must be run first, but after that cells can be run in any order.

In [None]:
# import required packages
%matplotlib inline
import math
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual, widgets

### Guess the fit, interactive display

When the following cell is run, random data will be generated creating a scatter plot for you to fit a beta curve to. Change the alpha and beta sliders slowly, giving the curve time to update, and once you've found satisfactory alpha and beta values, scroll down and click the "Done" button to reveal the true values used to generate the data.  

The mean ($\mu$) and standard deviation ($\sigma$) will also be displayed above the graph. In beta distribution,  
    $$\mu = \int\limits_0^1 \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{B(\alpha, \beta)} \, dx = \frac {\alpha}{\alpha + \beta} \qquad \textrm{and} \qquad \sigma = \sqrt{\frac{\alpha\beta}{(\alpha + \beta)^{2}(\alpha + \beta + 1)}} $$  
where $B(\alpha, \beta)$ is a constant which ensures a total probability of 1. **Re-run the following cell for a new random set of data**.  

Note that both alpha and beta can be any value between 0 and infinity, but a range of 0-10 was used in this demonstration for the sake of simplicity. 


In [None]:
# Generates random data with N data points. The data generated is a beta distribution plus some allowed amount of error.
def gen_data(a_in, b_in, N_in, err_in):
    
    a = a_in
    b = b_in
    N = N_in
    err = err_in
    
    dx = 1/(N+1)
    x = np.arange(dx, 1, dx)
    y = x**(a-1) * (1-x)**(b-1)
    A = 0
    for i in y:
        A = A + i * dx
    y = y/A
    for i in range(0, N):
        pos = err * np.random.rand()
        neg = err * np.random.rand()
        y[i] = y[i] + pos - neg
    return [x, y]

# Takes in data points as well as the user's best guess for alpha and beta and plots the proposed fit against the data.
def guess(alpha, beta, x_in, y_in):
    
    a = alpha
    b = beta
    x = x_in
    y = y_in
    
    # create the beta distribution
    dx = x[1] - x[0]
    f = x**(a-1) * (1-x)**(b-1)
    A = 0
    for i in f:
        A = A + i * dx
    f = f/A
    
    # calculate mean and variance as functions of alpha and beta
    mu = a/(a+b)
    var = (a*b)/(((a+b)**2)*(a+b+1))
    sigma = math.sqrt(var)
    
    plt.xlabel("x")
    plt.ylabel("p(x)")
    plt.title('Fit the data by dragging the sliders!')
    plt.xlim(0, 1)
    plt.ylim(0, 4)
    
    # Display mean and variance for this proposed fit
    plt.text(0, 5.5, "mean: " + str('{:.4}'.format(mu)))
    plt.text(0, 5, "standard deviation: " + str('{:.4}'.format(sigma)))
    
    plt.scatter(x, y)
    plt.plot(x, f, 'r')

# Run this cell to generate new data
true_a = 10 * np.random.rand()
true_b = 10 * np.random.rand()

true_mu = true_a/(true_a+true_b)
true_var = (true_a*true_b)/(((true_a+true_b)**2)*(true_a+true_b+1))
true_sigma = math.sqrt(true_var)

x_data, y_data = gen_data(true_a, true_b, 100, 0.8)
interact(guess, alpha=(0,10, 0.2), beta=(0,10,0.2), x_in=fixed(x_data), y_in=fixed(y_data))

#print("true alpha: ", true_a)
#print("true beta: ", true_b)
button = widgets.Button(description='Done')
out = widgets.Output()
def on_button_clicked(_):
      with out:
          print("true alpha: ", '{:.4}'.format(true_a))
          print("true beta: ", '{:.4}'.format(true_b))
          print("true mean: ", '{:.4}'.format(true_mu))
          print("true standard deviation: ", '{:.4}'.format(true_sigma))
button.on_click(on_button_clicked)
box = widgets.VBox([button, out])
box

### Batting Average (BA) data

The following cell takes in a set of data and fits a beta curve to it, listing the corresponding alpha and beta values. The data set being graphed is a histogram of all Blue Jays players' batting averages from the years 2015 through 2019. We can read from this graph that the peak occurs at a BA of around 0.230.

In [None]:
# get data from input file
x, y_data = np.loadtxt('../input/bluejays/beta_distribution_project_batting_averages.csv', delimiter=',', usecols=(6,7), skiprows=1, unpack=True)
n = x.size
dx = x[1]-x[0]

#normalize data
A_data = 0
for i in y_data:
    A_data = A_data + i * dx
y_data = y_data / A_data

# plot data
plt.figure()
plt.xlim(0, 0.5)
plt.ylim(0, 12)
plt.bar(x, y_data, 0.01)

# initialize variables
a_curr = -1
b_curr = -1
f_curr = x
L_curr = -999_999_999
step = 0.2

# for every a and b
for a in np.arange(step, 15, step):
    for b in np.arange(step, 35, step):

        # define the beta distribution
        f = x**(a-1) * (1-x)**(b-1)
        A = 0
        for i in f:
            A = A + i * dx
        f = f / A

        # calculate the likelihood
        L = 0
        for k in range(0, n):
            L = L + y_data[k] * np.log(f[k])

        # if this is the new highest likelihood, keep track of that
        if L > L_curr:
            L_curr = L
            f_curr = f
            a_curr = a
            b_curr = b
            
plt.xlabel('batting average, x')
plt.ylabel('p(x)')
plt.title('Blue Jays batting averages')
plt.plot(x, f_curr, 'r')

mu = a_curr/(a_curr+b_curr)
var = (a_curr*b_curr)/(((a_curr+b_curr)**2)*(a_curr+b_curr+1))
sigma = math.sqrt(var)

print('alpha:', '{:.4}'.format(a_curr))
print('beta:', '{:.4}'.format(b_curr))
print('mean:', '{:.4}'.format(mu))
print('standard deviation:', '{:.4}'.format(sigma))

### Eerned Run Average (ERA) data

From wikipedia: In baseball statistics, earned run average (ERA) is the average of earned runs given up by a pitcher per nine innings pitched (i.e. the traditional length of a game). It is determined by dividing the number of earned runs allowed by the number of innings pitched and multiplying by nine.

The following cell takes the ERA data from 328 pitchers from 2015 to 2019 and fits it with a beta distribution.

In [None]:
# get data from input file
x_data, y_data = np.loadtxt('../input/bluejays/beta_distribution_project_ERA.csv', delimiter=',', usecols=(3,4), skiprows=1, max_rows=139, unpack=True)
x = x_data/7 # need to rescale the values before fitting (max value here is 7 so I divide by 7)
n = x.size
dx = x[1]-x[0]

#normalize data
A_data = 0
for i in y_data:
    A_data = A_data + i * dx
y_data = y_data / A_data

# plot data
plt.figure()
plt.xlim(0, 7)
plt.ylim(0, 1)
plt.bar(x_data, y_data/7, 0.05)

# initialize variables
a_curr = -1
b_curr = -1
f_curr = x
L_curr = -999_999_999
step = 0.2

# for every a and b
for a in np.arange(step, 15, step):
    for b in np.arange(step, 15, step):

        # define the beta distribution
        f = x**(a-1) * (1-x)**(b-1)
        A = 0
        for i in f:
            A = A + i * dx
        f = f / A

        # calculate the likelihood
        L = 0
        for k in range(0, n):
            L = L + y_data[k] * np.log(f[k])

        # if this is the new highest likelihood, keep track of that
        if L > L_curr:
            L_curr = L
            f_curr = f
            a_curr = a
            b_curr = b
            
plt.xlabel('ERA, x')
plt.ylabel('p(x)')
plt.title('ERA data')
plt.plot(x_data, f_curr/7, 'r')

mu = (a_curr/(a_curr+b_curr)) * 7
var = (a_curr*b_curr)/(((a_curr+b_curr)**2)*(a_curr+b_curr+1))
sigma = math.sqrt(var) * 7

print('alpha:', '{:.4}'.format(a_curr))
print('beta:', '{:.4}'.format(b_curr))
print('mean:', '{:.4}'.format(mu))
print('standard deviation:', '{:.4}'.format(sigma))