# Calculating the $\beta$ of a security using linear regression

In this code, we calculate the $\beta$ of a security. The price data is downloaded from Yahoo finance and the monthly returns are calculated. We choose the Microsoft stock and calculate its correlation with the S&P 500. 

The Beta of an asset is a measure of the sensitivity of its returns relative to a market benchmark (usually a market index). 

The formula is: $\beta = \frac{cov(r_s r_b)}{var(r_b)}$

$r_s$: return of the stock

$r_b$: return of the benchmark

In [1]:
import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import time

美股直接输入股票代码如GOOG   
港股输入代码+对应股市，如腾讯：0700.hk   
国内股票需要区分上证和深证，股票代码后面加.ss或者.sz  
请输入你要查询的股票代码：  

In [2]:
price_data = pd.DataFrame()
stock_code=input("请输入股票代码：")
index_code=input("请输入指数代码：")
start_date=input("请输入开始日期：")
end_date=input("请输入结束日期：")
interval=input("请输入间隔eg-d,m,y:")
asset_list = [stock_code,index_code]

for asset in asset_list:
    price_data[asset] = wb.get_data_yahoo(asset, start=start_date, end=end_date, interval=interval)['Adj Close']
return_data = np.log(1.0 + price_data.pct_change())
return_data = return_data.dropna(axis=0)
return_data.head()

请输入股票代码：NIO
请输入指数代码：^GSPC
请输入开始日期：2019-01-01
请输入结束日期：2019-12-31
请输入间隔eg-d,m,y:d


Unnamed: 0_level_0,NIO,^GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-02,-0.02705,0.001268
2019-01-03,-0.024491,-0.025068
2019-01-04,0.04997,0.033759
2019-01-07,0.021774,0.006986
2019-01-08,-0.015504,0.009649


Calculating the beta using LogisticRegression.

In [3]:
train_x = return_data[index_code]
train_y = return_data[stock_code]

In [6]:
from numpy import *
import matplotlib.pyplot as plt
import time

# calculate the sigmoid function
def sigmoid(inX):
    return 1.0 / (1 + exp(-inX))
# train a logistic regression model using some optional optimize algorithm
# input: train_x is a mat datatype, each row stands for one sample
#         train_y is mat datatype too, each row is the corresponding label
#         opts is optimize option include step and maximum number of iterations
def trainLogRegres(train_x, train_y, opts):
    # calculate training time
    startTime = time.time()
    numSamples, numFeatures = shape(train_x)
    alpha = opts['alpha']; maxIter = opts['maxIter']
    weights = ones((numFeatures, 1))
    # optimize through gradient descent algorilthm
    for k in range(maxIter):
        if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm
            output = sigmoid(train_x * weights)
            error = train_y - output
            weights = weights + alpha * train_x.transpose() * error
        elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent
            for i in range(numSamples):
                output = sigmoid(train_x[i, :] * weights)
                error = train_y[i, 0] - output
                weights = weights + alpha * train_x[i, :].transpose() * error
        elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent
            # randomly select samples to optimize for reducing cycle fluctuations 
            dataIndex = range(numSamples)
            for i in range(numSamples):
                alpha = 4.0 / (1.0 + k + i) + 0.01
                randIndex = int(random.uniform(0, len(dataIndex)))
                output = sigmoid(train_x[randIndex, :] * weights)
                error = train_y[randIndex, 0] - output
                weights = weights + alpha * train_x[randIndex, :].transpose() * error
                del(dataIndex[randIndex]) # during one interation, delete the optimized sample
        else:
            raise NameError('Not support optimize method type!')
    print ('Congratulations, training complete! Took %fs!') % (time.time() - startTime)
    return weights

# test your trained Logistic Regression model given test set
def testLogRegres(weights, test_x, test_y):
    numSamples, numFeatures = shape(test_x)
    matchCount = 0
    for i in xrange(numSamples):
        predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5
        if predict == bool(test_y[i, 0]):
            matchCount += 1
    accuracy = float(matchCount) / numSamples
    return accuracy

# show your trained logistic regression model only available with 2-D data

def showLogRegres(weights, train_x, train_y):
    # notice: train_x and train_y is mat datatype
    numSamples, numFeatures = shape(train_x)
    if numFeatures != 3:
        print ("Sorry! I can not draw because the dimension of your data is not 2!")
        return 1
    # draw all samples
    for i in xrange(numSamples):
        if int(train_y[i, 0]) == 0:
            plt.plot(train_x[i, 1], train_x[i, 2], 'or')
        elif int(train_y[i, 0]) == 1:
            plt.plot(train_x[i, 1], train_x[i, 2], 'ob')

    # draw the classify line
    min_x = min(train_x[:, 1])[0, 0]
    max_x = max(train_x[:, 1])[0, 0]
    weights = weights.getA()  # convert mat to array
    y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]
    y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]
    plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')
    plt.xlabel('X1'); plt.ylabel('X2')
    plt.show()


In [7]:
print ("step 1: load data...")
test_x = train_x; test_y = train_y

step 1: load data...


In [9]:
print ("step 2: training...")
opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'}
optimalWeights = trainLogRegres(train_x, train_y, opts)


step 2: training...


ValueError: not enough values to unpack (expected 2, got 1)

In [11]:
print ("step 3: testing...")
accuracy = testLogRegres(optimalWeights, test_x, test_y)

step 3: testing...


NameError: name 'optimalWeights' is not defined