# Stock Signalling - BSE Sensex Index

Idea is to train a model based on the following indicators, to provide an indictor to be bullish or bearish in the market

- Close Price
- RSI
- Stochastic RSI

The model is planned to be trained on 30 years of Sensex Data - [Source of Data](https://www.bseindia.com/indices/IndexArchiveData.html)  - From 01-Jan-1990 till date

The indicators are calculated using the libreary - [TA-LIB](https://mrjbq7.github.io/ta-lib/func_groups/momentum_indicators.html)

However, I had tough time in installing this library. But found the custome implementation of RSI and Stoch RSI which gives exact same result as TA-LIB

[Custom RSI Implementation](https://gist.github.com/ultragtx/6831eb04dfe9e6ff50d0f334bdcb847d)

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Custom Implementation of MACD

In [None]:
def macd(prices):
    # Calculate exponentiall weighted moving averages:
    day12 = prices.ewm(span=12).mean() 
    day26 = prices.ewm(span=26).mean()
    macd = []  # List to hold the MACD line values
    counter=0  # Loop to substantiate the MACD line
    while counter < (len(day12)):
        macd.append(day12.iloc[counter] - day26.iloc[counter])  # Subtract the 26 day EW moving average from the 12 day.
        counter += 1
    macd_df = pd.DataFrame(macd)
    signal_df = macd_df.ewm(span=9).mean()
    return macd_df, signal_df 

# Customt Implmentation of RSI and Stochastic RSI


https://gist.github.com/ultragtx/6831eb04dfe9e6ff50d0f334bdcb847d



In [None]:
# https://gist.github.com/ultragtx/6831eb04dfe9e6ff50d0f334bdcb847d
def RSI2(series, period=14):
    delta = series.diff().dropna()
    ups = delta * 0
    downs = ups.copy()
    ups[delta > 0] = delta[delta > 0]
    downs[delta < 0] = -delta[delta < 0]
    ups[ups.index[period-1]] = np.mean( ups[:period] ) #first value is sum of avg gains
    ups = ups.drop(ups.index[:(period-1)])
    downs[downs.index[period-1]] = np.mean( downs[:period] ) #first value is sum of avg losses
    downs = downs.drop(downs.index[:(period-1)])
    rs = ups.ewm(com=period-1,min_periods=0,adjust=False,ignore_na=False).mean() / \
         downs.ewm(com=period-1,min_periods=0,adjust=False,ignore_na=False).mean() 
    return 100 - 100 / (1 + rs)

def StochRSI2(series, period=14, smoothK=3, smoothD=3):
    # Calculate RSI 
    delta = series.diff().dropna()
    ups = delta * 0
    downs = ups.copy()
    ups[delta > 0] = delta[delta > 0]
    downs[delta < 0] = -delta[delta < 0]
    ups[ups.index[period-1]] = np.mean( ups[:period] ) #first value is sum of avg gains
    ups = ups.drop(ups.index[:(period-1)])
    downs[downs.index[period-1]] = np.mean( downs[:period] ) #first value is sum of avg losses
    downs = downs.drop(downs.index[:(period-1)])
    rs = ups.ewm(com=period-1,min_periods=0,adjust=False,ignore_na=False).mean() / \
         downs.ewm(com=period-1,min_periods=0,adjust=False,ignore_na=False).mean() 
    rsi = 100 - 100 / (1 + rs)

    # Calculate StochRSI 
    stochrsi  = (rsi - rsi.rolling(period).min()) / (rsi.rolling(period).max() - rsi.rolling(period).min())
    stochrsi_K = stochrsi.rolling(smoothK).mean()
    stochrsi_D = stochrsi_K.rolling(smoothD).mean()

    return stochrsi, stochrsi_K, stochrsi_D

# Read the data

In [None]:
date_cols = ['Date']
sensex = pd.read_csv('../input/bse-sensex-index-30-yrs/BSE Sensex Daily Close Jan1990 Oct2020.csv', parse_dates=date_cols)

In [None]:
sensex.tail()

Take previous n years for calculation

n = 5 years

In [None]:
n_yrs = 5
import datetime
date = datetime.datetime.now() - datetime.timedelta(days=n_yrs*365)
sensex_r = sensex[sensex['Date'] > date]
sensex = sensex_r

close = sensex.Close

Calculate RSI and Stoch RSI

In [None]:
macd_df, signal_df = macd(close)
# type(macd_df)

In [None]:
sensex

In [None]:
rsi = RSI2(close, period=14)
rsi9 = RSI2(close, period=9)
stochrrsi = StochRSI2(close)

Add RSI, Stochastic RSI, MACD and MACD Signal to daily dataset

In [None]:
sensex['rsi'] = rsi
sensex['rsi9'] = rsi9
sensex['stochrsi'] = stochrrsi[1]
sensex['rsi_diff'] = sensex['rsi9'] - sensex['rsi']
sensex['macd'] = macd_df[0].values
sensex['signal'] = signal_df[0].values
sensex['macd_diff'] = sensex['macd'] - sensex['signal']
sensex.tail(50)

# How MACD signals the stock

If the MACD crosses the signal line upward

```
    if macd[i] > signal[i] and macd[i - 1] <= signal[i - 1]:
        listLongShort.append("BUY")
    #  The other way around
    elif macd[i] < signal[i] and macd[i - 1] >= signal[i - 1]:
        listLongShort.append("SELL")
    #  Do nothing if not crossed
    else:
        listLongShort.append("HOLD")
```

# How RSI signals the stock

If RSI is reaching 70, it means market will turn bearish soon. 
If RSI reaches around 30 market will go bullish soon

```
    if rsi[i] >= 70 :
        listLongShort.append("SELL")
    if rsi[i] <= 30 :
        listLongShort.append("BUY")
    else:
        listLongShort.append("HOLD")
```

# How Stochastic RSI signals

If Stoch RSI reaches 100, it means market will go bearish soon
If Stoch RSI reaches 0, it means market will go bearish soon



Calculate lookahead price action and % price change to the dataframe
Considering lookahead of 3 days

In [None]:
look_ahead = 3
sensex['Close_ahead'] = sensex['Close'].shift(-look_ahead)
sensex['Close_pct'] = (sensex['Close_ahead'] - sensex['Close'])/sensex['Close'] * 100
sensex = sensex.dropna()
sensex

# Data Pre Processing

Let us make all the features to be normalized

In [None]:
from sklearn.preprocessing import MinMaxScaler

## Let's decide features we want consider

We know RSI, Stoch RSI, MACD, Signal are important in deciding the stock signals. So we consider these as our features

In [None]:
x = sensex[['rsi','stochrsi','macd','signal']]
y = sensex['Close_pct']

In [None]:
scaler = MinMaxScaler()
x = scaler.fit_transform(x)
y.values

# Exploratory Data Analysis

Now for analysis purpose, we will round the RSI to integer and see how Look ahead Close % vary across the RSI range

In [None]:
df1 = sensex[['rsi','stochrsi','macd','signal','macd_diff','Close_pct']]
df1[500:550]

In [None]:
def get_int(f):
    return round(f)

def get_tens(f):
    tenth = round(f/10) * 10
    return tenth
def get_rsi_flag(f):
    if f < 35:
        flag = 'low'
    elif f > 65:
        flag = 'high'
    else:
        flag = 'normal'
    return flag


# df1['rsi_i'] = df1['rsi'].apply(get_int)
# df1['rsi_t'] = df1['rsi'].apply(get_tens)
# df1['srsi_f'] = df1['stochrsi']*100
# df1['srsi_i'] = df1['srsi_f'].apply(get_int)
# df1['rsi_flag'] = df1['rsi'].apply(get_rsi_flag)

# df2 = df1[['rsi_t','Close_pct']]
# df2 =df2.groupby(['rsi_t']).median()
# df2 = df2.reset_index()
# # df1[df1['rsi_i'] == 30]
# # df2[15:55]
# df1.tail()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 6)
plt.xlim(-100,100)
plt.ylim(-10,10)
sns.scatterplot(data=df1, x='macd_diff',y='Close_pct')

We see the datapoints spread everywhere. We can not make any proper correlation between MACD difference and % of close (n days)

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 6)
plt.xlim(20,80)
plt.ylim(-5,5)
sns.scatterplot(data=df1, x='rsi',y='Close_pct')

Again with respect to RSI we see the data points everywhere. No easy correlation found

In [None]:
import matplotlib.pyplot as plt
fig1, ax1 = plt.subplots()
date_time_obj1 = datetime.datetime.strptime('2020-01-01', '%Y-%m-%d')
date_time_obj2 = datetime.datetime.strptime('2020-10-19', '%Y-%m-%d')
plt.xlim(date_time_obj1, date_time_obj2)
fig1.set_size_inches(15, 6)
sns.scatterplot(data=sensex,x='Date',y='rsi')
sns.scatterplot(data=sensex,x='Date',y='Close_pct')

# Build the Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
import tensorflow as tf

Let us split our data into training and testing sets

In [None]:
# x_train, x_test, y_train, y_test = train_test_split(df1[['rsi','stochrsi','macd','signal','macd_diff']],df1[['Close_pct']],test_size=0.2)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

## Linear Regression (Baseline)
Build baseline model for future comparison. Let's build a linear regression

In [None]:
model1 = LinearRegression()
model1.fit(x_train,y_train)
model1.score(x_test,y_test)

Baseline model gives an R2 score of `0.045`. Which is pretty bad

## Support Vector Machine Regression
Let's try to build SVR

In [None]:
model2 = SVR(gamma=2.3)
model2.fit(x_train,y_train)
model2.score(x_test,y_test)

SVR with `RBF kernel` with `gamma=5.3` gives the R2 score of `0.0521`

In [None]:
# model2 = SVR(kernel='poly', degree=4, gamma=10 )
# model2.fit(x_train,y_train)
# model2.score(x_test,y_test)

SVR with `POLY kernel` with `degree=4, gamma=10` gives the R2 score of `-0.0106`

# XGBoost
Now let us try with `XGBOOST` 

In [None]:
import xgboost as xgb

In [None]:
model3 = xgb.XGBRegressor(objective = 'reg:squarederror',
                          learning_rate = 0.01,
                          max_depth = 30,
                          n_estimators = 170,
                          alpha = 10
                            
                    )
model3.fit(x_train,y_train)
preds = model3.predict(x_test)
r2score = metrics.r2_score(y_test,preds)
r2score

XGBoost with following parameter got R2 score of `0.1549`

```
objective = 'reg:squarederror',
                          learning_rate = 0.07,
                          max_depth = 15,
                          n_estimators = 170,
                          alpha = 9                  
                            
                    )                          
```

In [None]:
x_train.shape

# Deep Neural Net

Let's try to build a Neural Net for this regressor

In [None]:
model4 = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32,input_shape = (None,4), activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='linear')
])

model4.compile(optimizer =tf.optimizers.Adam(learning_rate=0.001),
              loss='mean_squared_error'
             )
                
                                    

In [None]:
model4.summary()

In [None]:
%%time
history = model4.fit(x = x_train, y = y_train, 
           epochs=1200,
            verbose=0,
           validation_split = 0.2
          )

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

In [None]:
import matplotlib.pyplot as plt
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.ylim([0,8])
#   plt.xlim([1000, 1500])
  plt.xlabel('Epoch')
  plt.ylabel('Error')
  plt.legend()
  plt.grid(True)

In [None]:
plot_loss(history)

Let's find R2 score of NN model

In [None]:
preds = model4.predict(x_test)
r2score = metrics.r2_score(y_test,preds)
r2score

# Analysis of the model complexity

## Experiment 1
We could achieve R2 score of `-0.031` using following NN model

We can understand that the model complexity is very less (49 trainable params). This simple model with 1 layer with 8 nodes is not sufficient for better score. 

Validation Loss was around 5

- Traininable parameters = 49
- R2 score of `-0.031`  
- Validation loss = 5
- epochs = 1000

```
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_3 (Dense)              (None, None, 8)           40        
_________________________________________________________________
dense_4 (Dense)              (None, None, 1)           9         
=================================================================
Total params: 49
Trainable params: 49
Non-trainable params: 0
_________________________________________________________________
```
---
## Experiment 2

We could achieve R2 score of `0.0335` using following NN model
Model complexity is increased and hence the score (97 trainable params). Validation Loss was around 4.8

- Traininable parameters = 97
- R2 score of `0.0335`  
- Validation loss = 4.8
- epochs = 1000

With the same model when epochs was increased to 2000, R2 score was `0.1655`. Validation loss was around 4.4

- Traininable parameters = 97
- R2 score of `0.1655`  
- Validation loss = 4.4
- epochs = 2000


```
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, None, 16)          80        
_________________________________________________________________
dense_6 (Dense)              (None, None, 1)           17        
=================================================================
Total params: 97
Trainable params: 97
Non-trainable params: 0
_________________________________________________________________
```
---
## Experiment 3


Still using single hidden layer. 

- Traininable parameters = 193
- R2 score of `0.14`  
- Validation loss = 4.2
- epochs = 2000

Let us try to increase the epochs=3000 and see if validation loss decreases.
- Traininable parameters = 193
- R2 score of `0.2084`  
- Validation loss = 3.95 
- epochs = 3000

But R2 Score drastically increased to `0.208`. Validation loss slightly reduced
```
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, None, 32)          160       
_________________________________________________________________
dense_8 (Dense)              (None, None, 1)           33        
=================================================================
Total params: 193
Trainable params: 193
Non-trainable params: 0
```
---
## Experiment 4

Now let us try to increase the complexity by adding one more layer to NN

- Traininable parameters = 369
- R2 score of `0.244`  
- Validation loss = 3.3
- epochs = 3000


Let's try with still higher epochs and see if it reduces any more loss
epochs = 4000
R2 Score = 0.15
Validation loss = 3.3

We can observe that loss reduced, but not that significantly. 


NN Summary
```
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_9 (Dense)              (None, None, 16)          80        
_________________________________________________________________
dense_10 (Dense)             (None, None, 16)          272       
_________________________________________________________________
dense_11 (Dense)             (None, None, 1)           17        
=================================================================
Total params: 369
Trainable params: 369
Non-trainable params: 0
_________________________________________________________________
```
---
## Experiment 5

In this experiment, let us increase the nodes in the 2 hidden layers and see the effect

- Trainable parameter = 1249
- epochs = 4000
- Validation loss =  6 
- R2 Score = `0.17`

We can see with 4000 epochs the model has overfitted

In the training history graph we can see with the least loss was around 3.3 (Epoch=1200)
- Trainable parameter = 1249
- epochs = 1200
- Validation loss = 6 
- R2 Score = `0.226`

For some reason validation loss did not reduce :-( 

NN Summary
```
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_12 (Dense)             (None, None, 32)          160       
_________________________________________________________________
dense_13 (Dense)             (None, None, 32)          1056      
_________________________________________________________________
dense_14 (Dense)             (None, None, 1)           33        
=================================================================
Total params: 1,249
Trainable params: 1,249
Non-trainable params: 0
_________________________________________________________________
```

---
## Experiment n
We could achieve R2 score of `0.134` using following NN model

```
Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_25 (Dense)             (None, None, 64)          320       
_________________________________________________________________
dense_26 (Dense)             (None, None, 32)          2080      
_________________________________________________________________
dense_27 (Dense)             (None, None, 1)           33        
=================================================================
Total params: 2,433
Trainable params: 2,433
Non-trainable params: 0
```
---
