## Challenge Description:TechGig 2018 Code Gladiator

Stock markets have always been a fancy for companies, investors and traders. There is a plethora of investors investing or exiting the stocks at any point of time with different trading strategies and investment horizons. However the underlying objective of every type of investor is common "Maximize the returns and minimize the losses." To achieve this objective every investor is intrigued with one question “Is my stock price going to rise or fall in future".
People have been trying to find this answer from various angles. For some investors fundamentals are important and hence they rely on the fundamental strength of the stock. Others try to predict the movement of the stock based on technical analysis. Some try to predict the movement based on news about stocks and overall market sentiment while other tries to track the trend to predict the price of stock.

In this competition we challenge you to identify the predictability in the market based on technical analysis of stocks.

## Goal

The problem statement for this competition is to design a decision-making framework that can be used to predict actual return value of stock after 30th trading days.

To elaborate, we want to predict for list of given stocks and their return after 30th trading days.

In this competition we have given you 12 stocks of Banking and IT sector in csv format with last 5 years of stock details.

Above stock data has been downloaded from Yahoo Finance, but participants are free to download any other historical data from Yahoo finance if they are interested. File format for stock data

Following technical indicators should be generated using any open source packages available in R or python for technical Analysis ( like TTR in R or Talib in Python) Simple Moving Average ( 30, 40, 50 Days) Exponential Moving Average ( 30, 40, 50 Days) Aroon Oscillator ( 30, 40, 50 Days) MACD signals Relative Strength Index (RSI) Bollinger Bands ( 30, 40, 50 Days) Stochastic Oscillator Stochastic momentum Indicator Chande Momentum Oscillator Commodity Channel Index ( 30, 40, 50 Days) Chakin Volatility indicator ( 30, 40, 50 Days) Trend Detection Index (30, 40, 50 Days) Rate of Price Change (30, 40, 50 Days) Rate of Volume Change (30, 40, 50 Days) William % R (30, 40, 50 Days) Participants are free to transform and derive more features from the above set of indicators from short term as well as long term perspective.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
%matplotlib inline

In [2]:
'''We have data of 12 diffenret stocks,but we will pick only one for exploration'''
infy_dataset = pd.read_csv("complete_data_set_v1/INFY.NS.csv")

In [3]:
infy_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1232 entries, 0 to 1231
Data columns (total 7 columns):
Date         1232 non-null object
Open         1232 non-null object
High         1232 non-null object
Low          1232 non-null object
Close        1232 non-null object
Adj Close    1232 non-null object
Volume       1232 non-null object
dtypes: object(7)
memory usage: 67.5+ KB


In [4]:
'''Convert datatypes'''
infy_dataset['Date'] = pd.to_datetime(infy_dataset['Date'],format='%Y-%m')
infy_dataset['Adj Close'] = pd.to_numeric(infy_dataset['Adj Close'],errors='coerce')
infy_dataset['Volume'] = pd.to_numeric(infy_dataset['Volume'],errors='coerce')
infy_dataset['Close'] = pd.to_numeric(infy_dataset['Close'],errors='coerce')
infy_dataset['Low'] = pd.to_numeric(infy_dataset['Low'],errors='coerce')
infy_dataset['High'] = pd.to_numeric(infy_dataset['High'],errors='coerce')
infy_dataset['Open'] = pd.to_numeric(infy_dataset['Open'],errors='coerce')

In [5]:
infy_dataset = infy_dataset.dropna(axis=0,how='any')

In [6]:
'''Create new column to show closing price after 30th day'''
forecast_out = int(30) # predicting 30 days into future
infy_dataset['PriceNextMonth'] = infy_dataset[['Adj Close']].shift(-forecast_out)

'''For last 30 rows,PriceNextMonth will be empty,so remove those rows'''
infy_dataset = infy_dataset[:-forecast_out]# remove last 30 from X

In [7]:
infy_dataset.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,PriceNextMonth
Open,1.0,0.997877,0.997125,0.99566,0.968154,-0.082128,0.874661
High,0.997877,1.0,0.996697,0.997984,0.969741,-0.0711,0.876141
Low,0.997125,0.996697,1.0,0.998256,0.970066,-0.116717,0.875029
Close,0.99566,0.997984,0.998256,1.0,0.971249,-0.100262,0.876217
Adj Close,0.968154,0.969741,0.970066,0.971249,1.0,-0.08972,0.92866
Volume,-0.082128,-0.0711,-0.116717,-0.100262,-0.08972,1.0,-0.074321
PriceNextMonth,0.874661,0.876141,0.875029,0.876217,0.92866,-0.074321,1.0


In [8]:
'''Keep only 'Adj Close and remove other columns as they are highly correlated'''
X = infy_dataset['Adj Close']
X = X.values.reshape(X.shape[0],1)

In [9]:
y = np.array(infy_dataset['PriceNextMonth'])

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [11]:
def evaluate_model(model,X_train, X_test, y_train, y_test):
    confidence = model.score(X_test, y_test)
    print("score: ", confidence)
    
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print("cross_val_score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    
    mae = mean_absolute_error(y_test, model.predict(X_test))
    print("mean_absolute_error: ", mae)

In [12]:
'''Try out couple of regression models to choose base model'''
# LinearRegression
clf = LinearRegression()
clf.fit(X_train,y_train)
evaluate_model(clf,X_train, X_test, y_train, y_test)

score:  0.869568026139
cross_val_score: 0.86 (+/- 0.03)
mean_absolute_error:  49.7995534962


In [13]:
#RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
evaluate_model(model,X_train, X_test, y_train, y_test)

score:  0.864177674294
cross_val_score: 0.83 (+/- 0.03)
mean_absolute_error:  49.7220166859


In [14]:
'''We wil go with  RandomForestRegressor as base model to help with feature engineering'''

'We wil go with  RandomForestRegressor as base model to help with feature engineering'