### FINAL REPORT 
    Taking a look at Yfinance, stock information
    Looking at data from 2 angles: Regression and Classification
    
    We take a stock, fetch its history from Yahoo Finanace.
    Then we build a feature set of the previous x days, and the target y is either:
        A) The Closing price of the subsequent day.
        B) Whether that closing price is higher or lower than 'today's closing price.
    Some stocks have a history of tending to go up, and the result for classification is 'too good'.
    The class TO_THE_MOON looks into this, acting as a model that assumes a stock will always go up.
    I wanted to select a well-known target stock that wasn't biased too heavily to go up/down.
    
#### REGRESSION
    So, we start with a data set of Microsoft daily stock information.
    We make an X out of the previous 30 days and y = Closing value of 'tomorrow'
    And if we throw this into a bunch of BASIC models with little monte carlo splits, we get...
        0.9994653230315521	0.9993490407573127
        0.9987841453242439	0.9988047837616991
        0.999476371806644	0.9991202017873585
        0.36916190336346727	0.3546131522438993
        0.9994602018307008	0.9993484792927727
        0.9991561648785652	0.9991232673696759
        0.9999045527516056	0.9993110397786895
    ... Well that's odd.
    This is what spawned the desire to look at classification. 
    These predictions are being scored by how close they are to the target. 
    And given stock data, MOST of the time, the changes aren't hugely volatile (this stock and others I checked).
    here's what a random sample looks like:
        [26.120001 26.129999 26.139999 26.160000 25.990000 25.709999 26.049999
         25.969999 25.750000 26.030001 25.700001 25.469999 26.000000 25.820000
         25.870001 26.090000 25.629999 26.030001 26.160000 26.320000 26.350000
         26.190001 26.590000 26.510000 27.010000 27.160000 27.450001 27.400000
         25.510000 25.360001 26.129999]] 
         y = [25.889999]
* ALL OF THE VALUES ARE WITHIN A //TINY\\ margin of error
* So we're not predicting whether "tomorrow" is better than "today", just a close number
* This spawned the idea to tackle it with CLASSIFICATION

#### CLASSIFICATION   
    So the data is now, X = the delta between today and each of the 30 days before.
    and y = Is the delta between today and tomorrow > 0 <- BOOLEAN, so we're 0 or 1 classification
    Right off the bat, let's run a series of classification models with little MC splits
        DIAMONDHANDS	0.5138681169272603	0.5118486633439058
        NaiveBayes	    0.5300475866757307	0.5338468509288627
        SVClassifier	0.5901880806707457	0.5811961939284096
        KNeighbors	    0.7208021753908905	0.5583824195740824
        DecisionTree	1.0	                0.6083597643860444
        RandomForest	1.0	                0.6867240598096964
        SGDClass	    0.5909585316111489	0.5804259175351156
    So DIAMONDHANDS is the TO_THE_MOON model. So we can see, there's a good variety int he data. 
    Today I learned that 51% of the time Microsoft goes up in value.
    And we already have improved performance with RandomForest.
    Here's what a random data point looks like in the classifiers:
        [[0.000000 0.019999 -0.100000 -0.060001 -0.010000 0.029999 0.369999
          0.240000 0.400000 0.509998 0.299999 0.589998 0.269999 0.029999 0.189999
          0.189999 0.199999 0.349998 0.009998 -0.170000 -0.200001 -0.220001
          0.000000 -0.350000 -0.430000 -1.090000 -1.110001 -1.410002 0.179998
          0.679998 26.129999]] 
          y = [0.000000]
* So, let's use the GridSearch ideas from PS8 on that random forest.
            
        max_depth = [5,10,15,20,25] 
        max_features = [3,5,10]
        n_estimators = [10,25,100]
        
                                WINNER:
             rank   test score   depth  features   estimators
              1      0.705029      5       20         100
        And then cross_val = 10
            train: 0.72215286742883
            test: 0.6873037701301736
* I'd call that mildly successful. The small RandomForest monte carlo averaged 0.6867. BUT we managed 0.6873 in the cross val. Which means that's a consistent predictive rate, not just an average. And on individual splits that might be underfit/overfit/biased etc, we managed to average over 0.7 - Which is an improvement from the naive model that got 0.51.

#### Future Endeavors
* I was thinking we need more features. Or rather, better features. Examples:
        Data on an adjacent ETF could be used as predictive.
            day growth, week growth, month growth in relevant ETF
            
        day   High and low, and growth
        month high and low, and growth
        year  high and low, and growth
* The trouble is that it would limit the data set. as dates within a year of the opening of the stock would go uncounted
* I tried to get some fun neural net stuff up and working in time, but it became overwhelmingly fiddly.
        -An adventure for another day-       
        

In [8]:
#Attempting to look at stocks, with Yfinance and more
import yfinance as yf

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from time import perf_counter

In [59]:
#IMPORT MODELS

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
#import tensorflow as tf

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

In [10]:
def get_X(stock, pd, inty):
    i = yf.download(tickers=stock,period=pd,interval=inty)
    return i.to_numpy()

Yfin_map = {"Open":0, "High":1, "Low":2,"Close":3, "Adj Close":4, "Volume":5}

In [11]:
def find_increases(df):
    y2 = np.zeros((len(df),))
    for i,row in enumerate(df):
        if i != len(df)-1:
            tmp = df[i+1][Yfin_map['Close']] - row[Yfin_map['Open']]
            y2[i] = tmp > 0
    return y2


In [32]:
def build_increases(history,jump=5):
    X_pile= np.zeros((len(history), jump+1))
    blank = [0 for x in range(0,jump+1)]
    for i,row in enumerate(history):
        if i<jump or i == len(history)-1:
            pass
        else:
            X_pile[i][0:jump] = np.array([row[Yfin_map['Close']] - history[i-j][Yfin_map['Close']] for j in range(0,jump)])
            X_pile[i][jump] = row[Yfin_map['Close']]
    return X_pile

In [13]:
#Let's Make the Y, "TOMORROW"s CLOSE
def make_Y(df):
    y = np.zeros((len(df),))
    for i,row in enumerate(df):
        if i != len(df)-1:
            y[i] = df[i+1][Yfin_map['Close']]
    return y

In [14]:
#Based on day's opening values
def make_X(history, jump=5):
    X_pile= np.zeros((len(history), jump+1))
    blank = [0 for x in range(0,jump+1)]
    for i,row in enumerate(history):
        if i<jump or i == len(history)-1:
            pass
        else:
            X_pile[i][0:jump] = np.array([history[i-j][Yfin_map['Open']] for j in range(0,jump)])
            X_pile[i][jump] = row[Yfin_map['Close']]
    return X_pile
            
        

In [15]:
def make_Xy(stk, interim, x_meth, y_meth):
    #fetch this stock's FULL history in days (where available)
    full_set_years = get_X(stk, 'max','1d')
    X = x_meth(full_set_years, interim)
    y = y_meth(full_set_years)
    #drop first #interim entries, because they're blank
    
    while interim != 0:
        X = np.delete(X,interim,0)
        y = np.delete(y,interim)
        interim = interim -1
    
    #Drop today because we don't know tomorrow
    X = np.delete(X,len(X)-1,0)
    y = np.delete(y,len(y)-1)
    
    return X, y

In [16]:
def MCtraintest(nmc,X,y,modelObj,testFrac):
    trainScore = np.zeros(nmc)
    testScore  = np.zeros(nmc)
    for i in range(nmc):
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=testFrac)
        modelObj.fit(X_train,y_train)
        trainScore[i] = modelObj.score(X_train,y_train)
        testScore[i]  = modelObj.score(X_test,y_test)
    return trainScore,testScore

In [17]:
class TO_THE_MOON:
    def fit(self,x,y):
        return
    def score(self,x,y):
        ones = np.sum(y==1)
        return ones/len(y)

#DEMO OF THE PROBLEM
A = np.array([1,1,1,1,0])
B = np.zeros(A.shape)
trains,tests = MCtraintest(500,B,A,TO_THE_MOON(),0.25)
print("DiamondHandsClassifer",np.mean(trains),np.mean(tests),sep='\t')

DiamondHandsClassifer	0.8099999999999998	0.785


In [34]:
np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:f}'.format})


X ,y  = make_Xy('MSFT',30, make_X,          make_Y)
X2,y2 = make_Xy('MSFT',30, build_increases, find_increases)
print(len(X), len(X2))
print(len(y), len(y2))

r = np.random.choice(len(X),1)
print(X[r],y[r])
print(X2[r],y2[r])

print()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
8827 8827
8827 8827
[[26.120001 26.129999 26.139999 26.160000 25.990000 25.709999 26.049999
  25.969999 25.750000 26.030001 25.700001 25.469999 26.000000 25.820000
  25.870001 26.090000 25.629999 26.030001 26.160000 26.320000 26.350000
  26.190001 26.590000 26.510000 27.010000 27.160000 27.450001 27.400000
  25.510000 25.360001 26.129999]] [25.889999]
[[0.000000 0.019999 -0.100000 -0.060001 -0.010000 0.029999 0.369999
  0.240000 0.400000 0.509998 0.299999 0.589998 0.269999 0.029999 0.189999
  0.189999 0.199999 0.349998 0.009998 -0.170000 -0.200001 -0.220001
  0.000000 -0.350000 -0.430000 -1.090000 -1.110001 -1.410002 0.179998
  0.679998 26.129999]] [0.000000]



In [50]:
tick = perf_counter()
classifiers = [TO_THE_MOON(),
               GaussianNB(), 
               SVC(),
               KNeighborsClassifier(),
               DecisionTreeClassifier(),
               RandomForestClassifier(n_jobs=3),
               SGDClassifier()]

Mod_dict = {0: "DIAMONDHANDS",1:'NaiveBayes',2:'SVClassifier',3:'KNeighbors',4:'DecisionTree',5:'RandomForest',6:"SGDClass"}
for i,mod in enumerate(classifiers):
    trains,tests = MCtraintest(10,X2,y2,mod,0.5)
    print(Mod_dict[i],np.mean(trains),np.mean(tests),sep='\t')

tock=perf_counter()
print("TIME:",tock-tick)

DIAMONDHANDS	0.5138681169272603	0.5118486633439058
NaiveBayes	0.5300475866757307	0.5338468509288627
SVClassifier	0.5901880806707457	0.5811961939284096
KNeighbors	0.7208021753908905	0.5583824195740824
DecisionTree	1.0	0.6083597643860444
RandomForest	1.0	0.6867240598096964
SGDClass	0.5909585316111489	0.5804259175351156
TIME: 35.2955022000001


In [46]:
from sklearn.utils import all_estimators
import warnings
warnings.filterwarnings('ignore')

estimators = all_estimators(type_filter='regressor')
for name, RegressorClass in estimators:
    print(name)
    #mod = RegressorClass()
    #trains,tests = MCtraintest(200,X,y,mod,0.5)
    #print(name, np.mean(trains), np.mean(tests),sep='\t')

ARDRegression
AdaBoostRegressor
BaggingRegressor
BayesianRidge
CCA
DecisionTreeRegressor
DummyRegressor
ElasticNet
ElasticNetCV
ExtraTreeRegressor
ExtraTreesRegressor
GammaRegressor
GaussianProcessRegressor
GradientBoostingRegressor
HistGradientBoostingRegressor
HuberRegressor
IsotonicRegression
KNeighborsRegressor
KernelRidge
Lars
LarsCV
Lasso
LassoCV
LassoLars
LassoLarsCV
LassoLarsIC
LinearRegression
LinearSVR
MLPRegressor
MultiOutputRegressor
MultiTaskElasticNet
MultiTaskElasticNetCV
MultiTaskLasso
MultiTaskLassoCV
NuSVR
OrthogonalMatchingPursuit
OrthogonalMatchingPursuitCV
PLSCanonical
PLSRegression
PassiveAggressiveRegressor
PoissonRegressor
RANSACRegressor
RadiusNeighborsRegressor
RandomForestRegressor
RegressorChain
Ridge
RidgeCV
SGDRegressor
SVR
StackingRegressor
TheilSenRegressor
TransformedTargetRegressor
TweedieRegressor
VotingRegressor


In [49]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import SGDRegressor


models = [LinearRegression(),
          #LogisticRegression(),
          Lasso(),
          KNeighborsRegressor(),
          KernelRidge(),
          Ridge(),
          SGDRegressor(),
          RandomForestRegressor()]

for mod in models:
    mod = Pipeline([("scaler", StandardScaler()),("model", mod)])
    trains,tests = MCtraintest(5,X,y,mod,0.5)
    print(np.mean(trains),np.mean(tests),sep='\t')


0.9994653230315521	0.9993490407573127
0.9987841453242439	0.9988047837616991
0.999476371806644	0.9991202017873585
0.36916190336346727	0.3546131522438993
0.9994602018307008	0.9993484792927727
0.9991561648785652	0.9991232673696759
0.9999045527516056	0.9993110397786895


In [56]:
def general_CV_platform(model, param_set,X,y, pNames):
    locs = ['rank_test_score','mean_test_score']
    locs.extend(pNames)
    tick = perf_counter()
    print('grid search...')
    shuffle = ShuffleSplit(n_splits=25, test_size=.5)
    gscv = GridSearchCV(model, param_grid=param_set, n_jobs=-1, refit=True, cv=shuffle, return_train_score=True)
    gscv.fit(X,y)
    
    results = pd.DataFrame(gscv.cv_results_)
    tock = perf_counter()
    print(results[locs])
    print(gscv.best_params_)
    print("Grid Searched in:", f"{tock - tick:0.4f} seconds")
    print()
    
    print('cross_val...')
    CVInfo = cross_validate(gscv.best_estimator_, X, y, cv=10, return_train_score=True,n_jobs=-1)
    print('train:', np.mean(CVInfo['train_score']))
    print('test:', np.mean(CVInfo['test_score']))
    print(np.sum(CVInfo['fit_time']), 'seconds')

In [57]:
rfr = RandomForestClassifier()
max_depth = [5,10,15,20,25] 
max_features = [3, 5,10,15,20,25]
n_estimators = [10,25,100]
parameters = {'max_depth':max_depth, 'max_features': max_features, 'n_estimators': n_estimators}
params = ['param_max_depth','param_max_features','param_n_estimators']

general_CV_platform(rfr,parameters,X2,y2, params)

grid search...
    rank_test_score  mean_test_score param_max_depth param_max_features  \
0                71         0.671128               5                  3   
1                66         0.676330               5                  3   
2                64         0.679991               5                  3   
3                40         0.688826               5                  5   
4                32         0.692143               5                  5   
..              ...              ...             ...                ...   
85               55         0.682728              25                 20   
86               34         0.691119              25                 20   
87               77         0.665990              25                 25   
88               59         0.682429              25                 25   
89               35         0.691110              25                 25   

   param_n_estimators  
0                  10  
1                  25  
2           

In [61]:
rfc = RandomForestClassifier(max_depth=5,max_features=20,n_estimators=100)
X_train, X_test, y_train, y_test = train_test_split(X2,y2,test_size=0.25)
rfc.fit(X_train,y_train)
print(rfc.score(X_train,y_train))
print(rfc.score(X_test,y_test))

0.7217522658610271
0.7050294517444495


In [63]:
trains,tests = MCtraintest(100,X2,y2,rfc,0.25)
print(np.mean(trains),np.mean(tests))

0.7233006042296074 0.7039102854553693
