# Challenge: Overfitting on Other Datasets

## Download data from `yfinance`

In [18]:
import yfinance as yf

ticker = 'CLSK'
df = yf.download(ticker)
df

[*********************100%***********************]  1 of 1 completed


Price,Adj Close,Close,High,Low,Open,Volume
Ticker,CLSK,CLSK,CLSK,CLSK,CLSK,CLSK
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2016-11-16,34.900002,34.900002,34.900002,34.900002,34.900002,40
2016-11-17,34.900002,34.900002,34.900002,34.900002,34.900002,0
2016-11-18,34.900002,34.900002,34.900002,34.900002,34.900002,0
2016-11-21,31.500000,31.500000,31.500000,31.500000,31.500000,160
2016-11-22,45.000000,45.000000,45.000000,45.000000,45.000000,10
...,...,...,...,...,...,...
2024-11-20,14.000000,14.000000,14.980000,13.320000,14.560000,49099900
2024-11-21,12.965000,12.965000,15.280000,12.600000,14.600000,62190600
2024-11-22,15.100000,15.100000,15.480000,12.860000,13.100000,49084900
2024-11-25,14.950000,14.950000,15.870000,14.510000,15.400000,42148400


## Preprocess the data

### Filter the date range

- Since 1 year ago at least

In [19]:
df = df.loc['2020-01-01':].copy()

### Create the target variable

#### Percentage change

- Percentage change on `Adj Close` for tomorrow

In [20]:
df['change_tomorrow'] = df['Adj Close'].pct_change(-1)
df.change_tomorrow = df.change_tomorrow * -1
df.change_tomorrow = df.change_tomorrow * 100

#### Remove rows with any missing data

In [21]:
df = df.dropna().copy()
df = df.droplevel('Ticker', axis=1)
df

Price,Adj Close,Close,High,Low,Open,Volume,change_tomorrow
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-02,5.400,5.400,5.60,5.300,5.375,19600,-0.000000
2020-01-03,5.400,5.400,5.50,5.020,5.020,2900,-4.448743
2020-01-06,5.170,5.170,5.60,5.100,5.300,6300,-1.372552
2020-01-07,5.100,5.100,5.30,4.750,5.100,7700,3.773590
2020-01-08,5.300,5.300,5.40,4.930,5.000,6000,-6.000004
...,...,...,...,...,...,...,...
2024-11-19,14.120,14.120,14.40,12.835,13.260,35421400,-0.857142
2024-11-20,14.000,14.000,14.98,13.320,14.560,49099900,-7.983030
2024-11-21,12.965,12.965,15.28,12.600,14.600,62190600,14.139074
2024-11-22,15.100,15.100,15.48,12.860,13.100,49084900,-1.003348


## Machine Learning modelling

### Feature selection

1. Target: which variable do you want to predict?
2. Explanatory: which variables will you use to calculate the prediction?

In [33]:
y = df.change_tomorrow
X = df.drop(columns='change_tomorrow')

### Train test split

In [34]:
from sklearn.model_selection import train_test_split

### Fit the model on train set

### Evaluate model

#### On test set

In [None]:
from sklearn.metrics import ???

#### On train set

## Backtesting

In [None]:
from backtesting import Backtest, Strategy

### Create the `Strategy`

In [None]:
class Regression(Strategy):
    limit_buy = 1
    limit_sell = -5
    
    def init(self):
        self.model = DecisionTreeRegressor(max_depth=15, random_state=42)
        self.already_bought = False
        
        ???

    def next(self):
        explanatory_today = self.data.df.iloc[[-1], :]
        forecast_tomorrow = self.model.predict(explanatory_today)[0]
        
        if forecast_tomorrow > self.limit_buy and self.already_bought == False:
            self.buy()
            self.already_bought = True
        elif forecast_tomorrow < self.limit_sell and self.already_bought == True:
            self.sell()
            self.already_bought = False
        else:
            pass

### Run the backtest on `test` data

In [None]:
bt = Backtest(???, Regression,
              cash=10000, commission=.002, exclusive_orders=True)

In [None]:
results = bt.run(limit_buy=1, limit_sell=-5)

df_results_test = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'Out of Sample (Test)'}, axis=1)
df_results_test

### Run the backtest on `train` data

In [None]:
bt = Backtest(???, Regression,
              cash=10000, commission=.002, exclusive_orders=True)

results = bt.run(limit_buy=1, limit_sell=-5)

df_results_train = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'In Sample (Train)'}, axis=1)
df_results_train

### Compare both backtests

- HINT: Concatenate the previous `DataFrames`

#### Plot both backtest reports

## How to solve the overfitting problem?

> Walk Forward Validation as a realistic approach to backtesting.

Next tutorial → [Walk Forward Validation]()

![](<src/10_Table_Validation Methods.png>)