#### Logistic Regression in Python - Predicting if the stock market is going Up or Down
https://www.youtube.com/watch?v=X9jjyh0p7x8&t=1302s


#### Che cos'è la funzione logistica?

Regressione logistica
La regressione logistica è un modello statistico (modello logit) usato negli algoritmi di classificazione del machine learning per ottenere la probabilità di appartenenza a una determinata classe.

L'algoritmo di classificazione basato sulla regressione logistica è del tipo ML supervisionato.

Si basa sull'utilizzo della funzione logistica (sigmoid) che converte i valori reali in un valore compreso tra 0 e 1.

Nota. Nonostante il nome dell'algoritmo "regressione logistica" (logistic regression) faccia pensare a un algoritmo di regressione, perché la funzione logistica è simile alla regressione lineare, si tratta di un algoritmo di classificazione.

Nella fase di addestramento l'algoritmo riceve in input un dataset di training composto da N esempi. Ogni esempio è composto da m attributi X e da un'etichetta y che indica la corretta classificazione.

L'algoritmo individua una vettore dei pesi W da associare al vettore degli attributi Xm degli esempi, in modo tale da massimizzare la percentuale di risposte corrette (o minimizzare quelle sbagliate).

La combinazione lineare z dei pesi L per gli attributi X fornisce una risposta del sistema per ogni esempio del training dataset.
Nella regressione logistica la combinazione lineare z è l'argomento della funzione logistica che lo traduce in un valore compreso tra 0 e 1.

Il risultato della funzione logistica è usato come funzione di attivazione dei nodi della rete neurale.

https://www.eage.it/machine-learning/regressione-logistica


Obiettivo dell'esercizio: prevedere se il mercato azionario domani salirà o scenderà utilizzando informazioni di mercato ritardate

In [3]:
from datetime import date, datetime
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import statsmodels.api as sm #usiamo statsmodels, ma il modello logistico è anche in scikit-learn

In [33]:
start_date = '2001-01-01'
end_date = '2005-12-31'

In [34]:
data = web.get_data_yahoo('^GSPC', start_date, end_date) 

In [35]:
data

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2001-01-02,1320.280029,1276.050049,1320.280029,1283.270020,1129400000,1283.270020
2001-01-03,1347.760010,1274.619995,1283.270020,1347.560059,1880700000,1347.560059
2001-01-04,1350.239990,1329.140015,1347.560059,1333.339966,2131000000,1333.339966
2001-01-05,1334.770020,1294.949951,1333.339966,1298.349976,1430800000,1298.349976
2001-01-08,1298.349976,1276.290039,1298.349976,1295.859985,1115500000,1295.859985
...,...,...,...,...,...,...
2005-12-23,1269.760010,1265.920044,1268.119995,1268.660034,1285810000,1268.660034
2005-12-27,1271.829956,1256.540039,1268.660034,1256.540039,1540470000,1256.540039
2005-12-28,1261.099976,1256.540039,1256.540039,1258.170044,1422360000,1258.170044
2005-12-29,1260.609985,1254.180054,1258.170044,1254.420044,1382540000,1254.420044


In [36]:
df = data['Adj Close'].pct_change() * 100

In [37]:
df = df.rename("Today")
df

Date
2001-01-02         NaN
2001-01-03    5.009861
2001-01-04   -1.055247
2001-01-05   -2.624236
2001-01-08   -0.191781
                ...   
2005-12-23    0.042586
2005-12-27   -0.955338
2005-12-28    0.129722
2005-12-29   -0.298052
2005-12-30   -0.488672
Name: Today, Length: 1256, dtype: float64

In [38]:
df = df.reset_index()
df

Unnamed: 0,Date,Today
0,2001-01-02,
1,2001-01-03,5.009861
2,2001-01-04,-1.055247
3,2001-01-05,-2.624236
4,2001-01-08,-0.191781
...,...,...
1251,2005-12-23,0.042586
1252,2005-12-27,-0.955338
1253,2005-12-28,0.129722
1254,2005-12-29,-0.298052


Creaiamo le colonne con i valori di rendimento ritardati

In [39]:
for i in range(1,6):
    df['Lag_' + str(i)] = df['Today'].shift(i)

In [40]:
df

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5
0,2001-01-02,,,,,,
1,2001-01-03,5.009861,,,,,
2,2001-01-04,-1.055247,5.009861,,,,
3,2001-01-05,-2.624236,-1.055247,5.009861,,,
4,2001-01-08,-0.191781,-2.624236,-1.055247,5.009861,,
...,...,...,...,...,...,...,...
1251,2005-12-23,0.042586,0.422078,0.251667,-0.023815,-0.583902,-0.284828
1252,2005-12-27,-0.955338,0.042586,0.422078,0.251667,-0.023815,-0.583902
1253,2005-12-28,0.129722,-0.955338,0.042586,0.422078,0.251667,-0.023815
1254,2005-12-29,-0.298052,0.129722,-0.955338,0.042586,0.422078,0.251667


Aggiungiamo il volume del giorno precedente

In [41]:
df['Volume'] = data.Volume.shift(1).values/1000_000_000

In [42]:
df

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume
0,2001-01-02,,,,,,,
1,2001-01-03,5.009861,,,,,,1.12940
2,2001-01-04,-1.055247,5.009861,,,,,1.88070
3,2001-01-05,-2.624236,-1.055247,5.009861,,,,2.13100
4,2001-01-08,-0.191781,-2.624236,-1.055247,5.009861,,,1.43080
...,...,...,...,...,...,...,...,...
1251,2005-12-23,0.042586,0.422078,0.251667,-0.023815,-0.583902,-0.284828,1.88850
1252,2005-12-27,-0.955338,0.042586,0.422078,0.251667,-0.023815,-0.583902,1.28581
1253,2005-12-28,0.129722,-0.955338,0.042586,0.422078,0.251667,-0.023815,1.54047
1254,2005-12-29,-0.298052,0.129722,-0.955338,0.042586,0.422078,0.251667,1.42236


In [43]:
df = df.dropna()
df

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume
6,2001-01-10,0.958639,0.381219,-0.191781,-2.624236,-1.055247,5.009861,1.19130
7,2001-01-11,1.031770,0.958639,0.381219,-0.191781,-2.624236,-1.055247,1.29650
8,2001-01-12,-0.623287,1.031770,0.958639,0.381219,-0.191781,-2.624236,1.41120
9,2001-01-16,0.614309,-0.623287,1.031770,0.958639,0.381219,-0.191781,1.27600
10,2001-01-17,0.212561,0.614309,-0.623287,1.031770,0.958639,0.381219,1.20570
...,...,...,...,...,...,...,...,...
1251,2005-12-23,0.042586,0.422078,0.251667,-0.023815,-0.583902,-0.284828,1.88850
1252,2005-12-27,-0.955338,0.042586,0.422078,0.251667,-0.023815,-0.583902,1.28581
1253,2005-12-28,0.129722,-0.955338,0.042586,0.422078,0.251667,-0.023815,1.54047
1254,2005-12-29,-0.298052,0.129722,-0.955338,0.042586,0.422078,0.251667,1.42236


Creiamo la colonna con i movimenti di mercato 

In [44]:
df['Direction'] = [1 if i > 0 else 0 for i in df['Today']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Direction'] = [1 if i > 0 else 0 for i in df['Today']]


In [45]:
df.head()

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume,Direction
6,2001-01-10,0.958639,0.381219,-0.191781,-2.624236,-1.055247,5.009861,1.1913,1
7,2001-01-11,1.03177,0.958639,0.381219,-0.191781,-2.624236,-1.055247,1.2965,1
8,2001-01-12,-0.623287,1.03177,0.958639,0.381219,-0.191781,-2.624236,1.4112,0
9,2001-01-16,0.614309,-0.623287,1.03177,0.958639,0.381219,-0.191781,1.276,1
10,2001-01-17,0.212561,0.614309,-0.623287,1.03177,0.958639,0.381219,1.2057,1


Aggiungiamo una colonna con una costante altrimenti la regressione non ha intercetta

In [46]:
df = sm.add_constant(df)

In [47]:
df.head()

Unnamed: 0,const,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume,Direction
6,1.0,2001-01-10,0.958639,0.381219,-0.191781,-2.624236,-1.055247,5.009861,1.1913,1
7,1.0,2001-01-11,1.03177,0.958639,0.381219,-0.191781,-2.624236,-1.055247,1.2965,1
8,1.0,2001-01-12,-0.623287,1.03177,0.958639,0.381219,-0.191781,-2.624236,1.4112,0
9,1.0,2001-01-16,0.614309,-0.623287,1.03177,0.958639,0.381219,-0.191781,1.276,1
10,1.0,2001-01-17,0.212561,0.614309,-0.623287,1.03177,0.958639,0.381219,1.2057,1


In [48]:
X = df[['const', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volume']]

In [49]:
y = df.Direction

In [50]:
model = sm.Logit(y,X)
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.691035
         Iterations 4


In [51]:
result.summary()

0,1,2,3
Dep. Variable:,Direction,No. Observations:,1250.0
Model:,Logit,Df Residuals:,1243.0
Method:,MLE,Df Model:,6.0
Date:,"Tue, 14 Dec 2021",Pseudo R-squ.:,0.002155
Time:,10:47:56,Log-Likelihood:,-863.79
converged:,True,LL-Null:,-865.66
Covariance Type:,nonrobust,LLR p-value:,0.713

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1268,0.241,-0.527,0.598,-0.599,0.345
Lag_1,-0.0778,0.050,-1.550,0.121,-0.176,0.021
Lag_2,-0.0390,0.050,-0.778,0.437,-0.137,0.059
Lag_3,0.0126,0.050,0.252,0.801,-0.085,0.110
Lag_4,0.0038,0.050,0.076,0.939,-0.094,0.102
Lag_5,0.0103,0.050,0.208,0.835,-0.087,0.107
Volume,0.1338,0.158,0.845,0.398,-0.177,0.444


L'unico coefficiente significativo è quello di Lag_1

In [52]:
prediction = result.predict(X)
prediction

6       0.506256
7       0.483505
8       0.480380
9       0.515939
10      0.507901
          ...   
1251    0.519447
1252    0.505643
1253    0.539423
1254    0.523833
1255    0.517203
Length: 1250, dtype: float64

Costruiamo una "matrice di confusione" dove mettiamo a confronto previsioni di rialzo o ribasso con effettivi rialzi o ribassi

In [53]:
def confusion_matrix(act,pred):
    predtrans = ['Up' if i>0.5 else "Down" for i in pred]
    actuals = ['Up' if i > 0 else 'Down' for i in act]
    confusion_matrix = pd.crosstab(pd.Series(actuals),
                                  pd.Series(predtrans),
                                  rownames = ['Actual'],
                                  colnames = ['Predicted'])
    return confusion_matrix



Per capire la capacità previsiva del modello dividiamo la somma dei casi in cui ha avuto ragione (previsto = verificato) per il totale dei casi. I casi in cui il modello ha fatto una previsione corretta sono quelli nella diagonale.

In [54]:
confusion_matrix(y,prediction)

Predicted,Down,Up
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,155,448
Up,143,504


In [56]:
(155+504)/1256

0.5246815286624203

Capacità di previsione leggermente migliore di quella che si avrebbe tirando una moneta (0.50)

La stima così fatta presenta il problema di essere fatta su tutto il campione. Occorre dividere la serie di una parte di stima o "addestramento" (train) e in una parte di test

In [61]:
x_train = df[df.Date.dt.year < 2005][['const', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volume']]
y_train = df[df.Date.dt.year < 2005]['Direction']
x_test = df[df.Date.dt.year == 2005][['const', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volume']]
y_test = df[df.Date.dt.year == 2005]['Direction']

In [63]:
model = sm.Logit(y_train, x_train)

In [64]:
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.691898
         Iterations 4


In [65]:
prediction = result.predict(x_test)

In [66]:
confusion_matrix(y_test, prediction)

Predicted,Down,Up
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,81,30
Up,103,38


In [69]:
(81+38)/len(x_test)

0.4722222222222222

Capacità di previsione peggiore di quella che si avrebbe tirando una moneta (0.50)

Cosa succede se eliminiamo tutte le variabili salvo Lag_1 e Lag_2?

In [71]:
x_train = df[df.Date.dt.year < 2005][['const', 'Lag_1', 'Lag_2']]
y_train = df[df.Date.dt.year < 2005]['Direction']
x_test = df[df.Date.dt.year == 2005][['const', 'Lag_1', 'Lag_2']]
y_test = df[df.Date.dt.year == 2005]['Direction']

In [72]:
model = sm.Logit(y_train, x_train)

In [73]:
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.692063
         Iterations 3


In [74]:
prediction = result.predict(x_test)

In [75]:
confusion_matrix(y_test, prediction)

Predicted,Down,Up
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,40,71
Up,37,104


In [76]:
(40 + 104)/len(x_test)

0.5714285714285714