Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

number of residuals from auto_arima is different from what I get from R #181

Closed
fissehab opened this issue Aug 14, 2019 · 15 comments
Closed

Comments

@fissehab
Copy link

fissehab commented Aug 14, 2019

I am comparing results of auto ARIMA with R (forecast package) and Python (pmdarima package). One of the issues I am getting is the length of the residuals in R and Python are different when d is not zero. For example, in the code shown below, in R, the length of the residuals is the same as the time series data but in python it is the length of the time series data minus d. Further, I also see significant changes in the order I get from R vs Python. What are the causes of the differences?
R

set.seed(250)
ts1 = arima.sim(list(order = c(1,1,0), ar = 0.7), n = 200)
set.seed(270)
ts2 = arima.sim(list(order = c(1,2,0), ar = 0.7), n = 200)
my_fit1 <- auto.arima(ts1, stepwise = FALSE)
resid1 = as.numeric(residuals(my_fit1))
my_fit2 <- auto.arima(ts2, stepwise = FALSE)
resid2 = as.numeric(residuals(my_fit2))
#to use with Python
 ts1 = data.frame(ts1 = ts1)
 write.csv(ts1, 'ts1.csv', row.names = F)
 ts2 = data.frame(ts2 = ts2)
 write.csv(ts2, 'ts2.csv', row.names = F)

Python

from pmdarima.arima import auto_arima
ts1 = pd.read_csv('ts1.csv')
ts2 = pd.read_csv('ts2.csv')
my_fit1 = auto_arima(y = ts1.ts1, 
                  start_p=1, 
                  start_q=1,
                  max_p=3, 
                  max_q=3,
                  seasonal=False,
                  d = None,  
                  error_action='ignore',  
                  suppress_warnings=True, 
                  stepwise=False)

 my_fit2 = auto_arima(y = ts2.ts2, 
                  start_p=1, 
                  start_q=1,
                  max_p=3, 
                  max_q=3,
                  seasonal=False,
                  d = None,  
                  error_action='ignore',  
                  suppress_warnings=True, 
                  stepwise=False)```
@tgsmith61591
Copy link
Member

So this seems like a duplicate of #175. We currently don't have a way of backing into residuals for differenced arrays based on the way statsmodels computes its time series models. It actually stores the differenced endogenous array internally, so you're probably getting d fewer residuals. I kind of wonder whether we could back into this if we used something like mentioned in #180... in any case, there are some improvements coming in 1.3.0 that may help but this is still an open problem.

Regarding

Further, I also see significant changes in the order I get from R vs Python. What are the causes of the differences?

Can you please provide a minimal reproducible dataset? Otherwise I can't help too much...

@fissehab
Copy link
Author

Thank you. Is there a way to get the fitted values without the residuals. I want the residuals to get the fitted values (fitted values = actual values- residuals)? If there is a way to get the fitted values directly without using the residuals, it will solve my problem. Final question, are the coefficients for the original data or for the differenced data when d>=1?

@fissehab
Copy link
Author

Regarding providing a minimal reproducible dataset, if you run the code above and see the orders they are different in R and Python. I generated data with R and saved it to use with Python.

@aaronreidsmith
Copy link
Member

@tgsmith61591 I will let you answer from an algorithm standpoint, but I have expanded the example, and attached the CSVs from @fissehab's R code to help you out.

Files

ts1.csv
ts2.csv

R

> library('forecast')
> set.seed(250)
> ts1 = arima.sim(list(order = c(1,1,0), ar = 0.7), n = 200)
> set.seed(270)
> ts2 = arima.sim(list(order = c(1,2,0), ar = 0.7), n = 200)
> my_fit1 <- auto.arima(ts1, stepwise = FALSE)
> arimaorder(my_fit1)
p d q 
1 1 0 
> resid1 = as.numeric(residuals(my_fit1))
> resid1
  [1]  0.00000000 -0.60364095  0.23826623  0.55722769 -0.23020760 -0.41220075
  [7]  0.78253425 -1.87075189  1.19027884 -1.11793094  1.44801661  0.58545044
 [13] -0.80168133  0.14150800 -1.26133487  0.69509677  0.37075834 -0.15454742
 [19] -0.07506156 -0.65475479 -0.39363677  0.09495343 -2.11172543 -1.39594151
 [25]  2.26873247  0.76211905  2.62624271  0.20756791 -0.17519514  0.15366845
 [31] -0.75394396  0.40269526  1.30275568  0.03974311 -0.86914667  0.74654955
 [37] -1.42677596 -0.11764568 -0.36827530  0.63085923 -0.35568819  0.53503752
 [43] -1.11812136 -1.20394944  0.03298737 -0.46537277  1.46181665  0.55329102
 [49]  0.65954289  1.07228447 -0.32819218 -0.60029496  0.95860825  0.26066664
 [55] -2.09685486  0.27539522  0.18676738  1.31785804  0.74285905  0.85642125
 [61]  1.94737305  1.38570623 -1.26571979 -0.25598437  0.83014839 -0.13828312
 [67] -1.25833457  0.52988298 -0.18840990 -1.62258951 -1.69961377  1.26925885
 [73] -1.23917939 -0.99038214  0.56235865  0.31754259 -1.62379398  0.70405516
 [79]  0.33092129 -0.03768896  1.52509539  0.63282506  1.15508415  0.15291803
 [85]  0.22988662 -0.29258460 -1.27369507  0.45285668 -0.11829754 -0.05201585
 [91]  2.11289020  0.04490440 -2.17809348 -0.57417372  0.51716956 -0.79458069
 [97]  0.89607635  0.09921942 -0.80588473 -0.87131186  1.74297724  0.09051064
[103]  0.61534240 -0.33864617  1.52146018 -0.83787263 -0.12031748 -0.18325604
[109] -1.27093387 -0.57869311 -1.43018033 -0.59291123  0.03099888  0.45575181
[115] -0.11544744 -0.41749196  0.51077514 -0.98883678  1.16138394  0.54893279
[121]  1.42583180 -0.61288395  0.76596722  0.83529175  0.10463794  0.69803678
[127]  0.81147072  0.16419393  0.18308735  0.63020989  0.08309077 -0.59091896
[133] -1.17471453 -2.00908007 -1.37471413  2.24625050 -2.66882416 -0.12291356
[139]  1.56299532  0.79906197  1.03739653  0.67977696  0.73662729 -0.73744140
[145]  2.11993889 -0.66426538 -0.46323626 -0.09076647 -0.87879961  0.55516104
[151]  0.20391274 -0.45673926  2.68615995  0.16879682  1.53363369  0.02861250
[157]  0.20392584 -0.23056804  1.41893053  2.46470838  0.69759568  0.32439179
[163] -1.44719100 -0.22394274  0.46879718 -0.54604309 -0.02741070 -0.77379035
[169]  1.00404773 -2.29133983 -0.14150747  0.81702877 -1.06178458  0.84844974
[175] -1.91793674 -0.99835482 -1.95073835 -0.64555812 -0.22738434  0.61966177
[181]  1.37577810 -1.83137609 -0.41625198 -0.68598856  0.56554510  0.21220398
[187]  0.65734704  0.24893157 -0.89582415 -0.24703034  0.41341682  1.05832060
[193]  0.31943008 -1.42929071 -0.55284699 -0.29899667  0.41514233  0.80769373
[199] -0.77864033 -0.63165443  1.26494252
> my_fit2 <- auto.arima(ts2, stepwise = FALSE)
> arimaorder(my_fit2)
p d q 
1 2 1 
> resid2 = as.numeric(residuals(my_fit2))
> resid2
  [1]  0.000000000  0.000000000  0.553135704  0.078884889 -0.772116810
  [6]  1.191752479 -0.896232626 -2.062970105 -0.726017862  0.063279693
 [11]  0.534729849 -0.020860066 -0.518894344 -1.407545106  1.629950431
 [16]  0.237475847  0.929761428  0.285019547  0.002373432 -1.066614459
 [21]  0.170374898  0.826748954  0.453558464  0.806905185 -0.201452123
 [26]  0.927365195 -0.315056029 -1.605456579  1.001293851 -0.583167870
 [31] -0.065537958  0.083631143 -0.943534835 -0.897012857  0.711878920
 [36] -1.667809795 -1.275387241 -0.776376336  0.156093004  0.385574839
 [41] -1.054707179  0.892523106 -0.118777249  0.002475682  0.321135527
 [46]  1.749487912  1.020931367  1.302966504 -0.334985031  0.948290240
 [51] -0.073429529  0.787138932 -0.003783066  1.471352449  0.022863341
 [56]  0.952155901 -0.656101628  0.354626629 -0.283058924  0.389940980
 [61] -1.516544803 -3.240804108 -0.534250486 -1.476116438 -1.722021722
 [66]  0.370818601  0.364762301  1.626126667 -0.713186461  0.653185814
 [71] -1.262779853 -0.386501740  1.658904794 -1.083788893  0.750618921
 [76]  1.394887238  0.807662624 -0.323083799  0.133315768  1.094482551
 [81]  0.794963951 -1.044287942 -1.467871008 -0.520638586 -0.292278329
 [86] -0.805503298 -0.682879043  1.010176905  1.536099988 -0.011595607
 [91] -1.199649546  0.281641646  1.029089762  1.361362113 -1.048213240
 [96]  0.546301033  0.561122104 -0.352473797 -0.932481643 -0.748541698
[101]  1.120561884 -0.736398291 -0.792149625  1.265344789 -0.519407130
[106]  0.452189280 -0.874618620  0.702206300  1.505288803  1.559516107
[111] -1.179830908  0.338015212  0.597624322  0.580372875 -1.107022221
[116]  0.555363872 -1.225402072 -0.783824198  0.689679399  1.590894096
[121] -1.133090782  1.412550135 -0.900777500 -0.082785372  0.859001741
[126]  0.928419242  1.134220525  0.754978551 -0.973236489  0.599800489
[131] -0.132953065 -0.276302693  0.894958563  0.525032841  1.342282351
[136] -0.462901967 -0.085348825  0.454797338 -0.306347744 -0.285569856
[141]  0.235856099  0.093424147 -1.031979489 -1.663810205  0.365205998
[146]  0.953019554  0.593837091 -0.298025980  2.177322013 -1.598080966
[151] -1.597428462  0.717544322 -0.338993484 -1.339025519 -0.685849756
[156] -0.962702048  1.418138903  0.276621679 -1.961510365  1.236341138
[161]  0.587880092  0.753530428 -0.481849596  0.904094056 -1.013752658
[166] -1.771149402  1.229417875 -0.672951165 -1.554274939 -1.143091548
[171]  0.553663878  1.577028296 -0.582703829  1.220492864 -0.769572292
[176]  0.703472326 -1.164729050  1.687470894 -0.531875777  0.096395906
[181] -1.808910934 -0.582637245 -0.523091826  0.300103475 -0.960595596
[186]  2.297750029  0.444818093 -1.610805595 -1.501848340 -0.551919250
[191]  1.145908843  1.241342836 -1.824136040  0.059812317 -1.303889251
[196]  0.387169543 -0.430681201 -0.322139411 -2.135068754  0.412077770
[201] -1.522639052 -2.161560121
> ts1 = data.frame(ts1 = ts1)
> write.csv(ts1, 'ts1.csv', row.names = F)
> ts2 = data.frame(ts2 = ts2)
> write.csv(ts2, 'ts2.csv', row.names = F)

Python

>>> import pandas as pd
>>> from pmdarima.arima import auto_arima
>>> ts1 = pd.read_csv('ts1.csv')
>>> ts2 = pd.read_csv('ts2.csv')
>>> my_fit1 = auto_arima(y=ts1.ts1, 
...                      start_p=1, 
...                      start_q=1,
...                      max_p=3, 
...                      max_q=3,
...                      seasonal=False,
...                      d=None,  
...                      error_action='ignore',  
...                      suppress_warnings=True, 
...                      stepwise=False)
>>> my_fit1.order
(2, 1, 2)
>>> my_fit1.resid()
array([-9.14538020e-01,  1.93464169e-01,  5.24371064e-01, -2.92925696e-01,
       -5.18102136e-01,  6.89138707e-01, -1.92777588e+00,  1.04641618e+00,
       -1.13247250e+00,  1.31099867e+00,  5.36093529e-01, -9.33193359e-01,
       -2.61315078e-02, -1.35341228e+00,  5.62208818e-01,  3.23055935e-01,
       -2.56341826e-01, -2.10124293e-01, -7.78631506e-01, -5.34063831e-01,
       -2.81301011e-02, -2.22997218e+00, -1.62061311e+00,  2.08168716e+00,
        6.67401742e-01,  2.36174743e+00, -2.95062490e-04, -4.78868268e-01,
       -7.95617995e-02, -8.96236584e-01,  2.68204549e-01,  1.25381087e+00,
        2.59122000e-02, -9.40936265e-01,  6.72041857e-01, -1.39607346e+00,
       -1.88237903e-01, -3.62619233e-01,  6.00114482e-01, -3.68871221e-01,
        4.44813850e-01, -1.17361653e+00, -1.34713923e+00, -1.05476073e-01,
       -5.65589374e-01,  1.28761411e+00,  4.29824178e-01,  4.51267661e-01,
        8.75801447e-01, -4.80762914e-01, -7.85236776e-01,  8.14348124e-01,
        2.22830032e-01, -2.16703941e+00,  1.12453174e-01,  1.64543106e-01,
        1.25540831e+00,  7.06172258e-01,  7.75858444e-01,  1.89455659e+00,
        1.42475300e+00, -1.20592661e+00, -2.44709101e-01,  9.87742863e-01,
        1.10515575e-01, -1.04865152e+00,  7.05049200e-01,  8.80124260e-02,
       -1.40753141e+00, -1.56989975e+00,  1.38072022e+00, -1.03004202e+00,
       -9.92185645e-01,  5.35830220e-01,  3.20254047e-01, -1.69565097e+00,
        4.99317668e-01,  2.30276822e-01, -2.02685833e-01,  1.31931584e+00,
        5.03635370e-01,  9.80913068e-01,  4.21565533e-02,  1.15157303e-01,
       -3.42533852e-01, -1.30229213e+00,  4.08596706e-01, -5.96888514e-02,
       -4.58378958e-02,  2.10906641e+00,  1.43844338e-01, -2.19172583e+00,
       -6.62125322e-01,  5.45546633e-01, -7.38654656e-01,  8.38459924e-01,
        1.05828275e-01, -8.71155353e-01, -9.89231359e-01,  1.62111435e+00,
        8.67467996e-02,  4.91681918e-01, -4.22675070e-01,  1.40624041e+00,
       -8.28313058e-01, -2.20171821e-01, -2.07097737e-01, -1.28579546e+00,
       -6.53110878e-01, -1.48003301e+00, -7.29491219e-01, -1.13993685e-01,
        2.71657248e-01, -3.45193780e-01, -7.24522597e-01,  1.65722356e-01,
       -1.29986193e+00,  7.50757296e-01,  2.46864814e-01,  1.07197921e+00,
       -9.09203788e-01,  3.88663220e-01,  5.82687708e-01, -1.13640149e-01,
        4.76265114e-01,  6.66810232e-01,  6.44259755e-02,  9.14453151e-02,
        5.88158977e-01,  1.04871011e-01, -5.66836931e-01, -1.15393751e+00,
       -1.99852351e+00, -1.41606643e+00,  2.19538125e+00, -2.60561076e+00,
       -4.15702414e-01,  1.37727224e+00,  6.40497831e-01,  7.86005328e-01,
        4.43024921e-01,  5.11265280e-01, -9.09830321e-01,  1.92113663e+00,
       -6.54437362e-01, -5.75093869e-01, -1.36377939e-01, -8.74536898e-01,
        5.20006723e-01,  2.39965496e-01, -4.62443126e-01,  2.63519378e+00,
        2.90901799e-01,  1.52137984e+00,  1.48199054e-01,  3.01543427e-01,
       -5.14341118e-02,  1.62868465e+00,  2.80421117e+00,  1.12419973e+00,
        7.11614556e-01, -9.83000574e-01,  2.31689031e-01,  1.05120074e+00,
        8.21582203e-02,  5.36102251e-01, -1.85103651e-01,  1.53671091e+00,
       -1.67855488e+00,  2.60659770e-01,  1.32153145e+00, -5.58644381e-01,
        1.19774854e+00, -1.48691621e+00, -7.53739567e-01, -1.67825080e+00,
       -4.85561825e-01, -7.72185571e-02,  6.88235934e-01,  1.39716245e+00,
       -1.83435110e+00, -6.25850082e-01, -8.15777532e-01,  3.87956568e-01,
        6.09959713e-02,  4.43642400e-01,  4.06122799e-02, -1.12948633e+00,
       -5.32393800e-01,  1.74436469e-01,  8.44546164e-01,  1.26966200e-01,
       -1.66073353e+00, -8.58371590e-01, -5.37814226e-01,  1.70941117e-01,
        5.69655669e-01, -1.02394265e+00, -9.75773478e-01,  9.34447674e-01])
>>> my_fit2 = auto_arima(y=ts2.ts2, 
...                      start_p=1, 
...                      start_q=1,
...                      max_p=3, 
...                      max_q=3,
...                      seasonal=False,
...                      d=None,  
...                      error_action='ignore',  
...                      suppress_warnings=True, 
...                      stepwise=False)
>>> my_fit2.order
(3, 2, 2)
>>> my_fit2.resid()
array([ 0.93498424,  0.11865935, -0.74539659,  1.18472186, -0.84989407,
       -2.07898243, -0.65103113,  0.01591816,  0.53588411,  0.309986  ,
       -0.35468874, -1.5608997 ,  1.61819017,  0.54450884,  0.91216129,
        0.34138055,  0.04778707, -1.06332563,  0.08631281,  0.87614061,
        0.52674929,  0.88340722, -0.14389948,  0.85723075, -0.27193337,
       -1.62392745,  1.00179547, -0.53456754, -0.1381289 ,  0.32301899,
       -0.90361638, -1.03319164,  0.8584496 , -1.47448276, -1.39528868,
       -0.63715174,  0.29869234,  0.44270045, -0.92199649,  1.05853092,
       -0.02445317, -0.12868128,  0.52885401,  1.8909318 ,  0.90287131,
        1.284364  , -0.08951854,  0.77761604, -0.27268465,  0.82464474,
        0.19805207,  1.25957484, -0.03232757,  1.08332586, -0.56617578,
        0.05424107, -0.21702669,  0.54297404, -1.55229714, -3.40037476,
       -0.36412127, -1.27776819, -1.88508363,  0.69261049,  0.79503463,
        1.49262823, -0.59685356,  0.95254585, -1.20456255, -0.7753768 ,
        1.9091864 , -0.75539544,  0.45743979,  1.48527013,  1.01771099,
       -0.36607227,  0.00704966,  1.16703819,  0.8219478 , -1.12871576,
       -1.44713619, -0.4547927 , -0.39195356, -0.7827503 , -0.3986085 ,
        1.11149303,  1.44836935,  0.12766762, -0.94883314,  0.1802624 ,
        0.83712718,  1.5077454 , -0.76929138,  0.36274145,  0.47086796,
       -0.27351047, -0.81009948, -0.82348303,  1.04339551, -0.54340087,
       -0.71608811,  1.30810891, -0.51056589,  0.45613281, -0.61290227,
        0.58790022,  1.47241659,  1.7185846 , -1.05461664,  0.16908157,
        0.59288033,  0.55928973, -1.04305812,  0.56017679, -1.2062302 ,
       -0.90193199,  0.89157858,  1.69748227, -1.20174913,  1.49020802,
       -0.61467375, -0.42063414,  0.90665987,  1.20533816,  1.07132981,
        0.6365093 , -0.84098777,  0.6182845 , -0.27519097, -0.39420656,
        1.16169109,  0.58940555,  1.0872276 , -0.29210526,  0.06531636,
        0.29699429, -0.48434988, -0.07645545,  0.40779051, -0.13670244,
       -1.05051871, -1.34366978,  0.35614509,  0.76706391,  0.79550408,
        0.09043893,  2.0313036 , -1.76704674, -1.50855526,  1.03146735,
       -0.62845911, -1.52246402, -0.08430522, -0.85495369,  0.9015557 ,
        0.65746577, -1.43956086,  0.95173401,  0.39331314,  1.05351368,
       -0.11704104,  0.54070703, -1.13058082, -1.46892611,  1.33434914,
       -0.94006593, -1.56265535, -0.57374688,  0.46179756,  1.22270341,
       -0.11840036,  1.64146156, -1.0830073 ,  0.33026865, -0.54255066,
        1.71584151, -0.93671821,  0.12367594, -1.18606485, -0.92092432,
       -0.86176041,  0.76895933, -0.55705392,  1.88423762,  0.54522791,
       -1.21461875, -1.49894922, -0.92082694,  1.24368024,  1.71859071,
       -1.86053481, -0.18252227, -0.96489466,  0.37905162, -0.50170363,
       -0.21354372, -1.83719068,  0.30092296, -1.51112236, -2.01598809])

@fissehab
Copy link
Author

From the orders @aaronreidsmith has shown above, the R results are more close to the correct values than the Python orders.

@tgsmith61591
Copy link
Member

It's because you have stepwise=False in the python code. When stepwise is False, the model fits are akin to RandomizedSearchCV in scikit-learn. Without seeding the random state, you can't guarantee deterministic results. When stepwise=True you get the exact same results (and it's much faster).

my_fit1 = pm.auto_arima(y=ts1, 
                        start_p=1, 
                        start_q=1,
                        max_p=3, 
                        max_q=3,
                        seasonal=False,
                        d=None,  
                        trace=True,
                        error_action='ignore',  
                        suppress_warnings=True, 
                        stepwise=True)

my_fit2 = pm.auto_arima(y=ts2, 
                        start_p=1, 
                        start_q=1,
                        max_p=3, 
                        max_q=3,
                        seasonal=False,
                        d=None,
                        trace=True,
                        error_action='ignore',  
                        suppress_warnings=True, 
                        stepwise=True)

Run the above and you'll get:

>>> my_fit1.order
(1, 1, 0)
>>> my_fit2.order
(1, 2, 1)

@talaowis
Copy link

talaowis commented May 9, 2020

@tgsmith61591
how can I get accurate residuals of the same size as the dataset from python?

ARIMA Python Residuals are not accurate and the problem I am also facing is that I am fitting a set of 57 row but the prediction size is 55 ?

Noting That the dataset is a daily time series

"""Here is the code""""
def evaluate_arima_model(X, arima_order,date):
# prepare training dataset
train_size = int(len(X) * 0.75);

train, test = X[0:train_size], X[train_size:];
history = [x for x in train]
predictions = list()
for t in range(len(test)):
    difference(X, interval=arima_order[1])
    model = ARIMA(history, order=arima_order,dates=date.values,freq='D')
    model_fit = model.fit(method="css");
    yhat = model_fit.forecast()[0]
    predictions.append(yhat)
    history.append(test[t])
# calculate out of sample error
predictions=[[a[0]] for a in predictions]

RMSE = sqrt(mean_squared_error(test, predictions));
MAPE=mean_absolute_percentage_error(test, predictions)

MSLE=msle(test, predictions)
RMSLE=rmsle(test, predictions)
R2=r2_score(test, predictions)
return RMSE,MAPE,MSLE,RMSLE,R2,model_fit,train, test,history

def evaluate_models(resultsdf,index,dataset,datasetAS_DF):
f= open(path+"country_order_RMSE.txt","a+")
p = d = q = range(0, 3)
bhistory=None
bestMAPE,bestMSLE,bestRMSLE,bestR2=float("inf"),float("inf"),float("inf"),float("inf")
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
dataset = dataset.astype('float32')
best_score, best_cfg,best_model = float("inf"), None, None
for order in pdq:
try:

        print("hello trying to evaluate");rmse,MAPE,MSLE,RMSLE,R2,model_fit,train, test,history = evaluate_arima_model(dataset, order,datasetAS_DF['Date']);print("rmse",rmse)
        if rmse < best_score:
            best_score, best_cfg = rmse, order
            best_model=model_fit
            bestMAPE,bestMSLE,bestRMSLE,bestR2=MAPE,MSLE,RMSLE,R2
            bhistory=history
        print('ARIMA%s MSE=%.3f' % (order,rmse))
    except:
        print("ARIMA IS NOT APPROPRIATE")
best_model.save(path+'models/'+index+'/ARIMAmodel.pkl')
print("HISTOOOOOOOORY")
print(len(bhistory))
print("RESIDUALLLLS",best_model.resid)
resid_arr=best_model.resid
fittedvalues=best_model.fittedvalues
residuals = pd.DataFrame(best_model.resid,columns=["Residuals"])
residuals.plot()
print("FITTTTTTTTTTED")
print(len(datasetAS_DF))               **############dataset size is 57**
print(len(resid_arr))                            **############residuals size is 55**

@tgsmith61591
Copy link
Member

@talaowis I don't even think you're using our package. It looks like you're fitting a stats models ARIMA directly.

@talaowis
Copy link

talaowis commented May 9, 2020

Hello @tgsmith61591 ,now I have switched to use your package put it is the same !!
the dataset size is not the same as the residuals result size
and the residuals are not accurate

def evaluate_arima_model(X, arima_order,date):
# prepare training dataset
train_size = int(len(X) * 0.75);

train, test = X[0:train_size], X[train_size:];
history = [x for x in train]
# make predictions
predictions = list()
for t in range(len(test)):
    model = ARIMA(order=arima_order)
    model_fit = model.fit(history);print("code breaks!!!")
    yhat = model_fit.predict(1)
    print("yhaaaaaaaaaaaaaaaaaaaaat",yhat)
    predictions.append(yhat)
    history.append(test[t])
# calculate out of sample error
predictions=[[a[0]] for a in predictions]

print("predictions",predictions);
RMSE = sqrt(mean_squared_error(test, predictions));
MAPE=mean_absolute_percentage_error(test, predictions)
MSLE=msle(test, predictions)
RMSLE=rmsle(test, predictions)
R2=r2_score(test, predictions)
return RMSE,MAPE,MSLE,RMSLE,R2,model_fit,train, test,history

def evaluate_models(resultsdf,index,dataset,datasetAS_DF):
f= open(path+"country_order_RMSE.txt","a+")
p = d = q = range(0, 3)
bhistory=None
bestMAPE,bestMSLE,bestRMSLE,bestR2=float("inf"),float("inf"),float("inf"),float("inf")
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
dataset = dataset.astype('float32')
best_score, best_cfg,best_model = float("inf"), None, None
for order in pdq:
try:

        rmse,MAPE,MSLE,RMSLE,R2,model_fit,train, test,history = evaluate_arima_model(dataset, order,datasetAS_DF['Date']);
        if rmse < best_score:
            best_score, best_cfg = rmse, order
            best_model=model_fit
            bestMAPE,bestMSLE,bestRMSLE,bestR2=MAPE,MSLE,RMSLE,R2
            bhistory=history
        print('ARIMA%s MSE=%.3f' % (order,rmse))
    except:
        print("ARIMA IS NOT APPROPRIATE")
# Serialize with Pickle
with open(path+'models/'+index+'/ARIMAmodel.pkl', 'wb') as pkl:
    pickle.dump(best_model, pkl)
print("HISTOOOOOOOORY")
print(len(bhistory))
print("RESIDUALLLLS",best_model.resid)
resid_arr=best_model.resid()
residuals = pd.DataFrame(resid_arr,columns=["Residuals"])
residuals.plot()
print("FITTTTTTTTTTED")
**print(len(datasetAS_DF)) #dataset size is 57
print(len(resid_arr)) residuals size is 56 **
**predictions = list(best_model.predict_in_sample()) #that is how i get the predictions**
print(len(predictions))
**#fill first resduals with 0-that's because ARIMA does not return 0 residuals
while len(datasetAS_DF)!=len(resid_arr) and len(resid_arr)<len(datasetAS_DF):
    resid_arr=np.concatenate([[0],resid_arr])
    predictions.insert(0,0)**

residuals = pd.DataFrame(resid_arr,columns=["Residuals"])
predictions = pd.DataFrame(predictions,columns=["predictions"])
datasetAS_DF=datasetAS_DF.reset_index(drop=True)
datasetAS_DF=pd.concat([datasetAS_DF,residuals],axis=1)
datasetAS_DF=pd.concat([datasetAS_DF,predictions],axis=1)
datasetAS_DF.to_csv(path+'models/'+index+'/AfterARIMA.csv')
plt.savefig(path+'models/'+index+"/residuals.png")
plt.clf()
residuals.plot(kind='kde')
plt.savefig(path+'models/'+index+"/KDE_residuals.png")
plt.clf()
print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))
#resultsdf=pd.DataFrame([{"RMSE":"","MAPE":"","MSLE":"","RMSLE":"","R2":""}
resultsdf.loc[index,"RMSE"]=rmse
resultsdf.loc[index,"MAPE"]=MAPE
resultsdf.loc[index,"MSLE"]=MSLE
resultsdf.loc[index,"RMSLE"]=RMSLE
resultsdf.loc[index,"R2"]=R2
f.write("Country   "+index+"   should have this order  "+str(best_cfg)+"   ,(RMSE)Loss Function Value= "+str(best_score)+"\n \n")
f.close()

@talaowis
Copy link

talaowis commented May 9, 2020

Here is a sample of the produced results :

Date Actual Residuals predictions
3/12/2020 0 0 0
3/13/2020 0 -0.14083 0.140829
3/14/2020 0 -0.14083 0.140829
3/15/2020 0 -0.14842 0.148417
3/16/2020 0 -0.14885 0.148849
3/17/2020 0 -0.14887 0.148873
3/18/2020 0 -0.14887 0.148874
3/19/2020 0 -0.14887 0.148874
4/30/2020 1 -0.14887 0.148874
5/1/2020 1 0.851126 0.148874
5/2/2020 3 -0.09483 1.094834
5/3/2020 3 1.854046 1.145954
5/4/2020 3 -0.04064 3.040636
5/5/2020 3 -0.14302 3.143025
5/6/2020 8 -0.14856 3.148558
5/7/2020 8 4.851143 3.148857

@talaowis
Copy link

talaowis commented May 9, 2020

the data set I have posted as an example above is not the whole one , I have just taken a part to show you the results
note that arima order is (0,1,1) the first row I have inserted zeros because the residuals size does not match the dataset size
the other bold rows show you that the residuals results are not accurate

Date Actual Residuals predictions
**3/12/2020 0 0 0**
3/13/2020 0 -0.14083 0.140829
3/14/2020 0 -0.14083 0.140829
3/15/2020 0 -0.14842 0.148417
3/16/2020 0 -0.14885 0.148849
3/17/2020 0 -0.14887 0.148873
3/18/2020 0 -0.14887 0.148874
3/19/2020 0 -0.14887 0.148874
**4/30/2020 1 -0.14887 0.148874**
5/1/2020 1 0.851126 0.148874
**5/2/2020 3 -0.09483 1.094834**
5/3/2020 3 1.854046 1.145954
5/4/2020 3 -0.04064 3.040636
5/5/2020 3 -0.14302 3.143025
**5/6/2020 8 -0.14856 3.148558**
5/7/2020 8 4.851143 3.148857

@tgsmith61591
Copy link
Member

You have differencing in your order (0, 1, 1). When you fit an ARIMA with differencing, you fit n - d samples, so you'll get n - d residuals back. That appears to be exactly what's happening, and that's perfectly expected.

@talaowis
Copy link

talaowis commented May 9, 2020

then the predicted value assumes no inverse difference ?

@tgsmith61591
Copy link
Member

When you produce in-sample predictions, the model predicts as many values as it was trained on. In this case, n - d, and the predictions start at d, in this case 1. Any residuals before that would be nan anyways, so they are not meaningful.

@talaowis
Copy link

talaowis commented May 9, 2020

ok thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants