Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP ENH: Add stats.control_charts #5506

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

shangyian
Copy link

Taking over this PR (#4191) to build control charts: #4191

@josef-pkt
Copy link
Member

there is one unrelated failure in the test run for the last python 3.6 on travis
two ci versions fail because I didn't pep-8 and lint clean my code
e,g, at the bottom of this https://travis-ci.org/statsmodels/statsmodels/jobs/495212888

Thanks for taking over.
As I briefly mentioned, I would like to merge an intial version as soon as the basic charts are ok, and leave extra options and variations or extensions of the chart to follow-up PRs.

In my experience (especially including my own work), open ended PRs with feature creep are difficult to merge, and might stall for some time, (even though those stalled and too wide reaching PRs are often useful to figure out the general pattern and design),
And given the topic and the literature, there are a very large number of extensions that we could be doing.

@codecov
Copy link

codecov bot commented Feb 26, 2019

Codecov Report

Merging #5506 into master will decrease coverage by 0.17%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5506      +/-   ##
==========================================
- Coverage      82%   81.82%   -0.18%     
==========================================
  Files         586      587       +1     
  Lines       92398    92600     +202     
  Branches    10250    10277      +27     
==========================================
  Hits        75768    75768              
- Misses      14282    14480     +198     
- Partials     2348     2352       +4
Impacted Files Coverage Δ
statsmodels/stats/control_chart.py 0% <0%> (ø)
statsmodels/sandbox/regression/treewalkerclass.py 0% <0%> (ø) ⬆️
statsmodels/stats/libqsturng/make_tbls.py 0% <0%> (ø) ⬆️
statsmodels/tsa/statespace/mlemodel.py 86.19% <0%> (ø) ⬆️
statsmodels/iolib/summary.py 54.94% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db49000...67a99e1. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 26, 2019

Codecov Report

Merging #5506 into master will increase coverage by 0.02%.
The diff coverage is 92.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5506      +/-   ##
==========================================
+ Coverage   82.05%   82.07%   +0.02%     
==========================================
  Files         586      588       +2     
  Lines       92741    93027     +286     
  Branches    10281    10307      +26     
==========================================
+ Hits        76095    76353     +258     
- Misses      14290    14312      +22     
- Partials     2356     2362       +6
Impacted Files Coverage Δ
statsmodels/stats/tests/test_control_chart.py 100% <100%> (ø)
statsmodels/stats/control_chart.py 89.75% <89.75%> (ø)
statsmodels/regression/mixed_linear_model.py 89.46% <0%> (-0.78%) ⬇️
statsmodels/stats/libqsturng/qsturng_.py 89.05% <0%> (+0.99%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a33ef0...24a7977. Read the comment docs.

@shangyian
Copy link
Author

Here's a Jupyter Notebook that shows what the various control charts look like in practice: https://gist.github.com/shangyian/041128898e7a6a010f2efe336ebfdd09

The tests were all verified using two R packages: https://cran.r-project.org/web/packages/mvdalab/ and https://cran.r-project.org/web/packages/qcc/

@coveralls
Copy link

coveralls commented Feb 26, 2019

Coverage Status

Coverage increased (+0.07%) to 84.617% when pulling 24a7977 on shangyian:control_charts into 7a33ef0 on statsmodels:master.

@shangyian
Copy link
Author

@josef-pkt let me know if the changes look okay!

I'm not sure why the AppVeyor build failed - doesn't seem related to any changes in this branch. I'll look into it.

@josef-pkt
Copy link
Member

I looked through parts of your changes earlier today.
I think it looks good overall, but there are several design decisions that I'm not sure about, and need to think more or try them out.
The main design decision that surprised me is to have the out of sample, future, data in the __init__. I can see some usecases for that, but I thought the main usecase would be to add future at a later stage.
Given that you allow that, it might be fine this way.
The main change is that additing new observations to the attached data should be optional (i.e. needs a keyword.)

The second surprise was the distinction between phase1 and phase2 limits, but then I saw that that was already in my comments on multivariate mean chart.

(So, I have to think more in order to see if the design has problems or not.)

Two github comments:
You dropped my initial commit, so we don't see what the changes compared to my version are. Although, there are so many changes and moving of code, that the changeset wouldn't be very informative anyway.

Second, don't merge statsmodels master into your feature branch, that makes the github commit history a bit messy. If you need to, then rebase on master. We usually rebase a feature branch when it falls too far behind master.

@shangyian
Copy link
Author

Yeah, I wasn't sure if it was necessary to have the in & out sample points in init, but I figured keeping some flexibility there would be good.

Another thing that came to mind - the phase 1 vs 2 structure is slightly confusing - technically we should be able to update both phase 1 and phase 2 with more data, but the current setup only allows for updating (appending) phase 2 with new data points. Do you think this is a use case that may prove necessary?

Sorry about the commit overwrite! I was trying to clean up a bunch of commits I made (almost all of them related to linting), but messed up the history there. I'll keep that in mind about rebasing vs merging.

@josef-pkt
Copy link
Member

about updating the stored data:
I have some comments about it in my initial code or issue.

  • usage without updating: this is for example needed if we want to simulate different phase 2 samples and check the behavior, e.g. time to signal, assuming all start with the same phase 1 sample or estimated parameters
  • We want to update the phase 2 sample if we do online monitoring and want to add new observations to an existing plot, or more precisely make a new plot with the an extended sample - this is what you have right now, AFAICS
  • We want to update the phase 1 sample if we want to re-estimate the parameters with a sample that has newly arrived. - this is what my update method does.
    Because we are back to phase 1, this requires that we handle out-of-control observations e.g. by only adding "clean" new observations (that are supposed to be in-control)

A variation of this will be in self-updating or continuously updating control charts, but we might not want to worry about those for now.

@josef-pkt
Copy link
Member

josef-pkt commented Feb 28, 2019

another use case that I don't know yet how we should handle it: using different statistics in phase 1 and phase 2.
My guess is that the best would be to work with two separate chart instances for phase 1 and phase 1.

example: I was reading/skimming some of the articles on robust estimators in phase 1. My impression (given that I did not confirm all the small print) is that they use the same robust estimators (median, trimmed mean, or mad, iqr for variance) also in phase 2.
However, I don't think that is necessary, i.e. we can take a robust estimator for mean and variance in phase 1, but use a standard mean, variance chart in phase 2 using the robust parameter estimates.
For standard charts all we need are the upp and low limits, so we can simulate phase 2 cases by just taking some (robust) estimates from phase 1 as given.

@shangyian
Copy link
Author

We want to update the phase 1 sample if we want to re-estimate the parameters with a sample that has newly arrived. - this is what my update method does.
Because we are back to phase 1, this requires that we handle out-of-control observations e.g. by only adding "clean" new observations (that are supposed to be in-control)

Good point. The ability to update phase 1 data is what's missing from my current implementation. And the need to remove out-of-control observations from phase 1. I'll try to add that later today.

another use case that I don't know yet how we should handle it: using different statistics in phase 1 and phase 2.

With the way that it's currently implemented, I believe we can just add a different set of statistics in _ControlChartDataHolder, similar to how there are currently different upper and lower boundaries depending on the phase. I think we can postpone the implementation of features to support online updates for control charts to a different PR.

@josef-pkt josef-pkt modified the milestones: 0.11, josef Oct 5, 2019
@josef-pkt josef-pkt added this to review/finalize in comp-stats-hypo May 10, 2020
distr,
alpha)
self.center = self.endog.mean(0)
self.std = self.endog.std(0, ddof=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to go into a method that can be overwritten by subclasses

@josef-pkt
Copy link
Member

josef-pkt commented Nov 26, 2022

skimming the code again:

code looks good overall, docstrings don't follow numpy standard and are incomplete
We need attribute chart(s), Poisson, Binomial, as subclasses to see whether we need changes for inheritance in base class and/or mean class. (#8533 new design issue)
Also, xbar/batches chart as subclass.
Attribute and xbar/batches chart will need nobs/exposure for each observation as extra argument. Do we use WLS, weighted statistics for xbar?
(I need to run more examples to see how the details like updating works.

fitting on historical data assumes that historical data is clean (no out of control observations)
I guess we can add a fit method to update the statistics.
Class is and will remain stateful and include "model" and "results" info. So, fit could update the estimate on historical data using more options for robust fit.
I did not compare details with my old PR.

bump for 0.15
I think we will be able to merge it with only smaller changes.

detail:
rename out -> outoc or to something similar, outside ? for out-of-control candidates
"out" sounds too much like out array in numpy (name also in my old notebook)
Those are just signaled as out-of-control, they are not observations verified as out-of-control

@josef-pkt josef-pkt modified the milestones: josef, 0.15 Nov 26, 2022
@josef-pkt
Copy link
Member

AFAICS, there is no method to update the historical data
update and concat_data_and_index is only used for future data

@josef-pkt josef-pkt mentioned this pull request Nov 26, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
comp-stats-hypo
review/finalize
Development

Successfully merging this pull request may close these issues.

None yet

3 participants