# COVID19 Global Forecasting (Week 3)


<!--TOC--><h2 id="tocheading">Table of Contents</h2><div id="toc"></div>

# Imports & Macros

In [None]:
%%javascript
// Build table of contents.
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
// Disable output area scrolling.
// IPython.OutputArea.auto_scroll_threshold = 1

In [None]:
!pip install hy > /dev/null

In [None]:
# Hy Magic
import IPython
def hy_eval(*args):
    import hy
    try: return hy.eval(hy.read_str("(do\n"+"".join(map(lambda s:s or "",args))+"\n)\n"),globals())
    except Exception as e: print("ERROR:", str(e)); raise e
@IPython.core.magic.register_line_cell_magic
def h(*args): return hy_eval(*args) # Prints result useful for debugging.
@IPython.core.magic.register_line_cell_magic
def hs(*args):hy_eval(*args) # Silent. Does not print result.
del h, hs

In [None]:
%%hs
(import  [useful [*]])
(require [useful [*]])

# Theory

1. Let us reason about the infection growth rate from first principles. 

2. Clearly it is not exponential, even though it appears so in the beginning.

3. Let $y$ be the number of people who are infected. Let $z$ be the total fatalities. Let $t$ be the number of days since December 31, 2019. Both $y$ and $z$ are functions of $t$. 

4. Let $N$ be the total population. This is not necessarily the total population of the region in question. It is the total number of people who will get infected. So it is the population that is susceptible to becoming infected. It follows then that $y \leq N$ and $z \let N$.

5. So each day $y$ goes up by some amount. This looks exponential but it obviously cannot remain exponential since there is an upper bound of $N$.

6. The change in $y$ every day is proportional to two things. (1) The number of new infected is proportional to is people who are currently infected. If twice as many people are infected the odds of them infecting the remaining population become twice as high. In other words the change in $y$ is proportional to $y$. (2) The number of new infected is proportional to the number of people who are uninfected, which is $N - y$. If there are half as many uninfected people then there will be half as many new infected. Only a percentage of the uninfected get infected every day.

7. So we can write it in this way. The rate of change of $y$ which is $y' = r y (1 - y)$.

8. This is precisely the equation for the sigmoid function. See [this StackExchange answer](https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) for an elegant derivation.

9. In other words 

$$
y = \frac{N}{1 + e^{-r(t-t_{inf})}}$
$$

This is a sigmoid or an S-curve. This is what it looks like.

<img width="500" src="https://upload.wikimedia.org/wikipedia/commons/a/ac/Logistic-curve.png">

10. Let us understand the terms in this equation. $t_{inf}$ is the inflection point. If you set $t$ to $t_{inf}$ that produces $\frac{N}{2}$. In other words, half the population is infected. At that point the rate of infection will start to drop off.

11. Note that we have done no machine learning yet and we already have an equation. This equation has these unknowns: $N$, $r$, and $t_{inf}$. 

12. The basic idea of this notebook is to use curvefitting to solve for these unknowns. We will observe how well the curve describes the historical data and then extrapolate that the relationship will persist into the future.

13. Next, let us develop some intuition about these values. What is $r$? Intuitively, $r$ is the speed of infection. This is independent of population size. There are two things that determine the absolute number of new infections per day: (1) The size of the population. In a large population like the US there will be more infections than in the population of a single state like California. (2) When the infection started. In a population where the infection started earlier the number will look worse. But this does not mean the infections are spreading faster there.

14. What $r$ tells us is how fast infections are spreading. Let us call it the *contagion coefficient*.

15. Is $r$ a constant? $r$ could be a constant over periods of time. $r$ could change based on changes in the population. For example, $r$ in California might be different before and after an enforced lockdown. $r$ could change based on how a population behaves. If people avoid going out of their houses $r$ could drop. If people start meeting again $r$ could rise.

16. For our purposes we are going to hold $r$ constant.

17. If we were to compare populations in different states or in different parts of the world, we could fruitfully ask how $r$ varies between them. $r$ could be affected by demographics. Countries with public transportation might have a higher $r$ while countries where people don't travel very often might have a lower $r$.

18. $t_{inf}$ is the inflection point. This is where half the susceptible population is infected. This is the inflection point because after this the scales begin to tip. Now each day there are fewer new infections than the previous day.

19. With this mental model in mind let us now figure out these values and build a mathematical model.


# Terminology

20. Let's define some terms.

    - $N$ is the maximum number of confirmed cases that we will have.
    - $y$ is the current number of confirmed cases.
    - $z$ is the current number of fatalities.
    - $t$ is time measured in days, measured in reference to the inflection point at $t=0$.
    - $r$ is the rate of infection.

21. Using this terminology gives us the following equation.

$$
y = \frac{ N }{ 1 + e^{-rt} }
$$




# Models & Deployment

In [None]:
%%h

; COVID19 Global Forecasting (Week 3)

; vim: filetype=lisp tw=9999 nowrap

; [Paths]

(=> cor-prefix (-> "covid19-global-forecasting-week-3" kag-comp->prefix))
(=> cor-train-csv (+ cor-prefix "train.csv"))
(=> cor-test-csv (+ cor-prefix "test.csv"))
(=> cor-submission-csv (+ cor-prefix "submission.csv"))

; [Macros & Utilities]

(import [seaborn :as sns])
(defmacro symbol-to-string [sym] `(mangle (name ~sym)))
(defmacro pd-define [df dst &rest forms]
  `(.assign ~df #** {~(symbol-to-string dst) (fn [$] ~@forms)}))
(defmacro pd-filter [df &rest forms]
    `(-> ~df (.where (fn [$] ~@forms)) (.dropna)))
(defmacro pd-define-plot [df dst &rest forms]
  `(-> ~df (pd-define ~dst ~@forms) (pd-plot [~(symbol-to-string dst)])))
(defn pd-regression [df x y]
  (setv x (-> df (get x) (.to-numpy)))
  (setv y (-> df (get y) (.to-numpy)))
  (setv line (stats.linregress x y))
  (print :sep "\n"
    f"slope={line.slope}"
    f"intercept={line.intercept}"
    f"pvalue={line.pvalue}"
    f"rvalue={line.rvalue}"
    f"stderr={line.stderr}")
  (plt.figure)
  (plt.plot x y "o" :label "original data")
  (plt.plot x (+ line.intercept (* line.slope x)) "r" :label "fitted line")
  (plt.legend)
  (plt.show)
  (plt.close)
  df)

(defmacro pd-fork [df &rest forms]
  `(do (setv $ ~df) ~@forms $))

(defn cor-csv->df [file-csv id]
  (-> file-csv
    (pd.read-csv :dtype {id object})
    (.fillna "") 
    ; Clean-up regions
    (pd-define RegionId (-> ($.Country_Region.str.cat :sep ":" $.Province_State)))
    (pd-drop ["Country_Region" "Province_State"])
    (pd-date-string-to-date "Date" "Date")
    (pd-date-to-std-day "t" "Date") 
    (.set-index "Date" :drop False)
    (pd-rename {"ConfirmedCases" "y"})
    (pd-rename {"Fatalities"     "z"})
  ))

; Use this to remove trailing colon
;=> (-> cor-train-csv (pd.read-csv) (.fillna "") (pd-assign region (-> ($.Country_Region.str.cat :sep ":" $.Province_State) (. str) (.replace ":$" ""))) (.query "region=='US:California'")

(defn kag-log [message]
  (=> timestamp (-> (datetime.now) (str)) )
  (print f"[{timestamp}] {message}"))

; [Models]

(defn sigmoid [x x0 r] 
  (-> x (- x0) (* (- r)) (np.exp) (+ 1) (np.reciprocal)))
(defn n-sigmoid [x N x0 r] 
  (-> x (- x0) (* (- r)) (np.exp) (+ 1) (np.reciprocal) (* N)))
(defn log-n-sigmoid [x N x0 r] 
  (-> x (- x0) (* (- r)) (np.exp) (+ 1) (np.reciprocal) (* N) (np.log1p)))
(defn quadratic [x a b c] 
  (-> a (* x) (+ b) (* x) (+ c)))

(import pylab)
(import [scipy.optimize [curve-fit]])

(defn const-func [c] (fn [t] c))
(defn const-model [c] (kw->obj :func (const-func c) :popt (np.array [0 0 0]) :mape 0.0))

(defn pd-curve-model [df y-col x-col f p0 &optional [plot False]]
  (=> df-clean (-> df (.query f"{y-col} > 0.0")))
  (=> y-max (-> df (get y-col) (.max)))
  (if (-> df-clean (len) (<= 4)) (return (const-model y-max)))
  (if (-> df-clean (len) (> 4)) (=> df df-clean))
  (=> x (-> df (get x-col)))
  (=> y (-> df (get y-col)))
  (=> (, popt pcov) (curve-fit f x y :p0 p0 :maxfev 20000))
  (=> func (fn [x] (f x #* popt)))
  (=> y-hat (func x))
  (=> mape (-> (np.abs y) (- (np.abs y-hat)) (np.abs) (np.mean) (/ (np.abs (np.mean y)))))
  (if plot 
    (do
      (pylab.plot x y "o" :label "data")
      (pylab.plot x y-hat :label "fit")
      (pylab.ylim (np.min y) (np.max y))
      (pylab.legend :loc "best")
      (pylab.show)))
  (kw->obj :func func :popt popt :mape mape))

; [Testing & Submission]

(=> cor-regions (-> (cor-csv->df cor-test-csv  "ForecastId") (get "RegionId") (.unique) (list)))
(=> cor-y-models {})
(=> cor-z-models {})

(defn cor-region->train-df [region]
  (-> cor-train-csv
    (cor-csv->df "Id") 
    (.query "RegionId == @region")
  ))

(defn cor-region->test-df [region]
  (-> cor-test-csv
    (cor-csv->df "ForecastId") 
    (.query "RegionId == @region")
  ))

(defn cor-train-df->model [df y-col x-col &optional [plot False]]
  (=> y (-> df (get y-col)))
  (=> y-max (-> y (.max)))
  (=> n-estimate (-> y-max (/ 2))) 
  (=> p0 [n-estimate 87 0.2])
  (=> model (pd-curve-model df y-col x-col n-sigmoid :p0 p0 :plot plot)))

(import math)
(defn is-num-bad [x] (or (-> x (math.isfinite) (not)) (-> x (> 0.5))))
(defn cor-check-bad-mape [mape] (if (is-num-bad mape) (print f"--> BAD mape={mape}")))

(defn cor-build-models [&optional [exclude-list []]]
  ; Build models per region
  (global cor-y-models cor-z-models)
  (=> executions [])
  (=> exclude-set (set exclude-list))
  (for [region cor-regions] 
    (print f"train: region={region}")
    (=> train-df (cor-region->train-df region))
    (for [series ["y" "z"]]
      (if (and (= series "z") (-> region (in exclude-set))) (continue))
      (=> model  (cor-train-df->model train-df series "t"))
      (cor-check-bad-mape model.mape)
      (if (= series "y") (-> cor-y-models (setf region model)))
      (if (= series "z") (-> cor-z-models (setf region model)))
      (executions.append 
        { "region" region 
          "series" series 
          "mape" model.mape
          "popt" model.popt 
          "N"    (get model.popt 0)
          "x0"   (get model.popt 1)
          "r"    (get model.popt 2)
          "bad_mape" (is-num-bad model.mape) })))
  executions)

; [Manual]

(defn chk-make-executions []
  (=> executions (cor-build-models)))

(defn chk-brunei-fit-1 []
  (-> "Brunei:" (cor-region->train-df) (cor-train-df->model "y" "t" :plot True)))

(defn chk-brunei-fit-2 []
  (-> "Brunei:" 
    (cor-region->train-df) 
    (pd-curve-model "y" "t" n-sigmoid :p0 [40 87 0.2] :plot True)))

; [Build Models]

(=> executions (cor-build-models))
(-> executions 
  (pd.DataFrame)
  (.sort-values :by "mape" :ascending False) (display))

; [Submission]

(defmacro pd-assign [df dst &rest forms]
  `(.assign ~df #** {~(name dst) (fn [$] ~@forms)}))

(defn cor-prepare-submission []
  (=> submission-df (pd.DataFrame))
  (for [region cor-regions]
    (print f"test: region={region}")
    (=> y-model (-> cor-y-models (get region)))
    (=> z-model (-> cor-z-models (get region)))
    (-> (cor-region->test-df region)
      (pd-assign ConfirmedCases (-> $.t (y-model.func) (np.round :decimals 1)))
      (pd-assign Fatalities     (-> $.t (z-model.func) (np.round :decimals 1)))
      (pd-save predict-df))
    ; Append predict-df to submission-df 
    (=> submission-df (pd.concat [submission-df predict-df])))
  submission-df)

(=> submission-df (cor-prepare-submission))

(pd.set-option "display.float_format" (fn [x] (% "%.2f" x)))

(-> submission-df 
  (pd-keep ["ForecastId" "ConfirmedCases" "Fatalities"] ) 
  (.to-csv "submission.csv" :index False :float-format "%.1f"))

