# COVID19 Global Forecasting (Week 2)

<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

In [None]:
%%javascript
/* Build table of contents. */
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
/* Disable output area scrolling. */
IPython.OutputArea.auto_scroll_threshold = 9999

# Introduction

## Mathematical Model

1. Let us reason about the growth rate from first principles. 

2. Clearly it is not exponential, even though it appears so in the beginning.

3. Let $Y$ be the number of people who are infected. Let $N$ be the total population. This is not necessarily the total population of the region in question. It is the total number of people who will get infected. So it is the population that is susceptible to becoming infected.

4. To keep the math simple, let’s get rid of absolute numbers and work with percentages. So $N=1$ and $y_{infected}$ is the number of people who are infected as a percentage of the population. So $y_{infected} \in (0,1]$. 

5. Now the confirmed cases will be slightly lower than the actual percentage that is infected. Some people might be asymptomatic or might not get tested for some reason. Let us assume that the confirmed cases $y_{confirmed}$ are a factor $k$ of $y_{infected}$. In other words, $y_{confirmed} = k \cdot y_{infected}$ where $k \in (0,1]$.

6. To keep things simple we will solve for $y_{confirmed}$ and call that $y$.

7. So each day $y$ goes up by some amount. This looks exponential but it obviously cannot remain exponential.

8. The change in $y$ every day is proportional to two things. 

9. The first thing the number of new infected is proportional to is people who are currently infected. If twice as many people are infected the odds of them infecting the remaining population become twice as high. In other words the change in $y$ is proportional to $y$.

10. The second thing the number of new infected is proportional to is the number of people who are uninfected, $N - Y$ in absolute number or $1 - y$ in percentages. If there are half as many uninfected people then there will be half as many new infected. Only a percentage of the uninfected get infected every day.

11. So we can write it in this way. The rate of change of $y$ which is $y' = r y (1 - y)$.

12. This is precisely the equation for the sigmoid function. See [this StackExchange answer](https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) for an elegant derivation.

13. In other words 

$$
y = \frac{1}{1 + e^{-r(t-t_{inf})}}$
$$

This is a sigmoid or an S-curve. This is what it looks like.

<img width="500" src="https://upload.wikimedia.org/wikipedia/commons/a/ac/Logistic-curve.png">

14. Let us understand the terms in this equation. $t_{inf}$ is the inflection point. If you set $t$ to $t_{inf}$ that produces $\frac{1}{2}$. In other words, half the population is infected. At that point the rate of infection will start to drop off.

15. One other note. It is possible the function plateaus at some point. There will not be 100% infection. Maybe 70% or 80%. So $N$ is really the maximum number of people who will get infected, not necessarily the entire population of California. We can use $N$ as 80% of the population of California.

16. Note that we have done no machine learning yet and we already have an equation. This equation has two unknowns: $a$ and $t_{inf}$. We can solve for these values from our infection data. After this we can perform two tests on it: (1) Are these numbers stable across time? (2) Can we predict out-of-sample values for infections and deaths? 

17. This second test is the cross validation test for a time series. We can build the model from a subset of the data and see how well it holds up for the out-of-sample data points.

18. Next, let us develop some intuition about these values. What is $r$? Intuitively, $r$ is the speed of infection. This is independent of population size. That are two things that determine the absolute number of new infections per day: (1) The size of the population. In a large population like the US there will be more infections than in the population of a single state like California. (2) When the infection started. In a population where the infection started earlier the number will look worse. But this does not mean the infections are spreading faster there.

19. What $r$ tells us is how fast infections are spreading. Let us call it the *contagion coefficient*.

20. Is $r$ a constant? $r$ could be a constant over periods of time. $r$ could change based on changes in the population. For example, $r$ in California might be different before and after an enforced lockdown. $r$ could change based on how a population behaves. If people avoid going out of their houses $a$ could drop. If people start meeting again $r$ could rise.

21. For our purposes we are going to hold $r$ constant.

22. If we were to compare populations in different states or in different parts of the world, we could fruitfully ask how $r$ varies between them. $r$ could be affected by demographics. Countries with public transportation might have a higher $r$ while countries where people don't travel very often might have a lower $r$.

22. $t_{inf}$ is the inflection point. This is where half the susceptible population is infected. This is the inflection point because after this the scales begin to tip. Now each day there are fewer new infections than the previous day.

23. The susceptible population is less than the total population: $N_{susceptible} < N_{total}$. Let's say California's population is about 40 million. It is possible that only 80% get infected. In this case $N_{susceptible} = N_{total} * \frac{80}{100} = 40,000,000 \times 0.8 = 32,000,000$.

24. There is a quick way of calculating $N_{susceptible}$ once we hit the inflection point, i.e. once the infection rates start to drop. The susceptible population will be twice the number infected at the inflection point. $N_{susceptible} = 2 \cdot y_{t_{inf}}$.

25. So how can we solve for $r$ and $t_0$? Based on the data we know $y$ for any specific date. Just take the confirmed cases and divide by $N$. We also know $y'$ for any specific date. $y'$ is the rate of change in $y$. In other words, $\frac{y_{t+1}}{y_{t}} - 1$.

26. With this mental model in mind let us now figure out these values and build a mathematical model.


## Useful Equations and Identities

1. Let's define some terms.

    - $N$ is the maximum number of confirmed cases that we will have.
    - $Y$ is the current number of confirmed cases.
    - $Z$ is the current number of fatalities.
    - $t$ is time measured in days, measured in reference to the inflection point at $t=0$.
    - $t=0$ is the inflection point when confirmed cases reaches $N/2$.
    - $r$ is the rate of infection.


2. Using this terminology gives us the following equation.

$$
Y = \frac{ N }{ 1 + e^{-rt} }
$$

3. Let us figure out the derivative of $Y$. The goal here is to build a model, then to fit the data to this model.

$$
Y' = \frac{ rN e^{-rt}}{\left[1 + e^{-rt}\right]^2} \\
Y' = r\left[\frac{N}{1+e^{-rt}}\right]\left[\frac{e^{-rt}}{1+e^{-rt}}\right] \\
Y' = r\left[\frac{N}{1+e^{-rt}}\right]\left[\frac{e^{-rt}}{1+e^{-rt}}\right] \\
Y' = rY\left[\frac{e^{-rt}}{1+e^{-rt}}\right] \\
Y' = rY\left[\frac{1+e^{-rt} - 1}{1+e^{-rt}}\right] \\
Y' = rY\left[\frac{1+e^{-rt}}{1+e^{-rt}} - \frac{1}{1+e^{-rt}}\right] \\
Y' = rY\left[1 - \frac{1}{1+e^{-rt}}\right] \\
Y' = rY\left[1 - \frac{1}{N} \cdot \frac{N}{1+e^{-rt}}\right] \\
Y' = rY\left[1 - \frac{1}{N} \cdot Y \right] \\
Y' = rY\left[1 - \frac{Y}{N} \right] \\
\frac{Y'}{Y} = r\left[1 - \frac{Y}{N} \right] \\
\frac{Y'}{Y} = r - \frac{r}{N} \cdot Y \\
\frac{Y'}{Y} = \frac{-r}{N} Y + r \\
$$

4. The expression on the right is linear in $Y$. It has a slope of $\frac{-r}{N}$ and a y-intercept of $r$. This particular technique of reducing the problem to solving for slope and intercept is based on the approach of [Pierre François Verhulst](https://en.wikipedia.org/wiki/Pierre_Fran%C3%A7ois_Verhulst) a Belgian Kaggler from the 1800's who analyzed population growth.

5. Instead of trying to analyze the population growth of humans as a function of the carrying capacity of an environment, we can treat humans as the environment and use the same equation to model the growth of coronavirus infections.

6. We have data points for $Y$, which is just the number of confirmed cases. Using this we can calculate $Y'$ and calculate $Y'/Y$.

7. Next we will fit this to a line, and extract it's coefficient and y-intercept. 

8. Using these values we will have estimates for $N$ and $r$. 

9. Recall that $N$ is the maximum number of confirmed cases that will ever occur, and $r$ is the rate of infection.

10. Using this model we can predict the number of confirmed cases over time.

## Calculating $r$ and $N$

Here is one way to do this.

1. We know this equation.

$$
\frac{Y'}{Y} = - \frac{r}{N} Y +r \\
$$

2. Once we have a line for $Y'/Y$ we can figure out its slope $m$ and intercept $b$. Then use those values to figure our $r$ and $N$.

$$
r = b \\
-\frac{r}{N} = m \\
N = -\frac{r}{m} \\
N = -\frac{b}{m} \\
$$

# Programming Notes

## Building Models

Let's think about building models now.

1. Given data frame with values for $Y$ and $t$ we can have a function that computes $Y'$, $Y'/Y$ and then from that $r$, $N$, and $t_{inflection}$. Once we have $r$ and $N$ we have a model. From these values we can create a model. A model is basically a dictionary containing these values. From the model you can generate predictions. For example, we can generate a prediction for a range of dates. So the model takes a data frame with a range of dates and computes predictions for those dates.

2. We can create numpy functions for all the generation. We can use that to generate values for $Y'$, $Y'/Y$. These are going to be series. On the other hand the model parameters—$r$, $N$, $t_{inflection}$— will be computed as single numbers, not series.

3. How can we integrate the code that generates the graphs with the actual code that computes the model? We want to have a single source of truth. So the model generation code should do the work (optionally) of adding data frame columns for the intermediates series that it produces. This way we have proof-of-work that can be printed out (optionally) as a graph.

4. We will present graphs for a few regions, e.g. California, Italy, etc. Then we will use the same code in production to go through all the regions (which could be countries, or states within countries), and produce predictions. These final predictions are what we will publish out to the submissions.csv file.

5. Optionally we should have a way to make predictions about any time interval. This can be useful for two purposes: (1) For out-of-sample testing. Let us discuss this scenario in more detail below. (2) For making predictions about certain regions and publishing them. For example, what does the model say about California moving forward.

6. The current model assumes a logistic function, an S-curve, for the number of cases. This is not going to match reality as people start recovering. There are some other S-curves that are operating here. (1) The S-curve of people who are recovering. (2) The S-curve of people who are dying. Both of these will subtract from the S-curve of the confirmed cases. Eventually the epidemic will subside and disappear. 

7. For now let us keep things simple and use a single function, understanding that it is simplistic and will be incorrect for the reasons listed in the previous point.

## Out-Of-Sample Testing

1. Let us discuss out-of-sample testing next. How should we do it?

2. We have a basic function that is going to build a model. Once we have the model we can generate predictions. I want to have all interaction with these functions happen through data frames. So the predictions should come back in a response data frame. The alternative would be to return a series. This is another possibility here since we are only producing a single series. If we were producing multiple series a data frame would be more elegant than a bunch of series. In fact, a data frame is really just a bunch of series. It is a series object that allows you to meaningfully name the series in it.

3. So essentially we have two functions: 

    - `pd-time-series->model [df y t]`: This takes a data frame as an argument and the names of the columns containing $y$ and $t$ values. It's return value is a model.

    - `pd-model->predictions [df predict model t]`: `df` is the data frame, `predict` is the name of the new column where the predictions will be added, `t` is the name of the column containing the independent variable, which in our case will be time or number of days since epoch.


4. We can write generic functions that work on series to do the heavy lifting. For example, we can have an `np-deriv [y t]` function. This takes two series: $y$ and $t$. It takes care of doing the smoothing and of computing the lag to do the derivation. Imagine calculating the derivative by taking more than two values. We could look at `k` values—a kind of window-size. And then instead of finding a line between 2 points we can find a regression line through the points. In fact we might not need to smooth $Y'$. The smoothing could happen as a result of the derivation process.

5. It turns out NumPy has a function that also does differentiation in a sophisticated way. Let us use that instead of writing our own.

6. One thing I don't like about the function `pd-time-series->model` is that it does not return a data frame. This means we cannot put it into a pipeline. If it was going to return a data frame then where would it store the model? (1) The model could be stored as constant values in the data frame. This feels like a waste of space. (2) The model could be injected into a dictionary that we pass into the function. That again feels wrong.

7. Maybe we can have it return a model and we can use `pd-fork` so we can continue processing the data frame below.

8. Also it would be nice to have the intermediate data that the model produces so we can graph that.

9. The question now becomes, what should the function that produces the model return? Should it return a data frame with the new columns added to it that represent the intermediate computations? Or should it return the model?

10. One solution is to have two functions. The first produces all the intermediate fields. The second produces the model from the intermediate fields. Here are some naming suggestions for them.

    - `pd-time-series-add-model-computations [df y_prime y t]`
    - `pd-time-series->model`


11. The second function assumes that the model computation fields are already inserted in the data frame. The second function returns a model.

12. Here is another thought. We should make these functions more specific. The names of the columns can be hard-coded in them. They can throw error messages if important inputs are not available. This is not a generic utility that could be used in other places. We are assuming a very specific model.

13. Note that we might need to run the same code for confirmed cases as well as for fatalities. So the underlying input variable that we are using could be $Y$ or it could be $Z$. This leads to the following signature.

    - `corona-pd-computations-add [df y]`
    - `corona-pd-computations->model [df y]`
    - `corona-pd-model->predictions [df predict model]`

## Predictions

1. So how should the computation for the prediction happen. This scheme is going to apply to the cross-validation step as well. So keep that in mind.

2. We should read `t`, apply it to the model to produce predictions and then write them in the `predict` column of the output data frame.

3. Later we can use data frame algebra to combine multiple data frames together.

4. We need utilities to generate `t` values for the prediction interval. Note these are just integers which measure the distance in days from the beginning of the epoch—Dec 31, 1969. We have start and end dates, we convert those to epoch days, then we generate a series that goes from the start to the end epoch-day value.

5. The prediction function can be a simple little thing. It takes a model, which really in a Lisp-like language such as Hy, can be a closure or a function. So the model can be a function. Although it would be nice to have a model that supports reflection into the values. So I am leaning towards a dictionary instead. That we way can look up values. However, it should be possible to go from that dictionary to a model function that takes in input parameters and produces output parameters. In fact it could take in a series of integers representing $t$ and produce a series of values of $Y$.

6. The only issue here is that we could end up with a more sophisticated prediction function later that takes more than just $t$ as its input. Our scheme for the function should be generalizable enough that it works if we are passing in additional parameters such as weather events. It is possible that weather affects the expression. There might be a coefficient that feeds into the equation that corrects for model pertubations caused by other things happening in the environment.

7. So instead of operating on series the prediction model should operate on data frames. The purpose of the model function is to take in a data frame, read input values, and then produce prediction values.

8. We will have two models. One for $Y$ (confirmed cases) and one for $Z$ (fatalities). Two separate dictionaries will lead to two separate functions. In both cases the functions depend on $t$ alone. They don't depend on other values. Therefore we could pass in the same data frame into both. However, we will have different `predict` columns. This way each function could write its predictions into the appropriate prediction column.

## Cross-Validation

1. Next let's think about how we can cross-validate our function on out-of-sample data.

2. We start with the given time series. We drop the initial zero values. Then we can train on 70% and test on 30%. Or train on 66% and test on 33%. Basically we have a cutoff in the time series. Everything before the cutoff goes into the training set. Everything on the cutoff and after goes into the test series.

3. Then we can calculate RMSE or some other similar error metric to compare the predictions against the actual values. The actual values for the series will be stored in `Y`. The predicted values will be in `Y_predict`. So the function that calculates the cross-validation error needs to know about the names of two columns that contain the two series we are interested in: actual values and predicted values. 

4. Here is what the function can look like.

    - `pd-predictions->error [df actual predict]`
    

5. The function then calculates the RMSE or other score and returns it in a result object, which will be an error object.


## Refactoring to Production

1. How should we proceed with the refactoring to production? Another way to frame this question. What order should we work in?

2. Create the functions in this order.

    - Function to add intermediate computation series to data frame containing `Y` and `t`.
    - Function to take the intermediate computations and produce a model object.
    - Function that takes a model and converts it to a model function.
    - Function that takes a model, a data frame containing model inputs, and adds predictions.
    - Function that takes a data frame with inputs, predictions, and produces an error object.


3. Note when I say *object* in the discussion I mean dictionary. Dictionaries are plain objects that make all their fields public, which is great for debugging and inspection.

## Cross-Validation

1. Let's think about cross-validation.

2. To do this we need to split up the training test. Let's split it up 70/30. So we need a function that can do this split. Or we need to somehow inform the training routine to only look at the first 70% of the data. Then we can ask it to validate against the last 30%. Split the data frame into train and test data frames. Use the train data frame to do the training and to produce the model. Then use the test data frame to make predictions and compare the results.

3. For cross-validation we should also support visualization. We should be able to visualize the results. Compare the predicted vs actual values.

4. Let's do this in the main line code first for California. Then we can think about refactoring it.

## Submission

1. Let us think about the submission to Kaggle. For this we need to make predictions for all the different countries and states that are out there. 

2. To do this let us create separate models for each country and state. We can create a key for each country/state combination that is distinct. We save all the models in a dictionary keyed by this country/state combination. Then we read the test data and fill out the values. Based on the country/state we pick up the model from the dictionary. Then we pass it the date and get the prediction.


# Code

## Hy Magic and Macros

In [None]:
!pip install hy > /dev/null

In [None]:
# Hy Magic
import IPython
def hy_eval(*args):
    import hy
    try:
        return hy.eval(hy.read_str("(do\n"+"".join(map(lambda s:s or "",args))+"\n)\n"),globals())
    except Exception as e:
        print("ERROR:", str(e))
        raise e
@IPython.core.magic.register_line_cell_magic
def h(*args): return hy_eval(*args) # Prints result useful for debugging.
@IPython.core.magic.register_line_cell_magic
def hs(*args):hy_eval(*args) # Silent. Does not print result.
del h, hs

In [None]:
%%hs
(import  [useful [*]])
(require [useful [*]])

> ## Note About Dates

1. Dates are going to be important for figuring out infection counts. To use dates meaningfully we need to standardize all dates. 

2. Should we use epoch-dates? Or should we use a reference point that is more recent, like January 1, 2020. Or December 1, 2019? The numbers are more meaningful if we use December 31, 2019 as $day_0$. Let us do that.

3. I would prefer to not use `day_of_year` and rather use `epoch_day` with the convention that our epoch is going to start on January 1, 2020. That might be confusing. So it is best to avoid use of the term epoch. Instead let us call days from December 31, 2019, something different. Let us call them *standard days*.

## Reading CSV Files

In [None]:
%%hs

; Define paths
(=> corona-prefix (-> "covid19-global-forecasting-week-2" kag-comp->prefix))
(=> corona-train-csv (+ corona-prefix "train.csv"))
(=> corona-test-csv (+ corona-prefix "test.csv"))
(=> corona-submission-csv (+ corona-prefix "submission.csv"))

; Read CSV, create distinct regions
(defn corona-csv->df [file-csv id]
  (-> file-csv
    (pd.read-csv :dtype {id object})
    (.fillna "")
    (pd-assign-> "RegionId" (-> ($.Country_Region.str.cat :sep ":" $.Province_State)))))

(defn corona-train-csv->df [] (corona-csv->df corona-train-csv "Id"))
(defn corona-test-csv->df  [] (corona-csv->df corona-test-csv  "ForecastId"))

; Get all regions
(defn corona-train->regions [] (-> (corona-train-csv->df) (.RegionId.unique) (list)))
(defn corona-test->regions  [] (-> (corona-test-csv->df)  (.RegionId.unique) (list)))

; Read CSV, prepare DF
(defn corona-train-region->df [region-id]
  (-> (corona-train-csv->df)

    ; Filter by region
    (pd-filter-> (= $.RegionId region_id))
   
    ; Standardize column names
    (pd-rename {"Date" "t" "ConfirmedCases" "Y" "Fatalities" "Z"})
    (pd-keep   ["t" "Y" "Z"])
    
    ; Standardize dates to start on Jan 1, 2020
    (pd-date-string-to-date "t" "t")
    (.set-index "t" :drop False)
    (pd-date-to-std-day "t" "t")))

(defn corona-train-country-state->df [country state]
  (corona-train-region->df (+ country ":" state)))

; Read CSV, prepare DF
(defn corona-test-region->df [region-id]
  (-> (corona-test-csv->df)

    ; Filter by region
    (pd-filter-> (= $.RegionId region_id))
   
    ; Standardize column names
    (pd-rename {"Date" "t"})

    ; Standardize dates to start on Jan 1, 2020
    (pd-date-string-to-date "t" "t")
    (.set-index "t" :drop False)
    (pd-date-to-std-day "t" "t")

    ; Keep essentials columns
    (pd-keep   ["ForecastId" "t"])))


## Data Exploration

In [None]:
%%hs
(import [math [log exp]])

(pd.set-option "display.float_format" (fn [x] (% "%.2f" x)))
(-> (corona-train-country-state->df "US" "California")
 
  ; Drop zero days
  (pd-filter-> (> $.Y 0.0))
  
  ; Calculate Y_prime
  (pd-assign-> "Y_prime" (-> $.Y (np.gradient :edge-order 1 $.t)))
  
  ; Calculate Y_prime_prime
  (pd-assign-> "Y_prime_prime" (-> $.Y_prime (np.gradient :edge-order 1 $.t)))
 
  ; Calculate Y_prime/Y
  (pd-assign-> "Y_prime_over_Y" (/ $.Y_prime $.Y))

  ; Calculate log_y_prime which equals Y_prime/Y
  (pd-assign-> "log_Y" (np.log $.Y))
  (pd-assign-> "log_Y_prime"   (-> $.log_Y (np.gradient :edge-order 1 $.t)))
  (pd-assign-> "log_Y_prime_smooth" (-> $.log_Y_prime (.ewm :alpha 0.25) (.mean)))
  (pd-plot ["log_Y_prime" "log_Y_prime_smooth"])
  (pd-plot ["log_Y_prime_smooth"])
  (pd-assign-> "Y_prime_over_Y" $.log_Y_prime_smooth)

  (pd-add-regression "log_Y_m" "log_Y_b" "log_Y_pvalue" "log_Y_rvalue" "t" "log_Y")
  (pd-assign-> "r1" $.log_Y_m)
  (pd-assign-> "N1" (/ (* $.r1 $.Y $.Y) (- (* $.r1 $.Y) $.Y_prime)))

  (pd-add-regression "m" "b" "pvalue" "rvalue" "Y" "Y_prime_over_Y")
 
  (pd-assign-> "r" $.b)
  (pd-assign-> "N" (-> (- $.b) (/ $.m)))
 
 
  (pd-assign-> "t_infl"  (-> $.N (/ $.Y) (- 1) (np.log) (-) (/ $.r) (-) (+ $.t)))
  (pd-assign-> "t_infl_date"  (-> $.t_infl (ps-std-day-to-date)))
 
  ; Growth rate
  (pd-assign-> "growth" (/ $.Y ($.Y.shift)))
  (pd-assign-> "doubling" (/ (np.log 2) (- (np.log $.Y) (np.log ($.Y.shift)))))
 
  ; Model
  (pd-assign-> "Y_predict" 
    (/ (np.mean $.N) 
       (-> $.t (- (np.mean $.t_infl)) (* (np.mean $.r)) (* -1) (np.exp) (+ 1))))
  (pd-save kag-work)
)

(-> kag-work
  (pd-plot ["Y_prime_prime"])
  (pd-plot ["log_Y"])
  (pd-regression "t" "log_Y")
  (pd-plot ["N1"])
  (pd-plot ["log_Y_prime"])
  (pd-plot ["Y_prime_over_Y" "log_Y_prime"])
  (pd-plot ["Y_prime_over_Y"] :index "Y")
  (pd-regression "Y" "Y_prime_over_Y")
  (pd-plot ["growth" "doubling"])
  ;(pd-plot ["Y" "Y_prime"])  (pd-fork (pd-keep ["Y" "Y_prime"]) (display))
  (pd-describe)
  (display)
)


## Model Building

In [None]:
%%h
(import [datetime [datetime timedelta]])

(defn corona-train-df->model-log-regression [df y t]
  ; Drop zeroes
  (=> df (-> df (pd-filter-> (-> $ (get y) (> 0.0)))))
  (=> y (-> df (get y) ))
  (=> t (-> df (get t) ))
  (=> y_log       (-> y (np.log)))
  (=> line        (stats.linregress t y_log))
  (=> m           (-> line.slope (np.float64)))
  (=> b           (-> line.intercept (np.float64)))
  (=> r           (-> line.slope (np.float64)))
  (=> func        (fn [t] (-> t (* m) (+ b) (np.round :decimals 1) (np.exp))))
  (locals->obj ["LinregressResult" "float64" "datetime" "function"]))

(defn corona-train-df->model-sigmoid-smoothing [df y t]
  (-> df
    (pd-filter-> (-> (get $ y) (> 0.0)))
    (pd-assign-> "y_prime" (-> (get $ y) (np.gradient :edge-order 1 $.t)))
    (pd-assign-> "y_prime_over_y" (-> $.y_prime (/ (get $ y))))
    (pd-assign-> "y_log_prime" (-> (get $ y) (np.log1p) (np.gradient :edge-order 1 $.t)))
    (pd-assign-> "y_log_prime" (-> $.y_log_prime (.ewm :alpha 0.25) (.mean)))
    (pd-save df))
  (=> y           (-> df (get y) ))
  (=> t           (-> df (get t) ))
  (=> y_prime     (-> df (get "y_prime") ))
  (=> y_log_prime (-> df (get "y_log_prime") ))
  (=> line        (stats.linregress y y_log_prime))
  (=> r           (-> line.intercept (np.float64)))
  (=> N           (-> (- r) (/ line.slope) (np.float64)))
  (=> t_infl      (-> N (/ y) (- 1) (np.log1p) (-) (/ r) (-) (+ t) (np-dropna) (np.mean)))
  (=> t_infl_date (-> t_infl (timedelta) (+ std-day-0)))
  (=> y_shift     (-> y (.shift)))
  (=> growth      (-> y (/ (y.shift)) (np.mean)))
  (=> doubling    (-> (np.log 2) (/ (np.log growth))))
  (=> func        (fn [t] (np.round :decimals 1 (/ N (-> t (- t_infl) (* r) (* -1) (np.expm1) (+ 1))))))
  (locals->obj ["LinregressResult" "float64" "datetime" "function"]))

(defn corona-fixed-func [value] (fn [t] value))

(defn np-last-2 [ser] 
  (-> ser (get (cut ser.index -2)) (tuple)))

(defn np-is-number [ser] 
  (~ (| (np.isnan ser) (np.isinf ser))))

(defn np-drop-nan-inf [ser]
  (-> ser (np-is-number) (np.extract ser)))

(defn corona-train-df->model-sigmoid [df y-col t]
  ; Drop zeroes
  (=> df (-> df (pd-filter-> (-> $ (get y-col) (> 0.0)))))
  (=> y (-> df (get y-col) ))
  (=> t (-> df (get t) ))
  (if (-> df (len) (= 0)) (return (dict->obj {"func" (corona-fixed-func 0.0)})))
  (=> y_max (-> y (.max)))
  (if (-> df (len) (< 4)) (return (dict->obj {"func" (corona-fixed-func y_max)})))
  (=> [y-last1 y-last0] (-> y np-last-2))
  (if (= y-last0 y-last1) (return (dict->obj {"func" (corona-fixed-func y-last0)})))
  (if (= y-last1 0)       (return (dict->obj {"func" (corona-fixed-func y-last0)})))
  (=> y-last-chg (-> y-last0 (- y-last1) (/ y-last1)))
  (if (< y-last-chg 0.06) (return (dict->obj {"func" (corona-fixed-func y-last0)})))
  (=> y_prime     (-> y (np.gradient :edge-order 1 t)))
  (=> y_prime_over_y (-> y_prime (/ y)))
  (=> y_log_prime (-> y (np.log1p) (np.gradient :edge-order 1 t)))
  (=> line        (stats.linregress y y_log_prime))
  (=> r           (-> line.intercept (np.float64)))
  (=> N           (-> (- r) (/ line.slope) (np.float64)))
  (if (or (< N 1) (np.isnan N)) (=> N 1))
  (=> t_infl      (-> N (/ y) (- 1) (np.log1p) (-) (/ r) (-) (+ t) (np-drop-nan-inf) (np.mean)))
  (if (np.isnan t_infl) (=> t_infl 100))
  (=> t_infl_date (-> t_infl (timedelta) (+ std-day-0)))
  (=> y_shift     (-> y (.shift)))
  (=> growth      (-> y (/ (y.shift)) (np.mean)))
  (=> doubling    (-> (np.log 2) (/ (np.log growth))))
  (=> func        
    (fn [t] 
      (=> y_predict (np.round :decimals 1 (/ N (-> t (- t_infl) (* r) (* -1) (np.expm1) (+ 1)))))
      (=> y_predict (-> y_predict (np.nan-to-num :posinf y_max :neginf 0)))
      y_predict))
  (locals->obj ["LinregressResult" "float64" "datetime" "function"]))

## Model Sanity Check

In [None]:
%%h
(=> corona-train-df->model corona-train-df->model-sigmoid)

(=> model 
  (-> (corona-train-country-state->df "Anhui" "China")
    ; Drop zero case days
    (pd-filter-> (> $.Y 0.0))
   
    ; Create model
    (corona-train-df->model "Z" "t")
  ))

(p model)
(model.func 0)

## Deployment

In [None]:
%%hs

(=> corona-y-models {})
(=> corona-z-models {})

(defn kag-log [message]
  (=> timestamp (-> (datetime.now) (str)) )
  (print f"[{timestamp}] {message}"))

(defn corona-build-models []
  ; Build models per region
  (=> train-regions (corona-train->regions))
  (for [region train-regions] 
    (=> train-df (corona-train-region->df region))
    (=> y-model (corona-train-df->model train-df "Y" "t"))
    (=> z-model (corona-train-df->model train-df "Z" "t"))
    (-> corona-y-models (setf region y-model))
    (-> corona-z-models (setf region z-model))))
(corona-build-models)

In [None]:
%%h

(defn corona-prepare-submission []
  (=> submission-df (pd.DataFrame))
  (=> test-regions (corona-test->regions))
  (for [region test-regions]
    (=> y-model (-> corona-y-models (get region)))
    (=> z-model (-> corona-z-models (get region)))
   
    (-> (corona-test-region->df region)
      (pd-assign-> "Y_predict" (-> $.t (y-model.func)))
      (pd-assign-> "Z_predict" (-> $.t (z-model.func)))
      (pd-save predict-df))
    ; Append predict-df to submission-df 
    (=> submission-df (pd.concat [submission-df predict-df])))
  submission-df)
  
(=> submission-df (corona-prepare-submission))
(-> submission-df 
  (pd-rename {"Y_predict" "ConfirmedCases" "Z_predict" "Fatalities"})
  (pd-keep ["ForecastId" "ConfirmedCases" "Fatalities"] ) 
  (.to-csv :index False "submission.csv"))

## Cross Validation

In [None]:
%%hs

; Cross-Validation
(=> corona-train-df->model corona-train-df->model-sigmoid)

; Train test split
(=> full-df (corona-train-country-state->df "China" "Fujian"))
(=> train-test-split 0.95)
(=> split-point (-> full-df (len) (* train-test-split) (int)))
(=> (, train-df test-df) (np.split full-df [split-point]))

; Build model
(=> y-model (corona-train-df->model train-df "Y" "t"))
(=> z-model (corona-train-df->model train-df "Z" "t"))
;(display [y-model z-model])

; Test model predictions
(-> test-df 
 (pd-assign-> "Y_predict" (-> $.t (y-model.func)))
 (pd-assign-> "Z_predict" (-> $.t (z-model.func)))
 (pd-save predict-df)
)
 
; Compute error
(=> y-error (-> predict-df (pd-prediction->error "Y_predict" "Y")))
(=> z-error (-> predict-df (pd-prediction->error "Z_predict" "Z")))

(defn pd-plot-predict-vs-actual [df x y-model y-actual]
  (-> df
    (.set-index x)
    (pd-plot [y-model y-actual]))
  df)

(-> predict-df 
 (pd-plot-predict-vs-actual "t" "Y" "Y_predict")
 (pd-plot-predict-vs-actual "t" "Z" "Z_predict"))

(display [y-error.rmsle z-error.rmsle])
