# COVID False Positive Rates versus Prevalence

Early COVID-19 tests had very high false positive rates, which is often managable when tests are used along with other indicators of infection, but can produce spurious results when a very large number of tests are administered. In this notebook, we'll use the Specificity and Selectivity parameters from a recently published paper, [Development and Clinical Application of A Rapid IgM-IgG Combined Antibody Test for SARS-CoV-2 Infection Diagnosis](https://pubmed.ncbi.nlm.nih.gov/32104917/), which reports: 

> The overall testing sensitivity was 88.66% and specificity was 90.63%
   
Specificty and Selective are a bit hard to understand, but are well explained on their [Wikipedia page. ](https://en.wikipedia.org/wiki/Sensitivity_and_specificity).  The important part to understand is the table in the [worked example](https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Worked_example). When a test is administered, there are four possible outcomes. The test can return a positive results, which can be a true positive or a false positive, or it can return a negative result, which is a true negative or a false negative. Of you organize those posibilities by what is the true condition ( does the patient have the vius or not ):

* Patient has virus
 * True Positive ($\mathit{TP}$)
 * False negative ($\mathit{FN}$)
* Patient does not have virus
 * True Negative ($\mathit{TN}$)
 * False Positive.  ($\mathit{FP}$)

In the wikipedia table:

* The number of people who do have the virus is  $\mathit{TP}+\mathit{FN}$, the true positives plus the false negatives, which are the cases that should have been reported positive, but were not. 
* The number or of people who do not have the virus is $\mathit{TN}+\mathit{FP}$, the true negatives and the false positives, which are the cases should have been reported positive, but were not. 

The values of Sensitivity and Specificity are defined as: 

$$\begin{array}{ll}
Sn = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}} & \text{True positives outcomes divided by all positive conditions} \tag{1}\label{eq1}\\ 
Sp = \frac{\mathit{TN}}{\mathit{FP} + \mathit{TN}} & \text{True negatives outcomes divided by all negative conditions}\\ 
\end{array}$$

We want to know the number of false positives($\mathit{FP}$) given the number of positive conditions ($\mathit{TP}+\mathit{FN}$) and the total number of tests. To compute these, we need to have some more information about the number of people tested, and how common the disease is: 

* Total test population $P$, the number of people being tested, which equals $\mathit{TP}+\mathit{FP}+\mathit{FN}+\mathit{TN}$
* The prevalence $p$, the population rate of positive condition. 

We can do a little math to get: 

$$\begin{array}{ll}
\mathit{TP} = Pp\mathit{Sn} & \text{}\\ 
\mathit{FP} = P(1-p)(1-\mathit{Sp}) \text{}\\ 
\mathit{TN} = P(1-p)\mathit{Sp} & \text{}\\ 
\mathit{FN} = Pp(1-\mathit{Sn})& \text{}\\ 
\end{array}$$

You can see examples of these equations worked out in the third line in the red and green cells of the [Worked Example](https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Worked_example) on the Sensitivity and Specificity Wikipedia page. 



One of the interesting questions when test results are reported is "What percentage of the positive results are true positives?" This is a particularly important question for the COVID-19 pandemic because there are a lot of reports that most people with the virus are asymptomatic. Are they really asymptomatic, or just false positives?

The metric we're interested here is the portion of positive results that are true positives, the true positive rate, $\mathit{TPR}$:

$$\mathit{TPR} = \frac{\mathit{TP} }{ \mathit{TP} +\mathit{FP}  } $$

Which expands to:

$$\mathit{TPR} = \frac{p\mathit{Sn} }{ p\mathit{Sn} + (1-p)(1-\mathit{Sp})  } $$

It is important to note that $\mathit{TPR}$ is not dependent on $P$, the size of the population being tested. It depends only on the quality parameters of the test, $\mathit{Sn}$ and $\mathit{Sp}$, and the prevalence, $p$. For a given test, only the prevalence will change over time. 


In [1]:

Sp = .9063
Sn = .8866

def p_vs_tpr(Sp, Sn):

    for p in np.power(10,np.linspace(-7,np.log10(.5), num=100)): # range from 1 per 10m to 50%
        tpr = (p*Sn) / ( (p*Sn)+(1-p)*(1-Sp))
        yield (p, tpr)
    
df = pd.DataFrame(list(p_vs_tpr(Sp, Sn)), columns='p tpr'.split())
df.head()

fig, ax = plt.subplots(figsize=(12,8))

df.plot(ax=ax, x='p',y='tpr', figsize=(10,10))

fig.suptitle(f'Portion of Positives that Are True Vs Prevalence\nFor test with Sp={Sp} and Sn={Sn}', fontsize=20)

ax.set_xlabel('Condition Prevalence in Portion of Tested Population', fontsize=18)
ax.set_ylabel('Portion of Positive Test Results that are True Positives', fontsize=18)


#ax.set_xscale('log')
#ax.set_yscale('log')

NameError: name 'pd' is not defined

In [None]:

# Find the row with a TPR closest to 50%
loss_min_idx = (df['tpr']-.5).abs().idxmin()
df.iloc[loss_min_idx]

In [None]:


def gen_data():

    for Sp in np.linspace(.9,1,num=20,endpoint=False):
        for Sn in np.linspace(.88,1,num=20,endpoint=False):            
            df = pd.DataFrame(list(p_vs_tpr(Sp, Sn)), columns='p tpr'.split())
            # Find the row with a TPR closest to 50%
            loss_min_idx = (df['tpr']-.5).abs().idxmin()
            p = df.iloc[loss_min_idx].p
            yield (Sp, Sn, p)

df = pd.DataFrame(list(gen_data()), columns='Sp Sn p'.split())



In [None]:
flights = sns.load_dataset("flights")

flights = flights.pivot("month", "year", "passengers")
flights

In [None]:
import seaborn as sns; sns.set()

t =df.pivot_table('p','Sp','Sn')

ax = sns.heatmap(t)


Now we can calculate these values for the situation in the US when it first hit 100 cases, on March 3. 

In [None]:
# What was the date that the US first hit 100 cases?
import metapack as mp
pkg = mp.open_package('http://library.metatab.org/jhu.edu-covid19-1.zip')
df = pkg.resource('confirmed').dataframe()
df[df.location == 'US'].date_100.unique()

Assume that on March 3, using the test described in the article linked above, that the US Government was able to test the whole US population, and that the actual number of cases was 10X of the reported number. So we have: 

* $\mathit{Sp} = .9063$
* $\mathit{Sn} = .8866$
* $P = 329e6$
* $p = 1000/P$


In [None]:
# In machine learning, the chart of the TP, FP, TN, FN values
# is called a confusion matrix. 

def calc_cm(p, P, Sp, Sn):
    
    TP = P * p * Sn
    FP = P * (1 - p) * ( 1 - Sp )
    TN = P * ( 1 - p ) * Sp
    FN = P * p * ( 1- Sn )
    
    return ( TP, FP, TN, FN)

recorded_cases = 100
actual_cases = recorded_cases * 10
P = 1e6 #329e6
p = actual_cases / P
    
"p={} TP={} FP={} TN={} FN={}".format(p, *calc_cm(p, P, Sp, Sn))
    

So, in this case, of 1,000 infections in the whole US, the testing would have caught 886 of them, while producing 31 million false positives. 

In [None]:
import pandas as pd
import numpy as np

def gen_data(p, Sp, Sn):
    tests = np.linspace(1e6,250e6, num=101)
    
    for tn  in tests:
        yield (tn,p)+tuple(int(e) for e in calc_cm(p, tn, Sp, Sn))
    
df = pd.DataFrame(list(gen_data(p,Sp, Sn)), columns='tests p TP FP TN FN'.split())
df['p_ratio'] = (df.p*df.tests) / df.TP
df