# Comparing confirmed coronavirus cases around a hotspot location

First, thek is about finding a general formula to locate hotspots of infections in the United States. A hotspot of infection means that there are a lot of confirmed cases in that region when it is compared with nearby regions. The formula is based on data published by the Johns Hopkins University CSSE at 
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv

This data contains all confirmed cases on March 28, 2020. This data was accessed on March 29, 2020 and using data for March 28, 2020.

**Applications:**

1) Formulas developed can be used to analyze the expansion between a hotspot county and counties within 50 miles of that county.

2) In addition, some of these formulas can be used to find all confirmed cases in counties near a fixed location. For example, about 1.4 million tourists attended Mardi Gras/Fat Tuesday activities in New Orleans on February 25, 2020. As a result, many confirmed cases in Louisiana was around New Orleans. (Source: https://www.newsmax.com/us/fat-tuesday-virus/2020/03/26/id/960115/)

3) A code is given to find all hotspots in the United States within a given radius and with a pre-determined ave_difference value. (ave_difference value is explained below). However, since this would be computational expensive, I haven't had a chance to check the code due to computational capacity of my laptop.


There are four functions defined in this work:

**Calculating distances from latitudes-longitudes**: First function calculates distances (in miles) between two regions using latitudes and longitudes given in the data. Function is adapted from the formula explained at http://edwilliams.org/avform.htm#Dist and used at https://www.nhc.noaa.gov/gccalc.shtml

This formula is

$$distance((x_1,y_1),(x_2,y_2))=\arccos(\sin(x_1)*\sin(x_2)+\cos(x_1)*\cos(x_2)*\cos(y_1-y_2))$$

**Average difference in cases between two nearby regions:** The second function computes the average difference in cases between two nearby regions. The aim of this function is to find differences in confirmed cases between two nearby regions, called $case1$ and $case2$. For a fixed distance $dist$, if difference in number of cases is large, then $x=dist/(cases1-cases2)$ formula will be lower meaning that there is a low expansion rate between the hotspot county and the nearby county. However, if a county is closer to the hotspot county, a higher number of cases in that county is expected due to close proximity to the hotspot county.

Formula ave_difference can be used to look at the expansion around some already known hotspots, as given below.

**Finding all counties within a 50 miles radius:** Formula find50miles(county1,state1) is defined to find all the counties within a 50 miles radius of a given county. By using this formula, one can compute all the ave_differences and determine the direction of expansion. Note that this formula could easily be adapted to other shorter or longer distances. 

**Computation of average difference values for all counties within a 50 miles radius:** Formula ave_diff_list(county1,state1) returns all counties withing a 50 mile radius of a given (county1,state1) pair and returns a dataframe listing all the counties and sorting from the highest ave_difference value to the lowest ave_difference value. Note that if a case difference is low and/or if two counties are closer to each other, this number tends to be higher. This might indicate a possible expansion along that direction ignoring all the other possibilities.  

To find all hotspots in the United States, one can use find50miles and ave_difference formulas. However, that code is also given in this work. However, computing capabilities of my laptop is not enough to check that formula.

**Notes:**

1) The dataset used contains number of cases on March 28, 2020. However, these formulas should work for other datasets if one keeps all the necessary variables in their data (i.e., County, State, Confirmed, Latitude and Longitute)

2) Even though formulas in this work is for the United States, it can be generalized for other countries, as well.

3) This work will be updated regularly. So, if there are any errors or if you have suggestions/comments, please let me know immediately. I greatly appreciate it and thank you in advance.

4) This work is prepared for research and education purposes.

Last updated: March 29, 2020 by Selma Yildirim, selmayildirim@gmail.com

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re


counties=pd.read_csv('Mar28JohnsHopkins.csv',usecols=['FIPS','Admin2','Province_State','Lat','Long_','Confirmed','Deaths','Recovered','Active'])
counties=counties.rename({'Admin2':'County','Province_State':'State','Long_':'Long'},axis=1)
#US data
counties=counties[0:3170]

counties['lat_radians']=counties['Lat']*np.pi/180
counties['long_radians']=counties['Long']*np.pi/180

counties.head()

Unnamed: 0,FIPS,County,State,Lat,Long,Confirmed,Deaths,Recovered,Active,lat_radians,long_radians
0,45001.0,Abbeville,South Carolina,34.223334,-82.461707,3,0,0,0,0.59731,-1.439228
1,22001.0,Acadia,Louisiana,30.295065,-92.414197,9,1,0,0,0.528749,-1.612932
2,51001.0,Accomack,Virginia,37.767072,-75.632346,2,0,0,0,0.65916,-1.320033
3,16001.0,Ada,Idaho,43.452658,-116.241552,76,0,0,0,0.758392,-2.028798
4,19001.0,Adair,Iowa,41.330756,-94.471059,1,0,0,0,0.721358,-1.648831


# Functions

### Calculating distances from latitudes-longitudes

In [2]:
#d=acos(sin(lat1)*sin(lat2)+cos(lat1)*cos(lat2)*cos(lon1-lon2))
#http://edwilliams.org/avform.htm#Dist
#https://www.nhc.noaa.gov/gccalc.shtml

def distance(county1,state1,county2,state2):
    x_1=float(counties[(counties['County']==county1) & (counties['State']==state1)]['lat_radians'])
    y_1=float(counties[(counties['County']==county1) & (counties['State']==state1)]['long_radians'])
    x_2=float(counties[(counties['County']==county2) & (counties['State']==state2)]['lat_radians'])
    y_2=float(counties[(counties['County']==county2) & (counties['State']==state2)]['long_radians'])
    earth_radius=3959 #in miles
    return np.arccos(np.sin(x_1)*np.sin(x_2)+np.cos(x_1)*np.cos(x_2)*np.cos(y_1-y_2))*earth_radius


### Average difference in cases between two nearby regions

In [3]:
#average difference

def ave_difference(county1,state1,county2,state2):
    cases1=int(counties[(counties['County']==county1) & (counties['State']==state1)]['Confirmed'])
    cases2=int(counties[(counties['County']==county2) & (counties['State']==state2)]['Confirmed'])
    dist=distance(county1,state1,county2,state2)
    if cases2<=cases1:
        return dist/(cases1-cases2)
    else:
        return dist/(cases2-cases1)


### Finding all counties within a 50 miles radius

In [4]:
#find all counties within 50 miles radius

def find50miles(county1,state1):
    radius_list=[]
    for i in range(0,len(counties)):
        county2=str(counties.iloc[i]['County'])
        state2=str(counties.iloc[i]['State'])
        if distance(county1,state1,county2,state2)<50 and county2!=county1:
            radius_list.append((county2,state2))
    return radius_list
        



### Computation of average difference values for all counties within a 50 miles radius

In [5]:
def ave_diff_list(county1,state1):
    ave_list=[]
    for i in find50miles(county1,state1):
        county2=i[0]
        state2=i[1]
        cases=int(counties[(counties['State']==state2) & (counties['County']==county2)]['Confirmed'])
        dist=distance(county1,state1,county2,state2)
        ave_list.append((i[0],i[1],ave_difference(county1,state1,county2,state2),cases,dist))
    df=pd.DataFrame(ave_list,columns=['county','state','ave_difference','cases','distance'])
    return df.sort_values(by='ave_difference',ascending=False)



## Application 1: Finding confirmed cases around some hotspot counties

### Confirmed cases around Cook County, Illinois

In [6]:
#check formula
print('Distance between Cook county and Lake county in Illinois is {}.'.format(distance('Cook','Illinois','Lake','Illinois')))
print('Distance between Cook county and Kane county is in Illinois is {}.'.format(distance('Cook','Illinois','Kane','Illinois')))

print('Average difference(Cook,Lake) in Illinois: {}.'.format(ave_difference('Cook','Illinois','Lake','Illinois')))
print('Average difference(Cook,Kane) in Illinois: {}.'.format(ave_difference('Cook','Illinois','Kane','Illinois')))


Distance between Cook county and Lake county in Illinois is 34.59591205816728.
Distance between Cook county and Kane county is in Illinois is 32.18673820814248.
Average difference(Cook,Lake) in Illinois: 0.01458512312738924.
Average difference(Cook,Kane) in Illinois: 0.01275732786688168.


So, the distances between Cook county and Lake county and Cook county and Kane county are close. On March 28, 2020, confirmed cases are as follows:

Cook: 2613

Lake: 241

Kane:90

So, when we compute the ave_difference formula, we should get a lower value for Kane county showing that there are less cases there than Lake county.

In [7]:
ave_diff_list('Cook','Illinois')

Unnamed: 0,county,state,ave_difference,cases,distance
2,Grundy,Illinois,0.018937,2,49.444944
0,DeKalb,Illinois,0.018846,4,49.170265
4,Kankakee,Illinois,0.018832,27,48.700109
8,McHenry,Illinois,0.018185,47,46.663971
9,Porter,Indiana,0.017959,9,46.765535
6,Lake,Illinois,0.014585,241,34.595912
7,Lake,Indiana,0.014464,68,36.811447
5,Kendall,Illinois,0.013839,11,36.009252
3,Kane,Illinois,0.012757,90,32.186738
10,Will,Illinois,0.01149,127,28.564127


The ave_differences are lower in this table since cases in Cook county is a lot higher than counties nearby Cook county. This might mean that social distancing measures are working and cases are mostly contained in Cook county. However, testing location might be the reason of high numbers in Cook county. For example, we don't know if there were patients residing in DuPage county but were tested and recorded in Cook county. 

On another note, even though number of cases in DuPage county is higher than number of cases in Will county, ave_difference value of DuPage county is lower since it is 15 miles closer to Cook county than Will county indicating that a higher number of cases is expected due to close proximity to a hotspot county.

### Confirmed cases around Denver County, Colorado

In [8]:
find50miles('Denver','Colorado')

ave_diff_list('Denver','Colorado')

Unnamed: 0,county,state,ave_difference,cases,distance
1,Arapahoe,Colorado,0.188275,155,29.55925
7,Jefferson,Colorado,0.152423,158,23.473147
2,Boulder,Colorado,0.145672,76,34.378626
4,Clear Creek,Colorado,0.134033,3,41.416317
5,Douglas,Colorado,0.127853,79,29.789725
0,Adams,Colorado,0.122557,71,29.536139
6,Gilpin,Colorado,0.11307,0,35.277777
3,Broomfield,Colorado,0.056552,13,16.909092


Since number of cases in Denver county was more than number of cases in other counties within 50 mile radius, it means that the higher the ave_difference value, the more expansion to that county. For example, number of cases in Arapahoe is 155 (3 fewer cases than number of cases in Jefforson county) but its ave_difference value is higher because it has higher number of cases even though it is about 6 miles further than Jefferson county. 

Another example would be number of cases in Arapahoe county and Douglas county. Even though they are about the same distance away from Denver county, there is a higher value for Arapahoe county since there are more cases there than the number of cases in Douglas county.

*In summary, this formula could be useful to evaluate the expansion of infections around a hotspot county. This might be due to many reasons but it is worth to consider these differences and similarities.*


### Confirmed cases around New York City, New York

In [9]:
ave_diff_list('New York City','New York')

Unnamed: 0,county,state,ave_difference,cases,distance
10,Orange,New York,0.001646,1101,47.210486
3,Fairfield,Connecticut,0.001596,908,46.07241
12,Putnam,New York,0.001585,131,46.979948
17,Sussex,New Jersey,0.001532,81,45.495666
19,Westchester,New York,0.001348,7875,29.524
7,Monmouth,New Jersey,0.001279,781,37.091555
16,Somerset,New Jersey,0.00124,258,36.604678
6,Middlesex,New Jersey,0.001125,808,32.596311
8,Morris,New Jersey,0.001047,442,30.70806
15,Rockland,New York,0.000955,1896,26.614832


## Application 2: Confirmed cases around a location

In [10]:
### Some of the functions are modified to a single location

In [11]:
#latitudes and longitutes of a location should be given as a tuple.

def distanceloc(location,county2,state2):
    x_1=location[0]*np.pi/180
    y_1=location[1]*np.pi/180
    x_2=float(counties[(counties['County']==county2) & (counties['State']==state2)]['lat_radians'])
    y_2=float(counties[(counties['County']==county2) & (counties['State']==state2)]['long_radians'])
    earth_radius=3959 #in miles
    return np.arccos(np.sin(x_1)*np.sin(x_2)+np.cos(x_1)*np.cos(x_2)*np.cos(y_1-y_2))*earth_radius


In [12]:
def find50milesloc(location):
    radius_list=[]
    for i in range(0,len(counties)):
        county2=str(counties.loc[i]['County'])
        state2=str(counties.loc[i]['State'])
        if distanceloc(location,county2,state2)<50:
            radius_list.append((county2,state2))
    return radius_list
        


In [13]:
def cases_list(location):
    cases_list=[]
    for i in find50milesloc(location):
        county=i[0]
        state=i[1]
        cases=int(counties[(counties['State']==state) & (counties['County']==county)]['Confirmed'])
        dist=distanceloc(location,county,state)
        cases_list.append((county,state,cases,dist))
    df=pd.DataFrame(cases_list,columns=['county','state','cases','distance'])
    return df.sort_values(by='distance',ascending=True)



### New Orleans, Louisiana 

In [14]:
New_Orleans=(29.9537, -90.07775) #Latitude, longitude value found by searcing on Bing.
find50milesloc(New_Orleans)
cases_list(New_Orleans)

Unnamed: 0,county,state,cases,distance
3,Orleans,Louisiana,1298,12.085264
1,Jefferson,Louisiana,744,14.849598
6,St. Charles,Louisiana,30,16.944644
8,St. John the Baptist,Louisiana,54,26.595628
9,St. Tammany,Louisiana,134,32.312803
5,St. Bernard,Louisiana,43,33.882161
2,Lafourche,Louisiana,34,35.71413
7,St. James,Louisiana,48,43.632561
4,Plaquemines,Louisiana,19,46.463227
0,Hancock,Mississippi,9,47.636077


## Application 3: Finding all hotspots in the United States

In [15]:
#find hotspots, code should work, laptop computing capabilities is not enough

'''for i in range(0,50):
    county1=str(counties.loc[i]['County'])
    state1=str(counties.loc[i]['State'])
    for j in range(0,len(counties)):
        county2=str(counties.loc[j]['County'])
        state2=str(counties.loc[j]['State'])
        if ((county2,state2) in find50miles(county1,state1)):
            if ave_difference(county1,state1,county2,state2)>0.5:
                print((county1,county2))'''
            

"for i in range(0,50):\n    county1=str(counties.loc[i]['County'])\n    state1=str(counties.loc[i]['State'])\n    for j in range(0,len(counties)):\n        county2=str(counties.loc[j]['County'])\n        state2=str(counties.loc[j]['State'])\n        if ((county2,state2) in find50miles(county1,state1)):\n            if ave_difference(county1,state1,county2,state2)>0.5:\n                print((county1,county2))"