# Set Up


In [0]:
import matplotlib.pyplot as plt
import numpy as np

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker

sns.set()

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [0]:
from google.colab import drive
import pandas as pd

drive.mount("/content/gdrive", force_remount=True)
root_path = 'gdrive/My Drive/Datathon/Data'

Mounted at /content/gdrive


## Helper Functions

In [0]:
def to_zero(count):
    if count == '<1':
      return 0
    else: 
      return int(count)

In [0]:
def get_month(month):
  switcher = {
    'JANUARY': 1,
    'FEBRUARY': 2,
    'MARCH': 3,
    'APRIL': 4,
    'MAY': 5,
    'JUNE': 6,
    'JULY': 7,
    'AUGUST': 8,
    'SEPTEMBER': 9,
    'OCTOBER': 10,
    'NOVEMBER': 11,
    'DECEMBER': 12
  }
  return switcher.get(month, 0)

# Objective and Rationale

*   We hypothesize that public awareness of disease outbreaks correlates negatively to the number of arrivals from China into the US.
  * We measure public awareness using two metrics: 
      * The popularity of disease-related search topics based on Google Trends
      * The number of disease-related articles from People's Daily, China's largest newspaper and the ruling Chinese Communist Party's official newspaper.
* We focused our attention on global epidemics in the 21st century, particularly the ones that China was directly affected by.
  * This narrows our focus to the 2005-06 Avian Influenza (H5N1), 2009 Swine Influenza (H1N1) and 2013 Avian Influenza (H7N9). 
  * In particular, China was the epicenter for the 2005-6 and 2013 Avian Influenzas, which makes them particularly relevant for understanding the current COVID-19 epidemic. 
  * Other epidemics in the 21st century includes the 2015-16 Zika virus, 2013-2016 Ebola virus, and the 2012 MERS coronavirus. We excluded these for the following reasons:
      * The diseases were mainly localized within their respective epicenter regions (Americas, West Africa and Middle East) and China was not affected by the outbreaks.
      * There was relatively less public awareness of the diseases.
      * The diseases occurred within the same period between 2013-2016, making year-on-year changes in arrival data an unreliable measure of the diseases’ impact on travel patterns
  * The 2003 SARS outbreak would be the most relevant to the current COVID-19 coronavirus, as China is the epicenter and most heavily affected in both cases. However, Google Trends data before 2004 is unavailable.


In [0]:
with open(f"{root_path}/Google Trends - China 2004-Present.csv", encoding = "utf-8") as f:
  google_trends = pd.read_csv(f)

fig = go.Figure()

google_trends['swine'] = google_trends['Swine influenza: (China)'].apply(to_zero)
google_trends['avian'] = google_trends['Avian influenza: (China)'].apply(to_zero)
google_trends['ebola'] = google_trends['Ebola virus disease: (China)'].apply(to_zero)
google_trends['mers'] = google_trends['Middle East respiratory syndrome: (China)'].apply(to_zero)
google_trends['zika'] = google_trends['Zika fever: (China)'].apply(to_zero)

fig.add_trace(go.Scatter(x=google_trends['Month'], y=google_trends['swine'],mode='lines',name='Swine influenza: (China)'))
fig.add_trace(go.Scatter(x=google_trends['Month'], y=google_trends['avian'],mode='lines',name='Avian influenza: (China)'))
fig.add_trace(go.Scatter(x=google_trends['Month'], y=google_trends['ebola'],mode='lines',name='Ebola: (China)'))
fig.add_trace(go.Scatter(x=google_trends['Month'], y=google_trends['mers'],mode='lines',name='MERS: (China)'))
fig.add_trace(go.Scatter(x=google_trends['Month'], y=google_trends['zika'],mode='lines',name='Zika virus: (China)'))
fig.update_xaxes(nticks=16)
fig.update_layout(title ='Google Trends Results for disease related search terms, 2004-2020')
fig.show()

# Data Collection and Processing
We had three primary sources of data:
* Historical Flight Dataset (Fidelity)
* Google Trends data (https://trends.google.com/trends/)
* Archival data of People's Daily news articles from 1946-2020 (http://data.people.com.cn)

With each dataset we faced several challenges that also lead to new insights.

Flight Dataset:
* For flight data, we selected the timeframes of interest that covered both the duration of the influenza outbreaks and the spikes in Google Trend searches related to the epidemics.
* For the 2009 Swine Influenza and the 2013 Avian Influenza, we found that looking at data across the year was sufficient, however for the 2005-06 Avian Influenza, looking at the data from September 2005 - August 2006 was more appropriate.

* Since travel patterns are also seasonal, we mesaured flight behavior as year-on-year to remove these seasonal fluctuations.

Google Trends data:
* Google Trends data is particularly useful for measuring public awareness because Topics cover related concepts across multiple languages instead of simply searching for keyword matches.

* An early issue that we noticed was that other prominent epidemics that have occurred recently pose certain difficulties for analysis. Namely, the 2015-16 Zika virus, 2013-2016 Ebola virus, and the 2012 MERS coronavirus.
* These diseases overlapped in time, and when compared to the other diseases we studied, showed much lower search volumes compared to the diseases that we studied. Thus we concluded that these diseases would produce noisy data and opted to exclude them from this study.

* We noticed that our data from Google Trends correlation was slightly off for the 2005-06 Avian Influenza.
* We considered several possible reasons. One reason might be that relative lower use of Google during the earlier years of the decade, making Google search data less representative of the population's awareness of current issues, much less of the Chinese population's awareness. Additionally, Google improved its categorization and organization of search data for Insights for Search in 2008. Possibly the use of Google Trends is more reliable after this date.

* We also acknowledge that Google is partly limited as a proxy for understanding public awareness in China due to its banned status. However, below we show examples of deviation from worldwide Google search trends (in blue) versus China search trends (in red), but we find that there is still a correlation. Moreover, when comparing Google search trends and news articles covering influenza published in China, we find a strong correlation, so Google trends remains valid as a proxy.

![Global searches vs Chinese searches](https://i.imgur.com/3q6vM5G.png)

![Global searches vs Taiwan searches](https://i.imgur.com/WEeNwoZ.png)

People's Daily news articles:

* We first identified a list of terms that commonly appeared in Chinese news articles related to influenza and scraped the web archive for relevant articles.
* However, when performing a sanity check on the data, we found large amounts of articles that were not related to diseases or epidemics.
* After closer reading, we learned that certain search terms that were too broad (for example, disease 病毒) would be used metaphorically in Chinese writing, resulting in noisy data.
* We then narrowed our search terms so that our collected data would be cleaner.

# 2009 Swine Flu

## Data

In [0]:
with open(f"{root_path}/2009 Swine Influenza.csv", encoding = "utf-8") as f:
  swine_trends = pd.read_csv(f)

swine_trends['Searches'] = swine_trends['Swine influenza: (China)'].apply(to_zero)
swine_trends['Week'] = swine_trends.index + 1

swine_trends.head()

Unnamed: 0,Week,Swine influenza: (China),Searches
0,1,0,0
1,2,<1,0
2,3,0,0
3,4,<1,0
4,5,0,0


In [0]:
with open(f"{root_path}/2004-2010 Monthly Tourism Statistics.xlsx",'rb') as f:
  travel_2009 = pd.read_excel(f,"2009")
travel_2009['Month'] = travel_2009['MONTH/QUARTER'].apply(get_month)
travel_2009 = travel_2009[['Month','PRC & HONG KONG']].query('Month!=0')

with open(f"{root_path}/2004-2010 Monthly Tourism Statistics.xlsx",'rb') as f:
  travel_2008 = pd.read_excel(f,"2008")

travel_2008['Month'] = travel_2008['MONTH/QUARTER'].apply(get_month)
travel_2008 = travel_2008[['Month','PRC & HONG KONG']].query('Month!=0')

travel_2009['YOY Changes'] = travel_2009['PRC & HONG KONG'].div(travel_2008['PRC & HONG KONG'])
travel_2009.head()

Unnamed: 0,Month,PRC & HONG KONG,YOY Changes
2,1,70987,1.27576
3,2,33788,0.792011
4,3,40999,0.877359
6,4,45969,0.988411
7,5,42118,0.709296


In [0]:
with open(f"{root_path}/2009_articles.csv", encoding = "utf-8") as f:
  articles09 = pd.read_csv(f)
articles09.head()

Unnamed: 0,Month,Articles
0,1,9
1,2,7
2,3,0
3,4,25
4,5,175


## Visualization

In [0]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=swine_trends['Week']/(52/12),y=swine_trends['Searches'],name="Searches",yaxis="y1"))
fig.add_trace(go.Scatter(x=travel_2009['Month'], y=travel_2009['YOY Changes']-1, name='YOY Arrivals Changes',yaxis="y2"))
fig.add_trace(go.Scatter(x=articles09['Month'],y=articles09['Articles'],name='Number of Articles',yaxis="y3"))

# Create axis objects
fig.update_layout(
    xaxis=dict(
        domain=[0.2, 0.9], title="Months"
    ),
    yaxis=dict(title="Searches",titlefont=dict(color="blue"),tickfont=dict(color="blue")),
    yaxis2=dict(title="YOY Arrival Changes",titlefont=dict(color="red"),tickfont=dict(color="red"),
        anchor="x",overlaying="y",side="right"),
    yaxis3=dict(title="No. of Articles",titlefont=dict(color="green"),tickfont=dict(color="green"),
                anchor="free",overlaying="y",side="left", position=0.1),)

# Update layout properties
fig.update_layout(
    title_text="2009 Swine Influenza",
    width=1000,
)

fig.show()

# 2013 Avian Flu

## Data

In [0]:
with open(f"{root_path}/2013 Avian Influenza.csv", encoding = "utf-8") as f:
  avian13_trends = pd.read_csv(f)
avian13_trends['Searches'] = avian13_trends['Avian influenza: (China)'].apply(to_zero)
avian13_trends['Week'] = avian13_trends.index + 1
avian13_trends.head()

Unnamed: 0,Week,Avian influenza: (China),Searches
0,1,<1,0
1,2,1,1
2,3,<1,0
3,4,<1,0
4,5,<1,0


In [0]:
with open(f"{root_path}/2012-2013 Arrival Cleaned.xlsx",'rb') as f:
  travel_1213 = pd.read_excel(f)
  
travel_1213['YOY Changes'] = travel_1213[2013].div(travel_1213[2012])
travel_1213['Month'] = travel_1213.index+1
travel_1213.head()

Unnamed: 0,2012,2013,YOY Changes,Month
0,161730,173113,1.070383,1
1,66041,122672,1.857513,2
2,84185,103531,1.229803,3
3,100429,113835,1.133487,4
4,118026,150344,1.273821,5


In [0]:
with open(f"{root_path}/2013_articles.csv", encoding = "utf-8") as f:
  articles13 = pd.read_csv(f)
articles13.head()

Unnamed: 0,Month,Articles
0,1,5
1,2,4
2,3,2
3,4,78
4,5,21


## Visualization

In [0]:
fig = go.Figure()
# Add traces

fig.add_trace(go.Scatter(x=avian13_trends['Week']/(52/12),y=avian13_trends['Searches'],name="Searches",yaxis="y1"))
fig.add_trace(go.Scatter(x=travel_1213['Month'], y=travel_1213['YOY Changes']-1, name='YOY Arrivals Changes',yaxis="y2"))
fig.add_trace(go.Scatter(x=articles13['Month'],y=articles13['Articles'],name='Number of Articles',yaxis="y3"))

# Create axis objects
fig.update_layout(
    xaxis=dict(
        domain=[0.2, 0.9], title="Months"
        
    ),
yaxis=dict(title="Searches",titlefont=dict(color="blue"),tickfont=dict(color="blue")),
yaxis2=dict(title="YOY Arrival Changes",titlefont=dict(color="red"),tickfont=dict(color="red"),
        anchor="x",overlaying="y",side="right"),
yaxis3=dict(title="No. of Articles",titlefont=dict(color="green"),tickfont=dict(color="green"),
                anchor="free",overlaying="y",side="left", position=0.1),)

# Update layout properties
fig.update_layout(
    title_text="2013 Avian influenza",
    width=1000,
)

fig.show()

# 2005-06 Avian Flu

## Data

In [0]:
with open(f"{root_path}/2005-2006 Avian Influenza.csv", encoding = "utf-8") as f:
  avian0506_trends = pd.read_csv(f)
  
avian0506_trends['Searches'] = avian0506_trends['Avian influenza: (China)'].apply(to_zero)
avian0506_trends['Week'] = avian0506_trends.index + 1

avian0506_trends.head()

Unnamed: 0,Week,Avian influenza: (China),Searches
0,1,2,2
1,2,1,1
2,3,4,4
3,4,4,4
4,5,3,3


In [0]:
with open(f"{root_path}/2004-06 Tourism Cleaned.xlsx",'rb') as f:
  travel_0406 = pd.read_excel(f)

travel_0406['YOY Changes'] = travel_0406['Count 05/06'].div(travel_0406['Count 04/05'])
travel_0406['Month'] = travel_0406.index +1
travel_0406.head()

Unnamed: 0,Month/Year 05/06,Count 05/06,Month/Year 04/05,Count 04/05,YOY Changes,Month
0,2005-09-01,36831,2004-09-01,28525,1.291183,1
1,2005-10-01,34849,2004-10-01,25572,1.36278,2
2,2005-11-01,28833,2004-11-01,24246,1.189186,3
3,2005-12-01,30119,2004-12-01,24173,1.245977,4
4,2006-01-01,41491,2005-01-01,32062,1.294086,5


In [0]:
with open(f"{root_path}/2005_articles.csv", encoding = "utf-8") as f:
  articles05 = pd.read_csv(f)
articles05.head()

Unnamed: 0,Month,Articles
0,1,6
1,2,23
2,3,112
3,4,42
4,5,35


## Visualization


In [0]:
fig = go.Figure()
# Add traces

fig.add_trace(go.Scatter(x=avian0506_trends['Week']/(52/12),y=avian0506_trends['Searches'],name="Searches",yaxis="y1"))
fig.add_trace(go.Scatter(x=travel_0406['Month'], y=travel_0406['YOY Changes']-1, name='YOY Arrivals Changes',yaxis="y2"))
fig.add_trace(go.Scatter(x=articles05['Month'],y=articles05['Articles'],name='Number of Articles',yaxis="y3"))

# Create axis objects
fig.update_layout(
    xaxis=dict(
        domain=[0.2, 0.9], title="Months"
    ),
  yaxis=dict(title="Searches",titlefont=dict(color="blue"),tickfont=dict(color="blue")),
  yaxis2=dict(title="YOY Arrival Changes",titlefont=dict(color="red"),tickfont=dict(color="red"),
        anchor="x",overlaying="y",side="right"),
  yaxis3=dict(title="No. of Articles",titlefont=dict(color="green"),tickfont=dict(color="green"),
                anchor="free",overlaying="y",side="left", position=0.1),)
# Update layout properties
fig.update_layout(
    title_text="2005 Avian influenza",
    width=1000,
)

fig.show()

# Applications

* The amount of search queries can be used to predict travel patterns in the short-term future. The increase in public awareness results in a subsequent decrease in arrival numbers, and vice versa. This information can be useful for companies and the government for their short-term planning and decision-making.

* In the case of investments, if a decrease in public awareness for the disease from a peak is observed, we can expect a rebound in tourist numbers within China the following months, and companies can make their investment decisions accordingly accordingly.

* The link between the amount of search queries and arrival data can help to inform the government on future tourism trends as well. For instance, on the tail end of the spread of a disease, as public interest decreases, we can expect a rebound in tourism numbers over the next few months. Based on the expected performance of the tourism sector, the government can either provide support and stimulus to tide the industry through slower times, or plan marketing campaigns to capitalize on growing tourism numbers.

* For governments, the amount of search queries and the number of articles offer a good indicator for public interest and concern for the disease, and they can use it to plan their response accordingly. In some cases, for example, while the disease might not be a serious public health threat, public interest in the disease might be extremely high, and the government could decide to implement more precautionary measures to quell public fear. 







# Possible Extensions

* As Google is generally unavailable in China, the Google trends data may not be the most precise.
  * The most popular search engine in China is Baidu, which has its own trend analysis tools. We were unable to access the data as a Chinese number is needed, but future studies can look at search trends from Chinese search engines.
* For a more comprehensive study on public awareness in China, it would be valuable to look at other news sources, and particularly social media sites such as WeChat and Sina Weibo. 
* We focused on China as it is the country at the epicenter of the current COVID-19 epidemic, and one of the countries with the highest number of arrivals into the US (3rd highest from 2004-2019).
  * Future studies can look at arrivals from other countries with high arrival numbers, such as the UK and Japan.
  * A further study can also be done to look at how public awareness of disease outbreaks in epicenter countries and regions affect arrivals from these places. For instance, we can study how the 2012 MERS coronavirus affects arrivals from the Middle East. 

* A possible extension of this project would be to look at the number of arrivals based on the purpose of travel and the activity participation while in the US.
  * We were unable to do research in this area, as there was no sectorized monthly data available. More detailed data in this area could provide greater insights to inform investment decisions. 


