# Curaio - TechLabs Project Summer Term 2019

Importing needed packages and libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import datetime
from pytrends.request import TrendReq
import pytrends
from random import randint
import pmdarima as pm
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
import datetime
import xgboost as xgb

Writing functions used during the project:

In [2]:
def get_usdata(kw_list):
    """takes keyword list as input and returns DataFrame
    containing ili rates for the state and Google Trends 
    data for the keywords"""
    
    # ili data
    us_ili = pd.read_csv("ili_national_level.csv", header=0)
    us_ili["date"] = pd.to_datetime(us_ili.week.astype(str) + us_ili.year.astype(str).add("-0"), format="%W%Y-%w")
    us_ili.set_index("date", inplace=True)
    ili = us_ili.shift(-1, freq='W')
       
    
    # Google Trends
    pytrends = TrendReq(hl='en-US', tz=360)
    pytrends.build_payload(kw_list, cat=0, timeframe="2015-01-11 2019-07-07", geo="US")
    trends = pytrends.interest_over_time()
    trends = trends.drop("isPartial", axis=1)
    
    # merge the dataframes on the Datetimeindex
    merged = trends.join(ili)
    merged["unweighted_ili"] = merged["unweighted_ili"].interpolate(method="linear")
    merged = merged.drop(merged.tail(4).index, inplace=False)
    merged = merged.asfreq("W")
    
    return merged
 
    
def rmse(testdf, preddf):
    """takes two dataframes as input: testdf containing (true) test data 
    and preddf containing predictions made with a forecasting model;
    returns the root mean squared error as evaluation metric"""
    
    return np.sqrt(mean_squared_error(testdf, preddf))

## Introduction and aim of the project

The aim of the project is to predict disease rates, more precisely influenza-like illnesses rates (ILI rates). On the one hand historical time series data were used, on the other hand the data basis was extended by Google Trends data. According to [Google News Lab](https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8) these data are  
> "normalized Trends data. This means that when we look at search interest over time for a topic, we’re looking at that interest as a proportion of all searches on all topics on Google at that time and location".

The data is indexed to 100, where 100 is the maximum search interest for the time and location selected. The idea was that people may google typical flu symptoms at the onset of a cold or flu even before the disease fully manifests itself. The explanatory variables for the ILI rates were therefore defined as a combination of keywords containing such typical symptoms of influenza. 

## Data acquisition and engineering

Initially, it was planned to carry out the project on the basis of data for Germany. However, due to the lack of freely accessible influenza data, we decided to do the project for the USA. The ILI rates could be obtained using the R Package [cdcfluview](https://rdrr.io/cran/cdcfluview/) and the ilinet function. This function retrieves data from the CDC FluView Portal containing, inter alia, in-season and past seasons' national, regional, and state-level outpatient illness surveillance data from ILINet (Influenza-like Illness Surveillance Network). <br><br>
The Google Trends data were obtained using Python's [pytrends](https://github.com/GeneralMills/pytrends), an unofficial pseudo API for extracting Google Trends data. As a combination of keywords, the following typical symptoms were identified:
- fever
- flu
- cough
- sore throat
- headache

For this project data was used ranging from the beginning of 2015 until July 2019, so over approximately four and a half years. Since ILI rates are reported on a weekly basis, we also used Google Trends data on a weekly basis. Therefore the data amounts to around 230 data points.