# **COVID-19 Public Health Authority (PHU) cases Analysis**

In [2]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
#import plotly.graph_objects as go
#import plotly.express as px

#### **Import, analyze, clean, and preprocess a “real-world” classification dataset.**

In [3]:
# Import our input dataset
df = pd.read_csv('./Resources/Ontario.csv')
df.head()

Unnamed: 0,Row_ID,Accurate_Episode_Date,Case_Reported_Date,Test_Reported_Date,Specimen_Date,Age_Group,Client_Gender,Case_AcquisitionInfo,Outcome1,Outbreak_Related,Reporting_PHU,Reporting_PHU_Address,Reporting_PHU_City,Reporting_PHU_Postal_Code,Reporting_PHU_Website,Reporting_PHU_Latitude,Reporting_PHU_Longitude
0,1,2020-03-07,2020-03-09,2020-03-11,2020-03-09,50s,MALE,Travel,Resolved,,York Region Public Health Services,17250 Yonge Street,Newmarket,L3Y 6Z1,www.york.ca/wps/portal/yorkhome/health/,44.048023,-79.480239
1,2,2020-03-02,2020-03-09,2020-03-09,2020-03-09,40s,MALE,Travel,Resolved,,Toronto Public Health,"277 Victoria Street, 5th Floor",Toronto,M5B 1W2,www.toronto.ca/community-people/health-wellnes...,43.656591,-79.379358
2,3,2020-03-06,2020-03-10,2020-03-10,2020-03-09,30s,FEMALE,Travel,Resolved,,York Region Public Health Services,17250 Yonge Street,Newmarket,L3Y 6Z1,www.york.ca/wps/portal/yorkhome/health/,44.048023,-79.480239
3,4,2020-03-02,2020-03-09,2020-03-12,2020-03-07,40s,MALE,Travel,Resolved,,Ottawa Public Health,100 Constellation Drive,Ottawa,K2G 6J8,www.ottawapublichealth.ca,45.345665,-75.763912
4,5,2020-03-03,2020-03-10,2020-03-11,2020-03-09,30s,MALE,Travel,Resolved,,Toronto Public Health,"277 Victoria Street, 5th Floor",Toronto,M5B 1W2,www.toronto.ca/community-people/health-wellnes...,43.656591,-79.379358


#### **Inspect Data**

In [4]:
# show number of columns and rows
df.shape

(36178, 17)

In [5]:
# show DF Columns
df.columns

Index(['Row_ID', 'Accurate_Episode_Date', 'Case_Reported_Date',
       'Test_Reported_Date', 'Specimen_Date', 'Age_Group', 'Client_Gender',
       'Case_AcquisitionInfo', 'Outcome1', 'Outbreak_Related', 'Reporting_PHU',
       'Reporting_PHU_Address', 'Reporting_PHU_City',
       'Reporting_PHU_Postal_Code', 'Reporting_PHU_Website',
       'Reporting_PHU_Latitude', 'Reporting_PHU_Longitude'],
      dtype='object')

In [6]:
# Return data types
df.dtypes

Row_ID                         int64
Accurate_Episode_Date         object
Case_Reported_Date            object
Test_Reported_Date            object
Specimen_Date                 object
Age_Group                     object
Client_Gender                 object
Case_AcquisitionInfo          object
Outcome1                      object
Outbreak_Related              object
Reporting_PHU                 object
Reporting_PHU_Address         object
Reporting_PHU_City            object
Reporting_PHU_Postal_Code     object
Reporting_PHU_Website         object
Reporting_PHU_Latitude       float64
Reporting_PHU_Longitude      float64
dtype: object

#### **Clean Data**

In [8]:
df2 = df[['Outcome1','Age_Group','Client_Gender','Reporting_PHU']]
df2.head()

Unnamed: 0,Outcome1,Age_Group,Client_Gender,Reporting_PHU
0,Resolved,50s,MALE,York Region Public Health Services
1,Resolved,40s,MALE,Toronto Public Health
2,Resolved,30s,FEMALE,York Region Public Health Services
3,Resolved,40s,MALE,Ottawa Public Health
4,Resolved,30s,MALE,Toronto Public Health


In [9]:
# Return columns with Null values 
for column in df2.columns:
    print(f"Column {column} has {df2[column].isnull().sum()} null values")

Column Outcome1 has 0 null values
Column Age_Group has 0 null values
Column Client_Gender has 0 null values
Column Reporting_PHU has 0 null values


In [10]:
# Retrun df duplicates
print(f"Duplicate entries: {df2.duplicated().sum()}")

Duplicate entries: 35132


### **Aggregate data by Accurate_Episode_date**

In [None]:
df \
.groupby(["Accurate_Episode_Date"]) \
.count()

### **Filter data by Public Health Units (PHU)**

In [None]:
data_df = df \
.groupby(["Reporting_PHU", "Reporting_PHU_Latitude", "Reporting_PHU_Longitude"]) \
.count()["Row_ID"] \
.reset_index() \
.rename(columns={"Row_ID" : "Cases"})

data_df["OnsetWithin"] = "current day"

data_df

#### **Gather Dateframe information**
- Display dataframe
- Display Categorical Variable list ('float64')
- Display Column unique values 

In [11]:
# Display Dataframe
df2.head()

Unnamed: 0,Outcome1,Age_Group,Client_Gender,Reporting_PHU
0,Resolved,50s,MALE,York Region Public Health Services
1,Resolved,40s,MALE,Toronto Public Health
2,Resolved,30s,FEMALE,York Region Public Health Services
3,Resolved,40s,MALE,Ottawa Public Health
4,Resolved,30s,MALE,Toronto Public Health


#### **Display Variable List**

In [12]:
# Categorical variable list
df2_cat = df2.dtypes[df.dtypes == "float64"].index.tolist()
df2_cat

[]

In [13]:
# Unique numbers
df2[df2_cat].nunique()

Series([], dtype: float64)

### **Categorical Variables using one-hot encoding** 
- One-hot encoding identifies all unique column values and splits the single categorical column into a series of columns, each containing information about a single unique categorical value
- Although one-hot encoding is a very robust solution, it can be very memory-intensive. Therefore, categorical variables with a large number of unique values might become difficult to navigate or filter once encoded.

### **Bucketing or binning**
**The process of reducing the number of unique categorical values in a dataset is known as bucketing or binning**
- Bucketing data typically follows one of two approaches:
- 1) Collapse all of the infrequent and rare categorical values into a single “other” category.
- 2) Create generalized categorical values and reassign all data points to the new corresponding values

In [14]:
# Application type counts
Reporting_PHU_counts = df2.Reporting_PHU.value_counts()
Reporting_PHU_counts

Toronto Public Health                                       13511
Peel Public Health                                           6027
York Region Public Health Services                           3082
Ottawa Public Health                                         2123
Durham Region Health Department                              1724
Windsor-Essex County Health Unit                             1675
Region of Waterloo, Public Health                            1313
Hamilton Public Health Services                               847
Halton Region Health Department                               781
Niagara Region Public Health Department                       763
Middlesex-London Health Unit                                  631
Simcoe Muskoka District Health Unit                           607
Wellington-Dufferin-Guelph Public Health                      493
Haldimand-Norfolk Health Unit                                 431
Leeds, Grenville and Lanark District Health Unit              354
Lambton Pu

In [None]:
# Visualize the value counts
Reporting_PHU_counts.plot.density()

The **Reporting_PHU** columns variables appear a lot in of dataset, they are a feature of our model

#### **Tasks**
According to the density plot, the most common unique values have more than 1000 instances within the dataset. Therefore, we can bucket any occurence that appears fewer than 1000 times in the dataset as “other.” To do this, we’ll use a Python for loop and Pandas’ replace method. 
- Determine which values to replace
- Replace in DataFrame
- Check to make sure binning was successful

In [20]:
df2_cat = df2.dtypes[df2.dtypes == "object"].index.tolist()

In [15]:
# replace values (1000 could be bit too large. could try something like 200 to 500)
replace_Reporting_PHU = list(Reporting_PHU_counts[Reporting_PHU_counts < 1000].index)

# use for loop to replace values
for i in replace_Reporting_PHU:
    df2.Reporting_PHU  = df2.Reporting_PHU.replace(i,"Other")

# Bucketing sucess
df2.Reporting_PHU .value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Toronto Public Health                 13511
Other                                  6723
Peel Public Health                     6027
York Region Public Health Services     3082
Ottawa Public Health                   2123
Durham Region Health Department        1724
Windsor-Essex County Health Unit       1675
Region of Waterloo, Public Health      1313
Name: Reporting_PHU, dtype: int64

#### **Tasks**
Now that we have reduced the number of unique values in the country variable, we’re ready to transpose the variable using one-hot encoding. The easiest way to perform **one-hot encoding** in Python is to use Scikit-learn’s **OneHotEncoder** module on the country variable. To build the encoded columns, we must **create an instance** of OneHotEncoder and **“fit”** the encoder with our values.

In [21]:
# Create the OneHotEncoder instance
enc = OneHotEncoder(sparse=False)

# Fit the encoder and produce encoded DataFrame
encoded_df2 = pd.DataFrame(enc.fit_transform(df2))

# Rename encoded columns
encoded_df2.columns = enc.get_feature_names(df2_cat)
encoded_df2.head()

Unnamed: 0,Outcome1_Fatal,Outcome1_Not Resolved,Outcome1_Resolved,Age_Group_20s,Age_Group_30s,Age_Group_40s,Age_Group_50s,Age_Group_60s,Age_Group_70s,Age_Group_80s,...,Client_Gender_TRANSGENDER,Client_Gender_UNKNOWN,Reporting_PHU_Durham Region Health Department,Reporting_PHU_Other,Reporting_PHU_Ottawa Public Health,Reporting_PHU_Peel Public Health,"Reporting_PHU_Region of Waterloo, Public Health",Reporting_PHU_Toronto Public Health,Reporting_PHU_Windsor-Essex County Health Unit,Reporting_PHU_York Region Public Health Services
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [24]:
encoded_df2 = encoded_df2[encoded_df2['Outcome1_Not Resolved'] != 1]

In [26]:
encoded_df2 = encoded_df2.drop(columns=['Outcome1_Not Resolved','Outcome1_Fatal'],axis=1)

# We can probably drop "Client_Gender_UNKNOWN" as well.

In [27]:
encoded_df2.head()

Unnamed: 0,Outcome1_Resolved,Age_Group_20s,Age_Group_30s,Age_Group_40s,Age_Group_50s,Age_Group_60s,Age_Group_70s,Age_Group_80s,Age_Group_90s,Age_Group_<20,...,Client_Gender_TRANSGENDER,Client_Gender_UNKNOWN,Reporting_PHU_Durham Region Health Department,Reporting_PHU_Other,Reporting_PHU_Ottawa Public Health,Reporting_PHU_Peel Public Health,"Reporting_PHU_Region of Waterloo, Public Health",Reporting_PHU_Toronto Public Health,Reporting_PHU_Windsor-Essex County Health Unit,Reporting_PHU_York Region Public Health Services
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


#### **Dataframe information**
- dataframe data type
- variable list
- column unique values

#### **Split our training and testing data**
- We need to split our training and testing data before fitting our **StandardScaler instance**. This prevents testing data from influencing the standardization function.
- To build our training and testing datasets, we need to separate two values:
- input values (which are our independent variables commonly referred to as model features or “X” in TensorFlow documentation (Links to an external site.))
- target output (our dependent variable commonly referred to as target or “y” in TensorFlow documentation)

- **Use Sklearn train_test_split method to split data into training and test**
    - X_train, X_test, y_train, y_test 
- **Prepare dataset for neural network model**
    - Normalize or standardize our numerical variables

In [28]:
# Split our preprocessed data into our features and target arrays
y = encoded_df2["Outcome1_Resolved"].values
X = encoded_df2.drop(["Outcome1_Resolved"],1).values

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

#### **Tasks**
At last, our data is preprocessed and separated and ready for modelling. For our purposes, we will use the same framework we used for our basic neural network:
- For our input layer, we must add the number of input features equal to the number of variables in our feature DataFrame.
- In our hidden layers, our deep learning model structure will be slightly different—we’ll add two hidden layers with only a few neurons in each layer. To create the second hidden layer, we’ll add another Keras Dense class while defining our model. All of our hidden layers will use the relu activation function to identify nonlinear characteristics from the input values.
- In the output layer, we’ll use the same parameters from our basic neural network including the sigmoid activation function. The sigmoid activation function will help us predict the probability that an employee is at risk for attrition.

#### **Define Model**
- Deep neural net
- First hidden layer
- Second hidden layer
- Output layer
- Check the structure of the model

In [30]:
# Define the model - deep neural net
number_input_features = len(X_train[0])
hidden_nodes_layer1 =  8
hidden_nodes_layer2 = 5

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(
    tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu")
)

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation="relu"))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 8)                 192       
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 45        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 6         
Total params: 243
Trainable params: 243
Non-trainable params: 0
_________________________________________________________________


Now it is time to compile our model and define the loss and accuracy metrics. Since we want to use our model as a binary classifier, we’ll use the **binary_crossentropy loss function**, **adam optimizer**, and **accuracy metrics**, which are the same parameters we used for our basic neural network. To compile the model, add and run the following code:

#### **Compile neural network**
- Use **adam optimizer**, which uses a gradient descent approach to ensure that the algorithm will not get stuck on weaker classifying variables and feature
- The **loss metric** is used by machine learning algorithms to score the performance of the model through each iteration and **epoch** by evaluating the inaccuracy of a single input.
- Use **binary_crossentropy**, which is specifically designed to evaluate a binary classification mode

In [31]:
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


#### **Train Model**
Training and evaluating the deep learning model is no different from a basic neural network. Depending on the complexity of the dataset, we may opt to increase the number of epochs to allow for the deep learning model more opportunities to optimize the weight coefficients. To train our model, we must add and run the following code:

In [33]:
# Train the model
fit_model = nn.fit(X_train,y_train,epochs=10) #epochs (run through the data)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [34]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

8627/8627 - 0s - loss: 0.1868 - acc: 0.9233
Loss: 0.1868220557805988, Accuracy: 0.9232641458511353


#### **Summary**

Looking at the performance metrics from the model, the neural network was able to correctly classify each of the points in the test data. In other words, the model was able to correctly classify data it was not trained on **92%** of the time. Although perfect model performance is ideal, more complex datasets and models may not be able to achieve 100% accuracy. Therefore, it is important to establish model performance thresholds before designing any machine learning mode. As seen from the results of the **Deep Learning** model metrics, the observation made was that the model correctly identified **Ontario Public Health Autority** data to be **92%** accurate

### **Plot PHU Hotspots in Ontario Canada**

In [None]:
from plotly.offline import plot
import plotly.graph_objs as go
fig = px.scatter_mapbox(data_df, lat="Reporting_PHU_Latitude", lon="Reporting_PHU_Longitude",  
            color="OnsetWithin", 
            color_discrete_sequence=["red", "darkblue", "yellow", "white"], 
            size="Cases", hover_name="Reporting_PHU", 
            size_max=28, zoom=5.4, 
            center=dict(lat=45,lon=-79.4), 
            height=800, 
            labels={"OnsetWithin" : "Onset w/in Date"},
            title=" Confirmed Cases per Public Health Unit" )
fig.update_layout(mapbox_style="open-street-map")
fig.show()
plot(fig, auto_open=True)