<img 
     src="https://github.com/Kesterchia/Global-animal-diseases/blob/main/Data/kruger_wildlife__banner.jpg?raw=true" 
     alt="Drawing" 
     style="width: 600px;"/>

## About the dataset:

## Context
This dataset is downloaded from the EMPRES Global Animal Disease Information System.
The Empress-i system is run by the Food and Agriculture Organisation of the United Nations. Its Disease Outbreak Module provides updated information on global animal disease distribution and current threats at national, regional and global level on priority animal diseases. Disease data, such as information on suspicions and confirmation of outbreaks in livestock and wildlife species, laboratory results or follow-up reports on an outbreak situation, can be stored in a standardized format and are presented through a user-friendly and customizable interface. 

The dataset can be downloaded from: https://www.kaggle.com/tentotheminus9/empres-global-animal-disease-surveillance

## Content
The dataset shows the when, where and what of animal disease outbreaks from 2016 to 2017, including African swine fever, Foot and mouth disease and Bird-flu. Numbers of cases, deaths, etc are also included.




# Part 1: Getting a brief overview of the data

In [None]:
# Import modules 

import numpy as np
import pandas as pd
import pandas_profiling
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import sklearn

%matplotlib inline

In [None]:
#Read in disease data as df

df = pd.read_csv("https://github.com/Kesterchia/Global-animal-diseases/blob/main/Data/Outbreak_240817.csv?raw=True")

In [None]:
df.head(15)

In [None]:
# Quickly seeing some information about the dataset

df.info()

In [None]:
#Generating profile report for df

pandas_profiling.ProfileReport(df)

## Observations:


Human-related variables like deaths, age and affected individuals have over 90% of values missing. This could be due to difficulty getting information on whether diseases have affected the human population

The data seems clean in other columns with geographical data and disease types.





# Part 2: Doing some EDA to find information from the dataset

## Looking at the most common diseases:

In [None]:
#Find top 5 most common diseases
top5_diseases = df[['disease']].groupby(df['disease']).count().nlargest(7,'disease')
top5_diseases

In [None]:
#Plot the counts of diseases

fig = plt.figure(figsize = (20,8))

disease_plot = sns.countplot(df[df['disease'].isin(list(top5_diseases.index))]['disease'],
                             palette = 'bright',
                             order = top5_diseases.index)
disease_plot.tick_params(labelsize=13)

plt.title('Counts of most common diseases',fontdict = {'fontsize':25})
plt.xlabel(None)
plt.ylabel(None)
plt.show()

In [None]:
# Pie chart version of the above plot


fig = plt.figure(figsize = (20,8))
pieplot = plt.pie(x = top5_diseases['disease'],
       labels = list(top5_diseases.index))

plt.show()

### Observation: 

The 4 most common diseases seem much more prevalent than the others.

## Looking at which species are most affected by diseases worldwide:

In [None]:
# Looking for most common species affected by diseases

np.count_nonzero(df['speciesDescription'].unique())
top8_species = df['speciesDescription'].groupby(df['speciesDescription']).count().nlargest(8)
top8_species

In [None]:
#Plot these species
fig = plt.figure(figsize = (20,8))
species_plot = sns.countplot(df[df['speciesDescription'].isin(top8_species.index)]['speciesDescription'],
                             order = top8_species.index,
                             palette = 'bright')
species_plot.tick_params(labelsize = 13)
plt.xlabel(None)
plt.ylabel(None)
plt.show()

In [None]:
#Pie chart version of the above plot

plt.figure(figsize = (20,8))
pieplot = plt.pie(x = top8_species,
       labels = list(top8_species.index))
plt.show()

### Observation: 
Domestic cattle are by far the most common species with diseases. 

It is also interesting that the top 7 disease-affected species are all domestic except for wild boar, which are the second most affected species.

## Looking at the age distribution of humans affected by zoonotic diseases:

In [None]:
#Get values on age of humans affected by disease
age_info = df[df['humansAge'].notnull()][['humansAge']]

#Dropping values of age = 0
age_info_clean = age_info[age_info['humansAge'] != 0]

age_info_clean.info()

In [None]:
fig = plt.figure(figsize = (20,8))
age_plot = sns.distplot(age_info_clean, bins = 30)
age_plot.tick_params(labelsize = 13)

plt.xlabel('Age', fontdict = {'fontsize':15})
plt.title('Age of disease-affected humans', fontdict = {'fontsize':15})
plt.show()

### Observation: 
The age distribution seems relatively normal with the mean age at around 60. There is also a slight spike in cases in infants at the left end of the graph.

It seems like the very young are more susceptible to zoonotic diseases.

## Using Folium's heatmap to see which regions are most affected by diseases

In [None]:
#Creating a list of location information (Latitudes and Longitudes)

lats = df['latitude'].astype(float)
long = df['longitude'].astype(float)
locationlist = []
for i in range(0,len(lats)):
    e = [lats[i],long[i]]
    locationlist.append(e)

#Location list should be a list of lists:

locationlist[0:5]

In [None]:
#Importing folium and the HeatMap function

import folium
from folium.plugins import HeatMap

#Creating a map and adding the HeatMap overlay

m = folium.Map()
HeatMap(locationlist, radius = 15).add_to(m)
m

### Observations:
Zoonotic diseases appear to occur in distinct clusters especially in Africa.

America has not many cases of such diseases compared to the rest of the world.

Europe and Asia seem to be hotspots for zoonotic diseases. 


# Conclusion


The Food and Agriculture Organisation of the UN can look into better ways to gather data on how zoonotic diseases affect humans, as over 90% of the human-related variables are missing.

Avian influenza is well-known and is  the most common zoonotic disease worldwide, but other diseases like Bluetongue and Lumpy skin disease are less well-known and not far behind. It might be worthwhile to spread awareness about these diseases in affected countries.

Wild boar are surprisingly a common origin of zoonotic diseases even as most of the main contributors are domestic species.

Infants and elderly persons appear to be more vulnerable to such diseases.

# Next section: Can we predict the size of each outbreak?

The data includes a column 'sumCases', which describes the size of each outbreak occurrence globally. This notebook attempts to come up with a model to predict the sizes of future outbreaks.

### Step 1: Cleaning target variable

In [None]:
#We see some descriptive statistics about the target variable y:

df['sumCases'].describe()

# Std is 5821 compared to mean of 328, indicating an extreme right tail.

In [None]:
#Checking the distribution of target variable:

sns.distplot(df['sumCases'])

#We can see the distribution indeed has an extreme right tail

In [None]:
#Checking null values in target variable
df['sumCases'].isnull().sum()

In [None]:
#Replacing with mean or median might not be suitable because of the skewed data

#Here I attempt to replace it with random observations from the same distribution instead

#Replacing null values:
cleaned_y = df['sumCases'].apply(lambda x: df['sumCases'].dropna().sample(1).values[0] if pd.isnull(x) else x)

#Checking for null values:
cleaned_y.isnull().sum()


In [None]:
#Replacing dataframe column with clean column:
df['sumCases'] = cleaned_y

In [None]:
#For this analysis, only outbreaks of considerable size (> 100) are considered 

#Checking the info of those cases:

morethan100 = df[df['sumCases'] > 100]
morethan100['sumCases'].describe()

#The minimum value is 402

In [None]:
#Checking the shape again:

sns.distplot(morethan100['sumCases'])

#Still has a very long tail, indicating some observations with huge case numbers
#We don't want this to skew later regression models, so we remove these values



In [None]:
#Removing the largest 10 percentile of the data:

final_data = morethan100[morethan100['sumCases'] < morethan100['sumCases'].quantile(0.9)]

#Checking the shape of this data again:
sns.distplot(final_data['sumCases'])

#The tail now doesn't look as extreme

In [None]:
#Checking the description of the final data:

final_data['sumCases'].describe()

#The data still has 1284 observations, which maybe is still large enough to apply a regression model for this analysis?


### Step 2: Cleaning explanatory variables

### Variables to be included in the model:

Species: Viral loads can differ between species, and some viruses also only target certain species.

Serotype: One virus/bacteria species can have many strains which may affect it's behavior.

Country: The previous folium map has shown that outbreaks occur in clusters within different countries

Disease: Outbreaks will vary in behavior depending on the type of disease

Mention why remaining variables were not selected

#### Cleaning Species data:


In [None]:
#Checking null values in species column
final_data['speciesDescription'].isnull().sum()


In [None]:
#Replace with a string value
final_data['speciesDescription'] = final_data['speciesDescription'].fillna(value = 'Not Available')

#Checking again
final_data['speciesDescription'].isnull().sum()

#### Cleaning Serotypes data:

Serotypes data is chosen as an explanatory variable because one virus/bacteria species can have many strains which may affect it's behavior.

In [None]:
#Replacing NaN serotypes with 'Not Available' tag:
final_data['serotypes'] = final_data['serotypes'].fillna('Not Available')

#Checking null values again:
final_data['serotypes'].isnull().sum()

### Checking if all our variables have no remaining null values:

In [None]:
final_data.info()

## Step 3: Implementing a model

Here the approach I have taken is to attempt separating the data into clusters using a clustering algorithm (unsupervised learning model). 

### K-means (or modes) clustering

K-means clustering is an algorithm that sorts data into clusters based on distance from cluster centroids.

As K-means clustering only takes numerical variables, a variant of the algorithm (K-modes clustering) is used. 
Instead of using euclidean distance to sort clusters, it uses dissimilarities (how many mismatches between two observations). 

Some documentation can be found here: https://github.com/nicodv/kmodes

I choose to include serotypes, disease, species and country as variables to do the clustering: 

In [None]:
#Defining explanatory variables and target variables


X = final_data[['speciesDescription','country','serotypes','disease']]
y = final_data['sumCases']

#For this analysis I didn't use train_test_split because it's unsupervised?

In [None]:
#Choosing number of clusters by minimising the total variation of all clusters

from kmodes.kmodes import KModes

total_variation = []

#Computing the total variation for clusters of k=1 to k =30
for i in range(1,31):
    km = KModes(n_clusters = i,
                max_iter = 100,
                ).fit(X)

    total_variation.append(km.cost_)

In [None]:
#Plotting the total variation against number of clusters

plt.figure(figsize = (15,7))

fig = plt.plot(total_variation)

plt.xticks(ticks = range(1,30), fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel('Number of clusters, k', fontdict = {'fontsize':15})
plt.ylabel('Variation within clusters', fontdict = {'fontsize':15})
plt.title('Variation across all clusters', fontdict = {'fontsize':15} )

plt.show()

In [None]:
#Fitting the model
km = KModes(n_clusters = 6,
                max_iter = 100,
                ).fit(X)

In [None]:
#Using the model to assign each row to a cluster
clusters = km.predict(X)

#Creating new column for cluster variable

final_data['cluster'] = clusters.astype(str)

In [None]:
#Checking what our data looks like with clusters

final_data[list(X.columns) + ['sumCases','cluster']].sample(10)

Break variables into components if possible, eg. domestic and species

Also eg. breaking serotypes into two? 

## Step 4: Checking if the clusters are related to the target variable

In [None]:
plt.figure(figsize = (15,7))

sns.boxplot(x = 'cluster',
            y = 'sumCases',
            data = final_data)

plt.xlabel(xlabel = 'Cluster', fontdict = {'fontsize':15})
plt.ylabel(ylabel = 'Outbreak Size', fontdict = {'fontsize':15})

### Observation:
There doesn't seem to be any correlation between clusters and outbreak size.

### Trying the same box plot but with a Box-Cox transformation of outbreak size:

In [None]:
import scipy
from scipy import stats
transformed = stats.boxcox(final_data['sumCases'])[0]

#Plotting graphs to compare the transformed and original data

fig, ax =plt.subplots(1,2, figsize = (20,8))

sns.distplot(transformed, ax = ax[1])
sns.distplot(final_data['sumCases'], ax = ax[0])


ax[0].set_title('Original outbreak sizes', fontsize = 15)
ax[1].set_title('Transformed outbreak sizes', fontdict = {'fontsize':15})
ax[0].set_xlabel(None)

plt.show()

In [None]:
final_data['Box-Cox sumCases'] = transformed 

In [None]:
final_data.groupby('cluster').count()['Id']

In [None]:
plt.figure(figsize = (15,7))

sns.boxplot(x = 'cluster',
            y = 'Box-Cox sumCases',
            data = final_data)

plt.xlabel(xlabel = 'Cluster', fontdict = {'fontsize':15})
plt.ylabel(ylabel = 'Outbreak Size', fontdict = {'fontsize':15})

plt.show()

### Observation:
There still doesn't seem to be any correlation between clusters and outbreak size. 

# Ending notes to consider:

Why are some regions so void of data? Are there no cases in regions like Australia/New Zealand, or is there just no data collection happening there?

Why are so many human-related variables missing? Are they missing data, or are they indicative of zeroes?

