# DATA3888 Project Report: Holiday Planner

COVID C4

**Note:**

Our final submission for the report would need to be a ZIP file containing:
- this report;
- our data/ folder, containing all data files needed by this notebook and by our python analysis code files; and
- our python analysis code files (common.py, analytics.py, analytics_clustering.py, analytics_helper.py, analytics_helper_clustering.py); and
- main.py and mapping.py.

In [2]:
import itertools
import more_itertools
import random
import matplotlib.pyplot as plt

from statistics import mean

from analytics_clustering import *
from analytics_helper_clustering import *
from analytics import *
from analytics_helper import *
from common import *

from IPython.display import Image

In [8]:
# hides warnings - these warnings do not affect code functionality

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [9]:
random.seed(3888)

## Executive summary

## Background
A clear description of the problem, articulating the aim of this project. Provides appropriate multidisciplinary context and motivational background explained well in an appropriate language.

## Method

1. Clear description of the approach in data collection, developed model, the evaluation strategies from a data-science perspective. Here we refer to all types of evaluation metrics including graphical, qualitative and quantitative metric.

2.  COVID project - interdisciplinary assessment - appropriate formulation and setting of the problem

### Question formulation

### Preparing the data for modelling

In addition to the provided live COVID dataset from OWID, we collected a diverse range of external datasets, as below. We extracted these in various ways; some via API or RSS feed, others as CSV files.

- Travel advice data from Smartraveller RSS feed
- Point of interest ratings data from Triposo API
- Country descriptions from Triposo API
- Tourism indexes dataset from Travel & Tourism Competitiveness Report
- Country photos from Google Places API

In terms of the live COVID data, we only used data from the last 30 days, to ensure that our recommendations were up-to-date. For each country, we computed the median of each COVID variable over the last 30 days, as median a robust measure of centre for quantitative data.

Once all datasets had been integrated, we weighted by a factor of 1000 the features which the user was interested in. Importantly, in our UI, the user did not directly select individual features in our dataset; rather, they selected "variable groups", where each variable group corresponded to several features in our dataset. For example, the COVID variable group included the variables `new_cases_smoothed_per_million` and `new_deaths_smoothed_per_million` in our integrated dataset. Increasing the weighting of features that the user was interested in ensured that our recommendations placed more emphasis on these features, and thus were tailored to each user's preferences.

We then performed PCA and extracted the first and second principal components, to avoid the curse of dimensionality.

### Models

Our Holiday Planner app incorporated two models - an ensemble 10-NN model, and a size-constrained k-means clustering model.

#### Ensemble 10-NN - user specifies country of interest

In our UI, if the user did select a country of interest, then we used ensemble 10-NN to determine the 10 most similar countries.

Our ensemble 10-NN model consisted of 9 10-NN models, each of which used a different distance metric. The 9 distance metrics we used are listed below:

1. Euclidean
2. Manhattan
3. Chebyshev
4. Cosine
5. Cityblock
6. Braycurtis
7. Canberra
8. Correlation
9. Minkowski

Each 10-NN model was given the country selected by the user, and returned this country's 10 nearest neighbours (10 most similar countries) as determined by its assigned distance metric.

After each 10-NN model had been run for the user-specified country, we considered the corresponding 9 sets of neighbours. We then returned the 10 neighbours which were the most common across all 9 neighbour sets, as the final recommendations to the user.

#### Size-constrained k-means clustering - user does not specify country of interest

In our UI, if the user did not select a country of interest, then we used size-constrained k-means clustering across all countries in our dataset to determine the 10-12 countries most suited to their interests.

For each cluster produced by size-constrained k-means clustering (a modification of k-means clustering, explained in more detail below), we determined its average rating across all interests specified by the user (if the user did not specify any interests, then we simply took the average rating across _all_ features in our integrated dataset). For all possible interests except COVID, a higher rating meant that the country had a higher standard for this interest. However, for COVID, clearly lower, rather than higher, case/death numbers are preferable. Thus, for each COVID variable (there were two - cases and deaths), the value for each country was subtracted from the max value for this variable. This transformation ensured that higher values (differences from max) were preferable - consistent with all other variables. We then returned the cluster with the highest average rating as countries recommended to the user.

In terms of what size-constrained k-means clustering is, it is a modification of k-means clustering proposed by researchers at Microsoft Research and the Rensselaer Polytechnic Institute, which would allow a minimum and maximum cluster size to be specified. We specified a minimum cluster size of 10 and a maximum of 12, as we felt that this number of recommendations would give the user a certain degree of variety without overwhelming them with too many options. Using _size-constrained_ k-means clustering, as opposed to standard k-means clustering, eliminated the possibility of having unevenly-sized clusters - with a worst case situation being that only 1 country was recommended to the user (clearly not enough variety to choose from).

### Evaluation strategies

\[Insert paragraph from Serena.\]

## Results

See ClusteringKNNEvaluation.ipynb for clustering and KNN evaluation. Then there is also qualitative evaluation which I don't believe has code attached.

#### Part A:
A clear justification of the final approach based on the proposed evaluation strategies. Ensuring multiple evaluation strategies are used.

It was difficult to evaluate the accuracy of our results due to the subjective nature of the recommender. Therefore we decided to focus on evaluating our methods. 

When evaluating our 10 nearest-neighbours ensemble, we decided to calculate the similarity score and explore how stable our model was. We decided to iteratively extract one distance metric from the ensemble, before finally comparing all corresponding country suggestions. 

*Insert table*

Overall we found a 97% similarity between all models after the removal of each distant metric, meaning that the model is quite stable, as it doesn't depend on any one metric.


For the size-constrained clustering, we decided to utilise a silhouette score to determine the modelâ€™s goodness of fit. Due to the numerous combinations of user inputs we could cluster on, we had to take a random sample of user inputs for the silhouette score test. 

The average score of 0.3 suggests that while our clusters are not as clearly distinguished, inter-cluster distances remain significant. This is likely due to the size of the dataset we're utilising. If we had more data, e.g. more specific to each city or information regarding accomodation and flights, we may be able to produce more significantly different clusters. 


As a qualitative measure, we also compared our result to travel recommendations made by the likes of lonely planet. This was to compare the subjective recommendation with popular opinion and suggestions from travel professionals. Finding that the countries we recommneded were generally similar with no outliers. 

#### Part B:
A clear description of the deployment process. An engaging and clear illustration of the product (games, shiny app, learning device etc) with a discussion of concepts from multiple disciplines.

## Discussion/Conclusion
Discussion of potential shortcomings or issues associated with the development process or the product with reference to both disciplines. Identification of future work or improvement in both disciplines. Conclusion adequately summarises the project and identification of future work.

## References

## Appendix

## Student contributions

1. Eve Fernando
2. Marie Montgomery
3. Rayani Saha
4. Serena Watson
5. Stuart Toft
6. Yan Liu