# Competition Mechanics

Each competition has the following competition elements:

* Data - read the description. Sometimes you can use more data than they give you (but need to check the rules)
* Model - what we build. It transforms data into answers. Should produce the best possible prediction and be reproducible
* Submission - usually we just submit our predictions, they don't care about the actual model. Some (cool) competitions let you submit your code as well. There will usually be a sample submission file that we can look at
* Evaluation - a function that takes in `(predictions, right_answers)` and returns a score saying how good the predictions are. But usually we don't really care about the score but rather our relative performance to other competitors.
* Leaderboard - usually shows your best score and your position.

# Real World ML Pipeline

It's a complicated process including:

1. Understanding the business problem
2. Problem formalization
2. Data collecting
3. Data preprocessing
4. Modelling - which model is appropriate? How to select the best model?
5. Way to evaluate model in real life
6. Way to deploy the the model
7. Monitor new performance and train on new data
8. Periodically revise your understanding of the problem and go through the cycle again and again

# Competition ML Pipeline

All the formualization and evaluation parts are already done. And we don't have to deploy the model. So, it's just

1. Data preprocessing
2. Modelling

BUT sometimes you need to understand the business problem to get insights or generate new features.

And sometimes you are allowed to use external data. And then data collection becomes a crucial part of the solution.

Often the hardest part is problem formalization and choosing target metric.

Sometimes you need to be wary of how complex your model is (perhaps so that businesses can deploy them in future) but usually not the case.

As competitiors _the only thing that matters is the target metric value_. Speed, complexity, memory consumption, all of these are irrelevant. We just care about getting the best performing model.

# Competition Philosophy

Not just about algorithms. It's about data and making things work. 

Everyone can and will tune classic approaches. We need some insight to win. These insights are usually more useful than a deftly tuned ensemble.

Somtimes there is no ML (shock horror!).

Do not limit yourself. The only thing we care about is the target metric. It's ok to use heuristics and manual data analysis. Doesn't be afraid of super complex solutions or advanced feature engineering or doing HUGE calculations. Totally fine to do these things. Use everything you can to get as much juice as you can from your models. 

Be creative. There are no rules. You can modify or hack an existing algorithm or even design a completely new one! Don't be afraid of reading source code or changing them.

Enjoy them!

# Families of ML Algorithms

- Linear - good for sparse, high-dimensional data.
 - Sklearn
 - Vowpal Wabbit both great for linear models
- Tree-based - decision tree, random forest, GBDT. Divides space into boxes (as it creates splits along the feature axes). Great for tabular data. In almost every competition, winners use this approach. Hard to capture linear dependencies though as it requires a lot of splits
 - Sklearn good implementation of RandomForest
 - Use XGBoost and LightGBM for gradient boosting due to higher speed and accuracy
- kNN - features based on kNN are often very informative. Note that square distance makes sense in small dimensions but not really in higher ones. So, you may have to use other metrics.
 - Sklearn - variety of built-in distance functions and you can use your own
- Neural Networks - produce a smooth separating curve (in contrast to decision trees).
 - Explore TensorFlow playground to get a more intuitive understanding of how NNs work. Loads of frameworks
 - TensorFlow/Keras
 - PyTorch (the instructor perfers PyTorch but obvs I know both are fine).

## No Free Lunch Theorem

There is no method that outperforms all others for all tasks. 

Some models are better suited for certain tasks.

Or rather, for every method, we can construct a task for which this method will not be the best.

Each method relies on some assumptions about the data/task. So, we cannot win every competition with just a single algorithm. We need a variety of algorithms.



## RandomForest vs. ExtraTrees

RandomForest is built by training many decisions trees on a different bootstrapped sample of the training data (thus each tree is trained on effectively a different, but similar dataset). Each tree can only use sqrt(num_features) to make its decisions. So it is using (slight) different data and can only use a (random) subsample of the total number of features to decide. Then we combine the results of these classifiers to get our final answer. 

ExtraTrees is built by training many decisions trees on the same training data. Again, it can only use sqrt(num_features) to make its decision. But it makes its split at a completely random point in each feature. This is in stark contrast to RandomForests that make the best split they can based on the features they are working with thanks to the 'criterion' they are trying to optimize e.g. gini or entropy.

GradientBoosting builds trees one at a time where each new tree helps to correct errors made by the previously trained tree.

# Software/Hardware Requirements

Most competitions (except image-based) can be solved on:

- High-level laptop
- 16+ GB RAM
- 4+ cores 

Quite good setup:
- Tower PC
- 32+ GB ram
- 6+ cores

(This setup is what the course instructor uses)

You can get Mac 13" old one up to 32GB memory, the 16" goes up to 64GB!! The new Macbook Pro only goes up to 16GB memory atm but has 8 cores.

**Really important things**
- **RAM** - if you can keep data in memory, everything will be much easier. So the more RAM you have, the better. 64GB should be more than enough (but some prefer 128GB or even more!)
- **Cores** - the more cores you have the more (or faster) experiments you can run (sometimes even 32 is not enough...)
- **Storage** - SSD is *cruicial* if you work with images or big datasets with a lot of small pieces. Especially important for training NNs on large number of images.

So now I am seeing some massive benefits of doing NLP over CV, namely it will be cheaper for me to learn as I will not have to work with such massive datasets!

Obvs can just rent all of this from AWS, GCP, or Azure.

Note AWS Spot Option. This lets you bid on unused instances which can lower your cost significantly!

Your spot instance runs whenever your current bid exceeds the market price. Generally, it's much cheaper than other options. But there is the risk that your bid will fall below market price and your session will be terminated.

Note that Jupyter notebooks allow you to work remotely. It is exactly the same working in a Jupyter notebook on AWS as it is running in your local machine, you just go through the setup differently (by SHH'ing into them I think).

Very important things to note:
- [Vowpal Wabbit](https://vowpalwabbit.org/) - blazing speed and handle really large datasets which don't fit into memory
- [Libfm](https://github.com/srendle/libfm) and [Libffm](https://github.com/ycjuan/libffm) implement different types of optimization machines and often used for sparse data like click-through rate prediction
- [rgf](https://github.com/RGF-team/rgf) - an alternative to base methods which he suggests we use in ensembles

Being honest I have _no idea_ what these things do (even after reading about them on Github), nor would I have any idea how to implement them. This must be how people feel when they look at Python code lol.

The [Final Project For The Course](https://www.kaggle.com/c/competitive-data-science-final-project)