# Winning Tips on Machine Learning Competitions
These tips are shared by Marlos Michailidis (a.k.a Kazanova), Kaggle Grandmaster, in a webinar on 5th March 2016. You can access the video and slides from this [tutorial](https://www.hackerearth.com/practice/machine-learning/advanced-techniques/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3/tutorial/). 

**1. What are the steps you follow for solving a ML problem? Please describe from scratch.**


Following are the steps I undertake while solving any ML problem:

1. Understand the data - After you download the data, start exploring features. Look at data types. Check variable classes. Create some univariate - bivariate plots to understand the nature of variables.
2. Understand the metric to optimise - Every problem comes with a unique evaluation metric. It's imperative for you to understand it, specially how does it change with target variable.
3. Decide cross validation strategy - To avoid overfitting, make sure you've set up a cross validation strategy in early stages. A nice CV strategy will help you get reliable score on leaderboard.
4. Start hyper parameter tuning - Once CV is at place, try improving model's accuracy using hyper parameter tuning. It further includes the following steps:
  * Data transformations: It involve steps like scaling, removing outliers, treating null values, transform categorical variables, do feature selections, create interactions etc.
  * Choosing algorithms and tuning their hyper parameters: Try multiple algorithms to understand how model performance changes.
  * Saving results: From all the models trained above, make sure you save their predictions. They will be useful for ensembling.
  * Combining models: At last, ensemble the models, possibly on multiple levels. Make sure the models are correlated for best results.

**2. What are the model selection and data manipulation techniques you follow to solve a problem?**


Generally, I try (almost) everything for most problems. In principle for:

1. **Time series:** I use GARCH, ARCH, regression, ARIMA models etc.
2. **Image classification:** I use deep learning (convolutional nets) in python.
3. **Sound Classification:** Common neural networks
4. **High cardinality categorical (like text data):** I use linear models, FTRL, Vowpal wabbit, LibFFM, libFM, SVD etc.
5. For everything else,I use Gradient boosting machines (like XGBoost and LightGBM) and deep learning (like keras, Lasagne, caffe, Cxxnet). I decide what model to keep/drop in Meta modelling with feature selection techniques. Some of the feature selection techniques I use includes:
  * Forward (cv or not) - Start from null model. Add one feature at a time and check CV accuracy. If it improves keep the variable, else discard.
  * Backward (cv or not) - Start from full model and remove variables one by one. It CV accuracy improves by removing any variable, discard it.
  * Mixed (or stepwise) - Use a mix of above to techniques.
  * Permutations
  * Using feature importance - Use random forest, gbm, xgboost feature selection feature.
  * Apply some stats’ logic such as chi-square test, anova.

Data manipulation technique could be different for every problem :

* **Time series:** You can calculate moving averages, derivatives. Remove outliers.
* **Text:** Useful techniques are tfidf, countvectorizers, word2vec, svd (dimensionality reduction). Stemming, spell checking, sparse matrices, likelihood encoding, one hot encoding (or dummies), hashing.
* **Image classification:** Here you can do scaling, resizing, removing noise (smoothening), annotating etc
* **Sounds:** Calculate Furrier Transforms , MFCC (Mel frequency cepstral coefficients), Low pass filters etc
* **Everything else:** Univariate feature transformations (like log +1 for numerical data), feature selections, treating null values, removing outliers, converting categorical variables to numeric.

**3. Can you elaborate cross validation strategy?**

Cross validation means that from my main set, I create RANDOMLY 2 sets. I built (train) my algorithm with the first one (let’s call it training set) and score the other (let’s call it validation set). I repeat this process multiple times and always check how my model performs on the test set in respect to the metric I want to optimise.

The process may look like:

* For 10 (you choose how many X) times
* Split the set in training (50%-90% of the original data)
* And validation (50%-10% of the original data)
* Then fit the algorithm on the training set
* Score the validation set. 
* Save the result of that scoring in respect to the chosen metric.
* Calculate the average of these 10 (X) times. That how much you expect this score in real life and is generally a good estimate.
* Remember to use a SEED to be able to replicate these X splits
* Other things to consider is Kfold and stratified KFold . Read here. For time sensitive data, make certain you always the rule of having past predicting future when testing’s.