In this notebook, I am going to teach you how to easily do Recursive Feature Elimination (or RFE, for short) using Catboost.

The beauty of RFE is that it is one of the most intuitive and straightforward feature selection methods.

And if you are someone like me who likes to create dozens and dozens of features and only later see which ones are useful, this feature selection technique will be like a godsend to you.

In short, RFE gradually removes the worst-performing features of your dataset, leaving you only with the ones that helps the performance of your model... (If you are interested in diving deeper into this topic, [here's a great article about it](https://machinelearningmastery.com/rfe-feature-selection-in-python/))

The only problem is that I haven't seen many notebooks teaching how to easily implement this technique, and I have had to do my fair share of testing and experimenting before I found this easy approach to it (only a handful lines of code will be needed).

Now, the dataset I am going to use in this notebook is the one I am using for the [Predict Future Sales Competition](https://www.kaggle.com/c/competitive-data-science-predict-future-sales). For the sake of simplicity, I will only concern myself with the RFE technique. So I won't talk about data-preprocessing, feature engineering and etc.
(My final model is still a work in progress and I plan to release a public notebook when I finish it, so stay tuned)


And without further a do, let's get right dive deep into the RFE.

<h2> Importing Libraries & Datasets </h2>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# 
pd.set_option('display.max_columns',None)

#
df = pd.read_pickle('../input/predict-future-sales-df/predict_future_sales_df.pkl')
df.head(5)

Now before we continue, here's a quick breakdown of the features of this dataset. **(Feel free to jump this section if you want to)**

**Shop_id**: Refers to the unique shop ID of a given store

**Item_id**: Refers to the unique item ID of a given product.

**mo**: Refers to the unique time index of the events.

**mo revenue**: (scaled) values of the monthly revenue of a given shop/item combination.

**Features ending in _mean**: Respective mean encoded feature.

**Features ending in _cluster:** Respective cluster category of a given feature.

**City**: City where a given shop is located.

**Shop_category_name**: The name of the category to which a given item belongs.

**Item_name_lenght**: Some discussion threads in this competition concluded that the length of the item name string contains useful information for our model. Is this true? We will soon find out.

**Features starting with rolling_**: Moving average of a particular feature.

**Features ending with lag**: Lagged features in relation to the current time index.

**Features ending with age**: Some shops and items have different launch dates. So I created this feature to find out if this information is indeed important for our model.

<h2> Separating Train Set, Validation Set, and Test Set </h2>

The first step for using RFE is to create the train set, validation set, and test set.

This is a very important step, but most people completely overlook it. However, if you commit this mistake, the quality of your final model will suffer. So please take the time to study the problem at hand and find out what should be your validation strategy.


Since we are dealing time series problem in this particular competition, the validation strategy we will use is walk-forward validation. That is, we will use the current month as the validation set, all months preceding it as the train set, and the next month (the month we are trying to forecast) as the prediction set.


In [None]:
# dropping some features to avoid data leakage/overfitting of this particilar dataset
columns_to_drop = ['mo_sales','item_category_id','item_price','mo_revenue','rev_times_sales','item_id','shop_id']

# Creating the train set (all months with the index values smaller thant the index of the current month)
x_train = df[df['mo'] < 33].drop(columns_to_drop,axis=1)
y_train = df[df['mo'] < 33]['mo_sales']

# Creating the valid set (month time index == 33)
x_valid = df[df['mo'] == 33].drop(columns_to_drop,axis=1)
y_valid = df[df['mo'] == 33]['mo_sales']

# Creating the test set (the month we are trying to predict)
test = df[df['mo'] == 34].drop(columns_to_drop,axis=1)

<h2> Creating CatBoost's Model </h2>

And now its finally time for us to crete our model and start selecting features.

The first step is to import and call Catboost Regressor.

In [None]:
categorical_features = ['City','category_name','Shop Type','Category Cluster','Shop Cluster','City Cluster','Shop Type Cluster','Cat_name Cluster']

from catboost import CatBoostRegressor

regressor = CatBoostRegressor(eval_metric='RMSE', # Evalution metric of this competition
                              iterations = 100, # Number of Gradient boost itertions we will use in this model
                              cat_features = categorical_features, # Categorical features of the df
                              random_state = 123
                              )

Now, unlike most other models, Catboost regressor allow you to preprocess the categorical features of the dataset when creating the regressor.

Of course, you can preprocess them on your own, but I find it unnecessary since Catboost will do it for you.

With that out of the way, lets move on.

---

<h2> Selecting Features </h2>

Now, the next step on the RFE pipeline is to call the **select_features** method on the Catboost regressor we have just created.

In a nutshell, this method will remove the least useful features of the dataset and inform us of how the model performs along the way. As simple as that.

Here are the most important parameters of this method.

**X** = The x_train set you have created in the previous step.

**y** = The y_train set you have created in the previous step.

**eval_set** = Short for evaluation set. Please specify the validation sets you will use to score the performance of your model. This parameter takes a tuple with (x_valid,y_valid) values as an argument.

**features_for_select**: Specify the features you want to be evaluated in the RFE. Since I want to perform it on the entire dataset, I will select features 0 to 36. But if you want to split-test only a few features and leave the rest of the dataset untouched, you can specify them here.

**num_features_to_select**: Specify how many features above you want to keep.

**steps**: Specify how many rounds of the model you to be performed in total. 

**Verbose**: Specifiy after how many iterations of the model you want the current model's score to be printed. You can also turn it off by setting the parameter to False

**train_final_model**: Set it to True if you want catboost to automatically train your model with the selected features after the RFE is finished.

**plot**: Set it to true if you want catboost to plot the model's score graph right after finishing the RFE.

Now, after completing the RFE, the select_features method will return a dictionary with the following values: selected_features; selected_features_names; eliminated_features; eliminated_features_names.

Feel free to save it in a dictionary if you think this information will be useful for later use.


For start, let's only keep 10 features out of the 37 and see how the model performs. After that, we will be able to optimize the model accordingly.

In [None]:
rfe_dict = regressor.select_features(X = x_train, 
                                     y = y_train, 
                                     eval_set = (x_valid,y_valid), # Walkforward validation set we have created earlier
                                     features_for_select = '0-36', # Features that will be selected on the RFE
                                     num_features_to_select = 10, # Number of features to keep from the selected
                                     steps = 5, # Number of model iterations performed in the RFE
                                     verbose = 50, #
                                     train_final_model = False, # Train final model after RFE is finished
                                     plot = True # plot the ??? after the RFE is finished
                                     )

---

In my opinion, what makes Catboost's RFE so great is that you can visualize the performance of the validation score as features get eliminated and use it to gauge how many features (and which ones of them specifically) you should ideally get rid of.

As we can see from the interactive graph above, the ideal number of features to remove seems to be 9. After that, it seems like we reach a point of diminishing returns and our score gets worse as we remove more features.


Now, the heavy lifting is already done, and the next logical step is to retrain the model. For that, you can simply re-do the RFE (but now changing the number of features to be selected), or you can find out (by hovering the mouse on the interactive graph) what features were removed, drop them from your dataset, and retrain the model.


For the sake of simplicity, I will simply re-do the RFE.

In [None]:
regressor.select_features(x_train,y_train,eval_set=(x_valid,y_valid),
                                     features_for_select='0-36',
                                     num_features_to_select = 28,
                                     steps = 5,
                                     verbose = False,
                                     train_final_model = True,
                                     plot = False)
y_pred = regressor.predict(test)

### Transforming predictions into a csv file following the rule of this particular competition ###
test['item_cnt_month'] = y_pred
test['item_cnt_month'].clip(0,20,inplace=True)
sub = test['item_cnt_month'].reset_index(drop=True)
submission = pd.DataFrame({"ID": sub.index,"item_cnt_month": sub.values})
submission.to_csv('sub.csv', index=False)
print('Done!')

And now you have an easy-to-implement framework for filtering useful features from the dozens of dozens you have created in a particular dataset.

As you could see, doing RFE with CatBoost is extremely easy and we only had to call two methods.

Feel free to copy and paste all of the codes in this notebook if you find them useful.

Hope you have learned a thing or two that you can use to improve your models.

Thanks for your time and attention!