In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

This short kernel is just talks about two things 
- basic way of identifying and handling outlier
- a very simple attempt at exploring the data. 

My aim is to see how the distribution of output is behaving w.r.t the target variable. There are great kernels out there that gives detailed walkthroughs of modelling, but I just wanted to see if we can get the data talking.

## **Setup**

In [None]:
df = pd.read_csv("../input/30-days-of-ml/train.csv")
df_test = pd.read_csv("../input/30-days-of-ml/test.csv")

### Checking the target

Decided to go from the reverse. First step, understanding the target.*Outlier removal*  in the target variable.

The easiest way to identify outliers is using the Inter Quantile Range. Any data points that lie 1.5 times of IQR above Q3 *(75th percentile)* and below Q1 *(25th percentile)* can be considered as outliers.

In [None]:
sns.boxplot(x='target', data=df)
_ = plt.title('Box plot of target column', fontsize=14)

The *whiskers* of the plot looks too highlighted. Which means that there are outliers that we need to take care of.

In [None]:
percent_25 = df['target'].describe()["25%"]
percent_75 = df['target'].describe()["75%"]
iqr = percent_75 - percent_25
lbound = percent_25 - (1.5 * iqr)
ubound = percent_75 + (1.5 * iqr)

df["outliers"] = df["target"].apply(lambda x: "yes" if (x < lbound or x > ubound) else "no")
outliers = df[df["outliers"]=="yes"]["target"].values

In [None]:
sns.scatterplot(x=range(len(outliers)), y=outliers,color="olive")

Looks like any target value less than 6 and greater/around 10 are outliers. And out of 3L data, these constitute for around 3000 points. There are two easy ways to handle these outliers. 
1. One is to drop the columns pertaining to these rows.
2. Other is to upper or lower bound them to a certain range. We can round them up to the inter-quartile ranges calculated earlier.  

For this notebook, I will go with Option-1. **However, it is always better to go with option-2**. Option-2 will ebsure that we will not end up losing out on important input features. This will particularly have a bearing in cases like churn prediction or fraud detection.

### Out with the outliers.

In [None]:
new_df = df[df["outliers"] == "no"].reset_index(drop=True)
new_df.drop("outliers", axis=1, inplace=True)

In [None]:
sns.boxplot(x='target', data=new_df)
_ = plt.title('Box plot of target column after removing outliers', fontsize=14)

Looks clean!

## **Exploring the data**

Considering I am doing it the reverse way *(target -> data)*,we will try to get some relation between how the features behave with respect to the target. To do this a little differently, I created "psuedo-target" by grouping the target into 3-bins.  

These bins are decided on the data range present in the target variable.

### Creating the bins  

In [None]:
new_df["psuedo_target"] = pd.cut(new_df['target'], bins=3)
print ("Created bins: ")
print(new_df["psuedo_target"].value_counts())

On visualization, we observe that these bins give us good enough spread.

In [None]:
sns.histplot(new_df['target'], kde=True, bins=3)

### Checking the spread 

In [None]:
cat_features = [feat for feat in new_df.columns if
                new_df[feat].nunique() <= 15 and #arbitrarily chosen
                new_df[feat].dtype == "object"]

num_features = [feat for feat in new_df.columns if
                new_df[feat].dtype in ["int64", "float64"] and 
                feat not in ['target','psuedo_target']]

useful_features = cat_features + num_features

#### Checking the spread of categorical variable 

In [None]:
fig, ax = plt.subplots(5,2, figsize=(18,10))

for indx,feature in enumerate(cat_features):
    row = indx // 2
    col = indx % 2
    sns.countplot(ax=ax[row, col],x=feature, hue='psuedo_target', data=new_df)

#### Checking the spread of numerical variable 

In [None]:
fig, ax = plt.subplots(5,3, figsize=(18,10))

for indx,feature in enumerate(num_features):
    row = indx // 3
    col = indx % 3
    sns.kdeplot(ax=ax[row, col],x=feature, hue='psuedo_target', data=new_df)

Unfortunately, no patterns are emerging from the data. The distributions looks really similar between the different target scales. We can, I guess, try one experiment based on this observation - work with 30% of the sample data and check if there are variations in the overall result. 

## **Getting the feature importance from the model**

Training a quick model to check if it can yield any insights into important features.

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

In [None]:
df = pd.read_csv('../input/30days-folds/train_folds.csv')
df_test = pd.read_csv('../input/30-days-of-ml/test.csv')
sample_submission = pd.read_csv('../input/30-days-of-ml/sample_submission.csv')

In [None]:
useful_features = [c for c in df.columns if c not in ("id", "target", "kfold")]
object_cols = [col for col in useful_features if col.startswith("cat")]
df_test = df_test[useful_features]

In [None]:
xgb_params = {
    'random_state': 1, 
    'n_jobs': 4,
    'booster': 'gbtree',
    'n_estimators': 1000,
    'learning_rate': 0.034682894846408095,
    'reg_lambda': 1.224383455634919,
    'reg_alpha': 36.043214512614476,
    'subsample': 0.9219010649982458,
    'colsample_bytree': 0.11247495917687526,
    'max_depth': 3,
    'min_child_weight': 6,
    'tree_method':'gpu_hist',
}

In [None]:
final_predictions = []
scores = []
for fold in range(5):
    xtrain = df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = df_test.copy()
    
    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    ordinal_encoder = preprocessing.OrdinalEncoder()
    xtrain[object_cols] = ordinal_encoder.fit_transform(xtrain[object_cols])
    xvalid[object_cols] = ordinal_encoder.transform(xvalid[object_cols])
    xtest[object_cols] = ordinal_encoder.transform(xtest[object_cols])
    
    model= XGBRegressor(**xgb_params)
    model.fit(
        xtrain, ytrain,
        early_stopping_rounds=300,
        eval_set=[(xvalid, yvalid)], 
        verbose=1000
    )
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_predictions.append(test_preds)
    rmse = mean_squared_error(yvalid, preds_valid, squared=False)
    scores.append(rmse)
    print(fold, rmse)

print(np.mean(scores), np.std(scores))

#### Getting the feature importance

In [None]:
importance = model.feature_importances_
importance

In [None]:
# get importance
importance = model.feature_importances_
# summarize feature importance
for feature,imp in zip(xtrain.columns, importance):
    print(f'feature:{feature}, Importance score: {imp:.5f}')

In [None]:
# plot feature importance
sns.barplot(xtrain.columns, importance)
_ = plt.xticks(rotation=90)
_ = plt.title("Feature Importance Chart")

From this small experiment, looks like there is no clear winner. Perhaps we can re-iterate the model by removing the features that contibute less than 0.02% to the overall result. Might give a bump to the score.

## **Conclusion**

This is quite an intriguing competition since there is no direct, visible patter that helps in predicting the targe variables. But then again, it is a wonderful opportunity to play around and underestand different modeling techniques like ensembling, stacking and blending. Excellently and patiently covered by one and only Abhishek Thakur in this playlist [here](https://www.youtube.com/watch?v=_55G24aghPY&list=PL98nY_tJQXZnP-k3qCDd1hljVSciDV9_N).