# We have a ton of features in this competition, should be fun! 

### Initial Observations 
- No missing value.
- There are 100 numerical continuous features.
- The target variable loss ranges from 0 to 42 for a total of 43 discrete values. 
- However, this is a regression problem and it is OK to submit as decimal values. ***But can we do a regression + classification?***

#### I have divided this Notebook in two parts:
- EDA
- AutoML (I am just learning this)

# Importing Libraries and Data for the EDA 

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
# matplotlib setting
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False
train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

In [None]:
print(f'Train Shape :  {train.shape}')
print(f'Test Shape :  {test.shape}')

#### Observations: 
- We have 101 columns + 1 target column
- We have a total data of 250k for the train data and 150k for the test data

In [None]:
target = train['loss']
train.drop(['id'], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)

#### Having a look at the top 2 rows of the train and the test data

In [None]:
train.head(2)

In [None]:
test.head(2)

# Info about the train and the test data

In [None]:
train.info()

In [None]:
test.info(max_cols=10)

#### Now since we are done with the inital data exploration, Let's have a look at the Target Variable Distribution to get an understanding of how the target values are spread 

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(17, 8))

target_cnt = train['loss'].value_counts().sort_index()

ax.bar(target_cnt.index, target_cnt, color=['#799EFF' if i%2==0 else '#CCDAFF' for i in range(9)],
       width=0.55, 
       edgecolor='black', 
       linewidth=0.7)

ax.margins(0.02, 0.05)

for i in range(10):
    ax.annotate(f'{target_cnt[i]/len(train)*100:.3}', xy=(i, target_cnt[i]+1000),
                   va='center', ha='center',
               )

ax.set_title('Target Distribution', weight='bold', fontsize=15)
ax.grid(axis='y', linestyle='-', alpha=0.4)

fig.tight_layout()
plt.show()

#### Observations
- There are a total of 43 discrete losses.
- The top 12 distributions account for 80% of the total.
- All except the order of 2 and 1 are in increasing order.

In [None]:
target_cnt_df = pd.DataFrame(target_cnt)
target_cnt_df['ratio(%)'] = target_cnt_df/target_cnt.sum()*100
target_cnt_df.sort_values('ratio(%)', ascending=False, inplace=True)
target_cnt_df['cummulated_sum(%)'] = target_cnt_df['ratio(%)'].cumsum()
target_cnt_df.style.bar(subset=['cummulated_sum(%)'], color='#CCDAFF').background_gradient(subset=['ratio(%)'], cmap='binary')
# target_cnt_df.style.bar(subset=['ratio(%)'], color='#799EFF')

# Statistics Check
The scale of this data is really diverse. Which makes me think that scaling should be done in this case. Usually we don't need to scaled data if we're using a tree-based model but it is important in case the data is as diverse as this here! 


In [None]:
train.describe()

#### Observations:
- The data is very diverse as you can see 
- Feature 0 and 4 can show it very easily, if you do deep dive you will see feature 16,52,60,75,91 also showing very high values and also very high standard deviation 
- There seem to be some data points with discrete values (integer values)
 - f1
 - f16
 - f27
 - f55
 - f60
 - f86


### A deeper dive into these 6 features with discrete values

In [None]:
discrete_features = []

for col in train.columns:
    if np.array_equal(train[col].values, train[col].values.astype(int)):
        discrete_features.append(col)

print(f'Total {len(discrete_features)} : ')
for dcol in discrete_features:
    print(f'{dcol} unique value : {train[dcol].nunique()}')

#### Observations:
- While the total number of data is 250000, most of the data in f16 and f60 are confirmed as continuous with different values
- But the remaining f1, f27, f55, and f86 look relatively categorical.

#### Looking at f1 and f86 with a small number of unique values: For the relationship with the loss, we averaged after groupby.

In [None]:
f1_loss = train.groupby(['f1'])['loss'].mean().sort_values()
fig, ax = plt.subplots(1, 1, figsize=(20, 6))

ax.bar(range(len(f1_loss)), f1_loss, alpha=0.7, color='#799EFF', label='Train Dataset')
ax.set_yticks(range(0, 20, 3))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('Average of loss grouped by f1', loc='left', fontweight='bold')
ax.legend()
plt.show()

#### Observations
- Depending on the value of f1, we can check the imbalance of loss.
- In 5 cases, we confirmed that the loss is all 0.

In [None]:
f86_loss = train.groupby(['f86'])['loss'].mean().sort_values()
fig, ax = plt.subplots(1, 1, figsize=(20, 6))

ax.bar(range(len(f86_loss)), f86_loss, alpha=0.7, color='#799EFF', label='Train Dataset')
ax.set_yticks(range(0, 20, 3))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('Average of loss grouped by f86', loc='left', fontweight='bold')
ax.legend()
plt.show()

# Scaling the data

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
features = [f'f{i}' for i in range(100)]
train[features] = ss.fit_transform(train[features])
test[features] = ss.transform(test[features])

### Target & Feature Relation
- As the value of targets increases, the mean moves away from zero.

In [None]:
from matplotlib.pyplot import cm
fig, ax = plt.subplots(1,1, figsize=(12, 7))
sns.heatmap(train.groupby('loss').mean().sort_index(),
            square=True, vmin=-0.5, vmax=0.5, center=0, linewidth=1,
            cmap=sns.diverging_palette(240, 220, as_cmap=True),
            cbar=False, 
           )

ax.set_title('Mean : Group by Target(Loss)',loc='left')
plt.show()

# Feature Distribution

In [None]:
fig, axes = plt.subplots(10,10,figsize=(12, 12))
axes = axes.flatten()

for idx, ax in enumerate(axes):
    sns.kdeplot(data=train, x=f'f{idx}', 
                fill=True,color = '#799EFF',
                ax=ax)
#     sns.kdeplot(data=test, x=f'f{idx}', 
#                 fill=True,color = 'grey',
#                 ax=ax)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Average by class (by feature) Train Dataset', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()

In [None]:
import warnings
warnings.filterwarnings('ignore')
fig = plt.figure(figsize = (15, 60))
for i in range(len(train.columns.tolist()[:100])):
    plt.subplot(20,5,i+1)
    sns.set_style("white")
    plt.title(train.columns.tolist()[:100][i], size = 12, fontname = 'monospace')
    a = sns.boxplot(train[train.columns.tolist()[:100][i]], linewidth = 2,color = '#799EFF',saturation=1)
    plt.ylabel('')
    plt.xlabel('')
    plt.xticks(fontname = 'monospace')
    plt.yticks([])
    for j in ['right', 'left', 'top']:
        a.spines[j].set_visible(False)
        a.spines['bottom'].set_linewidth(1.2)
        
fig.tight_layout(h_pad = 3)
plt.show()

#### Observations:
- It's scaled up, but it's a pretty interesting aspect of the data.

- It is safe to assume that the distributions of train and test are almost the same.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12 , 12))

corr = train.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr, ax=ax,
        square=True, center=0, linewidth=1,
        cmap=sns.diverging_palette(240, 220, as_cmap=True),
        cbar_kws={"shrink": .82},    
        mask=mask
       ) 

ax.set_title(f'Correlation', loc='left', fontweight='bold',)     

plt.show()

#### Observations:
- Most correlations are close to 0

# Now we do the Modelling 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
train = pd.read_csv(r'../input/tabular-playground-series-aug-2021/train.csv')
test = pd.read_csv(r'../input/tabular-playground-series-aug-2021/test.csv')
sub = pd.read_csv(r'../input/tabular-playground-series-aug-2021/sample_submission.csv')
y = train['loss']
train.drop('loss',axis=1,inplace=True)
features = []
for feature in train.columns:
    features.append(feature)
# print(features)

# Min Max Scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
train[features] = mm.fit_transform(train[features])
test[features] = mm.transform(test[features])
X = train

# LightGBM

In [None]:
def fit_lgb(trial, x_train, y_train, x_test, y_test):
    params = {
        'reg_alpha' : trial.suggest_loguniform('reg_alpha' , 0.47 , 0.5),
        'reg_lambda' : trial.suggest_loguniform('reg_lambda' , 0.32 , 0.33),
        'num_leaves' : trial.suggest_int('num_leaves' , 50 , 70),
        'learning_rate' : trial.suggest_uniform('learning_rate' , 0.03 , 0.04),
        'max_depth' : trial.suggest_int('max_depth', 30 , 40),
        'n_estimators' : trial.suggest_int('n_estimators', 100 , 6100),
        'min_child_weight' : trial.suggest_loguniform('min_child_weight', 0.015 , 0.02),
        'subsample' : trial.suggest_uniform('subsample' , 0.9 , 1.0), 
        'colsample_bytree' : trial.suggest_loguniform('colsample_bytree', 0.52 , 1),
        'min_child_samples' : trial.suggest_int('min_child_samples', 76, 80),
        'metric' : 'rmse',
        'device_type' : 'gpu',
    }
    
    
    model = LGBMRegressor(**params, random_state=2021)
    model.fit(x_train, y_train,eval_set=[(x_test,y_test)], early_stopping_rounds=150, verbose=False)
    
    y_train_pred = model.predict(x_train)
    
    y_test_pred = model.predict(x_test)
    y_train_pred = np.clip(y_train_pred, 0.1, None)
    y_test_pred = np.clip(y_test_pred, 0.1, None)
    
    log = {
        "train rmse": mean_squared_error(y_train, y_train_pred,squared=False),
        "valid rmse": mean_squared_error(y_test, y_test_pred,squared=False)
    }
    
    return model, log

In [None]:
def objective(trial):
    rmse = 0
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
    model, log = fit_lgb(trial, x_train, y_train, x_test, y_test)
    rmse += log['valid rmse']
        
    return rmse

In [None]:
lgb_params = {'reg_alpha': 0.4972562469417825, 'reg_lambda': 0.3273637203281044, 
          'num_leaves': 50, 'learning_rate': 0.032108486615557354, 
          'max_depth': 40, 'n_estimators': 4060, 
          'min_child_weight': 0.0173353329222102,
          'subsample': 0.9493343850444064, 
          'colsample_bytree': 0.5328221263825876, 'min_child_samples': 80,'device':'gpu'}
lgb_params

In [None]:
def cross_val(X, y, model, params, folds=10):

    kf = KFold(n_splits=folds, shuffle=True, random_state=2021)
    for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
        print(f"Fold: {fold}")
        x_train, y_train = X.values[train_idx], y.values[train_idx]
        x_test, y_test = X.values[test_idx], y.values[test_idx]

        alg = model(**params,random_state = 2021)
        alg.fit(x_train, y_train,
                eval_set=[(x_test, y_test)],
                early_stopping_rounds=400,
                verbose=False)
        pred = alg.predict(x_test)
        error = mean_squared_error(y_test, pred,squared = False)
        print(f" mean_squared_error: {error}")
        print("-"*50)
    
    return alg

In [None]:
lgb_model = cross_val(X, y, LGBMRegressor, lgb_params)

# XGBoost

In [None]:
def fit_xgb(trial, x_train, y_train, x_test, y_test):
    params = {
        'tweedie_variance_power': trial.suggest_discrete_uniform('tweedie_variance_power', 1.0, 2.0, 0.1),
        'max_depth': trial.suggest_int('max_depth', 6, 10), # Extremely prone to overfitting!
        'n_estimators': trial.suggest_int('n_estimators', 400, 4000, 400), # Extremely prone to overfitting!
        'eta': trial.suggest_float('eta', 0.007, 0.013), # Most important parameter.
        'subsample': trial.suggest_discrete_uniform('subsample', 0.2, 0.9, 0.1),
        'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.2, 0.9, 0.1),
        'colsample_bylevel': trial.suggest_discrete_uniform('colsample_bylevel', 0.2, 0.9, 0.1),
        'min_child_weight': trial.suggest_loguniform('min_child_weight', 1e-4, 1e4), # I've had trouble with LB score until tuning this.
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-4, 1e4), # L2 regularization
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-4, 1e4), # L1 regularization
        'gamma': trial.suggest_loguniform('gamma', 1e-4, 1e4)
    } 
    
    
    model = XGBRegressor(**params,tree_method='gpu_hist', random_state=2021)
    model.fit(x_train, y_train,eval_set=[(x_test,y_test)], early_stopping_rounds=150, verbose=False)
    
    y_train_pred = model.predict(x_train)
    
    y_test_pred = model.predict(x_test)
    y_train_pred = np.clip(y_train_pred, 0.1, None)
    y_test_pred = np.clip(y_test_pred, 0.1, None)
    
    log = {
        "train rmse": mean_squared_error(y_train, y_train_pred,squared=False),
        "valid rmse": mean_squared_error(y_test, y_test_pred,squared=False)
    }
    
    return model, log

In [None]:
def objective(trial):
    rmse = 0
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
    model, log = fit_xgb(trial, x_train, y_train, x_test, y_test)
    rmse += log['valid rmse']
        
    return rmse

In [None]:
xgb_params = {'tweedie_variance_power': 2.0,
 'max_depth': 9,
 'n_estimators': 4000,
 'eta': 0.01200085275863839,
 'subsample': 0.8,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.4,
 'min_child_weight': 2.824928835841522,
 'reg_lambda': 67.43522142240646,
 'reg_alpha': 0.00012103217663028774,
 'gamma': 0.012432559904494572,'tree_method':'gpu_hist'}
xgb_params

In [None]:
xgb_model = cross_val(X, y, XGBRegressor, xgb_params)

# CatBoost

In [None]:
def fit_cat(trial, x_train, y_train, x_test, y_test):
    params = {'iterations':trial.suggest_int("iterations", 1000, 20000),
              'od_wait':trial.suggest_int('od_wait', 500, 2000),
              'task_type':"GPU",
              'eval_metric':'RMSE',
              'learning_rate' : trial.suggest_uniform('learning_rate', 0.03 , 0.04),
              'reg_lambda': trial.suggest_loguniform('reg_lambda', 0.32 , 0.33),
              'subsample': trial.suggest_uniform('subsample',0.9,1.0),
              'random_strength': trial.suggest_uniform('random_strength',10,50),
              'depth': trial.suggest_int('depth',1,15),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf',1,30),
              'leaf_estimation_iterations': trial.suggest_int('leaf_estimation_iterations',1,15),
               }
    
    
    model = CatBoostRegressor(**params,task_type='GPU', random_state=2021)
    model.fit(x_train, y_train,eval_set=[(x_test,y_test)], early_stopping_rounds=150, verbose=False)
    
    y_train_pred = model.predict(x_train)
    
    y_test_pred = model.predict(x_test)
    y_train_pred = np.clip(y_train_pred, 0.1, None)
    y_test_pred = np.clip(y_test_pred, 0.1, None)
    
    log = {
        "train rmse": mean_squared_error(y_train, y_train_pred,squared=False),
        "valid rmse": mean_squared_error(y_test, y_test_pred,squared=False)
    }
    
    return model, log

In [None]:
def objective(trial):
    rmse = 0
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
    model, log = fit_cat(trial, x_train, y_train, x_test, y_test)
    rmse += log['valid rmse']
        
    return rmse

In [None]:
cat_params = {'iterations': 1224,
 'od_wait': 1243,
 'learning_rate': 0.03632022350716054,
 'reg_lambda': 0.3257139588327784,
 'subsample': 0.9741256425198503,
 'random_strength': 41.06792107841663,
 'depth': 12,
 'min_data_in_leaf': 27,
 'leaf_estimation_iterations': 10,'task_type':'GPU'}
cat_params

In [None]:
cat_model = cross_val(X, y, CatBoostRegressor, cat_params)

In [None]:
cat = CatBoostRegressor(**cat_params)
lgb = LGBMRegressor(**lgb_params)
xgb = XGBRegressor(**xgb_params)

In [None]:
from sklearn.ensemble import VotingRegressor
folds = KFold(n_splits = 10, random_state = 2021, shuffle = True)

predictions = np.zeros(len(test))

for fold, (trn_idx, val_idx) in enumerate(folds.split(X)):
    print(f"Fold: {fold}")
    X_train, X_val = X.values[trn_idx], X.values[val_idx]
    y_train, y_val = y.values[trn_idx], y.values[val_idx]

    model = VotingRegressor(
            estimators = [
                ('lgbm', lgb),
                ('xgb', xgb)
            ],
            weights = [0.15, 0.65]
        )
   
    model.fit(X_train, y_train)
    pred = model.predict(X_val)
    error = mean_squared_error(y_val, pred,squared = False)
    print(f" mean_squared_error: {error}")
    print("-"*50)
    
    predictions += model.predict(test) / folds.n_splits

# Submission

In [None]:
sub['loss'] = lgb_model.predict(test)
sub.to_csv(f'lgb.csv',index = False)

sub['loss'] = xgb_model.predict(test)
sub.to_csv(f'xgb.csv',index = False)

sub['loss'] = cat_model.predict(test)
sub.to_csv(f'cat.csv',index = False)

sub['loss'] = predictions
sub.to_csv(f'vote.csv',index = False)

## Credits to the codes that have helped me make this notebook: 
- [Notebook by Subin An ](https://www.kaggle.com/subinium/tps-aug-simple-eda)
- [Notebook by BIZEN](https://www.kaggle.com/hiro5299834/tps-aug-2021-lgbm-xgb-catboost)

In [None]:
%%html
<marquee style='width: 90% ;height:70%; color: #799EFF ;'>
    <b> Do UPVOTE if you like my work, I will be adding some more content to this kernel :) </b></marquee>