## <a>Introduction</a>

In this competition, we are given a classification task. We will be predicting a binary target based on a number of feature columns given in the data. The dataset is based on the Titanic dataset and this time the features are not anonymized.

Let's get started.

## <a>Loading Packages and Data</a>

In [None]:
import numpy as np 
import pandas as pd 
import os, gc
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import math
import lightgbm as lgb
import xgboost as xgb
import optuna

from sklearn.metrics import mean_squared_error, accuracy_score, log_loss
from sklearn.model_selection import KFold, GridSearchCV, train_test_split, StratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings('ignore')

In [None]:
PATH = '../input/tabular-playground-series-apr-2021/'

train = pd.read_csv(PATH + 'train.csv')
test = pd.read_csv(PATH + 'test.csv')
sample = pd.read_csv(PATH + 'sample_submission.csv')

print(train.shape, test.shape)

Both train and test are same sized datasets. Let's take a look at the train set.


In [None]:
train.head(10)

In [None]:
test.head(10)

In [None]:
train.info()

This time we've missing values unlike the previous competitions of the series. 

In [None]:
test.info()

In both the datasets, only 'Age', 'Ticket', 'Fare', 'Cabin' and 'Embarked' features have missing values.

## <a>EDA</a>

Let's first check the distribution of target variable.


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.distplot(train['Survived'], ax=ax[0])
sns.countplot(train['Survived'], ax=ax[1])

The dataset is somewhat imbalanced, but we've enough features from both the classes. 

In [None]:
train.describe()

In [None]:
FEATURES = train.drop(['PassengerId', 'Survived'], 1).columns
FEATURES

In [None]:
train.head(5)

Let's separately analyze feature w.r.t the target variable. 

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(20, 18))
ax = ax.flatten()

for k, i in enumerate(['Pclass', 'Sex', 'Embarked', 'Parch', 'SibSp']):
    sns.countplot(train[i], hue=train['Survived'], ax=ax[k])

1. **Pclass** : From 1st(upper) and 2nd(middle) class, the number of passengers who survived is comparable to the number of passengers who didn't. But in 3rd(lower) class, 75% didn't survive.

2. **Sex** : Over 80% of men didn't survive whereas approx. 66% of women survived. 

3. **Embarked** :  C = Cherbourg, Q = Queenstown, S = Southampton
   Most of the passenger embarked from Southampton and over 50k didn't survive.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 6))

for k, i in enumerate(['Fare', 'Age']):
    sns.distplot(train.loc[train['Survived'] == 1, i], ax=ax[k], label='1')
    sns.distplot(train.loc[train['Survived'] == 0, i], ax=ax[k], label='0')
    ax[k].legend()

In [None]:
for i in ['Sex', 'Embarked']:
    le = LabelEncoder()
    le.fit(train[i])
    train[i] = le.transform(train[i])
    test[i] = le.transform(test[i])

train.head()

In [None]:
x = train.corr()
plt.figure(figsize=(10,10))
sns.heatmap(x, annot=True)

## <a>Model</a>

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True)
cv

In [None]:
FEATURES = ['Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked']

In [None]:
X = train[FEATURES]
y = train.Survived
print(X.shape, y.shape)

In [None]:
oof_df = train[['PassengerId', 'Survived']].copy()
fold_ = 1


for train_idx, val_idx in cv.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    model = lgb.LGBMClassifier()
    model.fit(X_train, y_train)

    val_preds = model.predict(X_val)
    test_preds = model.predict(
        test[FEATURES])

    oof_df.loc[oof_df.iloc[val_idx].index, 'oof'] = val_preds
    sample[f'fold{fold_}'] = test_preds

    score = accuracy_score(
        oof_df.loc[oof_df.iloc[val_idx].index]['Survived'], oof_df.loc[oof_df.iloc[val_idx].index]['oof'])
    print(score)
    fold_ += 1

In [None]:
print(accuracy_score(oof_df.Survived, oof_df.oof))
sample['Survived'] = sample.drop(['PassengerId', 'Survived'], 1).mode(axis=1)
sample[['PassengerId', 'Survived']].to_csv('submission.csv', index=False)

In [None]:
sns.countplot(sample['Survived'])