<h2 style="font-size:35px; background-color:orange; text-align:center; color:blue">TPS April'21 </h2>

**The notebook consists of the following:-**
1. Importing libraries
2. Understanding Data
3. Visualising Missing Data
4. Exploratory Data Analysis
   - Univariate Analysis
   - Bivariate Analysis
7. Data preprocessing
8. Modeling and Prediction

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Importing libraries</h2>

In [None]:
# Data manipulation libraries
import numpy as np
import pandas as pd
import missingno as msno

# Visualization libraries
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Avoid Warnings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Understanding Data</h2>

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

In [None]:
train.info()

In [None]:
test.info()

It seems that there are missing values in both train and test data.

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Visualising Missing Data</h2>

In [None]:
msno.matrix(train, color = (.30, .20, .89), figsize=(8,8))
plt.show()

In [None]:
# One more way to visualize using seaborn heatmap
plt.figure(figsize=(10,10))
sns.heatmap(train.isnull(), center=1)
plt.show()

In [None]:
# missing data in train data
missing_percentages = (train[train.columns].isnull().sum() / train.shape[0]) * 100
missing_percentages

In [None]:
# missing data in test data
missing_percentages_test = (test[test.columns].isnull().sum() / test.shape[0]) * 100
missing_percentages_test

In [None]:
# total missing data counts in train and test data
missing_values_count = train.isnull().sum()
missing_values_count_test = test.isnull().sum()

# find the percentage of missing data in training data
total_cells = np.product(train.shape)
total_missing = missing_values_count.sum()
percent_missing = (total_missing / total_cells)*100
print("The percentage of total missing data from the training dataset is :", percent_missing, "%")

# find the percentage of missing data in testing data
total_cells_test = np.product(test.shape)
total_missing_test = missing_values_count_test.sum()
percent_missing_test = (total_missing_test / total_cells_test)*100
print("The percentage of total missing data from the testing dataset is :", percent_missing_test, "%")

*The columns with missing values are*
- Age
- Ticket
- Fare
- Cabin
- Embarked

In [None]:
train.describe().transpose()

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Exploratory Data Analysis</h2>

In [None]:
train.head()

In [None]:
for i in train.columns:
    print("The number of unique values in {} is {}".format(i, len(train[i].unique())))

*The categorical features are*
- Pclass
- Sex
- Embarked
- Parch
- SibSp

*The continuous features are*
- Age
- Fare

*The ones which will be dealt manually are*
- Ticket
- Cabin
- Name

In [None]:
categorical_features = ["Pclass","Sex","Embarked","Parch","SibSp"]
continuous_features = ["Age","Fare"]

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Univariate Analysis</h2>

> Understanding the target variable - Survived

In [None]:
# Checking the count & distribution of Survived
fig = plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
ax = sns.countplot(x="Survived",data=train)
plt.subplot(1,2,2)
sns.distplot(train.loc[: ,'Survived'], hist_kws={"color":"r"}, kde_kws={"color":"b", "lw":2})
plt.show()

In [None]:
print("The percentage of people who didn't survive :",((train['Survived'] == 0).sum() / train.shape[0]) * 100)
print("The percentage of people who did survive :",((train['Survived'] == 1).sum() / train.shape[0]) * 100)

> Understanding the features

**Pclass**

In [None]:
fig = plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
ax = sns.countplot(x="Pclass",data=train)
plt.subplot(1,2,2)
sns.distplot(train.loc[: ,"Pclass"], hist_kws={"color":"r"}, kde_kws={"color":"b", "lw":2})
plt.show()

In [None]:
print("The percentage of people in class 1 :",((train['Pclass'] == 1).sum() / train.shape[0]) * 100)
print("The percentage of people in class 2 :",((train['Pclass'] == 2).sum() / train.shape[0]) * 100)
print("The percentage of people in class 3 :",((train['Pclass'] == 3).sum() / train.shape[0]) * 100)

**Parch**

In [None]:
fig = plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
ax = sns.countplot(x="Parch",data=train)
plt.subplot(1,2,2)
sns.distplot(train.loc[: ,"Parch"], hist_kws={"color":"r"}, kde_kws={"color":"b", "lw":2})
plt.show()

**SibSp**

In [None]:
fig = plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
ax = sns.countplot(x="SibSp",data=train)
plt.subplot(1,2,2)
sns.distplot(train.loc[: ,"SibSp"], hist_kws={"color":"r"}, kde_kws={"color":"b", "lw":2})
plt.show()

**Sex**

In [None]:
fig = plt.figure(figsize=(6,6))
sns.set_palette(["#8072fa","orange"])
ax = sns.countplot(x="Sex",data=train)
plt.show()

In [None]:
print("The percentage of people who are male :",((train['Sex'] == "male").sum() / train.shape[0]) * 100)
print("The percentage of people who are female :",((train['Sex'] == "female").sum() / train.shape[0]) * 100)

**Embarked**

In [None]:
fig = plt.figure(figsize=(6,6))
sns.set_palette(["#8072fa","orange","Red"])
ax = sns.countplot(x="Embarked",data=train)
plt.show()

In [None]:
print("The percentage of people embarked at S :",((train['Embarked'] == "S").sum() / train.shape[0]) * 100)
print("The percentage of people embarked at C :",((train['Embarked'] == "C").sum() / train.shape[0]) * 100)
print("The percentage of people embarked at Q :",((train['Embarked'] == "Q").sum() / train.shape[0]) * 100)

In [None]:
fig = plt.figure(figsize=(8,14))
for index,col in enumerate(continuous_features):
    plt.subplot(4,1,index+1)
    sns.boxplot(train.loc[:, col], color="#8072fa",linewidth=2.5)
fig.tight_layout(pad = 2)

The Fare has a lot of outliers.

In [None]:
fig = plt.figure(figsize=(8,14))
for index,col in enumerate(continuous_features):
    plt.subplot(4,1,index+1)
    sns.distplot(train.loc[:, col], color="orange", kde_kws={"color":"r", "lw":2})
fig.tight_layout(pad = 2)

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Bivariate Analysis</h2>

In [None]:
train.head()

In [None]:
cr = train.corr(method='pearson')
fig = px.imshow(cr)
fig.show()

In [None]:
fig = px.histogram(data_frame=train,
                   x="Survived",
                   y=None,
                   color='Sex',
                   width=500,
                   template="plotly_dark",
                  color_discrete_map={"male":"#8072fa","female":"orange"})
fig.show()

Clearly, more females survived than males.

In [None]:
fig = px.histogram(data_frame=train,
                   x="Embarked",
                   y=None,
                   color='Survived',
                   width=500,
                   template="plotly_dark",
                  color_discrete_map={1:"#8072fa",0:"orange"})
fig.show()

Most of the people who embarked at Southampton didn't survive. Most of the people who embarked at Cherbourg survived.

In [None]:
fig = px.histogram(data_frame=train,
                   x="Pclass",
                   y=None,
                   color='Survived',
                   width=500,
                   template="plotly_dark",
                  color_discrete_map={0:"#8072fa",1:"orange"})
fig.show()

In [None]:
fig = px.histogram(data_frame=train,
                   x="Parch",
                   y=None,
                   color='Survived',
                   width=500,
                   template="plotly_dark",
                  color_discrete_map={1:"#8072fa",0:"orange"})
fig.show()

In [None]:
fig = px.histogram(data_frame=train,
                   x="SibSp",
                   y=None,
                   color='Survived',
                   width=500,
                   template="plotly_dark",
                  color_discrete_map={1:"#8072fa",0:"orange"})
fig.show()

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Data Preprocessing</h2>

**Conclusions from the EDA**
- There are NaNs value in
    1. Age
    2. Ticket
    3. Fare
    4. Cabin
    5. Embarked
- The people who survived and the people who didn't are almost equally distributed in train data.
- There are a lot of outliers in Fare feature since the distribution curve of Fare is highly right-skewed.
- The correlation graph doesn't show any significant correlation between features, so no dimensionality reduction can be performed. 
- Parch and SibSp represent the number of parents and siblings respectively. These can be added to form a feature of relatives.

> Handling missing data

In [None]:
train['Age'].fillna(train['Age'].mean(),inplace=True)
test['Age'].fillna(train['Age'].mean(),inplace=True)

train['Fare'].fillna(train['Fare'].mean(),inplace=True)
test['Fare'].fillna(train['Fare'].mean(),inplace=True)

train['Embarked'].fillna(train['Embarked'].mode()[0],inplace=True)
test['Embarked'].fillna(train['Embarked'].mode()[0],inplace=True)

Ticket number,Cabin & Name number doesn't seem to influence Survival. 

In [None]:
train.drop(['Name','Ticket','Cabin','PassengerId'], axis=1, inplace=True)
test.drop(['Name','Ticket','Cabin','PassengerId'], axis=1, inplace=True)

Applying a log function to reduce the influence of outliers in Fare column. Since there are a large number of outliers, removing them will lead to loss of a large numberof points.

In [None]:
train['Fare'] = train['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
test['Fare'] = test['Fare'].map(lambda i: np.log(i) if i > 0 else 0)

In [None]:
plt.figure(figsize = (8,6))
sns.distplot(train.loc[:, 'Fare'],color='orange',kde_kws={"color":"r", "lw":2})
plt.show()

Adding Parch and SibSp into one feature

In [None]:
train["relatives"] = train["Parch"] + train["SibSp"] + 1
test["relatives"] = test["Parch"] + test["SibSp"] + 1

In [None]:
fig = px.histogram(data_frame=train,
                   x="relatives",
                   y=None,
                   color='Survived',
                   width=500,
                   template="plotly_dark",
                  color_discrete_map={1:"#8072fa",0:"orange"})
fig.show()

<h2 style="font-size:35px; background-color:#8072fa; text-align:center; color:#fac472">Modeling</h2>

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

Label Encoding the categorical features

In [None]:
object_cols = ['Sex','Embarked']
for col in object_cols:
    label_encoder = LabelEncoder()
    label_encoder.fit(train[col])
    train[col] = label_encoder.transform(train[col])
    test[col] = label_encoder.transform(test[col])

In [None]:
train.head()

In [None]:
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','relatives']
target = train['Survived'].values

> Logistic Regression model

In [None]:
lr = LogisticRegression()
lr.fit(train[features], target)
print("Logistic Regression ROC AUC score:", roc_auc_score(target, lr.predict_proba(train[features])[:,1]))
print('Logistic Regression Accuracy score:', accuracy_score(target, lr.predict(train[features])))

> Decision Tree model

In [None]:
dt = DecisionTreeClassifier(random_state = 42)
dt.fit(train[features], target)
print('Decision Tree ROC AUC score:', roc_auc_score(target, dt.predict_proba(train[features])[:,1]))
print('Decision Tree Accuracy score:', accuracy_score(target, dt.predict(train[features])))

Decision Tree is overfitting the data without tuning. So using Log Reg for final predictions.

In [None]:
sample_submission['Survived'] = lr.predict(test[features])
sample_submission.to_csv('submission.csv',index=False)
sample_submission.head()

### If you like the notebook, consider giving an upvote.
These are my other notebooks:
1. https://www.kaggle.com/namanmanchanda/cat-vs-dog-classifier-10-lines-of-code-fast-ai
2. https://www.kaggle.com/namanmanchanda/star-wars-classifier
3. https://www.kaggle.com/namanmanchanda/gradient-descent-101