___________
# Summary
#### The aim of this notebook is to make first submission and get a baseline score from where improvement can be made. For this I would just be filling the null values, correcting the datatypes which includes one-hot-encoding categorical columns. All this is required as ML models need data which are numerical and void of null values. 
#### There is no additional EDA/Feature Engineering/Model optimization etc as our aim is first submission.

<a id='content-table'></a>
## Table of Contents
1. [Loading data](#load)
2. [Combine Train and Test data](#tag2)
3. [Filling missing values](#tag3)
4. [Remove unncessary columns](#tag4)
5. [Change datatypes if required](#tag5)
6. [Splitting into train/test set](#tag6)
7. [Training a simple model](#tag7)
8. [Making predicitions on Test set](#tag8)
9. [Making your first submission](#tag9)

In [None]:
import numpy as np 
import pandas as pd 

<a id='load'></a>
## [Step - 1 : Loading data](#content-table)

In [None]:
import pandas as pd
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

print(train.shape, test.shape, submission.shape)
print(train.columns)                             #printing the column names
print(set(train.columns)-set(test.columns))      #printing the target column

## Print first 5 rows

In [None]:
train.head()

## Check the %null values in  Train and Test data

In [None]:
_1 = train.isnull().sum()/len(train)*100
_2 = test.isnull().sum()/len(train)*100

df = pd.concat([_1,_2], axis = 1)
df.columns = ['train', 'test']
df

We see that columns that have null values are same in both the dataset and the % missing values is around the same

<a id='tag2'></a>
## [Step - 2 : Combine Train and Test data](#content-table)

In [None]:
test['Survived'] = -1
all_data = pd.concat([train, test])
print(all_data.head())
all_data.tail()

<a id='tag3'></a>
## [Step - 3 : Filling missing values](#content-table)

### Fill 'Age' and 'Fare' value with their mean value

In [None]:
for col in ['Age', 'Fare']:
    all_data[col] = all_data[col].fillna(all_data[col].mean())
    print(all_data[col].isnull().sum())

### Fill 'Embarked' and 'Ticket' values with their mode value

In [None]:
for col in ['Embarked', 'Ticket']:
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0]) 
    print(all_data[col].isnull().sum())

### Filling Cabin values
Here 67% values are missing. Hence I will fill it with 1 if value is present and 0 if missing value

In [None]:
col = 'Cabin'
all_data[col] = all_data[col].notnull().astype(int)
print(all_data[col].isnull().sum())

### Verify there are no null values

In [None]:
all_data.isnull().sum()/len(train)*100

<a id='tag4'></a>
## [Step - 4 : Remove unncessary columns](#content-table)

We will check %unique values in column

In [None]:
# Taking only categorical columns
cols = [col for col in all_data.columns if all_data[col].dtype == 'object']
cols

for col in cols:
    print(f"{col} : {all_data[col].nunique()/len(all_data)*100}")

`'Name'` and `'Ticket'` columns have more than 87% & 66% unique values respectively. They don't give any information to the model just as is. EDA/Feature Engineering might give us some insight, but we are not doing that here.

In [None]:
all_data.drop(['Name', 'Ticket'], axis = 1, inplace = True)

<a id='tag5'></a>
## [Step - 5 : Change datatypes if required](#content-table)

### Check column datatype with a sample datatype

In [None]:
df = pd.concat([all_data.iloc[0], all_data.dtypes], axis = 1)
df.columns = ['sample', 'dtype']
df

Here we see that the data type of sample matches with the datatype of the column. Hence no need to change column datatype

### One-hot-encode categorical columns

In [None]:
# Check which categorical columns are left
cols = [col for col in all_data.columns if all_data[col].dtype == 'object']
cols

In [None]:
all_data = pd.get_dummies(all_data, drop_first = True)
all_data.head()

Now our data is ready to be fed into model. So we will split into train/validation/test set and train a basic model

<a id='tag6'></a>
## [Step - 6 : Splitting into train/test set](#content-table)

### Split into train-test set

In [None]:
n_train = len(train)
train_modified = all_data.iloc[:n_train].copy()   # This will create copy of the df. Done to avoid future warnings
test_modified = all_data.iloc[n_train:].copy()

print(len(train_modified), len(test_modified))

In [None]:
train_modified.head()

In [None]:
# Removing 'PassengerId' column
train_modified.drop('PassengerId', axis = 1, inplace = True)

In [None]:
test_modified.head()

In [None]:
# Remove 'Survived' column from test data
test_modified.drop('Survived', axis = 1,inplace = True)

### Create a train-test split on training data

In [None]:
from sklearn.model_selection import train_test_split

X = train_modified.drop('Survived', axis = 1)
y = train_modified['Survived'].copy()

x_train, x_test, y_train, y_test = train_test_split(X, y.values, test_size = 0.25, random_state = 42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

In [None]:
x_test

<a id='tag7'></a>
## [Step - 7 : Training a simple model](#content-table)

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear', random_state = 42)

classifier.fit(x_train.values, y_train)

In [None]:
y_pred = classifier.predict(x_test.values)
accuracy = (y_pred == y_test).astype(int).sum()/len(y_test)*100
print(f"Model accuracy is : {accuracy: .3f} %")

An accuracy of 76.612% is a good starting point. From here on we can improve

<a id='tag8'></a>
## [Step - 8 : Making predicitions on Test set](#content-table)

In [None]:
# Saving 'PassengerId' of test data and deleting it
test_idx = test_modified['PassengerId'].copy()

test_modified.drop('PassengerId', axis = 1, inplace = True)

print(test_modified.shape)

In [None]:
y_pred = classifier.predict(test_modified.values)
submission.loc[:, 'Survived'] = y_pred

In [None]:
submission

<a id='tag9'></a>
## [Step - 9 : Making your first submission](#content-table)

In [None]:
submission.to_csv('submission.csv', index = False)   # index = False is important 

In [None]:
# Recheck if the file is in correct format
pd.read_csv("submission.csv")

**Now you have made your first submssion. From here on you can do many things to improve your accuracy. You can do EDA to get better insights in your data. Furthur you can also do feature engineering, hyperparameter optimization, ensembling of models.**

_______________