# Tabular Playground Series - April 2021

The aim of this notebook is to present a really simple data handling, data preparation & logistic regression. 
This notebook will not contain an EDA and you can find plenty of really nice EDA example already for this competition. 


## Library Importation
Several libraries will be used during this project, let's import all of them. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

## Function Definition
To help me looking & understand the data I am used to used below functions: 
* **var_meaning** : It will give the definition of a column name whenever I want based on a dictionnary I'll build at the beginning of the project
* **missing_value** : It will help me identifying if some variables contain or not missing value. 

In [None]:
def var_meaning(dict, var):
    if var in dict:
        print(dict[var])
    else:
        print('Variable not found')


def missing_value(df, column_names):
    for col in column_names:
        try:
            if sum(df[col].isnull()) >= 1:
                print(col, " does contain null value. The number of null values equal ", sum(df[col].isnull()))
                print('')
        except KeyError:
            print(col, 'not present')
            print('')
    print('The total number of null values = ', df.isnull().sum().sum())
    print('------------------------------------')

## Data Preparation
### 1. Reading the data

Let's first read the csv files we have 

In [None]:
# Read the training data
training = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')
pd.set_option('display.max_columns', None)

As explained above, I am creating a dictionnary of the variable defition. 

In [None]:
# Variable definition
variable_definition = {'PassengerId': 'Unique Passenger Identifier',
                       'Survived': 'Binary Variable | 0 = No, 1 = Yes',
                       'Pclass': 'Ticket Class',
                       'Name': 'Name of the passenger',
                       'Sex': 'Sex of the passenger',
                       'Age': 'Age of the passenger',
                       'SibSp': 'Number of siblings / spouses aboard the boat',
                       'Parch': '# of parents / children aboard the boat',
                       'Ticket': 'Ticket Number',
                       'Fare': 'Passenger fare',
                       'Cabin': 'Cabin number',
                       'Embarked': 'Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton'
                       }

By using the above function I can now have the definition of a column whenever. 
Example :

In [None]:
var_meaning(variable_definition, 'Pclass')

### 2. Do we have missing value to handle ? 

In [None]:
missing_value(training, list(variable_definition.keys()))
missing_value(test, list(variable_definition.keys()))

From above, one can see that several variables contains null values. Let's keep them in mind to make sure we are handling those use case before modelling our Logistic Regression ! The variables having missing values are the same in both our training and testing sets. 

* Age
* Ticket
* Fare
* Cabin
* Embarked

### 3. Analysis variable per variable
#### 3.1. Pclass

Let's have a look to the first variable. Again, this notebook will not contain any EDA to not make it super long. However I would definitely recommand anyone to start by an exhaustive EDA. I have myself done this step and some decisions are taken in this notebook based on my previous EDA. 

* **Missing Value** : No missing value for both train and test set. 
* **Outlier** : No outlier spotted
* Only three unique values | Class 1, Class 2, Class 3

The only thing I'll do for this variable is to transform into dummy variables. Indeed, here, even though the variable is an integer, the number itself doesn't mean anything and should be analysis as a categorical variables.


In [None]:
training['Pclass'].unique()  # Only three unique value.
training = training.join(pd.get_dummies(training['Pclass'], prefix='Pclass', drop_first=True))

test['Pclass'].unique()  # Only three unique value.
test = test.join(pd.get_dummies(test['Pclass'], prefix='Pclass', drop_first=True))


#### 3.2. Name
* **Missing value** : No missing value for both train and test sets

I haven't done anything with the variable name. I didn't identified any class (Dr, Lady, etc.). This variable will simply bet dropped at the end of the data preparation step. 

#### 3.3. Sex

* **Missing Value** : No missing value for both train and test set. 
* Only two unique values | Male , Female

The only thing I'll do for this variable is to transform into dummy variables

In [None]:
training['Sex'].unique()  # Only 2 unique value.
test['Sex'].unique()  # Only 2 unique value.

training = training.join(pd.get_dummies(training['Sex'], prefix='Sex', drop_first=True))
test = test.join(pd.get_dummies(test['Sex'], prefix='Sex', drop_first=True))

#### 3.4. Age

* **Missing Value** : We do have missing values for Age in both Train & Test sets. 
* **Outlier** :  Not outlier spotted

Based on my EDA, I have noticed that the Age of the observations was varying based on its Pclass. I'll therefore replace the missing value by the median age of the observation in the same Pclass. 

Notice that to do this observation, I've concatenated both Test & Training set to have a more representative datasets of the entire population onboard. 

In [None]:
df = pd.concat([training, test])
sns.displot(df, x='Age', kind='kde', hue='Pclass', fill='True')
plt.show()

map_age_pclass = df[['Age', 'Pclass']].dropna().groupby('Pclass').median().to_dict()

training['Age'] = training['Age'].mask(training['Age'].isna(), training['Pclass'].map(map_age_pclass['Age']))
test['Age'] = test['Age'].mask(test['Age'].isna(), test['Pclass'].map(map_age_pclass['Age']))

#### 3.5. SibSp & Parch

* **Missing Value** : No missing values 
* **Outlier** :  Not outlier spotted

I've decided to join those two together as they are both like to the family size. Two new features will be created: 
* **Size_Family** = SibSp + Parch
* **is_alone** = IF Size_Family = 0 THEN 1 ELSE 0

In [None]:
training['Size_Family'] = training['SibSp'] + training['Parch']
test['Size_Family'] = test['SibSp'] + test['Parch']

training['is_alone'] = np.where(training['Size_Family'] == 0, 1, 0)
test['is_alone'] = np.where(test['Size_Family'] == 0, 1, 0)

#### 3.6. Ticket

Nothing has been done with the Ticket variable. This variable will be dropped before starting the modelling phase. 

#### 3.7. Fare

* **Missing Value** : We have some missing value in both train & test sets. 
* **Outlier** :  Not outlier spotted

Similarly to Age, based on the EDA, I have notice that the variable Fare was varying based on the Pclass. I'll also replace the missing value by the median fare paid by the observation in the same Pclass. 

In [None]:
map_fare_pclass = df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()

training['Fare'] = training['Fare'].mask(training['Fare'].isna(), training['Pclass'].map(map_fare_pclass['Fare']))
test['Fare'] = test['Fare'].mask(test['Fare'].isna(), test['Pclass'].map(map_fare_pclass['Fare']))

#### 3.8. Cabin

* **Missing Value** : We have some missing value in both training & testing sets.
* **Outlier** :  Not outlier spotted

Few features will be created based on the Cabin variable. 
* **Has_Cabin** : IF cabin IS NULL THEN 0 ELSE 1
* **Cabin_Type** : First letter if the Cabin variable. 

The has_cabin variable is solving our missing values issue while the cabin type variable might indicate where in the boat the passenger was living. 
This cabin_type will then be transformed into dummy variables. 

In [None]:
training['Has_cabin'] = np.where(training['Cabin'].isnull() == True, 0, 1)
test['Has_cabin'] = np.where(test['Cabin'].isnull() == True, 0, 1)

training['Cabin_type'] = training['Cabin'].str[0]
test['Cabin_type'] = test['Cabin'].str[0]

training = training.join(pd.get_dummies(training['Cabin_type'], prefix='Cabin_type', drop_first=True))
test = test.join(pd.get_dummies(test['Cabin_type'], prefix='Cabin_type', drop_first=True))

#### 3.9. Embarked 

* **Missing Value** : We have some missing value in both training & testing sets.
* **Outlier** :  Not outlier spotted

Based on my EDA, I've notice that most of the passenger embarked at Southampton. I'll assume that my missing value also embarked on Southampton. 
Then I am creating dummies variables. 

In [None]:
training['Embarked'] = np.where(training['Cabin'].isnull() == True, 'S', training['Embarked'])
test['Embarked'] = np.where(test['Cabin'].isnull() == True, 'S', test['Embarked'])

training = training.join(pd.get_dummies(training['Embarked'], prefix='Embarked', drop_first=True))
test = test.join(pd.get_dummies(test['Embarked'], prefix='Embarked', drop_first=True))

#### 3.10. Dropping the unnecessary variables

Dropping from both the training & the testing set,  the columns that have either been transformed into dummy variables and the columns that I've decided not to use. 

In [None]:
# Dropping the unnecessary column.
training = training.drop('Pclass', axis=1)
test = test.drop('Pclass', axis=1)
training = training.drop('Name', axis=1)
test = test.drop('Name', axis=1)
training = training.drop('Sex', axis=1)
test = test.drop('Sex', axis=1)
training = training.drop('Ticket', axis=1)
test = test.drop('Ticket', axis=1)
training = training.drop('Cabin', axis=1)
test = test.drop('Cabin', axis=1)
training = training.drop('Embarked', axis=1)
test = test.drop('Embarked', axis=1)
training = training.drop('Cabin_type', axis=1)
test = test.drop('Cabin_type', axis=1)

### 4. Logistic Regression Modelling

My datasets are now ready. I'll be modelling a really simply logistic regression to see how it is performing. 

First thing is to split my datasets into X_train, Y_train & X_test. 

In [None]:
y_train = training['Survived']
x_train = training.drop('Survived', axis=1)
x_train = x_train.drop('PassengerId', axis=1)
x_test = test.drop('PassengerId', axis=1)

I can now use those to do my logistic regression and saved my results in my submission csv file. 

In [None]:
lr = LogisticRegression(max_iter = 500)
lr.fit(x_train, y_train)

y_pred = lr.predict(x_test).astype(int)
print('Mean =', y_pred.mean(), ' Std =', y_pred.std())

submission['Survived'] = y_pred
submission.to_csv("submission.csv", index=False)
print('Done')

### 5. Result & next steps

With this really simple data processing & logistic regression, I had an accuracy of 0.795 which placed which led me to the top 35%. 

To improve the score : 
* Additional feature creation
* Feature Selection 
* Additional Model testing
* Model finetuning