### In this notebook I'm using train dataset only. In this case the dataset is being splitted into train and test in order to evalue model prediction output. The key idea is to compare the known results of classification test dataset with machine learning (ML) model prediction as a way to guarantee generalization of model  

### Step 1: Data set import

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
df_train = pd.read_csv("../input/titanic-machine-learning-from-disaster/train.csv")
df_train.head()

In [None]:
print("############################## DESCRIPTION OF DATASET (TRAIN) ##############################")
print(df_train.info())

### Step 3: Checking age distribution

In [None]:
plt.figure(figsize = (11,7))
sns.distplot(x = df_train["Age"], axlabel = "Age_train")

In [None]:
print("Mean of age on training training: ", df_train["Age"].mean())
print("Median of age on training dataset: ", df_train["Age"].median())

### Step 4: Fill missing values of age for the median

In [None]:
median_train    = df_train["Age"].median()
df_train["Age"] = df_train["Age"].fillna(median_train)     

### Step 5: Checking Target proportion

In [None]:
plt.figure(figsize = (12, 7))
sns.countplot(data = df_train, x = "Survived")

An interesting point in this variable is the proportion of survived and not survived is considerable. For ML model this might be a problem 

### Step 5: Proportion of classifications in each classificatory variable 

In [None]:
class_var = ["Pclass", "Sex", "Embarked", "SibSp", "Parch"]

n = 1
m = 5

fig, ax = plt.subplots(n, m, figsize = (18, 7))

for i, ax in enumerate(fig.axes):
    sns.countplot(data = df_train, x = class_var[i], ax = ax)


### Step 6: Evaluation of possible correlations between independent variables and Target

In [None]:
n = 1
m = 5

fig, ax = plt.subplots(n, m, figsize = (18, 7))
hue = ["Sex", "Pclass", "Embarked", "SibSp", "Parch"]

for i, ax in enumerate(fig.axes):
    sns.countplot(x = "Survived", hue = hue[i], data = df_train, ax = ax)

These graphics are a good form to see possibles correlations between the independet variables and the target. The first one shows that sex is an important variable to distinguish who is going to survived or not. The difference in amount of male not survived and survived is remarkable. In the same way for female. 

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (15, 7))
columns = ["Age", "Fare"]

for i, ax in enumerate(fig.axes):
    sns.boxplot(x = "Survived", y = columns[i], data = df_train, ax = ax)

Another import conclusion that was possibled to concluded is the average of age is almost the same for survived or not survived. 

### Step 7: Evaluation of possible correlations among independent variables

In [None]:
plt.figure(figsize = (15, 7))
sns.boxplot(x = "Pclass", y = "Fare", data = df_train)

### Step 8: Machine learning model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV

#-------------------------------- Categorical into numerical variable -------------------------------#
df_train["new_Sex"] = df_train["Sex"].map({"female": 0, "male": 1})

names = ["Pclass", "new_Sex", "Fare"]

#---------------------- Erasing missing values for Embarked ---------------------#
new_df = df_train[names]
target = df_train["Survived"]


#--------------------Split dataset into train and test--------------------------#
x_train, x_test, y_train, y_test = train_test_split(new_df, target,
                                                   test_size= 0.3, random_state= 111)

#------------------------------ Model adjust ------------------------------------#
RFC = RandomForestClassifier(n_estimators = 100, max_depth = 5, criterion = "entropy",
                            random_state = 10)


model = RFC.fit(x_train, y_train)


#-------------------- Prediction with train dataset -----------------------------#
prev_train = model.predict(x_train)

#-------------------- Prediction with test dataset -----------------------------#
prev_test = model.predict(x_test)

#---------------------- Train dataset accuracy ----------------------------#
accur_train = accuracy_score(y_train, prev_train)

#---------------------- Test dataset accuracy ----------------------------#
accur_test = accuracy_score(y_test, prev_test)

#---------------------- Train dataset confusion matrix ----------------------------#
matrix_train = confusion_matrix(y_train, prev_train)

#---------------------- Test dataset confusion matrix ----------------------------#
matrix_test = confusion_matrix(y_test, prev_test)


print("Accuracy in train dataset: ", accur_train)
print("###################### Confusion Matrix (train) #########################")
print(matrix_train)

print("")
print("Accuracy in test dataset: ", accur_test)
print("###################### Confusion Matrix (test) #########################")
print(matrix_test)

The metrics of train and test dataset are almost similar. This is a good indicative that model is generalizating. But it still can be better. 

### Step 9: Balancing Target variable

In [None]:
from imblearn.over_sampling import SMOTEN

In [None]:
oversample = SMOTEN()
x, y = oversample.fit_resample(new_df, target)

plt.figure(figsize = (12, 7))
sns.countplot(x = y)

### Step 8: Machine learning model with balacend data train

In [None]:
#---------------------- Erasing missing values for Embarked ---------------------#
new_df = x
target = y


#--------------------Separando dados de treino e teste--------------------------#
x_train, x_test, y_train, y_test = train_test_split(new_df, target,
                                                   test_size= 0.3, random_state= 111)

#------------------------------ Model adjust ------------------------------------#
RFC = RandomForestClassifier(criterion = "entropy", 
                             random_state = 0)

model = RFC.fit(x_train, y_train)


#-------------------- Prediction with train dataset -----------------------------#
prev_train = model.predict(x_train)

#-------------------- Prediction with test dataset -----------------------------#
prev_test = model.predict(x_test)

#---------------------- Train dataset accuracy ----------------------------#
accur_train = accuracy_score(y_train, prev_train)

#---------------------- Test dataset accuracy ----------------------------#
accur_test = accuracy_score(y_test, prev_test)

#---------------------- Train dataset confusion matrix ----------------------------#
matrix_train = confusion_matrix(y_train, prev_train)

#---------------------- Test dataset confusion matrix ----------------------------#
matrix_test = confusion_matrix(y_test, prev_test)


print("Accuracy in train dataset: ", accur_train)
print("###################### Confusion Matrix (train) #########################")
print(matrix_train)

print("")
print("Accuracy in test dataset: ", accur_test)
print("###################### Confusion Matrix (test) #########################")
print(matrix_test)