# Aim of the Notebook

**Hello everyone, I will join the competition of the Titanic Dataset with a simple Random Forest Classifier. This will be my first submission for this competition.**

**I will start with the simple dataset visualization and analysis. I will use Seaborn, Plotly, and Folium libraries for the visualization.**

**In the next step, I will clear the data and use a simple Random Forest Classifier for the predictions.**

**Finally, I will analyze the results of the model and send the prediction results for the competition.**

**I am open to feedback and suggestions, feel free to comment your feedback and suggestions on the comment section or contact me.**

**Thank you and let's get started!**

# Importing the Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import folium
from folium import plugins
from folium.plugins import HeatMap
import warnings
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Reading the Train and Test Datasets

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

In [None]:
# Let's check the train dataset

train.head()

In [None]:
# Let's check the test dataset

test.head()

# Checking for the Missing Elements

**As you can see from the output, the Age section has 177 missing values and almost all of the Cabin elements are missing.**

In [None]:
# Checking for the missing elements on the dataset

count_NaN = train.isna().sum()
count_NaN

# Feature Visualization of the Dataset

In [None]:
# Embark towns of the Titanic Passengers

plt.figure(figsize=(15,8))
splot = sns.countplot(data=train, x='Embarked')
plt.ylabel("Number of the Passengers", fontsize=12)
plt.xlabel("Embark Towns", fontsize=12)
plt.title("Embark Towns of the Titanic Passengers", fontsize=16)

for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center',
                   xytext=(0, 9),
                   textcoords='offset points')

In [None]:
# Age Distribution of the Titanic Passengers

plt.figure(figsize=(15,8))
sns.countplot(data=train, x='Age')
plt.ylabel("Number of the Passengers", fontsize=12)
plt.xticks(rotation=90)
plt.title("Age Distribution of the Titanic Passengers", fontsize=16)

In [None]:
# Gender Distribution of the Titanic Passengers

train['Sex'] = np.where(train['Sex'] == 'male', 'Male', 'Female')
plt.figure(figsize=(15, 8))
splot = sns.countplot(data=train, x='Sex')
plt.ylabel("Number of the Passengers", fontsize=12)
plt.xlabel("Genders", fontsize=12)
plt.title("Gender Distribution of the Titanic Passengers", fontsize=16)

for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center',
                   xytext=(0, 9),
                   textcoords='offset points')

In [None]:
# Survival Distribution of the Titanic Passengers

train['Survived'] = np.where(train['Survived'] == 1, 'Survived', 'Dead')
plt.figure(figsize=(15, 8))
splot = sns.countplot(data=train, x='Survived')
plt.ylabel("Number of the Passengers", fontsize=12)
plt.title("Survival Distribution of the Titanic Passengers", fontsize=16)

for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center',
                   xytext=(0, 9),
                   textcoords='offset points')

In [None]:
# Ticket Class Distribution of the Titanic Passengers
train.loc[(train.Pclass == 3), 'Pclass'] = 'Third Class'
train.loc[(train.Pclass == 2), 'Pclass'] = 'Second Class'
train.loc[(train.Pclass == 1), 'Pclass'] = 'First Class'
plt.figure(figsize=(15, 8))
splot = sns.countplot(data=train, x='Pclass')
plt.ylabel("Number of the Passengers", fontsize=12)
plt.title("Class Distribution of the Titanic Passengers", fontsize=16)

for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'),
                   (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha='center', va='center',
                   xytext=(0, 9),
                   textcoords='offset points')


# Heatmap of the Titanic Embark Towns

**In this part, I will show the embark towns of the Titanic Passengers on the heatmap. According to the dataset, there are 3 different embark locations as you can see below.**

**Most of the passengers embarked from Southampton, and the following two towns have slightly the same passenger distributions.**

In [None]:
count_towns = train.groupby(
    pd.Grouper(key='Embarked')).size().reset_index(name='count')

latitude_embark = ['50.897', '49.6423', ' 51.84914']
longitude_embark = ['-1.404', '-1.62551', '-8.2975265']

count_towns['latitude_embark'] = latitude_embark
count_towns['longitude_embark'] = longitude_embark

m = folium.Map([49.922935, -6.068136], zoom_start=6, width='%100', height='%100')

heat_data = count_towns.groupby(["latitude_embark", "longitude_embark"])['count'].mean().reset_index().values.tolist()
folium.plugins.HeatMap(heat_data).add_to(m)
m

# Plotly Sunburst Visualization

**As you can see from the visualization below, the Female survival rate is higher for the First Class. Unfortunately, when the Class is decreased, the Female survival rate is decreasing.**

**The Third Class passengers have the majority on the Titanic and the number of Second and First class passenger are close to each other.**

**(This visualization is interactive, you can click the desired Class and Sex for more information. It is also available for you to see how many passengers are covered by this specific area.)**

In [None]:
train = pd.read_csv('../input/titanic/train.csv')

train['Survived'] = np.where(train['Survived'] == 1, 'Survived', 'Dead')
train.loc[(train.Pclass == 3), 'Pclass'] = 'Third Class'
train.loc[(train.Pclass == 2), 'Pclass'] = 'Second Class'
train.loc[(train.Pclass == 1), 'Pclass'] = 'First Class'
train['Sex'] = np.where(train['Sex'] == 'male', 'Male', 'Female')

fig = px.sunburst(data_frame=train, # Our dataset
                  path=["Pclass", "Sex", "Survived"],  # Root, Branches, Leaves
                  color="Pclass",
                  color_discrete_map={'First Class': 'rgb(246,207,113)',
                                      'Second Class': 'rgb(248,156,116)',
                                      'Third Class': 'rgb(102,197,204)'},  # Colours (could be changed easily)
                  maxdepth=-1,
                  branchvalues='total',
                  hover_name='Pclass',  # Hover name for chosen column
                  hover_data={'Pclass': False},
                  title='Visualization of the Titanic Dataset', template='ggplot2'# Title and the template 
                  )

fig.update_traces(textinfo='label+percent parent')
fig.update_layout(font=dict(size=16))
fig.show()

# Correlation Analysis of the Dataset

**It is easy to see the relation between the Survive vs Class and Sex from the Sunburst plot but, let's see check the correlation graph of the dataset.**

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
train['Sex'] = np.where(train['Sex'] == 'male', 1, 0) # 1 = Male and 0 = Female for this scenario

plt.figure(figsize=(15,8))
heatmap = sns.heatmap(train.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Graph of the Training Dataset', fontdict={'fontsize': 24})

**As you can see from the Correlation Graph of the Training Dataset, the Survive is slightly correlated with the Pclass. As seen from the Sunburst graph, it makes sense.**

**For the Survive and Sex analysis, these two features have a negative correlation. This is because of my assignment which is the '1 = Male and 0 = Female'. So, we can say that when the Survive is increasing, the Sex is decreasing strongly. That means the female survival is pretty high for this scenario. It makes sense due to the Sunburst plot above. As seen from the Sunburst plot, for the First and Second class passengers, the survival rate is pretty high for females.**

**It is also clear that the Parch and Survive have almost no relationship between them.**

**And, we can also observe that the Pclass and Fare have a negative correlation. It makes sense again because when the Fare increases, the ticket class increases to First Class ticket.**

**This analysis would help me to choose the best features for my Artificial Neural Network (ANN) model.**

# Data Preprocess

**In this part, I will try to make dataset better input for the Random Forest Classifier**

In [None]:
train = pd.read_csv('../input/titanic/train.csv')

# Male = 1 and Female = 0 for numerical representation
train['Sex'] = np.where(train['Sex'] == 'male', 1, 0)


# Dealing with the Age missing values
train_male = train[train['Sex'] == 1]  # Dataset for only male
train_female = train[train['Sex'] == 0]  # Dataset for only female

# NaN values replaced with the median age of their sex
train_male['Age'].fillna(train_male['Age'].median(), inplace=True)
train_female['Age'].fillna(train_female['Age'].median(), inplace=True)

train = pd.concat([train_male, train_female], axis=0)
train = train.sort_values(by='PassengerId')

In [None]:
# Let's check the new form of the dataset. As you can see, the Sex column has 1 or 0 values now.
train.head()

In [None]:
# Replacing the NaN embarked data point with the 'S' because it has the high majority
train['Embarked'].fillna('S', inplace=True)

# Encoding the Embark locations. Now S = 1, C = 2 and Q = 3
train.loc[(train.Embarked == 'S'), 'Embarked'] = 1
train.loc[(train.Embarked == 'C'), 'Embarked'] = 2
train.loc[(train.Embarked == 'Q'), 'Embarked'] = 3

In [None]:
#S = 1, C = 2 and Q = 3
# Let's check the new form of the dataset. As seen from the below, the Embarked column has 1, 2 or 3 values now.
train.head()

In [None]:
# Finally, I will use this form of the dataset as below. The PassengerId, Name, Ticket and Cabin columns will be dropped. I will not use these columns.

train = train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
train.head()

# Random Forest Classifier

In [None]:
# y are the values I want to predict
y = train['Survived']
X = train.drop('Survived', axis=1)
X

In [None]:
# Train and Test Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=13)

In [None]:
# Random Forest Classifier
rf = RandomForestClassifier(max_depth=10).fit(X_train, y_train)
predictions_rf = rf.predict(X_val)

# Let's Check the Accuracy

In [None]:
acc_rf = accuracy_score(y_val, predictions_rf)
print('Accuracy Random Forest: %', 100 * acc_rf)

**This model has %85.82 accuracy when I ran it. It may change if you run it on your computer due to stochastic nature of the models. I think this is not bad for the first shot.**

# Confusion Matrix

In [None]:
plt.figure(figsize=(15, 8))
conf_mat = confusion_matrix(y_true=y_val, y_pred=predictions_rf)
sns.heatmap(conf_mat, annot=True, fmt='g')
plt.title('Confusion Matrix of the Random Forest Classifier', fontsize=14)
plt.ylabel('Real Class', fontsize=12)
plt.xlabel('Predicted Class', fontsize=12)
plt.show()

**As you can see above, the model is not bad for the first try. I believe, I would improve this model for the next time or use different classification models for the best results.**

**Thank you if you followed me this far!**

**I am open to your feedbacks and suggestions, feel free to contact me!**