### Titanic Disaster 
Exploratory Data Analysis - Fundamentals 

EDA is usually the first stage for data mining. It allows us to visualize data to understand it, as well as create hypotheses for future analysis. 

The exploratory analysis revolves around the creation of a "**prelude**" to data. Its purpose is to highlight the truth about the content with no bias. Its use is focused on understanding the modeling that can result in creating **hypotheses**.

![](https://www.eyedocshoppe.com/images/T/xctmp4IlEPS.png)

Before we get into the problem proposed by the challenge, I will compile about the Exploratory Data Analysis (EDA). 

![](https://img.icons8.com/ios-filled/2x/learning.png)

Main stages of EDA. Some of its main aspects:

**Data requirements:**
It is important to understand what kind of data is needed for an adequate understanding of our problem; so that we can collect, select and store the minimum amount of data with noise. 

**Data Collection:**
> The data collected must be stored in the correct format. 

**Data processing:**
> Pre-processing involves processing the data set's preset prior to its actual analysis. 
The most common tasks at this stage involve the correct export of the dataset, placing it in a more harmonious structure.

**Data cleaning:**
>In this phase, the verification of duplicates, missing values, identifying inaccuracies in the dataset and etc. are applied to the correct transformations.



*Problem:* 
Before trying to extract insights, it is essential to understand the problem to be solved; in our case, it is, predicting the survivors on the Titanic with the fundamentals of ML.

Librarys

In [None]:
import pandas as pd
import random
import numpy as np
import seaborn as sns
from matplotlib import rcParams
import matplotlib.pyplot as plt
plt.style.use('seaborn-muted')  #mudar estilo? plt.style.available
import re
#mudar estilo? plt.style.available 

def _shuffle(list):
  c = plt.style.available
  random.shuffle(list)
  return list[0]

def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if big_string.find(substring) != -1:
            return substring
    return np.nan



Load & Check

In [None]:
dftrain = pd.read_csv('/kaggle/input/titanic/train.csv')
dftest = pd.read_csv('/kaggle/input/titanic/test.csv')
print("test :\t",dftest.shape)
print("train:\t",dftrain.shape)
#print("temp:\t",df.shape)

**First look at the data...**

* What types do we have here?
* How many records?
* How many columns?

![](https://pt-static.z-dn.net/files/dc1/f17c85f85405d14cbf4bb2bd8bebc29f.jpg)


In [None]:
print('Train')
dftrain.info()

In [None]:
print('Test')
dftest.info()

![](https://c4l.net/wp-content/uploads/2019/08/Facts-About-Learning-Differences-Icon-hover.png)
**Statistical data:** 
In this phase we will verify that we have a sense of how our data is distributed.

* Knowing our data ...
* Counting values
* Number of unique values
* Higher value (more frequent)
* Frequency of your primary value
* Mean, standard deviation, minimum and maximum values
* Percentiles of your data: 25%, 50%, 75% by default

In [None]:
describe = dftrain.describe()
describe.index = describe.index.map(str.upper)
describe

In [None]:
dftrain.head()

* Are there columns with null values?
* Which columns?
![](https://cdn.iconscout.com/icon/premium/png-128-thumb/inference-1428533-1210761.png)

In [None]:
dfnulls = dftrain.isnull().sum()
dfnulls[dfnulls.values>0]

Age, Cabin and Embarked features need to be adjusted.

*Okay, but the question that doesn't stop is:*

After all, how is the distribution of people who survived?

> **Survived**:
> * **0** = No
> * **1** = Yes


In [None]:
values = dftrain.Survived.value_counts()
plt.clf()
fig,ax=plt.subplots(1,0,figsize=(10,3))
values.plot.pie(shadow=True,startangle=180,explode=[0,0.1],autopct='%1.2f%%')
plt.show()

How is the distribution of survived vs a pclass?

The PClass feature is a proxy for socioeconomic status (SES)

Did the money factor influence who survived?

In [None]:
dftrain[['Survived','Pclass']].groupby(['Survived','Pclass']).apply(lambda c: c.count())

In [None]:
classes = ['1º','2º','3º']
die =  dftrain[dftrain.Survived==0].Pclass.value_counts()
survived =  dftrain[dftrain.Survived==1].Pclass.value_counts()
fig,ax = plt.subplots(1,2,figsize=(6,5))
survived.plot.pie(ax=ax[0],autopct='%1.1f%%',explode=[0.3,0.05,0.05],shadow=True)
ax[0].set_title('Survived')
ax[0].set_xlabel(classes)
ax[0].grid(True)

die.plot.pie(ax=ax[1],autopct='%1.1f%%',explode=[0.3,0.05,0.05],shadow=True)
ax[1].set_title('Die')
ax[1].set_ylabel('')
ax[1].grid(True)

plt.show()

It is evident that only 61.62% survived the accident; something curious is to see that of the only passengers that did not survive, most of them were passengers in the 3rd class.

**What did I do in the Problem definition step?**

![](https://images.twinkl.co.uk/tw1n/image/private/t_create_thumb/create/library/Parent-and-Child-Doing-math-Activity---Thank-You-2020-Cards-Home-Learning-Classic-KS1.png)
__

I basically read the Data Description to get some insights; and I did, later on we will see other features of our datasets.

The columns that drew my attention to the Data description: Sex, Name, Age




**Data Analysis**

**Discovery types**
In this first moment, we will identify and treat our features



![](https://learningboosters.com.au/wp-content/uploads/2019/05/hobarts-best-tutoring-service-learning-boosters-are-knowledgeable-experienced.png)

**Categorical** *(also known as Qualitative)*

This type of data represents the characteristics of an object; A variable that describes categorical data is called a categorical variable. These types of variables can have one of a limited number of values.

This type of variable is easy to understand.

There are some types of categorical variables.

**Binary**: Being able to assume exactly two values, it is also called the dichotomous variable.
>Example: Sex

**Politomics**: Which can assume more than two possible values.
>Example: PClass, Embarked



**Note** 
PClass: A proxy for socioeconomic status (SES)
>1st = Superior
>2nd = Average
>3rd = Low

Sibsp: The data set defines family relationships in this way ...

Brothers = brother, sister, half-brother, half-sister

Spouse = husband, wife (lovers and grooms were ignored)

Parch: The data set defines family relationships in this way ...

Parents = mother, father

Child = daughter, son, stepdaughter,

stepson Some children traveled only with the nanny, so parch = 0 for them.

Embarked:
>C = Cherbourg
>Q = Queenstown
>S = Southampton

In [None]:
cats = ['Pclass','Sex','SibSp','Parch','Embarked']
dftrain[cats].head()

In [None]:
fig,ax=plt.subplots(1,len(cats),figsize=(25,2))
colors =  ['blue','green','gray','gold','yellow', 'orange', 'red','purple','indigo','violet']
survived = dftrain[dftrain.Survived==1].Survived.value_counts()
for i,c in enumerate(cats):
  v = dftrain[c].value_counts() 
  ax[i].bar(v.index,v,color =_shuffle(colors),edgecolor='black')
  ax[i].set_title(c.upper())
  ax[i].set_ylabel('')
  ax[i].grid(True)
plt.show()

**the passengers?** 
For the training set, we provide the result (also known as ground truth) for each passenger.


Your model will be based on "features" such as gender and class of passengers.
* We can also use the  **[feature engineering](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)** 


#### **Name** Feature

In [None]:
dftrain['Name'].head()

![](https://images.twinkl.co.uk/tw1n/image/private/t_create_thumb/create/library/Spinner-with-Pencil-and-Paperclip---PlanIt-Y1-Addition-and-Subtraction-Home-Learning-Tasks---KS1.png)

Basic Feature Engineering with the Titanic Data


First up the Name column is currently not being used, but we can at least extract the title from the name. There are quite a few titles going around, but I want to reduce them all to Mrs, Miss, Mr and Master.  To do this we’ll need a function that searches for substrings. Thankfully the library ‘strings’ has just what we need.
https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/

Now that I have them, I recombine them to the four categories.

In [None]:
#title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev','Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess','Don', 'Jonkheer']
#tempo join
dftrain['fork']=1
dftest['fork']=2
dftest['Survived'] = 2
dftrain = pd.concat([dftrain, dftest], ignore_index=True, sort=False)

dftrain['Title']=dftrain.Name.str.extract('([A-Za-z]+)\.')
dftrain['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

#dftrain['Title']=dftrain['Name'].map(lambda x: substrings_in_string(x, title_list))
#dftrain['Title']=dftrain.apply(replace_titles, axis=1)
dftrain.Title = dftrain.Title.str.strip()
dftrain.Title = dftrain.Title.str.upper()

dftrain.Title.unique()


In [None]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS) 
names = ' a '.join(dftrain.Title.values)
wordcloud2 = WordCloud(background_color='white', stopwords=stopwords,min_font_size=10).generate(names)
plt.figure(figsize = (20, 4), facecolor=None) 
plt.imshow(wordcloud2)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

In [None]:
print(len(wordcloud2.words_))
wordcloud2.words_

In [None]:
titles = dftrain.Title.value_counts()
fig,ax = plt.subplots(1,1,figsize=(6,5))
titles.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.3,0.1,0.1,0.1,0.1,.1])
ax.grid(True)
plt.show()

Chance to survive for each title

In [None]:
chance = dftrain.groupby("Title")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(6,5))
chance.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.01,0.05,0.05,0.2,0.1,0.1])
ax.grid(True)
plt.show()

In [None]:
print('Survivors vs Title')
pd.crosstab(dftrain.Survived, dftrain.Title)

#### **Cabin**
This is going be very similar, we have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.


**Turning cabin number into Deck**

In [None]:
dftrain['Cabin'].fillna('Unknown',inplace=True)
dftrain['Cabin'].head()

In [None]:
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']
dftrain['Deck']=dftrain['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

In [None]:
deck = dftrain[(dftrain['fork'] ==1)].Deck.value_counts()
fig,ax = plt.subplots(1,1,figsize=(10,3))
deck.plot.bar(ax=ax)#,shadow=True,explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1])
ax.grid(True)
plt.show()


Chance to survive for each deck

In [None]:
chance = dftrain[(dftrain['fork'] ==1)].groupby("Deck")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(6,3))
chance.plot.bar(ax=ax)
ax.grid(True)
plt.show()

In [None]:
print('Survivors vs Deck')
pd.crosstab(dftrain[(dftrain['fork'] ==1)].Survived, dftrain[(dftrain['fork'] ==1)].Deck)

#### **Embarked**
Port of Embarkatio

In [None]:
dftrain.Embarked.fillna('S',inplace=True)
print('Survivors vs Embarked')
pd.crosstab(dftrain[(dftrain['fork'] ==1)].Survived, dftrain[(dftrain['fork'] ==1)].Embarked)

In [None]:
embarked = dftrain[(dftrain['fork'] ==1)].Embarked.value_counts()
fig,ax = plt.subplots(1,1,figsize=(6,5))
embarked.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.2,0.1,0.1])
ax.grid(True)
plt.show()

Chance to survive for each  Embarked

In [None]:
chance = dftrain[(dftrain['fork'] ==1)].groupby("Embarked")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(6,5))
chance.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.3,0.05,0.05])
ax.grid(True)
plt.show()

#### **SibSp**
of siblings / spouses aboard the Titanic

In [None]:
dftrain.SibSp.isna().sum()
print('Survivors vs SibSp')
pd.crosstab(dftrain[(dftrain['fork'] ==1)].Survived, dftrain[(dftrain['fork'] ==1)].SibSp)

In [None]:
sibSp = dftrain.SibSp[(dftrain['fork'] ==1)].value_counts()
fig,ax = plt.subplots(1,1,figsize=(10,3))
sibSp.plot.bar(ax=ax)
ax.grid(True)
plt.show()

In [None]:

chance = dftrain[(dftrain['fork'] ==1)].groupby("SibSp")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(10,4))
chance.plot.bar(ax=ax,color='g')
ax.axvline(x=chance.mean()*10,linewidth=5,color='r' )
ax.grid(True)
plt.show()

apparently with 1 or 2 (siblings / spouses) the chance of surviving is greater.

#### **Parch**
of parents / children aboard the Titanic

In [None]:
dftrain.Parch.isna().sum()
print('Survivors vs Parch')
pd.crosstab(dftrain[(dftrain['fork'] ==1)].Survived, dftrain[(dftrain['fork'] ==1)].Parch)

In [None]:
chance = dftrain[(dftrain['fork'] ==1)].groupby("Parch")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(10,4))
chance.plot.bar(ax=ax,color='g')
ax.axvline(x=chance.mean()*11,linewidth=5,color='r' )
ax.grid(True)
plt.show()

#### **PClass**
socioeconomic status

In [None]:
dftrain.Pclass.isna().sum()
print('Survivors vs Pclass')
pd.crosstab(dftrain[(dftrain['fork'] ==1)].Survived, dftrain[(dftrain['fork'] ==1)].Pclass)

In [None]:
pclass = dftrain[(dftrain['fork'] ==1)].Pclass.value_counts()
fig,ax = plt.subplots(1,1,figsize=(6,5))
pclass.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.2,0.1,0.1])
ax.grid(True)
plt.show()

In [None]:
chance = dftrain[(dftrain['fork'] ==1)].groupby("Pclass")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(6,5))
chance.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.2,0.05,0.05])
ax.grid(True)
plt.show()

**Numeric *(also known as Quantitative)*They are data with a sense of measurement involved. Their types can be continuous or discrete.

> * **Discreet:** Accounting data; they assume a fixed number of distinct values.
>> Example: SibSip, this feature tells us if the passenger is alone or in a family.
>>> * sibsp: The dataset defines family relations in this way ...
>>> * Sibling = brother, sister, stepbrother, stepsister
>>> * Spouse = husband, wife (mistresses and fiancés were ignored)

> * **Continuous:** Data that can have an infinite number of values within a classification.
>> * Example: Fare Passenger fare


In [None]:
cats = ['Age','Fare']
fig,ax=plt.subplots(1,len(cats),figsize=(25,2))
colors =  ['blue','green','gray','gold','yellow', 'orange', 'red','purple','indigo','violet']
survived = dftrain[dftrain.Survived==1 & (dftrain['fork'] ==1)].Survived.value_counts()
for i,c in enumerate(cats):
  v = dftrain[c].value_counts() 
  ax[i].bar(v.index,v,color =_shuffle(colors),edgecolor='black')
  ax[i].set_title(c.upper())
  ax[i].set_ylabel('')
  ax[i].grid(True)
plt.show()

#### **Fare**
Passenger fare	

In [None]:
dftrain[(dftrain['fork'] ==1)].Fare.describe()


We noticed something very interesting here, 15 passengers have a fare 0

In [None]:
plt.figure(figsize = (9,3))
plt.hist(dftrain[(dftrain['fork'] ==1)]["Fare"], bins = 40)
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()

#### **Age**
Age in years

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

In [None]:
dftrain.Age.isnull().sum()

In [None]:
dftrain.Age[(dftrain['fork'] ==1)].describe()

In [None]:
plt.figure(figsize = (9,3))
plt.hist(dftrain[(dftrain['fork'] ==1)]["Age"], bins = 40)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

**Data Cleaning & Correlation between the features**

#### **Age**
Adjusting and creating age range

In [None]:
dftrain.Age.isna().sum()

In [None]:
cols = dftrain.Title.value_counts().index
for c in  cols:
  print(len(dftrain[(dftrain.Age.isnull()) & (dftrain.Title==c)]))
  print(dftrain[(dftrain.Title==c)].Age.median(),dftrain[(dftrain.Title==c)].Age.mean())
  dftrain.loc[(dftrain.Age.isnull()) & (dftrain.Title==c),"Age"] = dftrain[(dftrain.Title==c)].Age.mean() 

In [None]:
dftrain["Age_Range"] = pd.cut(dftrain.Age, [0,15,25,35,45,55,65,81],  right=False)
dftrain["Age_Range"].head()

In [None]:
pd.crosstab(dftrain[(dftrain['fork'] ==1)].Age_Range, dftrain[(dftrain['fork'] ==1)].Survived).style.background_gradient(cmap='gray')


In [None]:
chance = dftrain[(dftrain['fork'] ==1)].groupby("Age_Range")["Survived"].mean()
fig,ax = plt.subplots(1,1,figsize=(6,5))
chance.plot.pie(ax=ax,autopct='%1.1f%%',shadow=True,explode=[0.3,0.1,0.1,0.1,0.1,0.1,0.1])
ax.grid(True)
plt.show()

It seems that children had a **better chance of survival**! hufa! :)

In [None]:
dftrain[(dftrain['fork'] ==1)]['Age_Range'].value_counts().to_frame().style.background_gradient(cmap='gray')

In [None]:
sns.factorplot('Age_Range','Survived',data=dftrain[(dftrain['fork'] ==1)],col='Pclass')
plt.show()

**The survival rate decreases with increasing age, regardless of your economic class.**

In [None]:
dftrain.Age = dftrain.Age_Range
dftrain.drop('Age_Range', axis=1, inplace=True)
dftrain.head(1)

#### **Embarked**

In [None]:
f,ax=plt.subplots(2,2,figsize=(20,5))
sns.countplot('Embarked',data=dftrain[(dftrain['fork'] ==1)],ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked',hue='Sex',data=dftrain[(dftrain['fork'] ==1)],ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked',hue='Survived',data=dftrain[(dftrain['fork'] ==1)],ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=dftrain[(dftrain['fork'] ==1)],ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

#### **Correlation**

Creating new family_size column

In [None]:
#https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/
dftrain['Family_Size']=dftrain['SibSp']+dftrain['Parch']
dftrain.head(2)

In [None]:
dftrain.drop('Cabin', axis=1, inplace=True)

In [None]:
dftrain.loc[dftrain['Family_Size'] == 0, 'Family_Size'] = 1
#dftrain[dftrain['Family_Size']==0]['Family_Size'] =1
dftrain['Family_Size'].value_counts()

In [None]:
dftrain.drop('Name', axis=1, inplace=True)

In [None]:
dftrain['Fare_Per_Person']=dftrain['Fare']/(dftrain['Family_Size']+1)
dftrain['Fare_Per_Person'].value_counts()

In [None]:
dftrain['Fare_Person_Range']=pd.qcut(dftrain['Fare_Per_Person'],5)
dftrain['Fare_Person_Range'].value_counts()

In [None]:
dftrain.drop('Fare_Per_Person', axis=1, inplace=True)

In [None]:
dftrain['Fare_Range']=pd.qcut(dftrain['Fare'],5)
dftrain[(dftrain['fork'] ==1)]['Fare_Range'].value_counts()

In [None]:
dftrain.drop('Fare', axis=1, inplace=True)

In [None]:
dftrain["Fare"] = dftrain.Fare_Range
dftrain.drop('Fare_Range', axis=1, inplace=True)


In [None]:
dftrain['Sex'].replace(['male','female'],[0,1],inplace=True)
dftrain['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
dftrain['Title'].replace(['MR', 'MRS', 'MISS', 'MASTER'],[0,1,2,3],inplace=True)
dftrain['Deck'].replace(['Unknown', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'],[0,1,2,3,4,5,6,7,8],inplace=True)

In [None]:
dftrain.drop(['Ticket','PassengerId'],axis=1,inplace=True)

In [None]:
dftrain.head(3)

In [None]:
plt.gcf()
c = dftrain[(dftrain['fork'] ==1)].copy()
c.drop('fork', axis=1, inplace=True)
sns.heatmap(c.corr(),annot=True,linewidths=2.2,annot_kws={'size':12})
fig=plt.gcf()
fig.set_size_inches(18,6)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [None]:
dftrain.Pclass = dftrain.Pclass.astype("category")
dftrain.Sex = dftrain.Sex.astype("category")
dftrain.Age = dftrain.Age.astype("category")
dftrain.SibSp = dftrain.SibSp.astype("category")
dftrain.Parch = dftrain.Parch.astype("category")
dftrain.Embarked = dftrain.Embarked.astype("category")
dftrain.Title = dftrain.Title.astype("category")
dftrain.Fare = dftrain.Fare.astype("category")
dftrain.Deck = dftrain.Deck.astype("category")
dftrain.Family_Size = dftrain.Family_Size.astype("category")

In [None]:
dftrain = pd.get_dummies(dftrain, columns= ['Family_Size'])
dftrain.head()

In [None]:
dftrain = pd.get_dummies(dftrain, columns= ['Pclass','Sex','Age','SibSp','Parch','Embarked','Title','Fare','Deck'])
dftrain.head()

In [None]:
dftrain.drop(['Fare_Person_Range'],axis=1,inplace=True)
dftrain.head()

**Modeling**

In [None]:
test = dftrain[(dftrain['fork'] ==2)].copy()
train = dftrain[(dftrain['fork'] ==1)].copy()
train.drop(['fork'],axis=1,inplace=True)
test.drop(['fork'],axis=1,inplace=True)
test.drop(['Survived'],axis=1,inplace=True)

x = train.drop(labels = "Survived", axis = 1).values
y = train.Survived.values

x.shape,test.shape,train.shape


#### **XGBoost**
If things don’t go your way in predictive modeling, use XGboost.  

XGBoost algorithm has become the ultimate weapon of many data scientist. 

It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

In [None]:
import sklearn
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score,accuracy_score
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from xgboost import XGBClassifier

#xgb = XGBClassifier(eta =0.0851,max_depth=10,gamma=0.6,alpha=0.5,scale_pos_weight =5,eval_metric='rmse')
xgb = XGBClassifier(n_estimators=360, max_depth=200, learning_rate=0.1)
scaler = MinMaxScaler()
xscaled = scaler.fit_transform(x)
testscaled = scaler.transform(test.values)
xtrain, xtest, ytrain, ytest = train_test_split(xscaled,y, test_size=0.35,stratify=y)


In [None]:
xgb.fit(xtrain,ytrain)
acc_train = round(xgb.score(xtrain, ytrain)*100,2) 
acc_test = round(xgb.score(xtest,ytest)*100,2)
print("Train:\t{}% ".format(acc_train))
print("Test :\t{}% ".format(acc_test))

In [None]:
predictions = xgb.predict(xtest)

#### **Compute the precision**
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.

In [None]:
print('precision_score:\t',precision_score(predictions,ytest, average='macro'))

**Accuracy classification score.**

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

In [None]:
print('accuracy_score:\t',accuracy_score(predictions,ytest))

#### Submission

In [None]:
predictions.shape, dftest.shape

In [None]:
predictions = xgb.predict(testscaled)
test_survived = pd.Series(predictions, name = "Survived").astype(int)
submit = pd.concat([dftest.PassengerId, test_survived],axis = 1)
submit.shape 
submit.Survived.value_counts().plot.pie()
submit.to_csv("titanic.csv", index = False)

*thanks for the support, you guys are great!*

Referencies
* @ash316 - [EDA To Prediction(DieTanic)](https://www.kaggle.com/ash316/eda-to-prediction-dietanic)
* @kanncaa1 - [DataiTeam Titanic EDA ](https://www.kaggle.com/kanncaa1/dataiteam-titanic-eda)
* [Exploratory Data Analysis](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)
* [Why EDA is necessary](https://medium.com/@srimalashish/why-eda-is-necessary-for-machine-learning-233b6e4d5083)
