we analyse Age, Gender,and Embarked of passengers

# Nacessary Information
Survival: Survival (0 = No; 1 = Yes)

Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

Name : Name

Sex : Sex

Age : Age

Sibsp : Number of Siblings/Spouses Aboard

Parch : Number of Parents/Children Aboard

Ticket : Ticket Number

Fare : Passenger Fare

Cabin : Cabin

Embarked : Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

# **importing necessary libaries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix, precision_score, recall_score

from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve, plot_roc_curve

%matplotlib inline

# Read Dataset

In [None]:
titanic = pd.read_csv("../input/titanic/train.csv")
titanic_test = pd.read_csv("../input/titanic/test.csv")

In [None]:
pd.DataFrame([titanic.corr()['Survived'].sort_values()])

In [None]:
font = {'family': 'serif',
        'color':  'white',
        'weight': 'normal',
        'size': 11,
        }

plt.style.use('default')

fig, ax = plt.subplots()

X=titanic['Fare']
y=titanic['Survived']


ax.set_xlabel('Fare', fontdict=font)
ax.set_ylabel('Survived', fontdict=font)

  
plt.plot(X,y, 'ro', alpha=0.5)

In [None]:
combined = [titanic, titanic_test]

In [None]:
titanic.head()

In [None]:
titanic.describe()

In [None]:
titanic_test.head()


### As we found:
###     About 38% of the passengers survived.
###     Mean age was around 30 years.
###     Passengers paid something around 32.
###     Half of the passengers travelled with siblings or spouse.



### Removing unuseful columns

In [None]:
## Removing unuseful columns

In [None]:
titanic["Cabin"].value_counts()



   It looks like that cabin number is to spreaded to be used in the model but let's visualise it and then make the decision.
   
   At most 4 rows share the same cabin so it doens't help much and it's better to drop it.



In [None]:
titanic["Fare"].value_counts()

In [None]:
len(titanic[titanic['Sex'].isnull()])

In [None]:
titanic["Fare"].describe()

In [None]:
sns.distplot(titanic["Fare"])

As we find: Fare" Distplot is pretty much skewed, maybe it's better to use the log form.
And clearly we don't have to remove "Fare" column.

# How many unique enters this array have

In [None]:
titanic['Sex'].unique()

Checking which gender have more survivers

In [None]:
titanic[['Survived', 'Sex']].groupby('Sex').mean()

# Replace Sex --> male= 0, female= 1
* for making the coding easier and generally faster
* Creating a list with two files, more accuracy for to the math to fill NaN values

In [None]:
combined=[titanic,titanic_test]

In [None]:
rep = {"male":0,"female":1}
for x in combined:
    x['Sex']=x['Sex'].map(rep)

In [None]:
titanic.head()
#look at sex column

# Working with embarked type

* Checking if exist some NaN value

In [None]:
len(titanic[titanic['Embarked'].isnull()])

In [None]:
titanic[titanic['Embarked'].isnull()]

* connecting all common variables of these people to predict where they embarked

**pclass**: Refers to passenger class (1st, 2nd, 3rd)

There are three possible values for Embark 
1-Southampton
2-Cherbourg, and 
3-Queenstown.

In [None]:
titanic[(titanic['Pclass']==1)&
                                (titanic['Survived']==1)&
                                                        (titanic['Sex']==1)].groupby('Embarked').sum()

In [None]:
sns.countplot(x='Pclass',hue='Embarked', data=titanic)

Here we can see, at all classes most of people embarked on "S"

In [None]:
titanic['Embarked']=titanic['Embarked'].fillna("S")

check titanic[Embarked] has empty field

In [None]:
len(titanic[titanic['Embarked'].isnull()])

In [None]:
rep_embarked={"S":0,"C":1,"Q":2}
for i in combined:
    i['Embarked']=i['Embarked'].map(rep_embarked)

** Lets see how is the distribuition by gender for people who survived**
 male= 0, female= 1

In [None]:
survived = titanic[titanic['Survived']==1]['Sex'].value_counts()
survived

**Extract how many peoples for each sex survived
seperrated by sex: male= 0, female= 1
**

In [None]:
dead = titanic[titanic['Survived']==0]['Sex'].value_counts()
dead

# looking for null values at Age.

In [None]:
len(titanic[titanic['Age'].isnull()])

In [None]:
titanic[titanic['Age'].isnull()].head()

In [None]:
combined_df = pd.concat([titanic, titanic_test], axis = 0)

In [None]:
combined_df[['Age','Pclass']].groupby('Pclass').mean()

In [None]:
combined_df.hist(figsize=(20,20))

In [None]:
data = [titanic, titanic_test]
for d in data:
    d['with_family'] = d['SibSp'] + d['Parch']
    d.loc[d['with_family'] > 0, 'not_alone'] = 0
    d.loc[d['with_family'] == 0, 'not_alone'] = 1
    d['not_alone'] = d['not_alone'].astype(int)
  
    
titanic['not_alone'].value_counts()

In [None]:
my_dataset = sns.factorplot('with_family','Survived', data=titanic, aspect = 2.5,cpalette='Blues_d')

In [None]:
sns.countplot(x='Sex',data=combined_df,hue='Pclass')

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(titanic.corr(), annot=True, cmap='inferno')
plt.show()

In [None]:
sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = titanic)

In [None]:
grid = sns.FacetGrid(titanic, col='Survived', row='Embarked', size=3.2, aspect=2.3)
grid.map(plt.hist, 'Age', alpha=.5, bins=20, )
grid.add_legend();