# Titanic : Machine Learning From Disaster :-

## Importing Libraries:

#### First we have to import all the required library for the project Then we will start working with that

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

 We are having two dataset One is train and another is test
 training set (train.csv)
 test set (test.csv)
 
 The training set should be used to build your machine learning models. For the training set, we provide the outcome (also      known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You  can also use feature engineering to create new features.

 The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground   truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you    trained to predict whether or not they survived the sinking of the Titanic.

 We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example   of what a submission file should look like.


In [None]:
train=pd.read_csv("../input/titanic/train.csv" )
train.head()

In [None]:
train.shape

#### So we are having 891 columns and 12 rows in the train data

In [None]:
test=pd.read_csv("../input/titanic/test.csv")
test.head()

In [None]:
test.shape

#### Here we are having 418 rows and 11 columns in the test data and here we have to predict the target column i.e the survived column

## Data Analysis

 Data Dictionary
 Variable	Definition	Key
 survival	Survival	0 = No, 1 = Yes
 pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
 sex	Sex	
 Age	Age in years	
 sibsp	# of siblings / spouses aboard the Titanic	
 parch	# of parents / children aboard the Titanic	
 ticket	Ticket number	
 fare	Passenger fare	
 cabin	Cabin number	
 embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
 Variable Notes
 pclass: A proxy for socio-economic status (SES)
 1st = Upper
 2nd = Middle
 3rd = Lower

 age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

 sibsp: The dataset defines family relations in this way...
 Sibling = brother, sister, stepbrother, stepsister
 Spouse = husband, wife (mistresses and fiancés were ignored)

 parch: The dataset defines family relations in this way...
 Parent = mother, father
 Child = daughter, son, stepdaughter, stepson
 Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
train.columns

In [None]:
train.info()

In [None]:
train.dtypes

#### Here we knew that three columns are there having some null values those are age column and the cabin column but there exist some columns whose datatype is object so we have to check for the unique value to ensure that there does nit exist any null values in the form of any symbol

In [None]:
train['Name'].unique()

In [None]:
train['Sex'].unique()

In [None]:
train['Ticket'].unique()

In [None]:
train['Cabin'].unique()

In [None]:
train['Embarked'].unique()

#### from above analysis it is clear that we are not having null values in any other columns apart from age embarked and cabin 

In [None]:
train.duplicated().sum()

#### We are not having any duplicate rows 

In [None]:
train.describe()

#### From the above code we can get the statistical summary of the dataset

## Data Wrangling

 - **There are four type of variables**
  - **Numerical Features**: Age, Fare, SibSp and Parch
  - **Categorical Features**: Sex, Embarked, Survived and Pclass
  - **Alphanumeric Features**: Ticket and Cabin(Contains both alphabets and the numeric value)
  - **Text Features**: Name

**We really need to tweak these features so we get the desired form of input data**

In [None]:
train.isnull().sum()

#### To replace these null values we have to first check the skeweness

In [None]:
train.skew()

In [None]:
train['Age'].fillna(train['Age'].median(),inplace=True)

In [None]:
train['Embarked'].fillna(train['Embarked'].mode(),inplace=True)

In [None]:
from statistics import mode
train["Embarked"] = train["Embarked"].fillna(mode(train["Embarked"]))

In [None]:
train.isnull().sum()

## Test Data

In [None]:
test.isnull().sum()

In [None]:
test.skew()

In [None]:
test['Age'].fillna(test['Age'].median(),inplace=True)

In [None]:
test['Fare'].fillna(test['Fare'].median(),inplace=True)

In [None]:
from statistics import mode
train["Embarked"] = train["Embarked"].fillna(mode(train["Embarked"]))

In [None]:
test.isnull().sum()

## Data Visualisation

In [None]:
#Analytics between numeric vrs categorical:-
#Age vrs Survival
plt.figure(figsize=(12,5))
sns.distplot(train.Age[train.Survived==0],color="darkblue")
sns.distplot(train.Age[train.Survived==1],color="cyan")
plt.legend(['0','1'])
plt.show()

In [None]:
#Analytics between numeric vrs categorical:-
#Fare vrs Survival
plt.figure(figsize=(12,5))
sns.distplot(train.Fare[train.Survived==0],color="darkblue")
sns.distplot(train.Fare[train.Survived==1],color="cyan")
plt.legend(['0','1'])
plt.show()

In [None]:
#categorical vrs categorical
#sex vrs survived
plt.figure(figsize=(6,3))
sns.countplot(train.Sex)
plt.show()
sns.countplot(train.Sex[train.Survived==1])
plt.show()

In [None]:
#categorical vrs categorical
#sex vrs survived
plt.figure(figsize=(6,3))
sns.countplot(train.SibSp)
plt.show()
sns.countplot(train.SibSp[train.Survived==1])
plt.show()

In [None]:
#categorical vrs categorical
#sex vrs survived
plt.figure(figsize=(6,3))
sns.countplot(train.Parch)
plt.show()
sns.countplot(train.Parch[train.Survived==1])
plt.show()

In [None]:
cor=train.corr()
#Heatmap for visualisation of correlation analysis
plt.figure(figsize=(12,10))
sns.heatmap(cor,annot=True,cmap='coolwarm')
#when we write annot= True , it shows the values .
plt.show()

### Feature Scaling

In [None]:
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1

train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2

In [None]:
train.columns

In [None]:
train.dtypes

In [None]:
xtr=train[['Sex','Age','SibSp','Parch','Fare','Embarked']]
ytr=train['Survived']
xts=test[['Sex','Age','SibSp','Parch','Fare','Embarked']]

## Applying Algorithm :

In [None]:
from sklearn.linear_model import LogisticRegression
logisticRegression = LogisticRegression(max_iter = 30000)
logisticRegression.fit(xtr, ytr)

In [None]:
ypred = logisticRegression.predict(xts)

In [None]:
ypred

In [None]:
output = pd.DataFrame({'PassengerId': test['PassengerId'],'Survived': ypred})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

In [None]:
output.head()

## Random forest Algorithm

In [None]:
from sklearn.ensemble import RandomForestClassifier
model4 = RandomForestClassifier(n_estimators=50,criterion='gini',max_depth=10,min_samples_leaf=20)
model4.fit(xtr,ytr)

In [None]:
ypred = model4.predict(xts)

In [None]:
ypred

In [None]:
output2 = pd.DataFrame({'PassengerId': test['PassengerId'],'Survived': ypred})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")