**import libraries**

In [6]:
import numpy as np
import pandas as pd

**Load the data**

In [None]:
train_data = pd.read_csv("https://github.com/dsrscientist/dataset1/blob/master/titanic_train.csv")
test_data = pd.read_csv("https://github.com/dsrscientist/dataset1/blob/master/titanic_train.csv")

print(f"train_data.shape :{train_data.shape}")
print(f"test_data.shape :{test_data.shape}")

# Data Understanding 

In [None]:
train_data.head()

The attributes have the following meaning:
* **PassengerId**: a unique identifier for each passenger
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

In [None]:
train_data.info()

`Age`,`Cabin`,`Embarked` have null value.

In [None]:
total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

77% of `Cabin` are null value. so we will ignore it.

For`Age`, we can fill with median value. 
For `Name` and `Ticket`, we will ignore it.

In [None]:
train_data.describe()

according to `mean`, there are only 38% of the training set is **Survived**.


**Checking the features' values**

In [None]:
train_data["Survived"].value_counts()

In [None]:
train_data["Pclass"].value_counts()

In [None]:
train_data["Sex"].value_counts()

In [None]:
train_data["Embarked"].value_counts()

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

# Data preprocessing

Split the data into **feature** and **label**.

In [None]:
x_train = train_data.drop("Survived", axis=1)
y_train = train_data["Survived"]

x_train

In [None]:
y_train

buliding preprocessing pipeline for **numerical attributes**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("sclar", StandardScaler())
])

buliding preprocessing pipeline for **categorical attributes**

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

cat_pipeline = Pipeline([
        ("ordinal_encoder", OrdinalEncoder()),    
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

Combine **Categorical pipeline** and **Numerical pipeline** 

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

**Transform the data**

In [None]:
X_train_tran = preprocess_pipeline.fit_transform(x_train)
test_data_tran = preprocess_pipeline.fit_transform(test_data)
print("X_train_tran.shape :",X_train_tran.shape)
print("test_data_tran.shape :",test_data_tran.shape)

# Modeling

I will use `RandomForestClassifier`

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_clf.fit(X_train_tran, y_train)

In [None]:
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_clf, X_train_tran, y_train, cv=10)

forest_scores.mean()

**support vector classifier**

In [None]:
from sklearn.svm import SVC

svm_clf = SVC(gamma="auto")
svm_scores = cross_val_score(svm_clf, X_train_tran, y_train, cv=10)

svm_scores.mean()

**plot the scores**

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot([1]*10, svm_scores, ".")
plt.plot([2]*10, forest_scores, ".")
plt.boxplot([svm_scores, forest_scores], labels=("SVM", "Random Forest"))
plt.ylabel("Accuracy")
plt.show()

therefore we will use **SVC**.

In [None]:
svm_clf.fit(X_train_tran, y_train)

y_pred = svm_clf.predict(test_data_tran)

y_pred.shape

In [None]:
test_data.shape

# Submission

In [None]:
# Create Submission

svm_sub = pd.DataFrame({
    "PassengerId" : test_data["PassengerId"],
    "Survived" : y_pred
})

svm_sub.head()

In [None]:
svm_sub.to_csv("submission.csv", index=False)