In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Above code prepares the python environment by importing necessary python libraries (numpy, pandas, os).
Libraries help you easily reference functions inside your code.
For example **os.path.join(dirname, filename)** is a method derived from the os.path module. All you have to do is to pass the directory name and filename and it lists all the file names in the input directory.

To train a machine to perform predictions, we need to first provide it input data (X). This input data has several columns. Some are FEATURES, in this case Sex, Age, Fare etc. and one of the columns in this dataset is the TARGET, in this case its the column SURVIVED. So the task of the machine is to find the pattern among the features (X) to be able to predict the outcome (Y). So given X -> Y relationship, can the machine (called model) train itself to predict future outcomes. The test dataset (test.csv) does not have the Y value column, it just has features (X). When you feed data to the model which has trained itself on the TRAIN.CSV dataset, it should be able to predict the values in the SURVIVED column.

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

Here we use the read_csv method to read the comma seperated values in our input file train.csv. 
Using the head method we display only the first 5 rows from top. This is just to get a feel for the data inside the command seperated file (CSV). The above output is only for the TRAIN.CSV file.

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

Similarly we inspect the TEST.CSV dataset. It has all the columns as in the TRAIN.CSV dataset, except for the feature column SURVIVED (Y). The outcome of the passengers survival is already known, but we wish to test the models capabilities and hence we do not give it the outcome. We only give it features (X) and test if it can predict Y. It is typical to split the data into TRAINING and TEST in the ratio 70:30 or 60:40. There is no fixed rule but generally you split the dataset you already have in such ratios.

Next up we do some Exploratory Data Analysis (EDA). EDA is essential for us to understand the data and spot patterns ourselves before we ask the machine to do it. EDA can also help you decide which algorithm to use for this problem and sometimes also help you understand if the data is valid. For example if you found out that 80% of the columns were missing the gender information, then you would not train the machine on this column.

In [None]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

EDA tells us that 74% of the women survived the tragedy of the ship sinking. Were there many Kate Winslets on board?

In [None]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

A very tiny fraction (approx 20%) of men surived. Does this mean men on boat were bad swimmers or did the captain send the women and children to safety using the rescue boats.

After doing some basic exploratory analysis of the data, we come to the important step of training the algorithm with the TRAINING data. In this example we train the Random Forest Classifier on 4 features ["Pclass", "Sex", "SibSp", "Parch"] and ask it to find patterns which lead to the outcome (Y) of SURVIVED = 1 or DIED = 0. So we hope the random forest classifier is able to create trees or forests with branches which decide which features or combination of features made it more likely for the passenger to surive or die. So perhaps being a man on the titanic meant certain death or perhaps most men were travelling first class and hence they faced certain death.

In [None]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

Ultimately you save the prediction as yet another CSV named my_submission.csv. If you open this file, you will see two columns, PassengerId and Survived.