I am a beginner to the world of machine and deep learning, and this is my first time attempting a competition at Kaggle. I am going to attempt this based on the knowledge I have gained reading from online sources and few courses. 

Without further ado, let’s dive into it. I am going to divide my work into following sections:
1. Import and View Data.
2. Manual Feature Selection.
3. Analyze and Preprocess Data with Data Visualization
4. Analysis for algorithm selection. 
5. Creating and Selecting Best Model.
6. Visualizing the Best Result.
7. File Submission.

Any comments/suggestions will be greatly appreciated.

**1. Import and View Data**

In [None]:
# Import pandas that will help us read the provided csv files into dataframes
import pandas as pd
training_set = pd.read_csv('../input/train.csv')
test_set = pd.read_csv('../input/test.csv')

# Lets view the training set
training_set.head(n = 10)

In [None]:
# Lets view the test set
test_set.head(n = 10)

Looking at the data we can see that **Survived** column is the label (dependent variable) and rest of the columns are features (independent variable). Now we will be eliminating those features that are unimportant variable to our model to be trained on.

**2. Manual Feature Selection**

Looking at the training and test data above, we can clearly see that we should be able to eliminate few features (columns) because they either do not contribute in training the model or their impact on the prediction will be very insignificant.
1. PassengerId >>> This feature (also refer to as independent variable) is not relevant and will not give us any information about the survival of the passenger.
2. Name >>> This feature may contain some relevant data if you consider salutations. For example, take 'Dr.' (Doctor) salutation. One can argue that being of doctor profession they may be more inclined to help and save other passengers' lives risking theirs but provided information doesn't provide enough evidence of that
3. Ticket >>> This variable in my opinion is redundant because we may be able to extract status information out of it but we can use pclass for this.
4. Cabin >>> This variable may be useful as depending on the position of the cabin in the ship, one may have more/less probability of survival than others but given the amount of information, we can argue we do not know if passenger were at their cabin or somewhere else during the time of impact.
5. Fare >>> If you look at the Fare vs Pclass (Ticket class) feature, then you can see that Pclass feature almost represent the Fare feature and it is just redundant data. For example, you will see lower fares for lower class (represented by higher number) and higher fares with higher class.

In [None]:
# View Fare vs Pclass columns
training_set.iloc[:, [2, 9]].head(n = 10)

In [None]:
# Dropping insignificant/unnecessary columns
training_set = training_set.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], axis = 1)
test_set = test_set.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], axis = 1)

In [None]:
# Lets view the training set now
training_set.head(n = 10)

In [None]:
# Lets view the test set now
test_set.head(n = 10)

**3. Analyze and Preprocess Data with Data Visualization**

I am going to work on this section by dividing it into three sub-sections:
    1. Get statistical information about the data.
    2. Analyze the data and Preprocess the data.

In [None]:
# Get statistical information about training set
training_set.describe(include = 'all')

In [None]:
# Get statistical information about test set
test_set.describe(include = 'all')

Here, the count variable gives us the number of rows with data filled in i.e. non NaN value. We can see that **Age** feature is missing quite a few values (in training and test set) and  **Embarked** feature is missing couple (in training set). So lets fill in these missing values. This process is often called **imputation**.

* **Age**
    
    Since Age is a continous numerical feature and it ranges from 0.17 to 76 (extracted from min and max above), we will be filling its missing values with its **mean**.

In [None]:
training_set['Age'] = training_set['Age'].fillna(training_set.mean()[0])
test_set['Age'] = test_set['Age'].fillna(test_set.mean()[0])

* **Embarked**

    Embarked is a categorical feature with possible values C = Cherbourg, Q = Queenstown, and S = Southampton. We will be filling the missing values with most frequent data i.e. **mode****.

In [None]:
training_set['Embarked'] = training_set['Embarked'].fillna(training_set['Embarked'].mode()[0])

In [None]:
# Now we can see that Age and Embarked features are not missing any values
print(training_set['Age'].count())
print(test_set['Age'].count())
print(test_set['Embarked'].count())

Since now we have a complete data, let's start analyzing data.
Let's make some manual assumptions about correlation between individual feature and label.
* **Age Feature:** We can assume that younger passenger may have higher probability to survive than older passenger considering younger passengers are physically more strong. Also, babies may have higher probability of survival. So, we are going to keep this feature to train our model.

In [None]:
# Import data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# draw a bar plot - Age-Group vs. survival
plt.subplots(1, 1, figsize = (15, 5))
age_bins = [0, 10, 20, 30 , 40, 50, 60, 70, 80, 90]
age_group = pd.cut(training_set['Age'], age_bins)
survived = training_set['Survived'].values
sns.barplot(x = age_group, y = survived)
plt.xlabel('Age-Group')
plt.ylabel('Survived')
plt.legend()
plt.show()

* **Pclass (Ticket class) feature**: We can assume that passenger with high class tickets are more likely to survive and those that are in lower class. Let's see if there is any correlation between Pclass and Survived features.

In [None]:
# draw a bar plot - Pclass vs. survival
plt.subplots(1, 1, figsize = (10, 5))
sns.barplot(x = 'Pclass', y = 'Survived', data = training_set)
plt.xlabel('Ticket class')
plt.ylabel('Survived')
plt.legend()
plt.show()

* **Sibsp (Sibling and spouse) and Parch (Parents and child) feature: ** We can assume, that having family on board decreases the probability of survival of a passenger. Actually, with the values in these feature we cannot really correlate information based on differences in values of these features vs their survival probability, so we will be combining these features into a feature - **HasFamily**. This process is also refer to as** Feature Engineering**.

In [None]:
import numpy as np

# Since number of siblings (sibsp) and parents (parch) denote if passenger had a family onboard
# we are going to combine these columns to create a column called HasFamily
training_set['HasFamily'] = np.where(training_set['SibSp'] + training_set['Parch'] > 0, 1, 0)
test_set['HasFamily'] = np.where(test_set['SibSp'] + test_set['Parch'] > 0, 1, 0)
# Now we can drop SibSp and Parch columns
training_set = training_set.drop(['SibSp', 'Parch'], axis = 1)
test_set = test_set.drop(['SibSp', 'Parch'], axis = 1)

training_set.head(n=10)

In [None]:
test_set.head(n=10)

In [None]:
# draw a factor plot - HasFamily vs. survival
sns.factorplot(x = 'HasFamily', y = 'Survived', data = training_set)
plt.xlabel('Has Family')
plt.ylabel('Survived')
plt.legend()
plt.show()

As we see above that surprisingly, more passengers with family survived the disaster than ones with no family. This will be an important feature to train our model.
* **Sex**:  Assuming female passengers have more probability of survival than male, lets analyze this feature with data visualization.

In [None]:
# draw a bar plot - Sex vs. survival
sns.barplot(x = 'Sex', y = 'Survived', data = training_set)
plt.xlabel('Sex')
plt.ylabel('Survived')
plt.legend()
plt.show()

Now I am going to preprocess some categorical data into numerical form so that we can utilize those feature and run computational algorithms against them.

In [None]:
# Let's divide the data into features and label
features_train = training_set.iloc[:, 1:].values 
labels_train = training_set.iloc[:, 0].values
features_test = test_set.iloc[:, :].values
# label for test data is not provided

# lets use LabelEncoder to convert Sex data into numerical form
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
features_train[:, 1] = label_encoder.fit_transform(features_train[:, 1])
features_test[:, 1] = label_encoder.fit_transform(features_test[:, 1])

In [None]:
# Now encode Embarked feature
label_encoder = LabelEncoder()
features_train[:, 3] = label_encoder.fit_transform(features_train[:, 3])
features_test[:, 3] = label_encoder.fit_transform(features_test[:, 3])

Now we need to do something called One Hot Encoding to **Embarked** column. One Hot Encoding is done because the label encoding turns the categorical feature into numerical form such as 0, 1, 2, 3, .. so on. Here it assumes that higher the encoded label (as number), better the prediction. In order to eliminate this issue, a feature is turned into number of features depending on the number of unique values in that feature. Here all of the resulting feature values will be in binary form

In [None]:
# One Hot Encode Embarked feature
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(categorical_features=[3]) # Embarked Column index is 4
features_train = one_hot_encoder.fit_transform(features_train).toarray()
one_hot_encoder = OneHotEncoder(categorical_features=[3]) # Embarked Column index is 4
features_test = one_hot_encoder.fit_transform(features_test).toarray()

Now we need to drop one of the columns created from One Hot Encoding to avoid Dummy Variable Trap. When One Hot Encoding creates these n number of columns they are exposed to a collinearity meaning, one of these encoded feature is can used to easily predict the other as they are in linear form. In order to break this, we drop one of this feature. 

In [None]:
features_train = features_train[:, 1:] # dropping column at 0th index
features_test = features_test[:, 1:] # dropping column at 0th index

**4. Analysis for algorithm selection. **

Now that all of the important features have been selected and unnecessary one have been dropped, let's talk about algorithm selection. Since we will be trying various models to get one with best accuracy, we will be dealing with various algorithm but we need to select correct algorithm based on either the problem is Regression problem or Classification problem. We know that this is a Classification problem since the prediction we are making is Survived or not and is not a continous set of predictions like price of a stock. So, lets pick some classification algorithms and train our model.

**5. Creating and Selecting Best Model. **

We will be fitting and predicting test results by using following classification algorithms:
1. Logistic Regression
2. K-Nearest Neighbors
3. Support Vector Machines
4. Naive Bayes
5. Decision Trees
6. Random Forest

Throughout the process, we will be applying scaling manually to some algorithms and not to others. Scaling is done in order for a feature to not have dominance over other because of the values it contain.

In [None]:
# Create a dataframe to store algorithms and their accuracies
algo_accuracy = pd.DataFrame(columns = ['Algorithm', 'Accuracy'])

# Logistic Regression
# Scaling features so that one features doesn't dominate the other
from sklearn.preprocessing import StandardScaler
# Standardize features by removing the mean and scaling to unit variance
# This basically gets the difference between the value and the mean of values in that column
scaler = StandardScaler()
features_train = scaler.fit_transform(features_train)
features_test = scaler.transform(features_test) #no need to fit since training set already does this

# Fit Logistic Regression to the training data
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(features_train, labels_train)

# Predit the test result and calculate accuracy
labels_pred = log_reg.predict(features_test)
accuracy_log_reg = log_reg.score(features_train, labels_train) * 100
algo_accuracy.loc[len(algo_accuracy)] = ['Logistic Regression', accuracy_log_reg]
print(accuracy_log_reg)

In [None]:
# K-Nearest Neigbors
# No need to scale the data since its already done above.
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
KNN.fit(features_train, labels_train)

# Predit the test result and calculate accuracy
labels_pred = KNN.predict(features_test)
accuracy_KNN = KNN.score(features_train, labels_train) * 100
algo_accuracy.loc[len(algo_accuracy)] = ['K-Nearest Neighbors', accuracy_KNN]
print(accuracy_KNN)

In [None]:
# Support Vector Machines
# No need to scale the data since its already done above.
from sklearn.svm import SVC
svc = SVC(kernel = 'rbf')
svc.fit(features_train, labels_train)

# Predit the test result and calculate accuracy
labels_pred = svc.predict(features_test)
accuracy_svc = svc.score(features_train, labels_train) * 100
algo_accuracy.loc[len(algo_accuracy)] = ['Support Vector Machines', accuracy_svc]
print(accuracy_svc)

In [None]:
# Naive Bayes
# No need to scale the data since its already done above.
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(features_train, labels_train)

# Predit the test result and calculate accuracy
labels_pred = nb.predict(features_test)
accuracy_nb = nb.score(features_train, labels_train) * 100
algo_accuracy.loc[len(algo_accuracy)] = ['Naive Bayes', accuracy_nb]
print(accuracy_nb)

In [None]:
# Decision Trees
# No need to scale the data since its already done above.
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(criterion = 'entropy')
decision_tree.fit(features_train, labels_train)

# Predit the test result and calculate accuracy
labels_pred = decision_tree.predict(features_test)
accuracy_decision_tree = decision_tree.score(features_train, labels_train) * 100
algo_accuracy.loc[len(algo_accuracy)] = ['Decision Trees', accuracy_decision_tree]
print(accuracy_decision_tree)

In [None]:
# Random Forest
# No need to scale the data since its already done above.
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 20, criterion = 'entropy')
random_forest.fit(features_train, labels_train)

# Predit the test result and calculate accuracy
labels_pred = random_forest.predict(features_test)
accuracy_random_forest = random_forest.score(features_train, labels_train) * 100
algo_accuracy.loc[len(algo_accuracy)] = ['Random Forest', accuracy_random_forest]
print(accuracy_random_forest)

Now, lets put all the algorithms in a table and examine few aspects to derive which is the best algorithm to select a best model for this problem.

In [None]:
algo_accuracy.head(n=10)

Therefore, we can see that Decision Trees was the algorithm with best accuracy. So lets pick the predictions done by Decision Tree and get the file ready for submission.

In [None]:
# Decision Trees has the highest accuracy
decision_tree = DecisionTreeClassifier(criterion = 'entropy')
decision_tree.fit(features_train, labels_train)

# Predit the test result
decision_tree_preds = decision_tree.predict(features_test)

# Build a submission file
orig_test_set = pd.read_csv('../input/test.csv')
submission = pd.DataFrame({
        "PassengerId": orig_test_set["PassengerId"],
        "Survived": decision_tree_preds
    })
submission.to_csv('titanic_preds.csv', index=False)