In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## <a id = "stepend"> Table of contents </a>
1. [Introduction](#1)
2. [Kaggle micro-courses study plan](#2)
3. [The solution of the problem](#3)
    * [Defining Libraries and Data Path](#4)
    * [Deal with Missing values](#5)
    * [Deal with Categorical Variables](#6)
    * [Defining and testing a model](#7)
    * [Submit to competiton](#8)
8. [Conclusion](#9)

# <a id = "1"> 1. Introduction</a>
### What is it about?
Good day, dear Kagglers! I wrote this notebook as a kind of logical conclusion to the study of Kaggle micro-courses. It uses a variety of techniques to solve the Titanic problem from all the different Kaggle micro-courses for beginners. You can simply familiarize yourself with this notebook and find those topics that, in your opinion, are worth repeating (i will give all the links to the micro-courses used).

You can use this solution for training, the main advantage of this solution is its simplicity. It will not use a large number of complex analysis methods or the creation of synthetic variables (such notebooks can be found in sufficient quantities) - this solution is aimed at obtaining decent model accuracy using the basic principles. Thus, you will be able to see the simple logic of data processing and model building.üëç
### Who is it for?
This notebook can be extremely useful for absolutely novice users, as well as those who are interested in mastering some theoretical aspects.


# <a id = "2"> 2. Kaggle micro-courses study plan</a>
### How do I get started on the Kaggle platform?
First, I would like to leave here my subjective view of the order of studying Kaggle micro-courses. In my opinion, these micro-courses are incredibly practical. You can stop anywhere, on any line code in a particular course and start your own exploration on the selected dataset. You can find a complete list of micro-courses available on this platform [here](https://www.kaggle.com/learn/overview).
### <br/> Here we go:
1. [Hello, Python!](https://www.kaggle.com/learn/python)
<br/> One of the main programming languages used in Data Science is Python. The presented course can be started to study even with a zero level of knowledge of the programming language. Actually, it was from this level that I started learning Python.üòä
The course is really interesting, and the practical tasks are really challenging.üí™
2. [Pandas](https://www.kaggle.com/learn/pandas)
<br/> While Python is the most popular programming language in the data science world, *Pandas* is the most popular Python library for data analysis. In this course, you can learn more about the functions and methods applicable to data frames. How to create a table, how to extract the necessary information from it, and much more.
3. [Data-visualization](https://www.kaggle.com/learn/data-visualization)
<br/> In this micro-course, you will learn the basics of data visualization using the *Seaborn* library. The presented course describes in great detail all the tools necessary for work, explaining literally every word in a line of code to you! It is also a very colorful course, because the resulting visuals look amazingly interesting. You can master data visualization without the first two micro-courses, but nevertheless, I recommend that you first study the base of the programming language. Data visualization is very interesting and, if used wisely, can reveal interesting dependencies, tell a previously unknown story about the presented data.
4. [Intro-to-machine-learning](https://www.kaggle.com/learn/intro-to-machine-learning)
<br/> This course introduces us to the basic machine learning models that we need to predict key values. The course provides an introduction to classification and regression models such as: *Decision Tree Model* and *Random Forest*. Here we will also get acquainted with the problem of *underfitting* and *overfitting* models, methods of assessing its accuracy. Of course, this micro-course will not give you the mathematical foundations of the presented methods, but as I said earlier, this is pure practice.‚ú® You can immediately apply these models to real data!
5. [Intermediate-machine-learning](https://www.kaggle.com/learn/intermediate-machine-learning)
<br/> This course is also a must, as it raises critical issues that are encountered in real-world data. In the Titanic problem, we will encounter data gaps, as well as categorical variables that our model does not want to perceive in their pure form.üôÑ In this course, you will also learn about *pipelines*, advanced model checking techniques (*cross-validation*), the problem of *data leakage*, and explore a new model that is widely used in one form or another in Kaggle's competitions - *XGBoost* üî•.
6. [Feature-engineering](https://www.kaggle.com/learn/feature-engineering)
<br/> I strongly recommend starting this course only after all the previous ones)üòÇ Surely you have seen an important part of any analysis of the problem with the title *Feature Engineering*, but what is it? In fact, this is the creation of new, synthetic variables based on the presented ones, which will improve the predictive performance of the model, as well as increase the interpretability of the results.

["Return to Monke"](#stepend)üôâ

# <a id = "3"> 3. The solution of the problem</a>

## <a id = "4"> Defining Libraries and Data Path</a>
In this cell, we will define all the libraries we need for working with data, methods for filling in the gaps, using various models, and so on. In fact, this block is often filled in during your analysis, when you understand that we need one or another method.

Here we read all our data and write them under the names *train_data* and *test_data*, respectively.

Then we determine the predictive feature *y*. And also we will determine the specific columns with which we will work, I did not use *Ticket* and *Cabin* features here. Although they hide interesting patterns in themselves, they require more complex processing, while giving a slight improvement to the model (you can add or remove some features at your discretionüòâ).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

train_data = pd.read_csv("../input/titanic/train.csv") 
test_data = pd.read_csv("../input/titanic/test.csv") 

y = train_data["Survived"]
features = ["Sex","Pclass", "SibSp", "Parch", "Age", "Fare", "Embarked"] #Instead of .dropna i'll use exact features due to the small amount of columns

X_train = train_data[features]
X_test = test_data[features]
X_train.head() # Shows the first five lines of our selected data

## <a id = "5"> Deal with Missing values</a>
So, we have defined specific features. Among them there may be *missing values*, which, of course, we will not see in the first lines).The first step in processing the data is to fill in the gaps! (of course, after choosing the main features that you select based on the analysis of all data, for example, with visualization tools, we must first assess the scale of our future actions). Moreover, it is necessary to note an interesting fact:
>Missing values may not necessarily be lost data, sometimes it may indicate that this feature simply does not exist or it cannot be obtained, which is already some interesting pattern. For example, the column may be called *Pets of passengers* and among *Cats*, *Dogs* and *Parrots*, there may be missing values that are not reasonable to fill in with *Cats* (for example), they will mean that the passenger has no pets ü§∑‚Äç‚ôÄÔ∏è)

Therefore, in this block we will **search for missing values** (more about how these lines of code work can be found in the micro-course [Intermediate-machine-learning. Missing-values](https://www.kaggle.com/alexisbcook/missing-values)):


In [None]:
# Study missing values:

missin_v_count_by_columns_train = (X_train.isnull().sum())
missin_v_count_by_columns_test = (X_test.isnull().sum())
print("Missing in train_data:",'\n',missin_v_count_by_columns_train[missin_v_count_by_columns_train > 0], '\nMissing in train_data:','\n',missin_v_count_by_columns_test[missin_v_count_by_columns_test > 0])

Now we see that there are missing values in both *training* and *test data*, both in columns with *categorical variables* ('Embarked') and with *numerical* ones. That is why, when imputuing values, we must process both *training* and *test data*, otherwise you will train the model without errors, and at the stage of applying the model on all data, you will get an error.ü§î More about how to impute values can be found in the micro-course [Intermediate-machine-learning. Missing-values](https://www.kaggle.com/alexisbcook/missing-values). However, I will immediately note that in this course we are introduced to the *SimpleImputer method*, which is imputed to the entire DataFrame with a single rule for filling in the missing data. However, I want to use different approaches for imputing data to different columns, so I use *.fillna*. You can easily understand how it works from the structure of the code. Now let's **impute the missing values** :

In [None]:
# Impute them right there:
X_train = X_train.fillna(value = {'Age': X_train['Age'].mean(), 'Fare': X_train['Fare'].median(), 'Embarked': 'N'})
X_test = X_test.fillna(value = {'Age': X_test['Age'].mean(), 'Fare': X_test['Fare'].median(), 'Embarked': 'N'})

In [None]:
# Let's check it out again:

missin_v_count_by_columns_train = (X_train.isnull().sum())
missin_v_count_by_columns_test = (X_test.isnull().sum())
print("Missing in train_data:",'\n',missin_v_count_by_columns_train[missin_v_count_by_columns_train > 0], '\nMissing in train_data:','\n',missin_v_count_by_columns_test[missin_v_count_by_columns_test > 0])

Now there are no missing valuesüòé

## <a id = "6"> Deal with Categorical Variables</a>

As we noted earlier, categorical data is contained in the 'Embarked' column. However, the 'Sex' column also contains categorical values. More about how to deal with Categorical Variables can be found in the micro-course [Intermediate-machine-learning. Categorical Variables](https://www.kaggle.com/alexisbcook/categorical-variables). In this micro-course, various ways of processing category values are given: simple *Drop* (drop such columns and ignore them), *LabelEncoding* and *OneHotEncoding*. In order to understand the difference between them, be sure to read the presented course. Empirically, I have determined that the best result, in this case, will be when using the *OneHotEncoding* approach to 'Embarked' column and *LabelEncoding* to 'Sex'. However, in order not to use this bulky preprocessing method, I use the alternative *.get_dummies* method (*.factorize()* can be an alternative to *LabelEncoding*). Now let's **preprocess Categorical Variables 'Sex'**:

In [None]:
label_encoder = LabelEncoder()
for col in ['Sex']:
    X_train[col] = label_encoder.fit_transform(X_train[col])
    X_test[col] = label_encoder.transform(X_test[col])
X_train.head()

Compare to the very first output, we replaced the 'male' and 'female' values with '1' and '0' respectively using LabelEncoder(). Now let's apply *.get_dummies*. Please note that I use this method at the very end, when all other data is already numerical! Since the *.get_dummies* method will turn all categorical columns of the DataFrame into sets of columns with corresponding values. **Preprocess Categorical Variables 'Embarked'**:

In [None]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_train.head()

It was easy, wasn't it?üòé BUT wait ... let's check the test data:

In [None]:
X_test.head()

üò® Have you noticed? We got a different number of columns in test and train data... our statistical model will not like it very much( 
<br/>Guess why it happened? (You can expand the answer below).

The fact is that, when we checked for missing values, it turned out that there are missing values in the 'Embarked' column in the *training data* (that is, a set of unique values: *'Nan'*, *'C'*, *'Q'* and *'S'*. But in the test there were no such missing values (only *'C'*, *'Q'* and *'S'*) That is why after applying the *OneHotEncoding* method, we received a different number of columns.

It's time to use the knowledge of Pandas!üòÇ Shortly, we need to create such a column entirely of zeros, but in addition to that, put it in the right place! (in order):

In [None]:
X_test['Embarked_N'] = 0 # Creating a zeros column

In [None]:
cols2 = X_test.columns.tolist() # List of column names
lista = [cols2[9]] # Link to missing column 
cols2 = cols2[:7] + lista + cols2[7:9] # Redefine the column numbering order
X_test = X_test[cols2] # Applying the new order
X_test.head()

Great, now everything is identical!üéâ

## <a id = "7"> Defining and testing a model</a>
In this block, we will divide our training data once again into training and test data. More details about this are discussed here: [Intro to Machine Learning. Model Validation](https://www.kaggle.com/dansbecker/model-validation). Short, models can be overfitted and split data allows this to be determined. In this way, we fit the model on a part of the training data, and on the other part we check their accuracy. In the next section, we will fit the model on the selected model on all training data.

In [None]:
# Split Data
s_X_train, s_X_test, s_y_train, s_y_test = train_test_split(X_train, y, train_size=0.75, test_size=0.25, random_state=0)

# Now, let's define and test different models! You can change and test any parametres here.
#1
model_1 = XGBClassifier(n_estimators = 10000, learning_rate = 0.5, use_label_encoder=False, eval_metric = 'logloss')
model_1.fit(s_X_train, s_y_train, early_stopping_rounds = 9, eval_set = [(s_X_test, s_y_test)], verbose = False)
#2
model_2 = RandomForestClassifier(n_estimators = 550, max_depth = 6, random_state = 1)
model_2.fit(s_X_train, s_y_train)
#3
model_3 = DecisionTreeClassifier(max_leaf_nodes = 100, random_state = 1)
model_3.fit(s_X_train, s_y_train)

val_predictions_1 = model_1.predict(s_X_test) # Model_1 prediction
val_predictions_2 = model_2.predict(s_X_test) # Model_2 prediction
val_predictions_3 = model_3.predict(s_X_test) # Model_3 prediction

#This class is called 'Dictionaries'
accuracy_results = {'XGBClassifier': accuracy_score(val_predictions_1, s_y_test), 'RandomForestClassifier': accuracy_score(val_predictions_2, s_y_test),'DecisionTreeClassifier': accuracy_score(val_predictions_3, s_y_test),}
accuracy_results

## <a id = "8"> Submit to competiton</a>
Based on the validation results (validation based on 25% of the training data), we can see that the XGBClassifier model has the best result, however I submit a random forest model.üôÉ

In [None]:
model_2.fit(X_train, y) # Fit model_2 on all train data
predictions = model_2.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)

# <a id = "9"> 4. Conclusion</a>
Hope you enjoyed this notebook. You are free to use it for your solution) I would be extremely grateful for any feedback, recommendations, additions and / or suggestions, as well as for your opinion on the presentation.üòä

["Worm-hole"](#stepend)üí´