# Exploratory data analysis, visualization, machine learning



In [None]:
import numpy as np
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')
sns.set(font_scale=2.5) 
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

#ignore warnings
import warnings
warnings.filterwarnings('ignore')



%matplotlib inline

df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1 # 자신을 포함해야하니 1을 더합니다
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch'] + 1 # 자신을 포함해야하니 1을 더합니다

df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean()

df_train['Fare'] = df_train['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda i: np.log(i) if i > 0 else 0)

# 3. Feature Engineering

* Let's start the feature engineering.
* First of all, you want to populate the null data that exists in the dataset.
* You can't fill it with any number, you can refer to statistics in the feature that contains null data, or you can squeeze other ideas to populate it.
* This is something you should pay attention to because the performance of the model can depend on how you fill in the null data.
* Feature engineering is intended to be used for the learning of real models, so you should apply the same test as well as train. Let's not forget.

# 3. 1 Fill Null

### 3. 1. 1 Fill Null in Age using title

* Age has 177 null data. There will be a lot of ideas to fill this up, and we'll use the title + statistics here.
* In English, there is a title like Miss, Mr. and Mrs. Each passenger's name will have a title like this, so I'll try it.
* In pandas series, there is a str method that changes data to string, and an extract method that allows normal expression to be applied to it. You can use this to easily extract the title. I will save the table in the initial column.

In [None]:
df_train['Ticket'].value_counts()

In [None]:
df_train['Initial']= df_train.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations
    
df_test['Initial']= df_test.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations

* Let's take a look at the count between the Initial and Sex we extracted using the crossstab of Pandas.

In [None]:
#Checking the Initials with the Sex
pd.crosstab(df_train['Initial'], df_train['Sex']).T.style.background_gradient(cmap='summer_r') 

* Let's take a look at the count between the Initial and Sex we extracted using the crossstab of Pandas.

In [None]:
#It replaces a certain value in the series "Initials" with a different value. 


df_train['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Mr'],inplace=True)

df_test['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Mr'],inplace=True)

In [None]:
df_train.groupby('Initial').mean()


* Miss, Mr. and Mrs. related to women have a high survival rate.


In [None]:
df_train.groupby('Initial')['Survived'].mean().plot.bar(color=['black', 'red', 'green', 'blue', 'cyan'])

* Now, we're going to fill in the null. There are so many ways to populate null data. There are ways to utilize statistics, and there are ways to create and predict and populate new machine learning algorithms based on data without null data. Here we will use how to leverage statistics.
* where statistics means train data. We should always leave the test as unseen and fill the null data of the test based on statistics obtained from the train.

In [None]:
#Combine 'Train' and 'Test' data to populate the null values at once.
df_all = pd.concat([df_train, df_test])

In [None]:
df_all.reset_index(drop=True)

* We will fill the null value using the average of the Age.
* When handling pandas dataframe, it is very convenient to index using Boolean array.
* To interpret the first line of code below, replace the value of 'Age' of the row that is null() and meets the condition that the initial is Mr.
* The method of replacing values using loc + Boolean + column is often used, so let's get used to it.

In [None]:
df_all.groupby('Initial').mean()

In [None]:
#Put below code into 'loc'
df_train['Survived'] == 1

In [None]:
#So just return the survived one like this!! You can set it like this. It's useful just to look at it'
df_train.loc[df_train['Survived'] == 1]

* Here we simply filled in the null, but there are examples of more diverse methods in other kernels. I will show you a more efficient way below.

In [None]:
#우리는 이제 여기서 null 값을 채워야 하는것. 
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mr'),'Age'] = 33
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Mrs'),'Age'] = 36
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Master'),'Age'] = 5
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Miss'),'Age'] = 22
df_train.loc[(df_train.Age.isnull())&(df_train.Initial=='Other'),'Age'] = 46

df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mr'),'Age'] = 33
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Mrs'),'Age'] = 36
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Master'),'Age'] = 5
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Miss'),'Age'] = 22
df_test.loc[(df_test.Age.isnull())&(df_test.Initial=='Other'),'Age'] = 46

In [None]:
df_train['Age'].isnull().sum()

In [None]:
df_test['Age'].isnull().sum()

# 3.1.2 Fill Null in Embarked

In [None]:
df_train['Embarked']

In [None]:
df_train['Embarked'].isnull().sum()

* Since Embarked has two null values and has the largest number of passengers in S, we will simply fill the null with S.
* You can fill it easily by using the fillna method in the data frame. If you do inplace=True here, you will actually apply fillna to df_train.

In [None]:
#null값을 최대값으로 채우는 방법 
df_train['Embarked'].fillna('S', inplace=True)

In [None]:
df_train['Embarked'].isnull().sum()

# 3.2 Change Age(continuous to categorical)

**In two different way! Second one is much simpler **

* Age is currently continuous feature. You can build a model even if you use it as it is, but you can divide the Age into several groups and category them. Changing continuous to casual may lead to information loss, but the purpose of this tutorial is to introduce various methods, so we will proceed.
* There are many ways. You can do it yourself using loc, the indexing method of dataframe, or you can add a function using apply.
* The first method is using loc. Loc is used frequently, so it's good to know how to use it.
* I will divide the age into 10 years apart.

In [None]:
df_train['Age_cat'] = 0

In [None]:
df_train.head()

In [None]:
df_train.loc[df_train['Age'] <10, 'Age_cat'] = 0
df_train.loc[(10 <= df_train['Age'])&(df_train['Age']<20), 'Age_cat'] =1
df_train.loc[(20 <= df_train['Age'])&(df_train['Age']<30), 'Age_cat'] =2
df_train.loc[(30 <= df_train['Age'])&(df_train['Age']<40), 'Age_cat'] =3
df_train.loc[(40 <= df_train['Age'])&(df_train['Age']<50), 'Age_cat'] =4
df_train.loc[(50 <= df_train['Age'])&(df_train['Age']<60), 'Age_cat'] =5
df_train.loc[(60 <= df_train['Age'])&(df_train['Age']<70), 'Age_cat'] =6
df_train.loc[(70 <= df_train['Age']), 'Age_cat'] = 7

In [None]:
df_train.head()

In [None]:
df_test.loc[df_train['Age'] <10, 'Age_cat'] = 0
df_test.loc[(10 <= df_train['Age'])&(df_train['Age']<20), 'Age_cat'] =1
df_test.loc[(20 <= df_train['Age'])&(df_train['Age']<30), 'Age_cat'] =2
df_test.loc[(30 <= df_train['Age'])&(df_train['Age']<40), 'Age_cat'] =3
df_test.loc[(40 <= df_train['Age'])&(df_train['Age']<50), 'Age_cat'] =4
df_test.loc[(50 <= df_train['Age'])&(df_train['Age']<60), 'Age_cat'] =5
df_test.loc[(60 <= df_train['Age'])&(df_train['Age']<70), 'Age_cat'] =6
df_test.loc[(70 <= df_train['Age']), 'Age_cat'] = 7

In [None]:
df_test.head()

* The second way is to create a simple function and put it into the apply method.
* It's much easier.

In [None]:
def category_age(x):
    if x < 10:
        return 0
    elif x < 20:
        return 1
    elif x < 30:
        return 2
    elif x < 40:
        return 3
    elif x < 50:
        return 4
    elif x < 60:
        return 5
    elif x < 70:
        return 6
    else:
        return 7    
    


* If the two methods are applied well, both should produce the same results.
* To verify this, use the all() method after comparing Boolean between Series. The all() method gives True if all values are True, False if any are False.

In [None]:
df_train['Age_cat_2'] = df_train['Age'].apply(category_age)

In [None]:
#두개를 합쳐서 비교해보는 것이다 .all 함수는 이제 모든값이 true면 true값을 주는 것
(df_train['Age_cat'] == df_train['Age_cat_2']).any()

* As you can see, it's true. You can choose between the two.
* Now we will remove the duplicate Age_cat column and the original Column Age.

In [None]:
df_train.drop(['Age', 'Age_cat_2'], axis=1, inplace=True)
df_test.drop(['Age'], axis=1, inplace=True)

# 3.3 Change Initial, Embarked and Sex (string to numerical)

* Currently, the initial consists of 5 pieces, Mr. Mrs, Miss, Master, and Other. When data expressed in these categories is inputted to the model, what we need to do is digitize it so that the computer can recognize it.
* You can do it simply with a map method.
* I'll organize it in advance and do the mapping.

In [None]:
df_train['Initial'] = df_train['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})

* Embarked also consists of C, Q, and S. Let's change it using map.
* Before we do that, let's take a quick look at how to see what values are in a particular column. You can simply write the unique() method or use value_counts() to view the count.

In [None]:
df_train['Embarked'].unique()

In [None]:
df_train['Embarked'].value_counts()

* You can see that Embarked consists of three methods: S, C, and Q. Now, let's use the map.

In [None]:
df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

In [None]:
df_train['Embarked']

* Let's see if Null is gone. Since importing only Embarked Columns is a Series object in one pandas, you can use the innull() method to obtain Boolean values for whether or not the values in the Series are null. And using any(), if there is a single true (if there is one null), it will return true. We got False because we changed null to S.

In [None]:
df_train['Embarked'].isnull().any()

* Sex is also composed of Female and Male. Let's change it using map.

In [None]:
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})

* ##### Now you want to look at the correlation between each feature. You can obtain a value between (-1, 1) by obtaining Pearson correction between two variables. Negative correlation by -1; positive correlation by 1, and 0 means no correlation. The formula you are looking for is as follows.

* We have many features, so it would be convenient to see them in a form of a maxtrix, which is called a heatmap plot, and you can draw them comfortably with the corr() method of dataframe and seaborn.

In [None]:
heatmap_data = df_train[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']] 

colormap = plt.cm.RdBu
plt.figure(figsize=(14, 12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(heatmap_data.astype(float).corr(), linewidths=0.1, vmax=1.0,
           square=True, cmap=colormap, linecolor='white', annot=True, annot_kws={"size": 16})

del heatmap_data

* As we have seen in the EDA, we can see that Sex and Pclass are somewhat correlated with Survived.
* You can see that there is a correlation between fair and embarked than you think.
* And the information we can get from this is that there are no features that are strongly correlated with each other.
* This means that when we train a model, there is no redundant (superfluous feature. If there's a feature A or B that correlates with one or one, there's actually only one piece of information we can get.
* Now, before we actually train the model, let's do data preprocessing. We're almost there. Let's go hip!

# 3.4 One-hot encoding on Initial and Embarked

* You can put the numericalized category data as it is, but you can do one-hot encoding to increase the performance of the model.
* Numericalization simply refers to mapping to Master == 0, Miss == 1, Mr == 2, Mrs == 3, Other == 4.
* One-hot encoding refers to the representation of the above category as a vector of five dimensions (0, 1) as shown below.
* You can also code the above tasks directly, but you can easily solve them using get_dummy in Pandas.
* A total of 5 categories, and after one-hot encoding, 5 new columns are created.
* The initial is prefixed to make it easier to distinguish.

In [None]:
df_train = pd.get_dummies(df_train, columns=['Initial'], prefix='Initial')
df_test = pd.get_dummies(df_test, columns=['Initial'], prefix='Initial')

In [None]:
df_train.head()

* As you can see, on the right you can see the one-hot encoded columns that we were trying to create.
* I'll apply it to Embarked as well. I will express it using one-hot encoding just like the initial.

In [None]:
df_train = pd.get_dummies(df_train, columns=['Embarked'], prefix='Embarked')
df_test = pd.get_dummies(df_test, columns=['Embarked'], prefix='Embarked')


* We applied one-hot encoding very easily.
* One-hot encoding is also possible using Labelencoder + OneHotencoder with sklearn.
* Sometimes there are more than 100 categories. If you use one-hot encoding, you can get 100 columns, which can be very hard to learn. In this case, a different method is used.

# 3.5 Drop columns
* It's time to clean up the desk. Let's erase all the columns we need.

In [None]:
df_train.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
df_test.drop(['PassengerId', 'Name',  'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [None]:
df_train.head()

In [None]:
df_test.head()

* As you can see, if you take out the Survived Feature (target class) of the train, you can see that both train and test have the same columns.

# 4 Building machine learning model and prediction using the trained model

* Now that we're ready, let's use sklearn to create a machine learning model.

In [None]:
#importing all the required ML packages
from sklearn.ensemble import RandomForestClassifier 
from sklearn import metrics 
from sklearn.model_selection import train_test_split 

* Sklearn has machine learning from beginning to end. All tasks related to machine learning, such as feature engineering, preprocessing, supervised learning algorithms, unsupervised learning algorithms, model evaluation, and pipeline, are implemented as easy interfaces. If you want to do data analysis + machine learning, you must be familiar with this library.
* 
* The Titanic problem is a binary classification problem because the target class (survived) is made up of 0, 1.
* We optimize the model with the input except the survived of the train set that we have now, and we create a model to determine the survival of each sample (passenger).
* Then, give the test set that the model did not learn as input to predict the survival of each sample (passenger) of the test set.

# 4.1 Preparation - Split dataset into train, valid, test set
* First, separate the target label (Survived) from the data that will be used for the learning. You can do it simply by using drop.

In [None]:
X_train = df_train.drop('Survived', axis=1).values
target_label = df_train['Survived'].values
X_test = df_test.values

* Usually, only train and test are mentioned, but to make a good model, we make a separate set and evaluate the model.
* It's not like the soccer team is going to the World Cup right after the team training, but rather to go to the World Cup after the team training, checking the team's training level (learning level) after the evaluation match (valid).
* Train_test_split makes it easy to detach train sets.

In [None]:
X_tr, X_vld, y_tr, y_vld = train_test_split(X_train, target_label, test_size=0.3, random_state=2018)

* Sklearn supports several machine learning algorithms.
* We will use the Random Forest model in this tutorial.
* Random Forest is a crystal tree-based model and an ensemble of different crystal trees.
* Each machine learning algorithm has several parameters. The random forest classifier also has several parameters: n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf, and so on. Depending on how these are set up, the performance of the model depends on the same dataset.
* Parameter tuning requires time, experience, and understanding of algorithms. In the end, you have to use it a lot to build a good model.
* Since this is a tutorial, let's set aside the parameter tuning for a while and proceed with the default setting.
* Create a model object and train it with the fit method.
* Then insert the valid set input to obtain the predicted value (whether or not the passenger is alive).

# 4.2 Model generation and prediction

In [None]:
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
prediction = model.predict(X_vld)

* With just three lines, you've built a model and even predicted it.
* Now, let's take a look at the performance of the model.

In [None]:
print('{:.2f}% accuracy'.format(y_vld.shape[0], 100 * metrics.accuracy_score(prediction, y_vld)))

We didn't tune any parameters, but we got 82% accuracy.

# 4.3 Feature importance
* The learned model has a feature importance, which we can check to see which feature the model we have created has been affected a lot.
* Simply put, when we think of 10 = 4x1 + 2x2 + 1*x3, we can think that x1 has a big impact on the result (10). Feature importance refers to 4, 2, and 1, and since x1 has the largest value (4), it can be said that it has the greatest impact on this model.
* The learned model basically has feature imports, so you can easily get that figure.
* Using the pandas series, you can easily sort and draw graphs.

In [None]:
from pandas import Series

feature_importance = model.feature_importances_
Series_feat_imp = Series(feature_importance, index=df_test.columns)

In [None]:
plt.figure(figsize=(8, 8))
Series_feat_imp.sort_values(ascending=True).plot.barh()
plt.xlabel('Feature importance')
plt.ylabel('Feature')
plt.show()

* Fare has the greatest influence on the models we have obtained, followed by Initial_2, Age_cat, and Pclass.
* In fact, feature importance represents the importance in the current model. If you use a different model, the feature importance may come out differently.
* You can look at this feature importance to determine that Fare can actually be an important feature, but this is one conclusion that ultimately attributes to the model, so you should look at it statistically more.
* With feature importance, you can perform feature selection to obtain a more accurate model, or remove feature for a faster model.

# 4.4 Prediction on Test set
* Now, let's give the model a set of tests that the model didn't learn (didn't see) and predict its survival.
* The results are actually submission, so you can find them on the leaderboard.
* I will read the file given by the Cagle, the gender_submission.csv file and prepare for submission.

In [None]:
submission = pd.read_csv('../input/titanic/gender_submission.csv')

In [None]:
submission.head()

* Now let's make a prediction for the testset and save the results as a csv file.

In [None]:
prediction = model.predict(X_test)
submission['Survived'] = prediction

In [None]:
submission.to_csv('./my_first_submission.csv', index=False)