# Table of Contents
**1. Dataset Check**
**2. EDA**
**3. Pclass (include sequence and category)**
**4. Sex**
**5. Both Sex and Pclass**
**6. Age**
**7. Pclass, Sex, Age**
**8. Embarked**
**9. Family - SibSp + Parch**
**10. Fare**
**11. Cabin**
**12. Ticket**

## Import Data 

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')
sns.set(font_scale=2.5)  #it is that i am gona use font_size 2.5 

import missingno as msno  #show the nerd data in the dataframe 

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline  

## 1. Dataset Check

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_test.describe()

### Check Null data
1)Check train data first 

In [None]:
#First we chect train and test files null data -> Null data have to be fill in!!!
for col in df_train.columns:
    wsg = 'column: {:>10}₩t Percent of NaN value: {:.2f}%'.format(col, 100*(df_train[col].isnull().sum()/df_train[col].shape[0]))
    print(wsg)

We can check that **'Age'**, **'Ticket'**, **'Fare'**, **'Cabin'**, **'Embarked**' have the null data

Next check out the test data 

In [None]:
for col in df_test.columns:
    wsg = 'column: {:>10}₩t Percent of NaN value: {:.2f}%'.format(col, 100*(df_test[col].isnull().sum()/df_test[col].shape[0]))
    print(wsg)

* We can check that **'Age'**, **'Ticket'**, **'Fare'**, **'Cabin'**, **'Embarked**' have the null data.
* Both test and train data have the same columns null data. 

In [None]:
#We can check there is 250 null data in the train data 
df_train[col].isnull().sum()

In [None]:
#To find the percentage of the null data   divide it with the total dataframe
df_train[col].isnull().sum() / df_train[col].shape[0]

In [None]:
#.iloc[] = index location. it brings the index that we need (distribution)
#the blank below the graph is the null
msno.matrix(df=df_train.iloc[:, :],figsize=(8,8),color=(0.8,0.5,0.2))

In [None]:
#Other way to find null data - using bar (percentage)
msno.bar(df=df_train.iloc[:, :],figsize=(8,8),color=(0.8,0.5,0.2))

In [None]:
Conclusion:
We found out that there is the null data in our data.Next Step we will gona find out the target label. We have to find out what kind of distribution they have, which we are targeting.

The method of evaluating the model depends on how valancefully the target label has or does not have a balance. Also, the way to make a model changes.So we have to check what kind of distribution we have.

In [None]:
f, ax = plt.subplots(1,2,figsize=(18,8))  

#explode = make a distance between the picture
#autopct = make a percentage 
#  ax[0], ax[1] = So which part you gona put in between 0 and 1

df_train['Survived'].value_counts().plot.pie(explode=[0,0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('') #ylabel = blank
sns.countplot('Survived', data=df_train, ax=ax[1])  #Count the Survived in the file df_train
ax[1].set_title('Count plot - Survived')
plt.show()

#The result show that this data is balanced 

In [None]:
#df_train['Survived'].value_counts() = Series, every series have a plot
df_train['Survived'].value_counts().plot()

## 2. EDA
It's about finding correlations between features. By doing this, we can gain a strong insight into which feature should be used. We need to create the ability to interpret pictures

In [None]:
#We can find out that there is a 12 features. 
df_train.shape

### 2.1 Pclass (include sequence and category)

In [None]:
df_train[['Pclass','Survived']].groupby(['Pclass'], as_index=True).count()

In [None]:
df_train[['Pclass','Survived']].groupby(['Pclass']).sum()

In [None]:
pd.crosstab(df_train['Pclass'],df_train['Survived'], margins=True).style.background_gradient(cmap='cool')

In [None]:
#What's the survival rate for each class?
#We have to as_index=False not as_index=True. Because if we make a plot we can only get one graph if we use =True

df_train[['Pclass','Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False).plot.bar()

In [None]:
y_position = 1.02
f, ax = plt.subplots(1,2,figsize=(18,8))
df_train['Pclass'].value_counts().plot.bar(color=['#CD7F32', '#FFDF00', '#D3D3D3'], ax=ax[0])
ax[0].set_title('Number of passengers By Pclass', y=y_position)
ax[0].set_ylabel('Count')
sns.countplot('Pclass', hue='Survived', data = df_train, ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead', y = y_position)
plt.show()

### Conclusion:
The higher the class, the higher the probability of survival.

Therefore, if you use 'class data' to make a model through this, it will have a better input.

### 2.2 Sex

In [None]:
f, ax = plt.subplots(1,2,figsize=(18,8))
df_train[['Sex','Survived']].groupby(['Sex'],as_index=True).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex', hue= 'Survived', data = df_train, ax=ax[1])
ax[1].set_title('Sex: Survived vs Dead')
plt.show()

In [None]:
df_train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
pd.crosstab(df_train['Sex'], df_train['Survived'], margins=True).style.background_gradient(cmap='summer_r')

### Conclusion:
Like Pclass, Sex is an important feature for predictive models.

### 2.3 Both Sex and Pclass
* Now, let's see how survival changes with respect to two things: Sex and Pclass.
* With Seaborn's factorplot, you can easily draw a graph of three dimensions.

In [None]:
sns.factorplot('Pclass', 'Survived', hue='Sex', data=df_train, size=6, aspect=1.5)

* You can see that female is more likely to live in all classes than male.
* Also, the higher the class regardless of male or female, the higher the probability of living.
* The graph above is column instead of hue, which makes it look like below.

In [None]:
#Other way to show in different plot
sns.factorplot(x='Sex', y='Survived', col='Pclass',data=df_train, satureation=.5,size=9, aspect=1)

### 2.4 Age 

In [None]:
print('oldest passenger : {:.1f} Years'.format(df_train['Age'].max()))
print('youngest passenger : {:.1f} Years'.format(df_train['Age'].min()))
print('passenger average age : {:.1f} Years'.format(df_train['Age'].mean()))

In [None]:
#Let me draw a histogram of the Age of Survival.
fig, ax = plt.subplots(1,1,figsize=(9,5))
sns.kdeplot(df_train[df_train['Survived']==1]['Age'],ax=ax)
sns.kdeplot(df_train[df_train['Survived']==0]['Age'],ax=ax)
plt.legend(['Survived == 1', 'Survived == 0'])
plt.show()

Interesting! Different with the titanic! The age 40-60 got the highest survival rate!

In [None]:
#Age distribution withing classes, by using hist plot we can see easily 
plt.figure(figsize=(18,16))
df_train['Age'][df_train['Pclass']==1].plot(kind='hist')
df_train['Age'][df_train['Pclass']==2].plot(kind='hist')
df_train['Age'][df_train['Pclass']==3].plot(kind='hist')

plt.xlabel('Age')
plt.title('Age Distribution within classes')
plt.legend(['1st Class', '2nd Class', '3rd Class'])

* The higher the class, the greater the proportion of older people
* I'm going to see what the survival rate is as the age changes.
* As we expand our age range, let's see what the survival rate is.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(9,5))
sns.kdeplot(df_train[(df_train['Survived']==0)&(df_train['Pclass']==1)]['Age'],ax=ax)
sns.kdeplot(df_train[(df_train['Survived']==1)&(df_train['Pclass']==1)]['Age'],ax=ax)
plt.legend(['Survived == 0', 'Survived == 1'])
plt.title('1st Class')
plt.show()

In [None]:
fig, ax = plt.subplots(1,1,figsize=(9,5))
sns.kdeplot(df_train[(df_train['Survived']==0)&(df_train['Pclass']==2)]['Age'],ax=ax)
sns.kdeplot(df_train[(df_train['Survived']==1)&(df_train['Pclass']==2)]['Age'],ax=ax)
plt.legend(['Survived == 0', 'Survived == 1'])
plt.title('2st Class')
plt.show()

In [None]:
fig, ax = plt.subplots(1,1,figsize=(9,5))
sns.kdeplot(df_train[(df_train['Survived']==0)&(df_train['Pclass']==3)]['Age'],ax=ax)
sns.kdeplot(df_train[(df_train['Survived']==1)&(df_train['Pclass']==3)]['Age'],ax=ax)
plt.legend(['Survived == 0', 'Survived == 1'])
plt.title('3rd Class')
plt.show()

Only the **1st Class** age distribution between 40-60 got higher survival rate contrast with 2nd and 3rd class.

Let's check out the **survival rate** also!!

In [None]:
cummulate_survival_ratio = []
#Survival of age by showing the trend.

for i in range(1, 80):
    cummulate_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived']))

plt.figure(figsize=(7, 7))
plt.plot(cummulate_survival_ratio)
plt.title('Survival rate change depending on range of Age', y=1.02)
plt.ylabel('Survival rate')
plt.xlabel('Range of Age(0~x)')
plt.show()

* As you can see, the younger you are,and as the age get older the higher your survival rate is.
* We confirmed that this age can be used as an important feature.

### 2.5 Pclass,Sex,Age
* I'd like to see all of the Sex, Pclass, Age, Survived. The easy way to draw this is seaborn's violinplot.
* The x-axis represents the case that we want to see separately, and the y-axis represents the distribution (Age) that we want to see.
* I'll draw it.

In [None]:
f, ax = plt.subplots(1,2,figsize=(18,8))
sns.violinplot('Pclass','Age', hue='Survived', data=df_train, scale='count', split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))

sns.violinplot('Sex','Age', hue='Survived', data=df_train, scale='count',split=True, ax=ax[1])
ax[1].set_title('sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

* The figure on the left is a graph of how the distribution of Age varies by Pclass, and whether it survives or not.
* The figure on the right is the same, a graph that shows how the distribution of survival differs.
* We can't find out the correlation between Age and survival rate. It is different with the titanic's dataset 
* In the picture on the right, you can clearly see that women have survived a lot.
* You can see that they took care of women and children first.

### 2.6 Embarked 
* Embarked represents the port on board.
* Similar to what we've done above, we'll look at the survival rate according to where we're on board.

In [None]:
df_train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=True).mean().sort_values(by='Survived')

In [None]:
f, ax = plt.subplots(1,1,figsize=(7,7))
df_train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=True).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax)

* As you can see, there is a difference, the survival rate is a bit different. But C is the highest. C is twice much survival rate than S
* I don't know how much impact it will have on the model, but I'll still use it.
* In fact, once we've created a model, we can see how important the features we've used have played. We will look at this later after we make the model.
* Let's split into different features and take a look.

In [None]:
f,ax=plt.subplots(2, 2, figsize=(20,15))

sns.countplot('Embarked', data=df_train, ax=ax[0,0])
ax[0,0].set_title('(1) No. Of Passengers Boarded')

sns.countplot('Embarked', hue='Sex', data=df_train, ax=ax[0,1])
ax[0,1].set_title('(2) Male-Female Split for Embarked')

sns.countplot('Embarked', hue='Survived', data=df_train, ax=ax[1,0])
ax[1,0].set_title('(3) Embarked vs Survived')

sns.countplot('Embarked', hue='Pclass', data=df_train, ax=ax[1,1])
ax[1,1].set_title('(4) Embarked vs Pclass')

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

* Figure (1) - Overall, S has the largest number of people on board.
* Figure (2) - C and Q have more proportions of women, and S has more men.
* Figure (3) - If the survival probability is S, you can see that it is very low. (I saw it on the previous graph.). But in the C, you can see that the survival rate is very high(I think because there is more women proportions rate)
* Figure (4) - According to the class split, the reason why C has a high probability of survival is because many people in the 1st class ride it. S has a low probability of survival because there are many 3rd classes.

### 2.7 Family -SibSp + Parch
If you combine SibSp and Parch, it will be Family. Let's combine them into Family.

In [None]:
#We can combine data because it is combined with number 
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch'] + 1 #we have to add 1 because we have include oneself
df_test['FamilySize'] = df_test['SibSp'] + df_train['Parch'] + 1 #we have to add 1 because we have include oneself

In [None]:
print('Maximum size of Family:', df_train['FamilySize'].max())
print('Minimum size of Family:', df_train['FamilySize'].min())

Let's take a look at the relationship between Family Size and survival.

In [None]:
f,ax=plt.subplots(1, 3, figsize=(40,10))
sns.countplot('FamilySize', data=df_train, ax=ax[0])
ax[0].set_title('(1) No. Of Passengers Boarded', y=1.02)

sns.countplot('FamilySize', hue='Survived', data=df_train, ax=ax[1])
ax[1].set_title('(2) Survived countplot depending on FamilySize',  y=1.02)

df_train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=True).mean().sort_values(by='Survived', ascending=False).plot.bar(ax=ax[2])
ax[2].set_title('(3) Survived rate depending on FamilySize',  y=1.02)

plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

* Figure (1) - You can see that the family size is from 1 to 18. It's mostly one person, followed by two, three, and four people.
* Figure (2), (3) - Only the family size2 survived more. We can think there is less correlation between family size and the survival rate. The probability of survival is random without any relationship.

### 2.8 Fare
Fare is a boarding fee and a constant feature. I'll draw a histogram.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(8,8))
g = sns.distplot(df_train['Fare'], color='b',label='skewness : {:.2f}'.format(df_train['Fare'].skew(), ax=ax))
g = g.legend(loc='best')

* As you can see, the distribution is very asymmetrical (high sense). If you put it in the model like this, the model may learn it wrong. If you are too sensitive to a few outlier, you can have bad results in real predictions.
* To reduce the impact of outlier, we will log on Fare.
* Here we will use the useful function of Pandas. If you want to apply a common action (function) to a particular column of dataFrame, you can apply it very easily by using the map, or apply below.
* What we want now is to log all the data in the Fare columns, and if you put a function that applies a simple log into the map using Python's simple lambda function, it's applied to the Fare columns data as an argument. It's a very useful feature, so make sure you understand it!

In [None]:
# Replace the nanvalue in the testset with the average value.
df_test.loc[df_test.Fare.isnull(), 'Fare'] = df_test['Fare'].mean() 

df_train['Fare'] = df_train['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
df_test['Fare'] = df_test['Fare'].map(lambda i: np.log(i) if i > 0 else 0)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness : {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

### 2.9 Cabin
* This feature has a so much **null data**, so it is not easy to obtain important information that will affect survival.
* Therefore, we will not include it in the model we are trying to build.

In [None]:
df_train.isnull().sum()

### 2.10 Ticket
* This feature also has too many a **null-data**, so it is not easy to obtain important information that will affect survival.
* Therefore, we will not include it in the model we are trying to build.

In [None]:
df_train['Ticket'].value_counts()

* As you can see, ticket numbers vary widely. What characteristics can we draw from this and link it to survival?
* You should come up with your own ideas! This is the starting point for a full-fledged Caglace. ^^
* This is a tutorial, so I'll skip the ticket first. After finishing the tutorial, it's good to get information out of the ticket to improve the performance of your model!

# The End!!

## If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :)