# About this notebook

In the following notebook, I reviewed the Titanic (test) dataset. 
I tried to examine the different parts of the dataset to have more insight into the issue.

<img src= "https://www.historic-uk.com/wp-content/uploads/2017/04/the-sinking-of-the-rms-titanic.jpg" alt ="Titanic" style='width: 300px;' class="center">

Each feature has been studied separately in related cells. I tried to include data description, data cleaning, statistical description, data distribution, correlations, and relationships in this notebook.



<h4>If you are interested in this problem and detailed analysis, you can copy this Notebook as follows</h4>

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1101107%2F8187a9b84c9dde4921900f794c6c6ff9%2FScreenshot%202020-06-28%20at%201.51.53%20AM.png?generation=1593289404499991&alt=media" alt="Copyandedit" width="300" height="300" class="center">
  

::: Updates :::
* Last update: 08.Jul.2021
* Updates: 
        * Add more methods to text analyzing
        * Reviewd the Ticket part

# Table of contents

* Importing Libraries
* Obtaining Data
* Overview
    * Data description
    * Data types
* Data cleaning
    * Missing values
* "Pclass"
* "Parch"
* "Ticket"
    * Ticket price
* "Siblings"
    * Outliers
* "Name"
    * Separating
    * Word cloud

# Importing Libraries


In [None]:
import os
import spacy
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from textwrap import wrap

# Obtaining Data

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        data = pd.read_csv(os.path.join(dirname, filename))

# An overview of dataset

In [None]:
data.head()

## Check Data description

Checking the simple statistical description of the problem

In [None]:
data.describe()

## Check the data types

Printing the data file information, including size, Dtype, number of attributes, etc.

In [None]:
data.info()

Printing out the columns' names for further reviews

In [None]:
data.columns

# Data cleaning

## Check missing values

In [None]:
data.isnull().sum()

In [None]:
data['Cabin'] = data['Cabin'].fillna(0)

I market the NaN values for "*Cabin*" feature to zero, to extend the experiments with the current data

In [None]:
data = data.dropna(axis=0)

In [None]:
data.isnull().sum()

# Studying the Pclass

Grouping different columns to extract detailed and summary info concerning various features.

In [None]:
data.groupby(data['Pclass']).mean()

In [None]:
data.groupby(data['Pclass']).count()

In [None]:
import warnings
warnings.simplefilter('ignore')
Pclass = data.groupby(data['Pclass'])['Pclass'].count()
sns.factorplot('Pclass', data=data, kind='count', aspect=1)
plt.xlabel('class')
plt.title('Classes')

From the following plot, we can observe the number of passengers of each class with regard to their sex. 

In [None]:
sns.factorplot('Pclass', data = data, hue = 'Sex', kind='count')

**Checking the number of men and women in each class**

In [None]:
pd.crosstab(index=data['Pclass'], columns=[data['Sex']],
            margins=True).style.background_gradient(cmap='YlGn')

The third class has the highest population on the ship, and most of them (almost 66%) are men.

# Studying the Parch

Extracting the counts of parach

In [None]:
data.Parch
data['parched'] = data['Parch'].apply(lambda x: x>0)
parched = data[data['parched'] == True]
parched['parched'].value_counts()

85 samples have Parched

In [None]:
parched.head()

In [None]:
plt.hist(data['Parch'])
plt.xlabel('parch')
plt.ylabel('counts')
plt.title('Histogram of parch')

**Exploring the sex in each class and the values of parch**

In [None]:
pd.crosstab(index=data['Parch'], columns=[data['Pclass'], data['Sex']],
            margins=True).style.background_gradient(cmap='YlGn')

We can see the first row of Parch has the highest values for all genders and classes. The highest value goes to the third class, which is regarding the men.

# Studying the Ticket

#### Total of passengers with regard to the port

I tool the advantages of the pivot table from pandas. This table gives us a better dashboard table.

In [None]:
pd.pivot_table(data=data, index='Sex', values='Ticket',
                    columns='Embarked', aggfunc=len, margins=True)

## Reviewing the Ticket price

In [None]:
pd.DataFrame(data.groupby('Ticket')['Fare'].mean())

## Sampling the dataset

In [None]:
sample = data.sample(frac=0.2, random_state=2)
print(sample.shape)
sample.tail()

## Histogram of the ticket's price over a random sample

In [None]:
fig, ax = plt.subplots(figsize=(17, 3))
sns.histplot(x="Fare", kde=True, data=sample)
fig.suptitle('Distribution of Ticket\'s price', fontsize=15)

In [None]:
sns.kdeplot(
   data=sample,x='Pclass',y="Fare",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)
plt.title("Conditional distributions")

****
**The price of each boarding a ship**

In [None]:
pd.crosstab(index=data['Fare'], columns=[data['Embarked'], data['Parch']],
            margins=True)

## Studying the siblings of the passengers

In [None]:
data.groupby(data['SibSp'])['SibSp'].count()

In [None]:
pd.crosstab(index=data['SibSp'], columns=[data['Pclass']],
            margins=True).style.background_gradient(cmap='YlGn')

We can observe that most of the passengers had no siblings.

In [None]:
print('Data type of siblings:', data['SibSp'].dtype, '\n')

## Outliers treatment

Checking the outliers and dealing with them

In [None]:
SibSp = (data.SibSp).values
m = []
for i in range(SibSp.shape[0]):
    m.append(np.mean(SibSp))
plt.plot(SibSp, label='siblings')
plt.plot(m, linewidth=3, color='r', label='Median')
plt.legend()
plt.title("Checking siblings outliers")

> From the above plot, we observe some out range data that has a high distance to the mean.

So, we saw that there are some data with high distances. How about measuring the real gaps to the average? For this matter, I calculated it as following.

In [None]:
z = np.abs(stats.zscore(SibSp))
print(z)

In [None]:
plt.plot(z, c='g', alpha=0.3)
plt.ylabel('Distance')
plt.xlabel('index')
plt.title('Distance of the SibSp value from the average')

Another way to check these outliers is the box plot. Besides the quarters, mean and median, It will give us the area above or below the Max/Min of the data.

In [None]:
plt.boxplot(data['SibSp'], 0,'o',showbox=True,
            showfliers=True, showcaps=True, showmeans=True)

> Based on the IQR definition, I defined the upper and lower bounds to drop the part of the feature values which I do not want in my experiments.

In [None]:
Q1 = np.percentile(SibSp, 25, interpolation='midpoint')
Q3 = np.percentile(SibSp, 65, interpolation='midpoint')
IQR = Q3 - Q1
upper = np.where(SibSp>=(Q3+1.5*IQR))
lower = np.where(SibSp<=(Q1-1.5*IQR))
newSibSp = pd.DataFrame(SibSp)
newSibSp.drop(upper[0], inplace=True)
newSibSp.drop(lower[0], inplace=True)
print(newSibSp.shape)
newSibSp.head()

Here, let's see the updated box plot, in the following plot, we do not have any outliers as before.

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
plt.boxplot(newSibSp, 0,'o',showbox=True,
            showfliers=True, showcaps=True, showmeans=True)
plt.title("Modified Siblings")
plt.subplot(2,2,2)
plt.boxplot(data['SibSp'], 0,'o',showbox=True,
            showfliers=True, showcaps=True, showmeans=True)
plt.title("Siblings real values")

# Studying the Name

In [None]:
name = data.Name
name = name.values
len(np.unique(name))

****In memorial of the passengers, I prefer to print all the names****

In [None]:
print("Names:", name)

## Splitting the names

In [None]:
names = data['Name'].apply(lambda x: x.split(', ')[0])

In [None]:
names

In [None]:
data['titles'] = data['Name'].str.extract('([A-Za-z]+)\.')
pd.crosstab(data.titles,data.Pclass).T.style.background_gradient(cmap='Set1_r')

The table shows that the third class was like the first_class as we can observe the "Master" title here.

For the purpose of **NLP**, we could convert all letters to **lower case**. I used the lower function to do that.

In [None]:
LowerCase = names.apply(lambda x: x.lower())
LowerCase

## Word cloud

In [None]:
name = " ".join(name for name in names)

for i in range(4):
    wordcloud = WordCloud(width=400, height=400, max_font_size=50, max_words=70, colormap="Dark2").generate(name).generate(name)
    plt.figure(figsize=(10,10))
    plt.subplot(2,2,i+1)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")