# Titanic Ship Case Study
### by- Siddharth Maratha (20BAI10257)
**Problem Description:** On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not.
enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The problem associated with the Titanic dataset is to predict whether a passenger survived the
disaster or not. The dataset contains various features such as passenger class, age, gender,
cabin, fare, and whether the passenger had any siblings or spouses on board. These features can
be used to build a predictive model to determine the likelihood of a passenger surviving the
disaster. The dataset offers opportunities for feature engineering, data visualization, and model
selection, making it a valuable resource for developing and testing data analysis and machine
learning skills.

Perform Below Tasks to complete the assignment:-
1. Download the dataset: [Dataset](https://drive.google.com/file/d/1bvKFoyULN_HJo_7YWN-FFJP8ozHv-IBI/view?usp=sharing)
2. Load the dataset.
3. Perform Below Visualizations.
  - Univariate Analysis
  - Bi - Variate Analysis
  - Multi - Variate Analysis
4. Perform descriptive statistics on the dataset.
5. Handle the Missing values.
6. Find the outliers and replace the outliers
7. Check for Categorical columns and perform encoding.
8. Split the data into dependent and independent variables.
9. Scale the independent variables
10. Split the data into training and testing

In [13]:
# completes tasks (1) and (2)
import pandas as pd

link_to_dataset='https://drive.google.com/file/d/190t0KiKqSdbFl-o_6r3S9Tvwo2mHzrcB/view'
url='https://drive.google.com/uc?id=' + link_to_dataset.split('/')[-2] # extracts `file id` from `drive link`
df = pd.read_csv(url)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [2]:
# step 3
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# univariate
sns.countplot(data=df, x='survived')
plt.show()

# bivariate
sns.histplot(data=df, x='age', hue='survived')
plt.show()

# multivariate
def is_numeric(val):
  try:
    float(val)
    return True
  except ValueError:
    return False
corr = (
      df.loc[:, ['survived', 'pclass', 'age', 'sibsp']]
        .applymap(lambda v: float(v) if is_numeric(v) else np.nan)
        .dropna()
    ).corr()
sns.heatmap(corr, annot=True)
plt.show()

ModuleNotFoundError: No module named 'seaborn'

### step 4: Descriptive stats on data

In [None]:
df.describe(include='all')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891.0,891.0,891,714.0,891.0,891.0,891.0,889,891,891,891,203,889,891,891
unique,,,2,,,,,3,3,3,2,7,3,2,2
top,,,male,,,,,S,Third,man,True,C,Southampton,no,True
freq,,,577,,,,,644,491,537,537,59,644,549,537
mean,0.383838,2.308642,,29.699118,0.523008,0.381594,32.204208,,,,,,,,
std,0.486592,0.836071,,14.526497,1.102743,0.806057,49.693429,,,,,,,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,,,,,,,
25%,0.0,2.0,,20.125,0.0,0.0,7.9104,,,,,,,,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,,,,,,,,
75%,1.0,3.0,,38.0,1.0,0.0,31.0,,,,,,,,


In [None]:
# removing useless columns
df.drop(['who', 'alone', 'alive'], axis=1, inplace=True)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,deck,embark_town
0,0,3,male,22.0,1,0,7.25,S,Third,True,,Southampton
1,1,1,female,38.0,1,0,71.2833,C,First,False,C,Cherbourg
2,1,3,female,26.0,0,0,7.925,S,Third,False,,Southampton
3,1,1,female,35.0,1,0,53.1,S,First,False,C,Southampton
4,0,3,male,35.0,0,0,8.05,S,Third,True,,Southampton


In [None]:
print(df.nunique())
for col in ['survived', 'pclass', 'sex']:
    print(df[col].unique())

survived         2
pclass           3
sex              2
age             88
sibsp            7
parch            7
fare           248
embarked         3
class            3
adult_male       2
deck             7
embark_town      3
dtype: int64
[0 1]
[3 1 2]
['male' 'female']


In [3]:
df.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
# handle missing values
mean_age = round(df['age'].mean(), 1)
df['age'].fillna(mean_age, inplace=True)
print(mean_age)

mode_embarked = df['embarked'].mode()[0]
df['embarked'].fillna(mode_embarked, inplace=True)
print(mode_embarked)

mode_embark = df['embark_town'].mode()[0]
df['embark_town'].fillna(mode_embarked, inplace=True)
print(mode_embark)

mode_deck = df['deck'].mode()[0]
df['deck'].fillna(mode_deck, inplace=True)
print(mode_deck)

29.7
S
Southampton
C


In [5]:
# handling outliers
def replace_outliers(df, column_name, z_thresh=3):
  ''' replaces outliers from numerical columns of the specified dataset '''
  median = df[column_name].median()
  std = df[column_name].std()
  outliers = (df[column_name] - median).abs() > z_thresh * std
  df[column_name][outliers] = np.nan
  df[column_name].fillna(median, inplace=True)

# numeric_cols = list(df.select_dtypes(include=[np.number]).columns.values)
numeric_cols = ['age', 'fare']

for val in numeric_cols:
  replace_outliers(df, val)

df.describe(include='all')

NameError: name 'np' is not defined

In [6]:
# perform encoding on categorical columns
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
cat_cols = ['sex', 'embarked', 'class', 'embark_town', 'adult_male']
for col in cat_cols:
  df[col] = le.fit_transform(df[col])
df

ModuleNotFoundError: No module named 'sklearn'

In [7]:
df['survived'] = df['survived'].astype('uint8')
df['age'] = df['age'].astype('uint32')
df['pclass'] = df['pclass'].astype('uint8')
df['sibsp'] = df['sibsp'].astype('uint32')
df['parch'] = df['parch'].astype('uint32')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22,1,0,7.25,S,Third,man,True,C,Southampton,no,False
1,1,1,female,38,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26,0,0,7.925,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35,0,0,8.05,S,Third,man,True,C,Southampton,no,True


In [8]:
# split dataset into independent and dependent variables
X = df.iloc[:, 1:-1]
y = df.iloc[:, 0]

In [9]:
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive
0,3,male,22,1,0,7.25,S,Third,man,True,C,Southampton,no
1,1,female,38,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes
2,3,female,26,0,0,7.925,S,Third,woman,False,C,Southampton,yes
3,1,female,35,1,0,53.1,S,First,woman,False,C,Southampton,yes
4,3,male,35,0,0,8.05,S,Third,man,True,C,Southampton,no


In [10]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: uint8

In [11]:
# Scale the independent variables
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['age'] = scaler.fit_transform(df[['age']])
df['fare'] = scaler.fit_transform(df[['fare']])
df.head()

ModuleNotFoundError: No module named 'sklearn'

In [12]:
# Split the data into training and testing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)

ModuleNotFoundError: No module named 'sklearn'