# Name: Sanchit Kripalani<br> Batch: M1 <br>Roll No: 31145

## Problem Statement

Perform the following operations using Python on any open source dataset (e.g., data.csv)

1. Import all the required Python Libraries.

2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site).

3. Load the Dataset into pandas data frame.

4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()) function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.

5. Data Formatting and Normalization: Summarize the types of variables by checking the data types.

6. Turn categorical variables into quantitative variables in python.

## Get the Data

Data Description: The Titanic Dataset is the one of most popular open-source dataset on the information of the passengers that were onboard the infamous Titanic. It contains personal information of each and every passenger including Name, Age, Sex, Ticket, Fare etc. It also contains the information about which passenger survived the sinking of the "unsinkable" ship, making it a great dataset to perform data manipulation and linear regression techinques for predicting survivability.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

In [2]:
# Import the dataset (Titanic Survivor Dataset)
input_data = pd.read_csv('train_titanic.csv')

# Raw data is a pandas Dataframe used to store the data
raw_data = pd.DataFrame(input_data)

## Initial Observations

In [3]:
# The .head() method of pandas DataFrame is used to display a part of the data stored
raw_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


As we can see, the data consists of the passenger information of the Passengers on board of the Titanic Ship. Each row represent the information of each passenger including Name, Age, Sex etc.

In [4]:
# .dtypes attribute is used to return the data types of all the columns of the dataframe
raw_data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

It is observed that the data mainly consists of integer and object types. 

- Integer is used to represent all numerical data types as well as categorical types.
- Object types is used to represent mixed types. Here, as seen above, Name, Sex and Embarked are string values which are represented as objects.

In [5]:
# Printing the no. of NaN values in all columns

print("Data Missing Values:\n",raw_data.isna().sum())

Data Missing Values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Thus the Age, Cabin and Embarked columns have null values. We will try and replace null values in every column based on the data type of the column. 

In [6]:
# The .info() function is used to provide a concise summary of the DataFrame
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


It can therefore be concluded that .info() is the method which can do the task of providing the information given by the 2 individual methods above (Getting the data type and the number of null values in each column).

### Cabin Column Preprocessing

In [7]:
# Cabin column seems to have a lot of missing values.
# We will replace all values with the first alphabet of the Cabin

raw_data['Cabin'] = raw_data['Cabin'].apply(lambda x: str(x)[0])

# Null values will be set to 'n'.
print("Cabin:", raw_data['Cabin'].unique())

Cabin: ['n' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']


Only the First alphabet of the Cabin is important, as can be observed from the layout of Titanic. Cabin Names are based on the floor that each cabin was placed. (Cabin A was on Deck level etc)

All the unique cabins alphabets have been found. This can later be used to provide categorical encoding to cabin names .

## Filling the NaN values

In [8]:
# First, let's replace NaN embarkations with the modal value of embarkation

# mode_embarked will get the modal place of Embarkment 
mode_embarked = raw_data['Embarked'].mode()[0]

raw_data['Embarked'].fillna(value=mode_embarked, inplace=True)

print("Modal place of Embarkment was: ", mode_embarked)
print("Number of missing values in the Embarked column: ", raw_data['Embarked'].isna().sum())

Modal place of Embarkment was:  S
Number of missing values in the Embarked column:  0


In [9]:
# Next, let's replace NaN Ages with the mean age in the dataset

mean_training_age = raw_data['Age'].mean()

print("Mean training data is: ", mean_training_age)

# Replace NaN values of Age column with Mean
raw_data['Age'].fillna(value=mean_training_age, inplace=True)

print("\nNumber of missing training values in the Age column: ", raw_data['Age'].isna().sum())

Mean training data is:  29.69911764705882

Number of missing training values in the Age column:  0


In [10]:
# Final check for all columns
print("Training Data Missing Values:\n", raw_data.isna().sum())

Training Data Missing Values:
 PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


All the null values are thus filled in.

## Encoding Categorical Values

In [11]:
# Used to encode categorical values
# We will be using Label Encoder for this task
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [12]:
raw_data['Sex'] = encoder.fit_transform(raw_data['Sex'])

In [13]:
raw_data['Sex'].unique()

array([1, 0])

Sex has thus been encoded into 0/1.

In [14]:
# Similarly, we encode all the ports (Embarkments)
# And then we also encode cabin data

raw_data['Embarked'] = encoder.fit_transform(raw_data['Embarked'])
raw_data['Cabin'] = encoder.fit_transform(raw_data['Cabin'])

**Major Difference between LabelEncoder and get_dummies**

- LabelEncoder creates encodings for categorical values and replaces the each entry in the dataframe with a numeric value. The number of columns remains the same.
- get_dummies on the other hand, creates a dataframe with one hot encoding of each entry. The number of columns are equal to to number of unique categorical values.

In [15]:
# We will drop the name and ticket number column
X = raw_data.drop(['Name', 'Ticket'], axis=1)

# .describe() is used to get a statistical summary of the dataframe
X.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,6.716049,1.536476
std,257.353842,0.486592,0.836071,0.47799,13.002015,1.102743,0.806057,49.693429,2.460739,0.791503
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,22.0,0.0,0.0,7.9104,8.0,1.0
50%,446.0,0.0,3.0,1.0,29.699118,0.0,0.0,14.4542,8.0,2.0
75%,668.5,1.0,3.0,1.0,35.0,1.0,0.0,31.0,8.0,2.0
max,891.0,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,8.0,2.0


### Normalization

Normalization is a scaling technique method in which data points are shifted and rescaled so that they end up in a range of 0 to 1

In [16]:
# Import MinMaxScaler from sklearn library
from sklearn.preprocessing import MinMaxScaler

# Create a normalization object
scaler = MinMaxScaler()

# This will scale the Fare and Age columns (We are not scaling categorical columns).
normalized_data = scaler.fit_transform(X[['Fare', 'Age']])

In [17]:
# Converting into Series (Since MinMaxScaler returns a numpy array)
X['Fare'] = pd.Series(normalized_data[:,0])
X['Age'] = pd.Series(normalized_data[:, 1])

In [18]:
X.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,1,0.271174,1,0,0.014151,8,2
1,2,1,1,0,0.472229,1,0,0.139136,2,0
2,3,1,3,0,0.321438,0,0,0.015469,8,2
3,4,1,1,0,0.434531,1,0,0.103644,2,2
4,5,0,3,1,0.434531,0,0,0.015713,8,2


Thus it is observed that Age and Fare have been Normalized. 