# Data Loading

## Machine Learning Process

Empat langkah utama dalam Machine Learning:

1. Data Visualization
2. Data Preprocessing
3. Data Modeling
4. Model Evaluation

Data Loading adalah tahap paling awal.

## Outline

1. Load datasets
2. Explore data
3. Describe data
4. Edit data

## Pandas

**Pandas** is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

- DataFrame: 2-dimensional data structure (table)
- Series: column in a DataFrame

Type of tabular data:
1. CSV
2. JSON
3. txt
4. Excel
5. others...

# Load Library

In [1]:
!pip install numpy
!pip install pandas
!pip install matplotlib



In [2]:
import sys
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

# Check Version

In [3]:
print(f"Numpy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Matplotlib: {mpl.__version__}")

Numpy: 1.19.1
Pandas: 1.1.2
Matplotlib: 3.3.1


# Titanic Datasets

## Load Data

In [4]:
titanic = pd.read_csv("datasets/titanic.csv")

## Inspect Data

In [5]:
# 5 data awal
titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [6]:
# 5 data terakhir
titanic.tail()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.45
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0
886,0,3,Mr. Patrick Dooley,male,32.0,0,0,7.75


In [7]:
titanic[100:110]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
100,0,3,Mr. Pastcho Petroff,male,29.0,0,0,7.8958
101,0,1,Mr. Richard Frasar White,male,21.0,0,1,77.2875
102,0,3,Mr. Gustaf Joel Johansson,male,33.0,0,0,8.6542
103,0,3,Mr. Anders Vilhelm Gustafsson,male,37.0,2,0,7.925
104,0,3,Mr. Stoytcho Mionoff,male,28.0,0,0,7.8958
105,1,3,Miss. Anna Kristine Salkjelsvik,female,21.0,0,0,7.65
106,1,3,Mr. Albert Johan Moss,male,29.0,0,0,7.775
107,0,3,Mr. Tido Rekic,male,38.0,0,0,7.8958
108,1,3,Miss. Bertha Moran,female,28.0,1,0,24.15
109,0,1,Mr. Walter Chamberlain Porter,male,47.0,0,0,52.0


## Change Columns

In [8]:
new_columns = {
    'Survived': 'SURVIVED',
    'Pclass': 'PCLASS',
    'Name': 'NAME',
    'Sex': 'SEX',
    'Age': 'AGE',
    'Siblings/Spouses Aboard': 'SIBSA',
    'Parents/Children Aboard': 'PARCA',
    'Fare': 'FARE'
}

In [9]:
titanic = titanic.rename(columns = new_columns)
titanic

Unnamed: 0,SURVIVED,PCLASS,NAME,SEX,AGE,SIBSA,PARCA,FARE
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


## Check Missing Value

In [10]:
print(f"Rows: {titanic.shape[0]}")
print(f"Columns: {titanic.shape[1]}")

Rows: 887
Columns: 8


In [11]:
titanic.isnull().any()

SURVIVED    False
PCLASS      False
NAME        False
SEX         False
AGE         False
SIBSA       False
PARCA       False
FARE        False
dtype: bool

## Data Types

In [12]:
titanic.dtypes

SURVIVED      int64
PCLASS        int64
NAME         object
SEX          object
AGE         float64
SIBSA         int64
PARCA         int64
FARE        float64
dtype: object

In [13]:
# ubah tipe data AGE dari float menjadi int
# jangan lupa assign agar berubah
titanic['AGE'] = titanic['AGE'].astype('int64')

In [14]:
titanic.dtypes

SURVIVED      int64
PCLASS        int64
NAME         object
SEX          object
AGE           int64
SIBSA         int64
PARCA         int64
FARE        float64
dtype: object

## Getting Insight

Berapa rata-rata `AGE` untuk masing-masing `PCLASS`?

In [16]:
titanic[['PCLASS', 'AGE']].groupby('PCLASS').mean()

Unnamed: 0_level_0,AGE
PCLASS,Unnamed: 1_level_1
1,38.782407
2,29.847826
3,25.170431
