## Pandas

Import the Pandas library and save it with the alias pd.

In [1]:
import pandas as pd

Pandas have 2 objects that are series and data frame.

### Object Series
- Object Series has 1-dimensional data.
- It doesn't have a column name because it has only one column.
- It has index.

In [2]:
data = [0.25, 0.50, 0.75, 1]

In [3]:
print(data)

[0.25, 0.5, 0.75, 1]


Convert the data into a series.

In [4]:
data = pd.Series(data)

In [5]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Task 1

In [6]:
waktu = [7.18, "Selasa", "Oktober", 2023]

In [7]:
print(waktu)

[7.18, 'Selasa', 'Oktober', 2023]


In [8]:
waktu = pd.Series(waktu)

In [9]:
waktu

0       7.18
1     Selasa
2    Oktober
3       2023
dtype: object

Convert the data from series into an array.

In [10]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

Display the index. 

The index is a range, where the start point is inclusive in the range, and the end point is exclusive from the range.

In [11]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [12]:
list(range(1,10))

[1, 2, 3, 4, 5, 6, 7, 8, 9]

Print the data using index

In [13]:
data[2]

0.75

Task 2

In [14]:
waktu = [7.18, "Selasa", 3, "Oktober", 2023]

waktu = pd.Series(waktu)

In [15]:
# convert series into an array

waktu.values

array([7.18, 'Selasa', 3, 'Oktober', 2023], dtype=object)

In [16]:
#show index

waktu.index

RangeIndex(start=0, stop=5, step=1)

In [17]:
# slicing

waktu[1:4:2]

1     Selasa
3    Oktober
dtype: object

Implicit index are the default index. <br>
We can define the index ourselves, which are called explicit index, and they are the ones that we specify. <br>
When defining index, the number of indices must match the number of data points.

In [18]:
data = pd.Series([0.25, 0.50, 0.75, 1], index = ['a','b','c','d'])

In [19]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [20]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [21]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Print the data using explicit index and implisit index

In [22]:
# index explicit

data['a']

0.25

This is data selection. <br>
Even though we have defined explicit index, we can still call the implicit index.

In [23]:
# index implisit

data[3]

1.0

Task 3

In [24]:
waktu = pd.Series([7.18, "Selasa", 3, "Oktober", 2023], index = ["Pukul", "Hari", "Tanggal", "Bulan", "Tahun"])

In [25]:
waktu

Pukul         7.18
Hari        Selasa
Tanggal          3
Bulan      Oktober
Tahun         2023
dtype: object

In [26]:
waktu["Hari"]

'Selasa'

In [27]:
waktu[2:5]

Tanggal          3
Bulan      Oktober
Tahun         2023
dtype: object

When the implicit index and explicit index are the same, when we call the data, it will depend on the explicit index.

In [28]:
data_2 = pd.Series([0.25, 0.50, 0.75, 1], index = [2,5,3,7])

In [29]:
data_2[2]

0.25

In [30]:
# data_2[0]

Task 4

In [31]:
data_3 = pd.Series([21, 22, 23, 24.5, 25, 26, 27.5, 28, 29, 30], index = [5,11,15,1,40,3,66,32,23,45])

In [32]:
data_3[3]

26.0

In [33]:
# data_3[2]

Slicing data with explicit index and implisit index

In [34]:
data = pd.Series([0.25, 0.50, 0.75, 1], index = ['a','b','c','d'])

In [35]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Show data from b to c:

In [36]:
# index explicit
data['b':'c']

b    0.50
c    0.75
dtype: float64

When we slice using implicit index, only the starting point will be shown, as implicit index are in the form of a range.

In [37]:
# index implisit
data[1:2]

b    0.5
dtype: float64

Task 5

In [38]:
aktivitas = pd.Series([7.18, "Selasa", 3, "Oktober", 2023, "Kelas pagi", "Data Analyst", "MyEduSolve", 6, "Pandas"], 
                      index = ["Pukul", "Hari", "Tanggal", "Bulan", "Tahun", "Aktivitas", "Pathway", "Di", "Week", "Materi"])

In [39]:
aktivitas

Pukul                7.18
Hari               Selasa
Tanggal                 3
Bulan             Oktober
Tahun                2023
Aktivitas      Kelas pagi
Pathway      Data Analyst
Di             MyEduSolve
Week                    6
Materi             Pandas
dtype: object

In [40]:
# slicing explicit
aktivitas["Tanggal":"Pathway"]

Tanggal                 3
Bulan             Oktober
Tahun                2023
Aktivitas      Kelas pagi
Pathway      Data Analyst
dtype: object

In [41]:
# slicing implisit 2 paramater
aktivitas[4:9]

Tahun                2023
Aktivitas      Kelas pagi
Pathway      Data Analyst
Di             MyEduSolve
Week                    6
dtype: object

In [42]:
# slicing implisit 3 parameter
aktivitas[2:10:3]

Tanggal               3
Aktivitas    Kelas pagi
Week                  6
dtype: object

### Loc and Iloc

The main difference between loc and iloc is in the way they access the data. loc uses explicit labels/indices, while iloc uses implicit positions/indices.

When we access an index, what appears is the explicit index

In [43]:
data_2 = pd.Series([0.25, 0.50, 0.75, 1], index=[2, 5, 3, 7])

In [44]:
data_2

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

#### Loc

In [45]:
#selecting index eksplisit
data_2.loc[3] 

0.75

In [46]:
#slicing index eksplisit
data_2.loc[2:3]

2    0.25
5    0.50
3    0.75
dtype: float64

#### Iloc

In [47]:
#selecting index implisit
data_2.iloc[3]

1.0

In [48]:
#slicing index implisit
data_2.iloc[2:3]

3    0.75
dtype: float64

In [49]:
dict_populasi = {'jakarta':750, 'bogor':490, 'depok':350, 'tanggerang':270, 'bekasi':670}

In [50]:
dict_populasi

{'jakarta': 750, 'bogor': 490, 'depok': 350, 'tanggerang': 270, 'bekasi': 670}

In [52]:
# dictionary to series transformation
populasi = pd.Series(dict_populasi)

In [53]:
populasi

jakarta       750
bogor         490
depok         350
tanggerang    270
bekasi        670
dtype: int64

In [55]:
populasi.loc['depok']

350

In [56]:
populasi.iloc[2]

350

In [57]:
dict_luas = {'Jakarta':737, 
                 'Bogor':325,
                 'Depok':247,
                 'Tanggerang':302,
                 'Bekasi':355}

In [58]:
luas = pd.Series(dict_luas)

In [59]:
luas

Jakarta       737
Bogor         325
Depok         247
Tanggerang    302
Bekasi        355
dtype: int64

### DataFrame

DataFrame is a collection of series with at least one series

In [60]:
daerah = pd.DataFrame({'pop':populasi, 'luas daerah':luas})

In [61]:
daerah

Unnamed: 0,pop,luas daerah
Bekasi,,355.0
Bogor,,325.0
Depok,,247.0
Jakarta,,737.0
Tanggerang,,302.0
bekasi,670.0,
bogor,490.0,
depok,350.0,
jakarta,750.0,
tanggerang,270.0,


In [62]:
daerah['luas daerah']

Bekasi        355.0
Bogor         325.0
Depok         247.0
Jakarta       737.0
Tanggerang    302.0
bekasi          NaN
bogor           NaN
depok           NaN
jakarta         NaN
tanggerang      NaN
Name: luas daerah, dtype: float64

In [63]:
daerah['luas daerah']['Jakarta']

737.0

When calling data with the regional.pop syntax it will appear as below

In [64]:
daerah.pop

<bound method DataFrame.pop of               pop  luas daerah
Bekasi        NaN        355.0
Bogor         NaN        325.0
Depok         NaN        247.0
Jakarta       NaN        737.0
Tanggerang    NaN        302.0
bekasi      670.0          NaN
bogor       490.0          NaN
depok       350.0          NaN
jakarta     750.0          NaN
tanggerang  270.0          NaN>

because pop is the same as the function name in the DataFrame

So it is safer to call data with the syntax area ['populasi']

In [65]:
daerah['pop']

Bekasi          NaN
Bogor           NaN
Depok           NaN
Jakarta         NaN
Tanggerang      NaN
bekasi        670.0
bogor         490.0
depok         350.0
jakarta       750.0
tanggerang    270.0
Name: pop, dtype: float64

Renamed column pop to populasi

In [66]:
daerah = pd.DataFrame({'populasi':populasi, 'luas':luas})

In [67]:
daerah

Unnamed: 0,populasi,luas
Bekasi,,355.0
Bogor,,325.0
Depok,,247.0
Jakarta,,737.0
Tanggerang,,302.0
bekasi,670.0,
bogor,490.0,
depok,350.0,
jakarta,750.0,
tanggerang,270.0,


In [68]:
daerah['populasi']

Bekasi          NaN
Bogor           NaN
Depok           NaN
Jakarta         NaN
Tanggerang      NaN
bekasi        670.0
bogor         490.0
depok         350.0
jakarta       750.0
tanggerang    270.0
Name: populasi, dtype: float64

In [69]:
# indeks eksplisit
daerah['populasi']['Jakarta':'Depok']

Series([], Name: populasi, dtype: float64)

In [70]:
# indeks implisit
daerah['populasi'].iloc[0:3]

Bekasi   NaN
Bogor    NaN
Depok    NaN
Name: populasi, dtype: float64

### Dataset

In [71]:
# load dataset titanic

df = pd.read_csv('Titanic.csv')

In [72]:
# show the first 5 rows of the data

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [73]:
# show the last 5 rows of the data

df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [74]:
# show data info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [75]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [76]:
# show total data non null

df.notnull().sum()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [77]:
# show total data NaN

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [78]:
# show total row and total column

df.shape

(891, 12)

In [79]:
# show columns

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [80]:
# show index

df.index

RangeIndex(start=0, stop=891, step=1)

In [81]:
# show column that contain number

df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [82]:
# show mean from Age column

df['Age'].mean()

29.69911764705882

In [83]:
# show median from Age column

df['Age'].median()

28.0

In [85]:
# show mode from Age column

df['Age'].mode()

0    24.0
Name: Age, dtype: float64

In [86]:
# show min from Age column

df['Age'].min()

0.42

In [87]:
# show max from Age column

df['Age'].max()

80.0