# Replacing values in a DataFrame

Bab ini menunjukkan penggunaan fungsi `replace()` untuk mengganti satu atau beberapa nilai menggunakan **lists** dan **dictionaries**.

## Replace scalar values using `.replace()`

### The popular name dataset

In [111]:
import pandas as pd

names = pd.read_csv("datasets/Popular_Baby_Names.csv")
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,1
1,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,CHLOE,106,2
2,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMILY,93,3
3,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,4
4,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMMA,75,5


### Replace values in pandas

In [3]:
import time

In [8]:
start_time = time.time()
names['Gender'].loc[names['Gender'] == 'MALE'] = 'BOY'
print("Replace values using .loc[]: {} sec".format(time.time() - start_time))

Replace values using .loc[]: 0.005461931228637695 sec


### Replace values using `.replace()`

In [12]:
start_time = time.time()
names['Gender'].replace('MALE', 'BOY', inplace=True)
print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0015935897827148438 sec


### Practice: Replacing scalar values I

Dalam latihan ini, kita akan mengganti list nilai dalam dataset dengan menggunakan metode `.replace()`.

Kami akan menerapkan fungsi dalam DataFrame `poker_hands`. Ingat bahwa dalam DataFrame `poker_hands`, setiap baris kolom R1 ke R5 mewakili peringkat setiap kartu dari kartu poker pemain yang berkisar dari 1 (Ace) hingga 13 (King). Fitur `Class` mengklasifikasikan masing-masing tangan sebagai kategori, dan fitur `Explanation` secara singkat menjelaskan masing-masing tangan.

In [13]:
poker_hands = pd.read_csv("datasets/poker_hand.csv")
poker_hands.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


In [49]:
exp = {9: 'Royal flush', 8:'Straight flush', 1:'One pair', 0:'Nothing in hand',
       4:'Straight', 3:'Three of a kind', 2:'Two pairs', 5:'Flush', 6:'Full house',
       7:'Four of a kind'}

class_exp = poker_hands['Class'].tolist()

In [69]:
exp_list = []
for i in class_exp:
    for k, v in exp.items():
        if i is k:
            exp_list.append(v)

In [70]:
poker_hands['Explanation'] = exp_list

In [74]:
# Replace Class 1 to -2 
poker_hands['Class'].replace(1, -2, inplace=True)
# Replace Class 2 to -3
poker_hands['Class'].replace(2, -3, inplace=True)

print(poker_hands[['Class', 'Explanation']])

       Class      Explanation
0          9      Royal flush
1          9      Royal flush
2          9      Royal flush
3          9      Royal flush
4          9      Royal flush
...      ...              ...
25005      0  Nothing in hand
25006     -2         One pair
25007     -2         One pair
25008     -2         One pair
25009     -2         One pair

[25010 rows x 2 columns]


**Catatan** : Sekarang tangan yang diklasifikasikan sebagai Pasangan (*pair*) adalah satu-satunya yang memiliki angka negatif.

### Replace scalar values II

Seperti yang dibahas dalam video, dalam pandas DataFrame, dimungkinkan untuk mengganti nilai-nilai dengan cara yang sangat intuitif: kami menemukan posisi (baris dan kolom) di dalam Dataframe dan menetapkan nilai baru yang ingin Anda ganti. Dengan cara yang lebih pandas, tersedia fungsi `.replace()` yang melakukan tugas yang sama.

Anda akan menggunakan DataFrame `names` yang meliputi, antara lain, nama-nama paling populer di AS berdasarkan tahun, jenis kelamin dan etnis.

Tugas Anda adalah mengganti semua bayi yang diklasifikasikan sebagai `FEMALE` menjadi `GIRL` dengan menggunakan metode berikut:

* intuitive scalar replacement
* menggunakan fungsi `.replace()`

In [81]:
start_time = time.time()

# Replace all the entries that has 'FEMALE' as a gender with 'GIRL'
names['Gender'].loc[names['Gender'] == 'FEMALE'] = 'GIRL'

print("Time using .loc[]: {} sec".format(time.time() - start_time))

Time using .loc[]: 0.007901668548583984 sec


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [86]:
start_time = time.time()

# Replace all the entries that has 'FEMALE' as a gender with 'GIRL'
names['Gender'].replace('FEMALE', 'GIRL', inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.005396366119384766 sec


**Catatan** : Menggunakan `.replace()` lebih cepat.

## Replace values using lists

### Replace multiple values with one value

In [88]:
start_time = time.time()
names['Ethnicity'].loc[(names["Ethnicity"] == 'WHITE NON HISPANIC') | (names["Ethnicity"] == 'WHITE NON HISP')] = 'WNH'
print("Results from the above operation calculated in %s seconds" %(time.time() - start_time))

Results from the above operation calculated in 0.006935834884643555 seconds


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


### Replace multiple values using .replace() I

In [89]:
start_time = time.time()
names['Ethnicity'].replace(['WHITE NON HISPANIC', 'WHITE NON HISP'], 'WNH', inplace=True)
print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0018818378448486328 sec


### Practice: Replace multiple values I

In [98]:
start_time = time.time()

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'
names['Ethnicity'].loc[(names["Ethnicity"] == 'BLACK NON HISP') | 
                      (names["Ethnicity"] == 'BLACK NON HISPANIC') | 
                      (names["Ethnicity"] == 'WHITE NON HISP') | 
                      (names["Ethnicity"] == 'WHITE NON HISPANIC')] = 'NON HISPANIC'

print("Time using .loc[]: sec".format(time.time() - start_time))

Time using .loc[]: sec


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [100]:
start_time = time.time()

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'
names['Ethnicity'].replace(['BLACK NON HISP', 'BLACK NON HISPANIC', 'WHITE NON HISP' , 'WHITE NON HISPANIC'], 'NON HISPANIC', inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0048792362213134766 sec


### Replace multiple values II

In [102]:
start_time = time.time()

# Replace ethnicities as instructed
names['Ethnicity'].replace(['ASIAN AND PACI','BLACK NON HISP','WHITE NON HISP'], ['ASIAN AND PACIFIC ISLANDER','BLACK NON HISPANIC','WHITE NON HISPANIC'], inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.004667043685913086 sec


## Replace values using dictionaries

### Replace single values with dictionaries

In [103]:
start_time = time.time()
names['Gender'].replace({'MALE':'BOY', 'FEMALE':'GIRL'}, inplace=True)
print("Time using .replace() with dictionary: {} sec".format(time.time() - start_time))

Time using .replace() with dictionary: 0.0037374496459960938 sec


In [105]:
start_time = time.time()
names['Gender'].replace('MALE', 'BOY', inplace=True)
names['Gender'].replace('FEMALE', 'GIRL', inplace=True)
print("Time using multiple .replace(): {} sec".format(time.time() - start_time))

Time using multiple .replace(): 0.00292205810546875 sec


### Replace multiple values using dictionaries

In [106]:
start_time = time.time()
names.replace({'Ethnicity': {'ASIAN AND PACI': 'ASIAN',
                             'ASIAN AND PACIFIC ISLANDER': 'ASIAN',
                             'BLACK NON HISPANIC': 'BLACK',
                             'BLACK NON HISP': 'BLACK',
                             'WHITE NON HISPANIC': 'WHITE','WHITE NON HISP': 'WHITE'}})

print("Time using .replace() with dictionary: {} sec".format (time.time() - start_time))

Time using .replace() with dictionary: 0.0063822269439697266 sec


### Practice: Replace single values I

In [107]:
# Replace Royal flush or Straight flush to Flush
poker_hands.replace({'Royal flush':'Flush', 'Straight flush':'Flush'}, inplace=True)
print(poker_hands['Explanation'].head())

0    Flush
1    Flush
2    Flush
3    Flush
4    Flush
Name: Explanation, dtype: object


### Replace single values II

In [110]:
# Replace the number rank by a string
names['Rank'].replace({1:'FIRST', 2:'SECOND', 3:'THIRD'}, inplace=True)
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,FIRST
1,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,CHLOE,106,SECOND
2,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMILY,93,THIRD
3,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,4
4,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMMA,75,5


### Replace multiple values III

In [112]:
# Replace the rank of the first three ranked names to 'MEDAL'
names.replace({'Rank': {1:'MEDAL', 2:'MEDAL', 3:'MEDAL'}}, inplace=True)

# Replace the rank of the 4th and 5th ranked names to 'ALMOST MEDAL'
names.replace({'Rank': {4:'ALMOST MEDAL', 5:'ALMOST MEDAL'}}, inplace=True)
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,MEDAL
1,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,CHLOE,106,MEDAL
2,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMILY,93,MEDAL
3,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,ALMOST MEDAL
4,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMMA,75,ALMOST MEDAL


### Most efficient method for scalar replacement

Jika Anda ingin mengganti nilai scalar dengan nilai scalar lain, teknik mana yang paling efisien ??

* Ganti menggunakan dictionaries.