# 4. Basic pandas optimizations

Bab ini menawarkan pengantar singkat tentang cara bekerja secara efisien dengan pandas DataFrames. Anda akan mempelajari berbagai opsi yang Anda miliki untuk iterasi melalui DataFrame. Kemudian, Anda akan belajar cara menerapkan fungsi secara efisien ke data yang disimpan dalam DataFrame.

## Intro to pandas DataFrame iteration

### pandas recap

* Lihat ikhtisar pandas di [Intermediate Python for Data Science](https://github.com/tommypratama/datacamp/tree/master/Intermediate%20Python%20for%20Data%20Science)
* Library yang digunakan untuk analisis data
* Struktur data utama adalah DataFrame
  * Data tabular dengan baris dan kolom berlabel
  * Dibangun di atas struktur NumPy array
* Tujuan Bab:
  * Praktik terbaik untuk iterasi melalui pandas DataFrame

### Baseball stats

In [164]:
import pandas as pd

baseball_df = pd.read_csv("datasets/baseball_stats.csv", usecols=['Team', 'League', 'Year', 'RS', 'RA', 'W', 'G', 'Playoffs'])
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,Playoffs,G
0,ARI,NL,2012,734,688,81,0,162
1,ATL,NL,2012,700,600,94,1,162
2,BAL,AL,2012,712,705,93,1,162
3,BOS,AL,2012,734,806,69,0,162
4,CHC,NL,2012,613,759,61,0,162


### Calculating win percentage

In [13]:
import numpy as np

def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

In [14]:
win_perc = calc_win_perc(50, 100)
print(win_perc)

0.5


### Adding win percentage to DataFrame

In [15]:
win_perc_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]
    
    wins = row['W']
    games_played = row['G']
    
    win_perc = calc_win_perc(wins, games_played)
    win_perc_list.append(win_perc)
    
baseball_df['WP'] = win_perc_list

In [16]:
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,Playoffs,G,WP
0,ARI,NL,2012,734,688,81,0,162,0.5
1,ATL,NL,2012,700,600,94,1,162,0.58
2,BAL,AL,2012,712,705,93,1,162,0.57
3,BOS,AL,2012,734,806,69,0,162,0.43
4,CHC,NL,2012,613,759,61,0,162,0.38


### Iterating with .iloc

In [17]:
%%timeit
win_perc_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]
    
    wins = row['W']
    games_played = row['G']
    
    win_perc = calc_win_perc(wins, games_played)
    win_perc_list.append(win_perc)
    
baseball_df['WP'] = win_perc_list

254 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Iterating with .iterrows()

In [21]:
win_perc_list = []

for i,row in baseball_df.iterrows():
    wins = row['W']
    games_played = row['G']
    
    win_perc = calc_win_perc(wins, games_played)
    win_perc_list.append(win_perc)
    
baseball_df['WP'] = win_perc_list

In [23]:
%%timeit
win_perc_list = []

for i,row in baseball_df.iterrows():
    wins = row['W']
    games_played = row['G']
    
    win_perc = calc_win_perc(wins, games_played)
    win_perc_list.append(win_perc)
    
baseball_df['WP'] = win_perc_list

159 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Practice: Iterating with .iterrows()

`.iterrows()` mengembalikan setiap baris DataFrame sebagai pasangan tuple (indeks, pandas Series). Tapi apa artinya ini? Mari kita jelajahi dengan beberapa latihan coding.

DataFrame pandas telah dimuat sebagai `pit_df`. DataFrame ini berisi statistik untuk tim Baseball Liga Utama bernama Pittsburgh Pirates (disingkat `'PIT'`) dari tahun 2008 hingga tahun 2012.

In [27]:
# Create dataframe
pit_df = baseball_df[['Team', 'League', 'Year', 'RS', 'RA', 'W', 'G', 'Playoffs']]

In [32]:
# Iterate over pit_df and print each row
for i,row in pit_df.iterrows():
    print(row)

Team         ARI
League        NL
Year        2012
RS           734
RA           688
W             81
G            162
Playoffs       0
Name: 0, dtype: object
Team         ATL
League        NL
Year        2012
RS           700
RA           600
W             94
G            162
Playoffs       1
Name: 1, dtype: object
Team         BAL
League        AL
Year        2012
RS           712
RA           705
W             93
G            162
Playoffs       1
Name: 2, dtype: object
Team         BOS
League        AL
Year        2012
RS           734
RA           806
W             69
G            162
Playoffs       0
Name: 3, dtype: object
Team         CHC
League        NL
Year        2012
RS           613
RA           759
W             61
G            162
Playoffs       0
Name: 4, dtype: object
Team         CHW
League        AL
Year        2012
RS           748
RA           676
W             85
G            162
Playoffs       0
Name: 5, dtype: object
Team         CIN
League        NL
Year        

Name: 257, dtype: object
Team         NYM
League        NL
Year        2004
RS           684
RA           731
W             71
G            162
Playoffs       0
Name: 258, dtype: object
Team         NYY
League        AL
Year        2004
RS           897
RA           808
W            101
G            162
Playoffs       1
Name: 259, dtype: object
Team         OAK
League        AL
Year        2004
RS           793
RA           742
W             91
G            162
Playoffs       0
Name: 260, dtype: object
Team         PHI
League        NL
Year        2004
RS           840
RA           781
W             86
G            162
Playoffs       0
Name: 261, dtype: object
Team         PIT
League        NL
Year        2004
RS           680
RA           744
W             72
G            161
Playoffs       0
Name: 262, dtype: object
Team         SDP
League        NL
Year        2004
RS           768
RA           705
W             87
G            162
Playoffs       0
Name: 263, dtype: object
Team     

Name: 515, dtype: object
Team         FLA
League        NL
Year        1993
RS           581
RA           724
W             64
G            162
Playoffs       0
Name: 516, dtype: object
Team         HOU
League        NL
Year        1993
RS           716
RA           630
W             85
G            162
Playoffs       0
Name: 517, dtype: object
Team         KCR
League        AL
Year        1993
RS           675
RA           694
W             84
G            162
Playoffs       0
Name: 518, dtype: object
Team         LAD
League        NL
Year        1993
RS           675
RA           662
W             81
G            162
Playoffs       0
Name: 519, dtype: object
Team         MIL
League        AL
Year        1993
RS           733
RA           792
W             69
G            162
Playoffs       0
Name: 520, dtype: object
Team         MIN
League        AL
Year        1993
RS           693
RA           830
W             71
G            162
Playoffs       0
Name: 521, dtype: object
Team     

Name: 771, dtype: object
Team         CHC
League        NL
Year        1983
RS           701
RA           719
W             71
G            162
Playoffs       0
Name: 772, dtype: object
Team         CHW
League        AL
Year        1983
RS           800
RA           650
W             99
G            162
Playoffs       1
Name: 773, dtype: object
Team         CIN
League        NL
Year        1983
RS           623
RA           710
W             74
G            162
Playoffs       0
Name: 774, dtype: object
Team         CLE
League        AL
Year        1983
RS           704
RA           785
W             70
G            162
Playoffs       0
Name: 775, dtype: object
Team         DET
League        AL
Year        1983
RS           789
RA           679
W             92
G            162
Playoffs       0
Name: 776, dtype: object
Team         HOU
League        NL
Year        1983
RS           643
RA           646
W             85
G            162
Playoffs       0
Name: 777, dtype: object
Team     

Name: 1029, dtype: object
Team         KCR
League        AL
Year        1971
RS           603
RA           566
W             85
G            161
Playoffs       0
Name: 1030, dtype: object
Team         LAD
League        NL
Year        1971
RS           663
RA           587
W             89
G            162
Playoffs       0
Name: 1031, dtype: object
Team         MIL
League        AL
Year        1971
RS           534
RA           609
W             69
G            161
Playoffs       0
Name: 1032, dtype: object
Team         MIN
League        AL
Year        1971
RS           654
RA           670
W             74
G            160
Playoffs       0
Name: 1033, dtype: object
Team         MON
League        NL
Year        1971
RS           622
RA           729
W             71
G            162
Playoffs       0
Name: 1034, dtype: object
Team         NYM
League        NL
Year        1971
RS           588
RA           550
W             83
G            162
Playoffs       0
Name: 1035, dtype: object
Te

In [33]:
# Iterate over pit_df and print each index variable and then each row
for i,row in pit_df.iterrows():
    print(i)
    print(row)
    print(type(row))

0
Team         ARI
League        NL
Year        2012
RS           734
RA           688
W             81
G            162
Playoffs       0
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
1
Team         ATL
League        NL
Year        2012
RS           700
RA           600
W             94
G            162
Playoffs       1
Name: 1, dtype: object
<class 'pandas.core.series.Series'>
2
Team         BAL
League        AL
Year        2012
RS           712
RA           705
W             93
G            162
Playoffs       1
Name: 2, dtype: object
<class 'pandas.core.series.Series'>
3
Team         BOS
League        AL
Year        2012
RS           734
RA           806
W             69
G            162
Playoffs       0
Name: 3, dtype: object
<class 'pandas.core.series.Series'>
4
Team         CHC
League        NL
Year        2012
RS           613
RA           759
W             61
G            162
Playoffs       0
Name: 4, dtype: object
<class 'pandas.core.series.Series'>
5
Team         

Name: 176, dtype: object
<class 'pandas.core.series.Series'>
177
Team         TEX
League        AL
Year        2007
RS           816
RA           844
W             75
G            162
Playoffs       0
Name: 177, dtype: object
<class 'pandas.core.series.Series'>
178
Team         TOR
League        AL
Year        2007
RS           753
RA           699
W             83
G            162
Playoffs       0
Name: 178, dtype: object
<class 'pandas.core.series.Series'>
179
Team         WSN
League        NL
Year        2007
RS           673
RA           783
W             73
G            162
Playoffs       0
Name: 179, dtype: object
<class 'pandas.core.series.Series'>
180
Team         ARI
League        NL
Year        2006
RS           773
RA           788
W             76
G            162
Playoffs       0
Name: 180, dtype: object
<class 'pandas.core.series.Series'>
181
Team         ATL
League        NL
Year        2006
RS           849
RA           805
W             79
G            162
Playoffs    

Name: 371, dtype: object
<class 'pandas.core.series.Series'>
372
Team         HOU
League        NL
Year        2000
RS           938
RA           944
W             72
G            162
Playoffs       0
Name: 372, dtype: object
<class 'pandas.core.series.Series'>
373
Team         KCR
League        AL
Year        2000
RS           879
RA           930
W             77
G            162
Playoffs       0
Name: 373, dtype: object
<class 'pandas.core.series.Series'>
374
Team         LAD
League        NL
Year        2000
RS           798
RA           729
W             86
G            162
Playoffs       0
Name: 374, dtype: object
<class 'pandas.core.series.Series'>
375
Team         MIL
League        NL
Year        2000
RS           740
RA           826
W             73
G            163
Playoffs       0
Name: 375, dtype: object
<class 'pandas.core.series.Series'>
376
Team         MIN
League        AL
Year        2000
RS           748
RA           880
W             69
G            162
Playoffs    

Name: 555, dtype: object
<class 'pandas.core.series.Series'>
556
Team         SFG
League        NL
Year        1992
RS           574
RA           647
W             72
G            162
Playoffs       0
Name: 556, dtype: object
<class 'pandas.core.series.Series'>
557
Team         STL
League        NL
Year        1992
RS           631
RA           604
W             83
G            162
Playoffs       0
Name: 557, dtype: object
<class 'pandas.core.series.Series'>
558
Team         TEX
League        AL
Year        1992
RS           682
RA           753
W             77
G            162
Playoffs       0
Name: 558, dtype: object
<class 'pandas.core.series.Series'>
559
Team         TOR
League        AL
Year        1992
RS           780
RA           682
W             96
G            162
Playoffs       1
Name: 559, dtype: object
<class 'pandas.core.series.Series'>
560
Team         ATL
League        NL
Year        1991
RS           749
RA           644
W             94
G            162
Playoffs    

Name: 748, dtype: object
<class 'pandas.core.series.Series'>
749
Team         CLE
League        AL
Year        1984
RS           761
RA           766
W             75
G            163
Playoffs       0
Name: 749, dtype: object
<class 'pandas.core.series.Series'>
750
Team         DET
League        AL
Year        1984
RS           829
RA           643
W            104
G            162
Playoffs       1
Name: 750, dtype: object
<class 'pandas.core.series.Series'>
751
Team         HOU
League        NL
Year        1984
RS           693
RA           630
W             80
G            162
Playoffs       0
Name: 751, dtype: object
<class 'pandas.core.series.Series'>
752
Team         KCR
League        AL
Year        1984
RS           673
RA           686
W             84
G            162
Playoffs       1
Name: 752, dtype: object
<class 'pandas.core.series.Series'>
753
Team         LAD
League        NL
Year        1984
RS           580
RA           600
W             79
G            162
Playoffs    

Name: 944, dtype: object
<class 'pandas.core.series.Series'>
945
Team         SFG
League        NL
Year        1976
RS           595
RA           686
W             74
G            162
Playoffs       0
Name: 945, dtype: object
<class 'pandas.core.series.Series'>
946
Team         STL
League        NL
Year        1976
RS           629
RA           671
W             72
G            162
Playoffs       0
Name: 946, dtype: object
<class 'pandas.core.series.Series'>
947
Team         TEX
League        AL
Year        1976
RS           616
RA           652
W             76
G            162
Playoffs       0
Name: 947, dtype: object
<class 'pandas.core.series.Series'>
948
Team         ATL
League        NL
Year        1975
RS           583
RA           739
W             67
G            161
Playoffs       0
Name: 948, dtype: object
<class 'pandas.core.series.Series'>
949
Team         BAL
League        AL
Year        1975
RS           682
RA           553
W             90
G            159
Playoffs    

Name: 1135, dtype: object
<class 'pandas.core.series.Series'>
1136
Team         CHC
League        NL
Year        1966
RS           644
RA           809
W             59
G            162
Playoffs       0
Name: 1136, dtype: object
<class 'pandas.core.series.Series'>
1137
Team         CHW
League        AL
Year        1966
RS           574
RA           517
W             83
G            163
Playoffs       0
Name: 1137, dtype: object
<class 'pandas.core.series.Series'>
1138
Team         CIN
League        NL
Year        1966
RS           692
RA           702
W             76
G            160
Playoffs       0
Name: 1138, dtype: object
<class 'pandas.core.series.Series'>
1139
Team         CLE
League        AL
Year        1966
RS           574
RA           586
W             81
G            162
Playoffs       0
Name: 1139, dtype: object
<class 'pandas.core.series.Series'>
1140
Team         DET
League        AL
Year        1966
RS           719
RA           698
W             88
G            162
Pl

In [34]:
# Use one variable instead of two to store the result of .iterrows()
for row_tuple in pit_df.iterrows():
    print(row_tuple)

(0, Team         ARI
League        NL
Year        2012
RS           734
RA           688
W             81
G            162
Playoffs       0
Name: 0, dtype: object)
(1, Team         ATL
League        NL
Year        2012
RS           700
RA           600
W             94
G            162
Playoffs       1
Name: 1, dtype: object)
(2, Team         BAL
League        AL
Year        2012
RS           712
RA           705
W             93
G            162
Playoffs       1
Name: 2, dtype: object)
(3, Team         BOS
League        AL
Year        2012
RS           734
RA           806
W             69
G            162
Playoffs       0
Name: 3, dtype: object)
(4, Team         CHC
League        NL
Year        2012
RS           613
RA           759
W             61
G            162
Playoffs       0
Name: 4, dtype: object)
(5, Team         CHW
League        AL
Year        2012
RS           748
RA           676
W             85
G            162
Playoffs       0
Name: 5, dtype: object)
(6, Team        

(256, Team         MIN
League        AL
Year        2004
RS           780
RA           715
W             92
G            162
Playoffs       1
Name: 256, dtype: object)
(257, Team         MON
League        NL
Year        2004
RS           635
RA           769
W             67
G            162
Playoffs       0
Name: 257, dtype: object)
(258, Team         NYM
League        NL
Year        2004
RS           684
RA           731
W             71
G            162
Playoffs       0
Name: 258, dtype: object)
(259, Team         NYY
League        AL
Year        2004
RS           897
RA           808
W            101
G            162
Playoffs       1
Name: 259, dtype: object)
(260, Team         OAK
League        AL
Year        2004
RS           793
RA           742
W             91
G            162
Playoffs       0
Name: 260, dtype: object)
(261, Team         PHI
League        NL
Year        2004
RS           840
RA           781
W             86
G            162
Playoffs       0
Name: 261, dtype: 

Name: 512, dtype: object)
(513, Team         CLE
League        AL
Year        1993
RS           790
RA           813
W             76
G            162
Playoffs       0
Name: 513, dtype: object)
(514, Team         COL
League        NL
Year        1993
RS           758
RA           967
W             67
G            162
Playoffs       0
Name: 514, dtype: object)
(515, Team         DET
League        AL
Year        1993
RS           899
RA           837
W             85
G            162
Playoffs       0
Name: 515, dtype: object)
(516, Team         FLA
League        NL
Year        1993
RS           581
RA           724
W             64
G            162
Playoffs       0
Name: 516, dtype: object)
(517, Team         HOU
League        NL
Year        1993
RS           716
RA           630
W             85
G            162
Playoffs       0
Name: 517, dtype: object)
(518, Team         KCR
League        AL
Year        1993
RS           675
RA           694
W             84
G            162
Playoffs 

(777, Team         HOU
League        NL
Year        1983
RS           643
RA           646
W             85
G            162
Playoffs       0
Name: 777, dtype: object)
(778, Team         KCR
League        AL
Year        1983
RS           696
RA           767
W             79
G            163
Playoffs       0
Name: 778, dtype: object)
(779, Team         LAD
League        NL
Year        1983
RS           654
RA           609
W             91
G            163
Playoffs       1
Name: 779, dtype: object)
(780, Team         MIL
League        AL
Year        1983
RS           764
RA           708
W             87
G            162
Playoffs       0
Name: 780, dtype: object)
(781, Team         MIN
League        AL
Year        1983
RS           709
RA           822
W             70
G            162
Playoffs       0
Name: 781, dtype: object)
(782, Team         MON
League        NL
Year        1983
RS           677
RA           646
W             82
G            163
Playoffs       0
Name: 782, dtype: 

(1021, Team         BAL
League        AL
Year        1971
RS           742
RA           530
W            101
G            158
Playoffs       1
Name: 1021, dtype: object)
(1022, Team         BOS
League        AL
Year        1971
RS           691
RA           667
W             85
G            162
Playoffs       0
Name: 1022, dtype: object)
(1023, Team         CAL
League        AL
Year        1971
RS           511
RA           576
W             76
G            162
Playoffs       0
Name: 1023, dtype: object)
(1024, Team         CHC
League        NL
Year        1971
RS           637
RA           648
W             83
G            162
Playoffs       0
Name: 1024, dtype: object)
(1025, Team         CHW
League        AL
Year        1971
RS           617
RA           597
W             79
G            162
Playoffs       0
Name: 1025, dtype: object)
(1026, Team         CIN
League        NL
Year        1971
RS           586
RA           581
W             79
G            162
Playoffs       0
Name: 1

In [35]:
# Print the row and type of each row
for row_tuple in pit_df.iterrows():
    print(row_tuple)
    print(type(row_tuple))

(0, Team         ARI
League        NL
Year        2012
RS           734
RA           688
W             81
G            162
Playoffs       0
Name: 0, dtype: object)
<class 'tuple'>
(1, Team         ATL
League        NL
Year        2012
RS           700
RA           600
W             94
G            162
Playoffs       1
Name: 1, dtype: object)
<class 'tuple'>
(2, Team         BAL
League        AL
Year        2012
RS           712
RA           705
W             93
G            162
Playoffs       1
Name: 2, dtype: object)
<class 'tuple'>
(3, Team         BOS
League        AL
Year        2012
RS           734
RA           806
W             69
G            162
Playoffs       0
Name: 3, dtype: object)
<class 'tuple'>
(4, Team         CHC
League        NL
Year        2012
RS           613
RA           759
W             61
G            162
Playoffs       0
Name: 4, dtype: object)
<class 'tuple'>
(5, Team         CHW
League        AL
Year        2012
RS           748
RA           676
W          

Name: 222, dtype: object)
<class 'tuple'>
(223, Team         LAA
League        AL
Year        2005
RS           761
RA           643
W             95
G            162
Playoffs       1
Name: 223, dtype: object)
<class 'tuple'>
(224, Team         LAD
League        NL
Year        2005
RS           685
RA           755
W             71
G            162
Playoffs       0
Name: 224, dtype: object)
<class 'tuple'>
(225, Team         MIL
League        NL
Year        2005
RS           726
RA           697
W             81
G            162
Playoffs       0
Name: 225, dtype: object)
<class 'tuple'>
(226, Team         MIN
League        AL
Year        2005
RS           688
RA           662
W             83
G            162
Playoffs       0
Name: 226, dtype: object)
<class 'tuple'>
(227, Team         NYM
League        NL
Year        2005
RS           722
RA           648
W             83
G            162
Playoffs       0
Name: 227, dtype: object)
<class 'tuple'>
(228, Team         NYY
League        A

Name: 441, dtype: object)
<class 'tuple'>
(442, Team         PIT
League        NL
Year        1998
RS           650
RA           718
W             69
G            163
Playoffs       0
Name: 442, dtype: object)
<class 'tuple'>
(443, Team         SDP
League        NL
Year        1998
RS           749
RA           635
W             98
G            162
Playoffs       1
Name: 443, dtype: object)
<class 'tuple'>
(444, Team         SEA
League        AL
Year        1998
RS           859
RA           855
W             76
G            161
Playoffs       0
Name: 444, dtype: object)
<class 'tuple'>
(445, Team         SFG
League        NL
Year        1998
RS           845
RA           739
W             89
G            163
Playoffs       0
Name: 445, dtype: object)
<class 'tuple'>
(446, Team         STL
League        NL
Year        1998
RS           810
RA           782
W             83
G            163
Playoffs       0
Name: 446, dtype: object)
<class 'tuple'>
(447, Team         TBD
League        A

Name: 667, dtype: object)
<class 'tuple'>
(668, Team         CHC
League        NL
Year        1987
RS           720
RA           801
W             76
G            161
Playoffs       0
Name: 668, dtype: object)
<class 'tuple'>
(669, Team         CHW
League        AL
Year        1987
RS           748
RA           746
W             77
G            162
Playoffs       0
Name: 669, dtype: object)
<class 'tuple'>
(670, Team         CIN
League        NL
Year        1987
RS           783
RA           752
W             84
G            162
Playoffs       0
Name: 670, dtype: object)
<class 'tuple'>
(671, Team         CLE
League        AL
Year        1987
RS           742
RA           957
W             61
G            162
Playoffs       0
Name: 671, dtype: object)
<class 'tuple'>
(672, Team         DET
League        AL
Year        1987
RS           896
RA           735
W             98
G            162
Playoffs       1
Name: 672, dtype: object)
<class 'tuple'>
(673, Team         HOU
League        N

Name: 894, dtype: object)
<class 'tuple'>
(895, Team         STL
League        NL
Year        1978
RS           600
RA           657
W             69
G            162
Playoffs       0
Name: 895, dtype: object)
<class 'tuple'>
(896, Team         TEX
League        AL
Year        1978
RS           692
RA           632
W             87
G            162
Playoffs       0
Name: 896, dtype: object)
<class 'tuple'>
(897, Team         TOR
League        AL
Year        1978
RS           590
RA           775
W             59
G            161
Playoffs       0
Name: 897, dtype: object)
<class 'tuple'>
(898, Team         ATL
League        NL
Year        1977
RS           678
RA           895
W             61
G            162
Playoffs       0
Name: 898, dtype: object)
<class 'tuple'>
(899, Team         BAL
League        AL
Year        1977
RS           719
RA           653
W             97
G            161
Playoffs       0
Name: 899, dtype: object)
<class 'tuple'>
(900, Team         BOS
League        A

Name: 1115, dtype: object)
<class 'tuple'>
(1116, Team         CHC
League        NL
Year        1967
RS           702
RA           624
W             87
G            162
Playoffs       0
Name: 1116, dtype: object)
<class 'tuple'>
(1117, Team         CHW
League        AL
Year        1967
RS           531
RA           491
W             89
G            162
Playoffs       0
Name: 1117, dtype: object)
<class 'tuple'>
(1118, Team         CIN
League        NL
Year        1967
RS           604
RA           563
W             87
G            162
Playoffs       0
Name: 1118, dtype: object)
<class 'tuple'>
(1119, Team         CLE
League        AL
Year        1967
RS           559
RA           613
W             75
G            162
Playoffs       0
Name: 1119, dtype: object)
<class 'tuple'>
(1120, Team         DET
League        AL
Year        1967
RS           683
RA           587
W             91
G            163
Playoffs       0
Name: 1120, dtype: object)
<class 'tuple'>
(1121, Team         HOU
Lea

**Note** : Sejak `.iterrows()` mengembalikan setiap baris DataFrame sebagai pasangan tuple (indeks, pandas Series), Anda dapat membagi tuple ini dan menggunakan indeks dan nilai-baris secara terpisah (seperti yang Anda lakukan dengan `for i,row in pit_df.iterrows()` ), atau Anda dapat menyimpan hasil `.iterrows()` dalam bentuk tuple (seperti yang Anda lakukan dengan `for row_tuple in pit_df.iterrows()` ).

Jika menggunakan `i,row`, Anda dapat mengakses hal-hal dari baris menggunakan tanda kurung siku (mis. `row['Team']`). Jika menggunakan `row_tuple`, Anda harus menentukan elemen tuple mana yang ingin Anda akses sebelum meraih nama tim (mis., `row_tuple[1]['Team']`).

Dengan pendekatan mana pun, menggunakan `.iterrows()` masih akan jauh lebih cepat daripada menggunakan `.iloc`.

### Practice: Run differentials with .iterrows()

Anda telah disewa oleh San Francisco Giants sebagai analis — selamat! Pemilik tim ingin Anda menghitung metrik yang disebut *run differential* untuk setiap musim dari tahun 2008 hingga 2012. Metrik ini dihitung dengan mengurangi jumlah total run yang diizinkan oleh tim dalam satu musim dari jumlah total run tim yang dicetak dalam suatu musim. `'RS'` berarti *runs scored* dan `'RA'` berarti *runs allowed*.

Fungsi di bawah ini menghitung metrik ini:

In [36]:
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

In [63]:
giants_df = pit_df[(pit_df['Team'] == 'SFG') & (pit_df['Year'] >= 2008)]

In [64]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

giants_df['RD'] = run_diffs    
print(giants_df)

    Team League  Year   RS   RA   W    G  Playoffs   RD
24   SFG     NL  2012  718  649  94  162         1   69
54   SFG     NL  2011  570  578  86  162         0   -8
84   SFG     NL  2010  697  583  92  162         1  114
114  SFG     NL  2009  657  611  88  162         0   46
144  SFG     NL  2008  640  759  72  162         0 -119


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


**Note** : Lihatlah DataFrame `giants_df` dengan kolom *run differential* yang baru (`'RD'`) yang Anda buat di atas.

Kolom `'Playoffs'` memberi tahu Anda jika suatu tim membuat playoff untuk musim tertentu. `1` berarti tim membuat babak playoff di musim itu dan `0` berarti tim tidak membuat babak playoff di musim itu.

Apakah Anda memperhatikan bahwa di musim dengan *run differentials* tertinggi Giants membuat playoff? Bahkan, di kedua musim ini (2010 dan 2012), San Francisco Giants tidak hanya berhasil lolos tetapi juga memenangkan World Series! Keren!

## Another iterator method: `.itertuples()`

### Team wins data

In [67]:
team_wins_df = baseball_df[['Team', 'Year', 'W']]
team_wins_df = team_wins_df[team_wins_df['Year'] == 2012]

In [69]:
team_wins_df.head()

Unnamed: 0,Team,Year,W
0,ARI,2012,81
1,ATL,2012,94
2,BAL,2012,93
3,BOS,2012,69
4,CHC,2012,61


In [74]:
for row_tuple in team_wins_df.iterrows():
    print(row_tuple)
    print(type(row_tuple[1]))

(0, Team     ARI
Year    2012
W         81
Name: 0, dtype: object)
<class 'pandas.core.series.Series'>
(1, Team     ATL
Year    2012
W         94
Name: 1, dtype: object)
<class 'pandas.core.series.Series'>
(2, Team     BAL
Year    2012
W         93
Name: 2, dtype: object)
<class 'pandas.core.series.Series'>
(3, Team     BOS
Year    2012
W         69
Name: 3, dtype: object)
<class 'pandas.core.series.Series'>
(4, Team     CHC
Year    2012
W         61
Name: 4, dtype: object)
<class 'pandas.core.series.Series'>
(5, Team     CHW
Year    2012
W         85
Name: 5, dtype: object)
<class 'pandas.core.series.Series'>
(6, Team     CIN
Year    2012
W         97
Name: 6, dtype: object)
<class 'pandas.core.series.Series'>
(7, Team     CLE
Year    2012
W         68
Name: 7, dtype: object)
<class 'pandas.core.series.Series'>
(8, Team     COL
Year    2012
W         64
Name: 8, dtype: object)
<class 'pandas.core.series.Series'>
(9, Team     DET
Year    2012
W         88
Name: 9, dtype: object)
<class

### Iterating with .itertuples()

In [77]:
for row_namedtuple in team_wins_df.itertuples():
    print(row_namedtuple)

Pandas(Index=0, Team='ARI', Year=2012, W=81)
Pandas(Index=1, Team='ATL', Year=2012, W=94)
Pandas(Index=2, Team='BAL', Year=2012, W=93)
Pandas(Index=3, Team='BOS', Year=2012, W=69)
Pandas(Index=4, Team='CHC', Year=2012, W=61)
Pandas(Index=5, Team='CHW', Year=2012, W=85)
Pandas(Index=6, Team='CIN', Year=2012, W=97)
Pandas(Index=7, Team='CLE', Year=2012, W=68)
Pandas(Index=8, Team='COL', Year=2012, W=64)
Pandas(Index=9, Team='DET', Year=2012, W=88)
Pandas(Index=10, Team='HOU', Year=2012, W=55)
Pandas(Index=11, Team='KCR', Year=2012, W=72)
Pandas(Index=12, Team='LAA', Year=2012, W=89)
Pandas(Index=13, Team='LAD', Year=2012, W=86)
Pandas(Index=14, Team='MIA', Year=2012, W=69)
Pandas(Index=15, Team='MIL', Year=2012, W=83)
Pandas(Index=16, Team='MIN', Year=2012, W=66)
Pandas(Index=17, Team='NYM', Year=2012, W=74)
Pandas(Index=18, Team='NYY', Year=2012, W=95)
Pandas(Index=19, Team='OAK', Year=2012, W=94)
Pandas(Index=20, Team='PHI', Year=2012, W=81)
Pandas(Index=21, Team='PIT', Year=2012, W=79

In [78]:
print(row_namedtuple.Index)

29


In [79]:
print(row_namedtuple.Team)

WSN


### Comparing methods

In [82]:
%%timeit
for row_tuple in team_wins_df.iterrows():
    data = row_tuple

2.63 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [83]:
%%timeit
for row_namedtuple in team_wins_df.itertuples():
    data = row_namedtuple

396 µs ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [84]:
for row_tuple in team_wins_df.iterrows():
    print(row_tuple[1]['Team'])

ARI
ATL
BAL
BOS
CHC
CHW
CIN
CLE
COL
DET
HOU
KCR
LAA
LAD
MIA
MIL
MIN
NYM
NYY
OAK
PHI
PIT
SDP
SEA
SFG
STL
TBR
TEX
TOR
WSN


In [None]:
# TypeError: tuple indices must be integers or slices, not str
for row_namedtuple in team_wins_df.itertuples():
    print(row_namedtuple['Team'])

In [88]:
for row_namedtuple in team_wins_df.itertuples():
    print(row_namedtuple.Team)

ARI
ATL
BAL
BOS
CHC
CHW
CIN
CLE
COL
DET
HOU
KCR
LAA
LAD
MIA
MIL
MIN
NYM
NYY
OAK
PHI
PIT
SDP
SEA
SFG
STL
TBR
TEX
TOR
WSN


### Practice: Iterating with `.itertuples()`

Ingat, `.itertuples()` mengembalikan setiap baris DataFrame sebagai tipe data khusus yang disebut **namedtuple**. Anda dapat mencari atribut dalam sebuah namedtuple dengan sintaks khusus. Mari kita berlatih bekerja dengan namedtuple.

DataFrame pandas telah dimuat sebagai `rangers_df`. DataFrame ini berisi statistik (`'Team'`, `'League'`, `'Year'`, `'RS'`, `'RA'`, `'W'`, `'G'`, dan `'Playoffs'`) untuk tim baseball Liga Utama bernama Texas Rangers (disingkat `'TEX'`).

In [91]:
rangers_df = baseball_df[['Team', 'League', 'Year', 'RS', 'RA', 'W', 'G', 'Playoffs']]
rangers_df = baseball_df[baseball_df['Team'] == 'TEX']

In [94]:
# Loop over the DataFrame and print each row
for row in rangers_df.itertuples():
    print(row)

Pandas(Index=27, Team='TEX', League='AL', Year=2012, RS=808, RA=707, W=93, Playoffs=1, G=162, WP=0.57)
Pandas(Index=57, Team='TEX', League='AL', Year=2011, RS=855, RA=677, W=96, Playoffs=1, G=162, WP=0.59)
Pandas(Index=87, Team='TEX', League='AL', Year=2010, RS=787, RA=687, W=90, Playoffs=1, G=162, WP=0.56)
Pandas(Index=117, Team='TEX', League='AL', Year=2009, RS=784, RA=740, W=87, Playoffs=0, G=162, WP=0.54)
Pandas(Index=147, Team='TEX', League='AL', Year=2008, RS=901, RA=967, W=79, Playoffs=0, G=162, WP=0.49)
Pandas(Index=177, Team='TEX', League='AL', Year=2007, RS=816, RA=844, W=75, Playoffs=0, G=162, WP=0.46)
Pandas(Index=207, Team='TEX', League='AL', Year=2006, RS=835, RA=784, W=80, Playoffs=0, G=162, WP=0.49)
Pandas(Index=237, Team='TEX', League='AL', Year=2005, RS=865, RA=858, W=79, Playoffs=0, G=162, WP=0.49)
Pandas(Index=268, Team='TEX', League='AL', Year=2004, RS=860, RA=794, W=89, Playoffs=0, G=162, WP=0.55)
Pandas(Index=298, Team='TEX', League='AL', Year=2003, RS=826, RA=96

In [95]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
    i = row.Index
    year = row.Year
    wins = row.W
    print(i, year, wins)

27 2012 93
57 2011 96
87 2010 90
117 2009 87
147 2008 79
177 2007 75
207 2006 80
237 2005 79
268 2004 89
298 2003 71
328 2002 72
358 2001 73
388 2000 71
418 1999 95
448 1998 88
476 1997 77
504 1996 90
532 1993 86
558 1992 77
584 1991 85
610 1990 83
636 1989 83
662 1988 70
688 1987 75
714 1986 87
740 1985 62
766 1984 69
792 1983 77
818 1982 64
844 1980 76
870 1979 83
896 1978 87
922 1977 94
947 1976 76
971 1975 79
995 1974 83
1019 1973 57


In [96]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
    i = row.Index
    year = row.Year
    wins = row.W
  
    # Check if rangers made Playoffs (1 means yes; 0 means no)
    if row.Playoffs == 1:
        print(i, year, wins)

27 2012 93
57 2011 96
87 2010 90
418 1999 95
448 1998 88
504 1996 90


**Note** : Anda terbiasa menggunakan `.itertuples()`. Ingat, Anda perlu menggunakan sintaks *dot* untuk mereferensikan atribut dalam **namedtuple**.

Anda dapat membuat variabel baru menggunakan referensi titik baris (seperti yang Anda lakukan saat menyimpan `row.Index` sebagai variabel `i`). Atau Anda dapat menggunakan referensi titik baris secara langsung untuk melakukan perhitungan dan pemeriksaan. Perhatikan bahwa Anda tidak harus menyimpan `row.Playoffs` ke variabel baru dalam pernyataan `if` Anda (Anda dapat menggunakan `row.Playoffs` langsung di `if` Anda).

Apakah Anda memperhatikan pola dalam penampilan playoff Texas Rangers? Hanya enam penampilan dan dua kelompok pengelompokan yang berbeda (satu dari 2010 - 2012 dan satu dari 1996 - 1999).

### Practice: Run differentials with `.itertuples()`

New York Yankees telah melakukan perdagangan dengan San Francisco Giants untuk kontrak analis Anda — Anda adalah komoditas panas! Bos baru Anda telah melihat pekerjaan Anda dengan Giants dan sekarang ingin Anda melakukan sesuatu yang mirip dengan data Yankees. Dia ingin Anda menghitung *run differentials* untuk Yankees dari tahun 1962 hingga tahun 2012 dan menemukan musim mana mereka memiliki *run differentials* terbaik.

Anda ingat fungsi yang Anda gunakan saat bekerja dengan Giants :

In [97]:
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

Mari kita gunakan `.itertuples()` untuk loop DataFrame `yankees_df`.

In [106]:
yankees_df = baseball_df[['Team', 'League', 'Year', 'RS', 'RA', 'W', 'G', 'Playoffs']]
yankees_df = baseball_df[baseball_df['Team'] == 'NYY'].reset_index()

In [114]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
print(yankees_df)

    index Team League  Year   RS   RA    W  Playoffs    G    WP   RD
0      18  NYY     AL  2012  804  668   95         1  162  0.59  136
1      48  NYY     AL  2011  867  657   97         1  162  0.60  210
2      78  NYY     AL  2010  859  693   95         1  162  0.59  166
3     108  NYY     AL  2009  915  753  103         1  162  0.64  162
4     138  NYY     AL  2008  789  727   89         0  162  0.55   62
5     168  NYY     AL  2007  968  777   94         1  162  0.58  191
6     198  NYY     AL  2006  930  767   97         1  162  0.60  163
7     228  NYY     AL  2005  886  789   95         1  162  0.59   97
8     259  NYY     AL  2004  897  808  101         1  162  0.62   89
9     289  NYY     AL  2003  877  716  101         1  163  0.62  161
10    319  NYY     AL  2002  897  697  103         1  161  0.64  200
11    349  NYY     AL  2001  804  713   95         1  161  0.59   91
12    379  NYY     AL  2000  871  814   87         1  161  0.54   57
13    409  NYY     AL  1999  900  

**Question**

Pada tahun berapa di dalam DataFrame Anda, New York Yankees memiliki *run differential* tertinggi?

* Pada tahun 1998 (dengan Run Diferensial 309)

**Note** : Anda menggunakan `.itertuples()` untuk membantu Yankees menghitung *run differential*. Ingat, menggunakan `.itertuples()` sama seperti menggunakan `.iterrows()` kecuali itu cenderung lebih cepat. Anda juga harus menggunakan referensi *dot* ketika mencari atribut dengan `.itertuples()`.

Anda menemukan bahwa *run differential* tertinggi Yankees adalah pada tahun 1998. Apakah Anda tahu mereka benar-benar memegang rekor untuk *run differential* tertinggi dalam musim MLB (411 pada tahun 1939 di mana RS 967 dan RA 556)? Wow!

## pandas alternative to looping

### pandas `.apply()` method

* Mengambil fungsi dan menerapkannya ke DataFrame
  * Harus menentukan sumbu untuk diterapkan ( 0 untuk kolom; 1 untuk baris)
* Dapat digunakan dengan anonymous functions ( `lambda` functions)

Contoh:

In [115]:
baseball_df.apply(lambda row: calc_run_diff(row['RS'], row['RA']), axis=1)

0        46
1       100
2         7
3       -72
4      -146
       ... 
1227    -54
1228     80
1229    188
1230    110
1231   -117
Length: 1232, dtype: int64

### Run differentials with `.apply()`

In [117]:
run_diffs_apply = baseball_df.apply(lambda row: calc_run_diff(row['RS'], row['RA']), axis=1)
baseball_df['RD'] = run_diffs_apply
print(baseball_df)

     Team League  Year   RS   RA    W  Playoffs    G    WP   RD
0     ARI     NL  2012  734  688   81         0  162  0.50   46
1     ATL     NL  2012  700  600   94         1  162  0.58  100
2     BAL     AL  2012  712  705   93         1  162  0.57    7
3     BOS     AL  2012  734  806   69         0  162  0.43  -72
4     CHC     NL  2012  613  759   61         0  162  0.38 -146
...   ...    ...   ...  ...  ...  ...       ...  ...   ...  ...
1227  PHI     NL  1962  705  759   81         0  161  0.50  -54
1228  PIT     NL  1962  706  626   93         0  161  0.58   80
1229  SFG     NL  1962  878  690  103         1  165  0.62  188
1230  STL     NL  1962  774  664   84         0  163  0.52  110
1231  WSA     AL  1962  599  716   60         0  162  0.37 -117

[1232 rows x 10 columns]


### Comparing approaches

In [118]:
%%timeit
run_diffs_iterrows = []

for i,row in baseball_df.iterrows():
    run_diff = calc_run_diff(row['RS'], row['RA'])
    run_diffs_iterrows.append(run_diff)
    
baseball_df['RD'] = run_diffs_iterrows

135 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [119]:
%%timeit
run_diffs_apply = baseball_df.apply(lambda row: calc_run_diff(row['RS'], row['RA']), axis=1)
baseball_df['RD'] = run_diffs_apply

29.3 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Analyzing baseball stats with `.apply()`

Tampa Bay Rays ingin Anda menganalisis data mereka.

Mereka menginginkan metrik sebagai berikut:
* Jumlah setiap kolom dalam data
* Jumlah total *runs scored* dalam setahun ( `'RS'` + `'RA'` untuk setiap tahun)
* Kolom `'Playoffs'` dalam format teks daripada menggunakan angka `1` dan `0`

Fungsi di bawah ini dapat digunakan untuk mengonversi kolom `'Playoffs'` menjadi teks:

In [120]:
def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No' 

Gunakan `.apply()` untuk mendapatkan metrik ini. DataFrame ( `rays_df` ) telah dimuat. DataFrame ini diindeks pada kolom `'Year'`.

In [135]:
rays_df_filter = baseball_df[baseball_df['Team'] == 'TBR']

In [136]:
rays_df = rays_df_filter[['Year','RS','RA','W','Playoffs']].set_index('Year')

In [141]:
rays_df

Unnamed: 0_level_0,RS,RA,W,Playoffs
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012,697,577,90,0
2011,707,614,91,1
2010,802,649,96,1
2009,803,754,84,0
2008,774,671,97,1


In [138]:
# Gather sum of all columns
stat_totals = rays_df.apply(sum, axis=0)
print(stat_totals)

RS          3783
RA          3265
W            458
Playoffs       3
dtype: int64


In [139]:
# Gather total runs scored in all games per year
total_runs_scored = rays_df[['RS', 'RA']].apply(sum, axis=1)
print(total_runs_scored)

Year
2012    1274
2011    1321
2010    1451
2009    1557
2008    1445
dtype: int64


In [140]:
# Convert numeric playoffs to text
textual_playoffs = rays_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(textual_playoffs)

Year
2012     No
2011    Yes
2010    Yes
2009     No
2008    Yes
dtype: object


**Note** : Metode `.apply()` memungkinkan Anda menerapkan fungsi ke semua baris atau kolom DataFrame dengan menentukan sumbu/axis.

Jika Anda telah menggunakan pandas untuk beberapa waktu, Anda mungkin telah memperhatikan bahwa cara yang lebih baik untuk menemukan statistik ini adalah menggunakan metode bawaan pandas `.sum()`.

Anda bisa menggunakan `rays_df.sum(axis=0)` untuk mendapatkan jumlah kolom dan `rays_df[['RS', 'RA']].sum(axis=1)` untuk mendapatkan jumlah baris.

Anda juga bisa menggunakan `.apply()` langsung pada Series (atau kolom) dari DataFrame. Misalnya, Anda bisa menggunakan `rays_df['Playoffs'].apply(text_playoffs)` untuk mengkonversi kolom `'Playoffs'` ke text.

### Settle a debate with `.apply()`

Sudah ada kabar sampai ke Arizona Diamondbacks tentang keterampilan analitik Anda yang luar biasa. Mereka ingin Anda membantu menyelesaikan perdebatan di antara para manajer. Satu manajer mengklaim bahwa tim telah membuat babak playoff setiap tahun mereka memiliki persentase kemenangan 0,50 atau lebih besar. Manajer lain mengatakan ini tidak benar.

Mari kita gunakan fungsi di bawah ini dan metode `.apply()` untuk melihat manajer mana yang benar.

In [143]:
def calc_win_perc(wins, games_played):
    
    win_perc = wins / games_played
    return np.round(win_perc, 2)

In [148]:
ari_df_filter = baseball_df[baseball_df['Team'] == 'ARI']
dbacks_df = ari_df_filter[['Team','League','Year','RS','RA','W','G','Playoffs']].reset_index()

In [151]:
# Display the first five rows of the DataFrame
dbacks_df.head()

Unnamed: 0,index,Team,League,Year,RS,RA,W,G,Playoffs
0,0,ARI,NL,2012,734,688,81,162,0
1,30,ARI,NL,2011,731,662,94,162,1
2,60,ARI,NL,2010,713,836,65,162,0
3,90,ARI,NL,2009,720,782,70,162,0
4,120,ARI,NL,2008,720,706,82,162,0


In [152]:
# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

0     0.50
1     0.58
2     0.40
3     0.43
4     0.51
5     0.56
6     0.47
7     0.48
8     0.31
9     0.52
10    0.60
11    0.57
12    0.52
13    0.62
14    0.40
dtype: float64 



In [153]:
# Append a new column to dbacks_df
dbacks_df['WP'] = win_percs
print(dbacks_df, '\n')

# Display dbacks_df where WP is greater than 0.50
print(dbacks_df[dbacks_df['WP'] >= 0.50])

    index Team League  Year   RS   RA    W    G  Playoffs    WP
0       0  ARI     NL  2012  734  688   81  162         0  0.50
1      30  ARI     NL  2011  731  662   94  162         1  0.58
2      60  ARI     NL  2010  713  836   65  162         0  0.40
3      90  ARI     NL  2009  720  782   70  162         0  0.43
4     120  ARI     NL  2008  720  706   82  162         0  0.51
5     150  ARI     NL  2007  712  732   90  162         1  0.56
6     180  ARI     NL  2006  773  788   76  162         0  0.47
7     210  ARI     NL  2005  696  856   77  162         0  0.48
8     241  ARI     NL  2004  615  899   51  162         0  0.31
9     271  ARI     NL  2003  717  685   84  162         0  0.52
10    301  ARI     NL  2002  819  674   98  162         1  0.60
11    331  ARI     NL  2001  818  677   92  162         1  0.57
12    361  ARI     NL  2000  792  754   85  162         0  0.52
13    391  ARI     NL  1999  908  676  100  162         1  0.62
14    421  ARI     NL  1998  665  812   

**Question**

Manajer mana yang benar dalam klaim mereka?

* Manajer yang mengklaim **tim belum membuat** babak playoff setiap tahun mereka memiliki persentase kemenangan 0,50 atau lebih besar.

**Note** : Menggunakan metode `.apply()` dengan fungsi `lambda` memungkinkan Anda untuk menerapkan fungsi ke DataFrame tanpa perlu menulis `for` loop.

Sayangnya, manajer kedua benar. Pada tahun 2012, 2008, 2003, dan 2000 Arizona Diamondbacks memiliki persentase kemenangan yang lebih besar dari atau sama dengan 0.50, tetapi masih belum mencapai babak playoff.

## Optimal pandas iterating

### pandas internals

* Menghilangkan loop juga berlaku untuk menggunakan pandas
* panda dibangun di atas NumPy
  * Manfaatkan efisiensi NumPy array 

In [154]:
wins_np = baseball_df['W'].values
print(type(wins_np))

<class 'numpy.ndarray'>


In [155]:
print(wins_np)

[ 81  94  93 ... 103  84  60]


### Power of vectorization

* Broadcasting (vectorizing) sangat efisien!

In [159]:
baseball_df['RS'].values - baseball_df['RA'].values

array([  46,  100,    7, ...,  188,  110, -117])

### Run differentials with arrays

In [160]:
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values
baseball_df['RD'] = run_diffs_np
print(baseball_df)

     Team League  Year   RS   RA    W  Playoffs    G    WP   RD
0     ARI     NL  2012  734  688   81         0  162  0.50   46
1     ATL     NL  2012  700  600   94         1  162  0.58  100
2     BAL     AL  2012  712  705   93         1  162  0.57    7
3     BOS     AL  2012  734  806   69         0  162  0.43  -72
4     CHC     NL  2012  613  759   61         0  162  0.38 -146
...   ...    ...   ...  ...  ...  ...       ...  ...   ...  ...
1227  PHI     NL  1962  705  759   81         0  161  0.50  -54
1228  PIT     NL  1962  706  626   93         0  161  0.58   80
1229  SFG     NL  1962  878  690  103         1  165  0.62  188
1230  STL     NL  1962  774  664   84         0  163  0.52  110
1231  WSA     AL  1962  599  716   60         0  162  0.37 -117

[1232 rows x 10 columns]


### Comparing approaches

In [161]:
%%timeit
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values

baseball_df['RD'] = run_diffs_np

149 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Replacing .iloc with underlying arrays

Sekarang setelah Anda memiliki pemahaman yang lebih baik tentang internal DataFrame mari memperbarui salah satu analisis Anda sebelumnya untuk memanfaatkan array yang mendasari DataFrame. Anda akan mengunjungi kembali perhitungan persentase kemenangan yang Anda lakukan baris demi baris dengan metode `.iloc`:

In [162]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

win_percs_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

baseball_df['WP'] = win_percs_list

Mari kita perbarui analisis ini untuk menggunakan array alih-alih metode `.iloc`.

In [168]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df['WP'] = win_percs_np

baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,Playoffs,G,WP
0,ARI,NL,2012,734,688,81,0,162,0.5
1,ATL,NL,2012,700,600,94,1,162,0.58
2,BAL,AL,2012,712,705,93,1,162,0.57
3,BOS,AL,2012,734,806,69,0,162,0.43
4,CHC,NL,2012,613,759,61,0,162,0.38


**Question**

Gunakan `timeit` dalam (*cell magic mode*) di dalam konsol IPython Anda untuk membandingkan runtime antara blok kode lama menggunakan `.iloc` dan kode baru yang Anda kembangkan menggunakan NumPy array.

Jangan sertakan kode yang mendefinisikan fungsi `calc_win_perc()` atau pernyataan `print()` saat penghitungan waktu.

Anda harus memasukkan **delapan baris kode** saat menghitung waktu blok kode lama dan **dua baris kode** saat menghitung waktu kode baru yang Anda kembangkan.

In [169]:
%%timeit
win_percs_list = []
for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]
    wins = row['W']
    games_played = row['G']
    win_perc = calc_win_perc(wins, games_played)
    win_percs_list.append(win_perc)
baseball_df['WP'] = win_percs_list

256 ms ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [170]:
%%timeit
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)
baseball_df['WP'] = win_percs_np

177 µs ± 8.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


**Answer** : Pendekatan `NumPy` array lebih cepat dari pendekatan `.iloc`.

Menggunakan array yang mendasari DataFrame untuk melakukan perhitungan benar-benar dapat mempercepat kode Anda dan menghasilkan beberapa keuntungan efisiensi yang signifikan. Apakah Anda memperhatikan bahwa pendekatan NumPy array tidak hanya lebih cepat, tetapi juga menggunakan lebih sedikit baris kode dan lebih mudah dibaca?

### Bringing it all together: Predict win percentage

Anda ingin mencoba *memprediksi* persentase kemenangan tim untuk musim tertentu dengan menggunakan total skor lari tim dalam satu musim ('RS') dan total lari yang diizinkan dalam satu musim ('RA') dengan fungsi berikut:

In [171]:
def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)

Mari kita bandingkan pendekatan yang telah Anda pelajari untuk menghitung prediksi persentase kemenangan untuk setiap musim (atau baris) di DataFrame Anda.

In [172]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)
baseball_df['WP_preds'] = win_perc_preds_np
print(baseball_df.head())

  Team League  Year   RS   RA   W  Playoffs    G    WP  WP_preds
0  ARI     NL  2012  734  688  81         0  162  0.50      0.53
1  ATL     NL  2012  700  600  94         1  162  0.58      0.58
2  BAL     AL  2012  712  705  93         1  162  0.57      0.50
3  BOS     AL  2012  734  806  69         0  162  0.43      0.45
4  CHC     NL  2012  613  759  61         0  162  0.38      0.39


**Question**

Bagaimana urutan pendekatan dari tercepat ke paling lambat?

In [173]:
%%timeit
win_perc_preds_loop = []
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

16.8 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [174]:
%timeit win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

46.2 ms ± 831 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [175]:
%timeit win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)

37.4 µs ± 340 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


**Answer**

Menggunakan `NumPy array` adalah pendekatan **tercepat**, diikuti oleh pendekatan `.itertuples()`, dan pendekatan `.apply()` **paling lambat**.

Anda berlatih menggunakan tiga pendekatan berbeda untuk iterate menggunakan DataFrame pandas dan melakukan perhitungan. Apakah Anda memperhatikan bahwa pendekatan `.itertuples()` mengalahkan pendekatan `.apply()`? Meskipun kedua implementasi ini bisa berguna, Anda harus menggunakan default untuk array yang mendasari DataFrame untuk melakukan perhitungan.

Lihatlah prediksi persentase kemenangan Anda (kolom `'WP_preds'`) dan bandingkan dengan persentase kemenangan aktual (kolom `'WP'`). Tidak buruk!

Anda telah melakukan pekerjaan luar biasa selama kursus! Sekarang, Anda sudah siap untuk menulis kode Python dan pandas yang efisien!

## What you have learned

* Definisi kode **efisien** dan **Pythonic**
* Cara menggunakan library bawaan Python yang powerful
* Keuntungan dari NumPy arrays
* Beberapa perintah *magic* yang berguna untuk kode profil
* Cara menggunakan solusi yang efisien dengan `zip()`, `itertools`, `collections`, dan teori `set` .
* *The cost of looping and how to eliminate loops*
* Praktik terbaik untuk iterasi dengan pandas DataFrames