# Basic pandas optimizations

This chapter offers a brief introduction on how to efficiently work with pandas DataFrames. You'll learn the various options you have for iterating over a DataFrame. Then, you'll learn how to efficiently apply functions to data stored in a DataFrame.

## Intro to pandas DataFrame iteration

1. Intro to pandas DataFrame iteration
So far, we've focused on Python's built-in data structures. Now, we'll shift gears and focus on one of the most popular data analytics tools: the pandas DataFrame.

2. pandas recap
You should be familiar with pandas before continuing with this course. If not, consider refreshing your pandas knowledge with the great overview provided in the DataCamp course listed here. Just to recap, pandas is a library used for data analysis. The main construct of pandas is the DataFrame, a tabular data structure with labeled rows and columns. This chapter will focus on the best approaches for iterating over a DataFrame. Let's begin by analyzing a Major League Baseball dataset.

3. Baseball stats
We've collected team stats for each Major League Baseball team from the year 1962 to 2012, which are stored in a pandas DataFrame named baseball_df.

4. Baseball stats
The Team column is each baseball team's abbreviated name. The first team, ARI, represents the Arizona Diamondbacks.

5. Baseball stats
All other columns in this DataFrame represent specific statistics for each team in a given Year or season. We'll cover what the RS, RA, and Playoffs columns mean in later exercises. For now, we'll focus on the W column, which specifies the number of wins a team had in a season and the G column that contains the number of games a team played in a season.

6. Calculating win percentage
One popular statistic used to evaluate a team's performance for a given season is the team's win percentage. This metric is calculated by dividing a team's total wins by number of games played. We've written a simple function to perform this calculation. If a team wins 50 out of 100 games, we see that our function returns the correct result.

7. Adding win percentage to DataFrame
We'd like to create a new column in our baseball_df DataFrame that stores each team's win percentage for a season. To do this, we'll need to iterate over the DataFrame's rows and apply our calc_win_perc function. First, we create an empty win_perc_list to store all the win percentages we'll calculate. Then, we write a loop that will iterate over each row of the DataFrame. Notice that we are using an index variable (i) that ranges from zero to the number of rows that exist within the DataFrame. We then use the dot-iloc method to lookup each individual row within the DataFrame using the index variable. Now, we grab each team's wins and games played by referencing the W and G columns. Next, we pass the team's wins and games played to calc_win_perc to calculate the win percentages. Finally, we append win_perc to win_perc_list and continue the loop. We create our desired column in the DataFrame, called WP, by setting the column value equal to the win_perc_list.

8. Adding win percentage to DataFrame
Printing the first five rows of our DataFrame, we see that the win percentage column now appears.

9. Iterating with .iloc
Looping over the DataFrame with dot-iloc gave us our desired output, but is it efficient? When estimating the runtime, the dot-iloc approach took 183 milliseconds, which is pretty inefficient.

10. Iterating with .iterrows()
pandas comes with a few efficient methods for looping over a DataFrame. The first method we'll cover is the dot-iterrows method. This is similar to the dot-iloc method, but dot-iterrows returns each DataFrame row as a tuple of (index, pandas Series) pairs. This means each object returned from dot-iterrows contains the index of each row as the first element and the data in each row as a pandas Series as the second element. Notice that we still create the empty win_perc_list, but now we don't have to create an index variable to look up each row within the DataFrame. dot-iterrows handles the indexing for us! The remainder of the for loop stays the same to create a new win percentage column within our baseball_df DataFrame.

11. Iterating with .iterrows()
Using dot-iterrows takes roughly half the time dot-iloc takes to iterate over our DataFrame. We'll explore more efficient ways to loop over a DataFrame later on in the chapter. But for now, we know that using dot-iloc is not efficient and shouldn't be used to iterate over a DataFrame.

12. Practice DataFrame iterating with .iterrows()
Now, let's practice iterating over a DataFrame using dot-iterrows!

### Iterating with .iterrows()

![image.png](attachment:44f37b29-fa2e-490a-98c8-7c2ead0a41c2.png)

In [None]:
  Team League  Year   RS   RA   W    G  Playoffs
0  PIT     NL  2012  651  674  79  162         0
1  PIT     NL  2011  610  712  72  162         0
2  PIT     NL  2010  587  866  57  162         0
3  PIT     NL  2009  636  768  62  161         0
4  PIT     NL  2008  735  884  67  162         0

In [None]:
# Iterate over pit_df and print each row
for ____,____ in ____.____:
    print(row)

In [None]:
# Iterate over pit_df and print each row
for i, row in pit_df.iterrows():
    print(row)

In [None]:
<script.py> output:
    Team         PIT
    League        NL
    Year        2012
    RS           651
    RA           674
    W             79
    G            162
    Playoffs       0
    Name: 0, dtype: object
    Team         PIT
    League        NL
    Year        2011
    RS           610
    RA           712
    W             72
    G            162
    Playoffs       0
    Name: 1, dtype: object
    Team         PIT
    League        NL
    Year        2010
    RS           587
    RA           866
    W             57
    G            162
    Playoffs       0
    Name: 2, dtype: object
    Team         PIT
    League        NL
    Year        2009
    RS           636
    RA           768
    W             62
    G            161
    Playoffs       0
    Name: 3, dtype: object
    Team         PIT
    League        NL
    Year        2008
    RS           735
    RA           884
    W             67
    G            162
    Playoffs       0
    Name: 4, dtype: object

![image.png](attachment:7fe968fd-d9c0-418a-b11e-5b70277ea06a.png)

In [None]:
# Iterate over pit_df and print each index variable and then each row
for i,row in pit_df.iterrows():
    ____
    print(row)
    ____

In [None]:
# Iterate over pit_df and print each index variable and then each row
for i,row in pit_df.iterrows():
    print(i)
    print(row)
    print(type(row))

In [None]:
<script.py> output:
    0
    Team         PIT
    League        NL
    Year        2012
    RS           651
    RA           674
    W             79
    G            162
    Playoffs       0
    Name: 0, dtype: object
    <class 'pandas.core.series.Series'>
    1
    Team         PIT
    League        NL
    Year        2011
    RS           610
    RA           712
    W             72
    G            162
    Playoffs       0
    Name: 1, dtype: object
    <class 'pandas.core.series.Series'>
    2
    Team         PIT
    League        NL
    Year        2010
    RS           587
    RA           866
    W             57
    G            162
    Playoffs       0
    Name: 2, dtype: object
    <class 'pandas.core.series.Series'>
    3
    Team         PIT
    League        NL
    Year        2009
    RS           636
    RA           768
    W             62
    G            161
    Playoffs       0
    Name: 3, dtype: object
    <class 'pandas.core.series.Series'>
    4
    Team         PIT
    League        NL
    Year        2008
    RS           735
    RA           884
    W             67
    G            162
    Playoffs       0
    Name: 4, dtype: object
    <class 'pandas.core.series.Series'>

![image.png](attachment:d7e6b0d3-33d4-449f-8eba-18e30e504f7e.png)

In [None]:
# Use one variable instead of two to store the result of .iterrows()
for ____ in ____.____:
    print(row_tuple)

In [None]:
# Use one variable instead of two to store the result of .iterrows()
for row_tuple in pit_df.iterrows():
    print(row_tuple)

In [None]:
<script.py> output:
    (0, Team         PIT
    League        NL
    Year        2012
    RS           651
    RA           674
    W             79
    G            162
    Playoffs       0
    Name: 0, dtype: object)
    (1, Team         PIT
    League        NL
    Year        2011
    RS           610
    RA           712
    W             72
    G            162
    Playoffs       0
    Name: 1, dtype: object)
    (2, Team         PIT
    League        NL
    Year        2010
    RS           587
    RA           866
    W             57
    G            162
    Playoffs       0
    Name: 2, dtype: object)
    (3, Team         PIT
    League        NL
    Year        2009
    RS           636
    RA           768
    W             62
    G            161
    Playoffs       0
    Name: 3, dtype: object)
    (4, Team         PIT
    League        NL
    Year        2008
    RS           735
    RA           884
    W             67
    G            162
    Playoffs       0
    Name: 4, dtype: object)

![image.png](attachment:ce81c65c-6aef-4016-a6ef-3e877ecd8ebf.png)

In [None]:
# Print the row and type of each row
for row_tuple in pit_df.iterrows():
    print(row_tuple)
    ____

In [None]:
# Print the row and type of each row
for row_tuple in pit_df.iterrows():
    print(row_tuple)
    print(type(row_tuple))

In [None]:
<script.py> output:
    (0, Team         PIT
    League        NL
    Year        2012
    RS           651
    RA           674
    W             79
    G            162
    Playoffs       0
    Name: 0, dtype: object)
    <class 'tuple'>
    (1, Team         PIT
    League        NL
    Year        2011
    RS           610
    RA           712
    W             72
    G            162
    Playoffs       0
    Name: 1, dtype: object)
    <class 'tuple'>
    (2, Team         PIT
    League        NL
    Year        2010
    RS           587
    RA           866
    W             57
    G            162
    Playoffs       0
    Name: 2, dtype: object)
    <class 'tuple'>
    (3, Team         PIT
    League        NL
    Year        2009
    RS           636
    RA           768
    W             62
    G            161
    Playoffs       0
    Name: 3, dtype: object)
    <class 'tuple'>
    (4, Team         PIT
    League        NL
    Year        2008
    RS           735
    RA           884
    W             67
    G            162
    Playoffs       0
    Name: 4, dtype: object)
    <class 'tuple'>

![image.png](attachment:ea57b5a7-ce9f-44a3-bea4-114176bb0421.png)

### Run differentials with .iterrows()

![image.png](attachment:8ff24f55-6cd3-4edc-8f33-913d1234a611.png)

In [None]:
  Team League  Year   RS   RA   W    G  Playoffs
0  SFG     NL  2012  718  649  94  162         1
1  SFG     NL  2011  570  578  86  162         0
2  SFG     NL  2010  697  583  92  162         1
3  SFG     NL  2009  657  611  88  162         0
4  SFG     NL  2008  640  759  72  162         0

In [None]:
# Create an empty list to store run differentials
____ = ____

In [None]:
# Create an empty list to store run differentials
run_diffs = []

![image.png](attachment:475c4206-2bf9-4363-ba93-07ec8a8c6971.png)

In [None]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for ____,____ in ____.____:
    runs_scored = ____[____]
    runs_allowed = ____[____]

In [None]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i, row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']

![image.png](attachment:8b21b54e-8f32-484c-bbc7-8af987473138.png)

In [None]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = ____(____, ____)

In [None]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)

![image.png](attachment:56dfb529-f243-4abb-8e76-ac6ed27f8fdf.png)

In [None]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    ____.____(____)

giants_df['RD'] = run_diffs
print(giants_df)

In [None]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

giants_df['RD'] = run_diffs
print(giants_df)

In [None]:
<script.py> output:
      Team League  Year   RS   RA   W    G  Playoffs   RD
    0  SFG     NL  2012  718  649  94  162         1   69
    1  SFG     NL  2011  570  578  86  162         0   -8
    2  SFG     NL  2010  697  583  92  162         1  114
    3  SFG     NL  2009  657  611  88  162         0   46
    4  SFG     NL  2008  640  759  72  162         0 -119

![image.png](attachment:626aa326-26bc-430c-baf7-f9ba701f6a51.png)

## Another iterator method: .itertuples()

1. Another iterator method: .itertuples()
In the previous lesson, we covered how to iterate over a pandas DataFrame row by row using the dot-iterrows method. pandas also comes with a similar iteration method called dot-itertuples that is often more efficient that dot-iterrows. Let's continue using our baseball dataset to compare these two methods.

2. Team wins data
Suppose we have a pandas DataFrame called team_wins_df that contains each team's total wins in a season.

3. Iterating with .iterrows()
If we use dot-iterrows to loop over our team_wins_df DataFrame and print each row's tuple, we see that each row's values are stored as a pandas Series. Remember, dot-iterrows returns each DataFrame row as a tuple of (index, pandas Series) pairs, so we have to access the row's values with square bracket indexing.

4. Iterating with .itertuples()
But, we could use dot-itertuples to loop over our DataFrame rows instead. The dot-itertuples method returns each DataFrame row as a special data type called a namedtuple. A namedtuple is one of the specialized data types that exist within the collections module we've discussed previously. These data types behave just like a Python tuple but have fields accessible using attribute lookup. What does this mean? Notice in the output that each printed row_namedtuple has an Index attribute and each column in our team_wins_df as an attribute. That means we can access each of these attributes with a lookup using a dot method. Here, we can print the last row_namedtuple's Index using row_namedtuple-dot-Index. We can print this row_namedtuple's Team with row_namedtuple-dot-Team, Year with row_namedtuple-dot-Year and so on.

5. Comparing methods
When we compare dot-iterrows to dot-itertuples, we see that there is quite a bit of improvement! The reason dot-itertuples is more efficient than dot-iterrows is due to the way each method stores its output. Since dot-iterrows returns each row's values as a pandas Series, there is a bit more overhead.

6. Attribute lookup caveat
One more quick note about the differences between these methods. When using dot-iterrows, we can use square brackets to reference a column within our team_wins_df DataFrame. Here, we are printing the Team column for each row in our DataFrame. If we use the same syntax with dot-itertuples, we get a TypeError. This is due to the fact that namedtuples don't support square brackets like a pandas Series does. When looking up an attribute within a namedtuple, we must use a dot to reference the attribute. So anytime we use dot-itertuples we have to use a dot when referring to a column within our DataFrame. If we replace our square bracket notation with a dot, we see that the Teams are correctly printed out.

7. Let's keep iterating!
Now, let's put our new skill to the test and practice efficiently looping over rows of a DataFrame using dot-itertuples.

### Iterating with .itertuples()

![image.png](attachment:6d869714-009d-4886-a6db-05f85bc4c09f.png)

In [None]:
In [1]:
rangers_df.head()
Out[1]:

  Team League  Year   RS   RA   W    G  Playoffs
0  TEX     AL  2012  808  707  93  162         1
1  TEX     AL  2011  855  677  96  162         1
2  TEX     AL  2010  787  687  90  162         1
3  TEX     AL  2009  784  740  87  162         0
4  TEX     AL  2008  901  967  79  162         0

In [None]:
# Loop over the DataFrame and print each row
for ____ in ____.____():
  print(____)

In [None]:
# Loop over the DataFrame and print each row
for row_namedtuple in rangers_df.itertuples():
  print(row_namedtuple)

In [None]:
<script.py> output:
    Pandas(Index=0, Team='TEX', League='AL', Year=2012, RS=808, RA=707, W=93, G=162, Playoffs=1)
    Pandas(Index=1, Team='TEX', League='AL', Year=2011, RS=855, RA=677, W=96, G=162, Playoffs=1)
    Pandas(Index=2, Team='TEX', League='AL', Year=2010, RS=787, RA=687, W=90, G=162, Playoffs=1)
    Pandas(Index=3, Team='TEX', League='AL', Year=2009, RS=784, RA=740, W=87, G=162, Playoffs=0)
    Pandas(Index=4, Team='TEX', League='AL', Year=2008, RS=901, RA=967, W=79, G=162, Playoffs=0)
    Pandas(Index=5, Team='TEX', League='AL', Year=2007, RS=816, RA=844, W=75, G=162, Playoffs=0)
    Pandas(Index=6, Team='TEX', League='AL', Year=2006, RS=835, RA=784, W=80, G=162, Playoffs=0)
    Pandas(Index=7, Team='TEX', League='AL', Year=2005, RS=865, RA=858, W=79, G=162, Playoffs=0)
    Pandas(Index=8, Team='TEX', League='AL', Year=2004, RS=860, RA=794, W=89, G=162, Playoffs=0)
    Pandas(Index=9, Team='TEX', League='AL', Year=2003, RS=826, RA=969, W=71, G=162, Playoffs=0)
    Pandas(Index=10, Team='TEX', League='AL', Year=2002, RS=843, RA=882, W=72, G=162, Playoffs=0)
    Pandas(Index=11, Team='TEX', League='AL', Year=2001, RS=890, RA=968, W=73, G=162, Playoffs=0)
    Pandas(Index=12, Team='TEX', League='AL', Year=2000, RS=848, RA=974, W=71, G=162, Playoffs=0)
    Pandas(Index=13, Team='TEX', League='AL', Year=1999, RS=945, RA=859, W=95, G=162, Playoffs=1)
    Pandas(Index=14, Team='TEX', League='AL', Year=1998, RS=940, RA=871, W=88, G=162, Playoffs=1)
    Pandas(Index=15, Team='TEX', League='AL', Year=1997, RS=807, RA=823, W=77, G=162, Playoffs=0)
    Pandas(Index=16, Team='TEX', League='AL', Year=1996, RS=928, RA=799, W=90, G=163, Playoffs=1)
    Pandas(Index=17, Team='TEX', League='AL', Year=1993, RS=835, RA=751, W=86, G=162, Playoffs=0)
    Pandas(Index=18, Team='TEX', League='AL', Year=1992, RS=682, RA=753, W=77, G=162, Playoffs=0)
    Pandas(Index=19, Team='TEX', League='AL', Year=1991, RS=829, RA=814, W=85, G=162, Playoffs=0)
    Pandas(Index=20, Team='TEX', League='AL', Year=1990, RS=676, RA=696, W=83, G=162, Playoffs=0)
    Pandas(Index=21, Team='TEX', League='AL', Year=1989, RS=695, RA=714, W=83, G=162, Playoffs=0)
    Pandas(Index=22, Team='TEX', League='AL', Year=1988, RS=637, RA=735, W=70, G=161, Playoffs=0)
    Pandas(Index=23, Team='TEX', League='AL', Year=1987, RS=823, RA=849, W=75, G=162, Playoffs=0)
    Pandas(Index=24, Team='TEX', League='AL', Year=1986, RS=771, RA=743, W=87, G=162, Playoffs=0)
    Pandas(Index=25, Team='TEX', League='AL', Year=1985, RS=617, RA=785, W=62, G=161, Playoffs=0)
    Pandas(Index=26, Team='TEX', League='AL', Year=1984, RS=656, RA=714, W=69, G=161, Playoffs=0)
    Pandas(Index=27, Team='TEX', League='AL', Year=1983, RS=639, RA=609, W=77, G=163, Playoffs=0)
    Pandas(Index=28, Team='TEX', League='AL', Year=1982, RS=590, RA=749, W=64, G=162, Playoffs=0)
    Pandas(Index=29, Team='TEX', League='AL', Year=1980, RS=756, RA=752, W=76, G=163, Playoffs=0)
    Pandas(Index=30, Team='TEX', League='AL', Year=1979, RS=750, RA=698, W=83, G=162, Playoffs=0)
    Pandas(Index=31, Team='TEX', League='AL', Year=1978, RS=692, RA=632, W=87, G=162, Playoffs=0)
    Pandas(Index=32, Team='TEX', League='AL', Year=1977, RS=767, RA=657, W=94, G=162, Playoffs=0)
    Pandas(Index=33, Team='TEX', League='AL', Year=1976, RS=616, RA=652, W=76, G=162, Playoffs=0)
    Pandas(Index=34, Team='TEX', League='AL', Year=1975, RS=714, RA=733, W=79, G=162, Playoffs=0)
    Pandas(Index=35, Team='TEX', League='AL', Year=1974, RS=690, RA=698, W=83, G=161, Playoffs=0)
    Pandas(Index=36, Team='TEX', League='AL', Year=1973, RS=619, RA=844, W=57, G=162, Playoffs=0)

![image.png](attachment:29b13720-766b-4f14-97a6-83cc3cc1c5ff.png)

In [None]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = ____
  year = ____
  wins = ____
  print(i, year, wins)

In [None]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  print(i, year, wins)

In [None]:
<script.py> output:
    0 2012 93
    1 2011 96
    2 2010 90
    3 2009 87
    4 2008 79
    5 2007 75
    6 2006 80
    7 2005 79
    8 2004 89
    9 2003 71
    10 2002 72
    11 2001 73
    12 2000 71
    13 1999 95
    14 1998 88
    15 1997 77
    16 1996 90
    17 1993 86
    18 1992 77
    19 1991 85
    20 1990 83
    21 1989 83
    22 1988 70
    23 1987 75
    24 1986 87
    25 1985 62
    26 1984 69
    27 1983 77
    28 1982 64
    29 1980 76
    30 1979 83
    31 1978 87
    32 1977 94
    33 1976 76
    34 1975 79
    35 1974 83
    36 1973 57

![image.png](attachment:c3e344c2-da21-446f-ac8b-5273dd8efdff.png)

In [None]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
  if ____.____ == 1:
    print(____, ____, ____)

In [None]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
  if row.Playoffs == 1:
    print(i, year, wins)

In [None]:
<script.py> output:
    0 2012 93
    1 2011 96
    2 2010 90
    13 1999 95
    14 1998 88
    16 1996 90

![image.png](attachment:74e23aa0-3c6a-4fbb-8d92-b83aa29d5c64.png)

### Run differentials with .itertuples()

![image.png](attachment:52f98747-2abd-41c5-bd14-091de155be36.png)

In [None]:
In [1]:
yankees_df.head()
Out[1]:

  Team League  Year   RS   RA    W    G  Playoffs
0  NYY     AL  2012  804  668   95  162         1
1  NYY     AL  2011  867  657   97  162         1
2  NYY     AL  2010  859  693   95  162         1
3  NYY     AL  2009  915  753  103  162         1
4  NYY     AL  2008  789  727   89  162         0

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for ____ in ____.____():
    
    runs_scored = ____
    runs_allowed = ____

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

![image.png](attachment:1de6f043-1a75-466f-9b4e-b7f27899a2b9.png)

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = ____(____, ____)
    
    run_diffs.append(____)

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

![image.png](attachment:d5bc6aa2-db25-4028-917a-553cb4a9b0c1.png)

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df[____] = ____
print(yankees_df)

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
print(yankees_df)

In [None]:
<script.py> output:
       Team League  Year   RS   RA    W    G  Playoffs   RD
    0   NYY     AL  2012  804  668   95  162         1  136
    1   NYY     AL  2011  867  657   97  162         1  210
    2   NYY     AL  2010  859  693   95  162         1  166
    3   NYY     AL  2009  915  753  103  162         1  162
    4   NYY     AL  2008  789  727   89  162         0   62
    5   NYY     AL  2007  968  777   94  162         1  191
    6   NYY     AL  2006  930  767   97  162         1  163
    7   NYY     AL  2005  886  789   95  162         1   97
    8   NYY     AL  2004  897  808  101  162         1   89
    9   NYY     AL  2003  877  716  101  163         1  161
    10  NYY     AL  2002  897  697  103  161         1  200
    11  NYY     AL  2001  804  713   95  161         1   91
    12  NYY     AL  2000  871  814   87  161         1   57
    13  NYY     AL  1999  900  731   98  162         1  169
    14  NYY     AL  1998  965  656  114  162         1  309
    15  NYY     AL  1997  891  688   96  162         1  203
    16  NYY     AL  1996  871  787   92  162         1   84
    17  NYY     AL  1993  821  761   88  162         0   60
    18  NYY     AL  1992  733  746   76  162         0  -13
    19  NYY     AL  1991  674  777   71  162         0 -103
    20  NYY     AL  1990  603  749   67  162         0 -146
    21  NYY     AL  1989  698  792   74  161         0  -94
    22  NYY     AL  1988  772  748   85  161         0   24
    23  NYY     AL  1987  788  758   89  162         0   30
    24  NYY     AL  1986  797  738   90  162         0   59
    25  NYY     AL  1985  839  660   97  161         0  179
    26  NYY     AL  1984  758  679   87  162         0   79
    27  NYY     AL  1983  770  703   91  162         0   67
    28  NYY     AL  1982  709  716   79  162         0   -7
    29  NYY     AL  1980  820  662  103  162         1  158
    30  NYY     AL  1979  734  672   89  160         0   62
    31  NYY     AL  1978  735  582  100  163         1  153
    32  NYY     AL  1977  831  651  100  162         1  180
    33  NYY     AL  1976  730  575   97  159         1  155
    34  NYY     AL  1975  681  588   83  160         0   93
    35  NYY     AL  1974  671  623   89  162         0   48
    36  NYY     AL  1973  641  610   80  162         0   31
    37  NYY     AL  1971  648  641   81  162         0    7
    38  NYY     AL  1970  680  612   93  163         0   68
    39  NYY     AL  1969  562  587   80  162         0  -25
    40  NYY     AL  1968  536  531   83  164         0    5
    41  NYY     AL  1967  522  621   72  163         0  -99
    42  NYY     AL  1966  611  612   70  160         0   -1
    43  NYY     AL  1965  611  604   77  162         0    7
    44  NYY     AL  1964  730  577   99  164         1  153
    45  NYY     AL  1963  714  547  104  161         1  167
    46  NYY     AL  1962  817  680   96  162         1  137

![image.png](attachment:67097ae1-3ef9-4726-a1d5-36d37345d577.png)

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
print(yankees_df.sort_values(by=['RD'], ascending='False'))

In [None]:
# Output is wrong as it's not sorted properly. 

   Team League  Year   RS   RA    W    G  Playoffs   RD
20  NYY     AL  1990  603  749   67  162         0 -146
19  NYY     AL  1991  674  777   71  162         0 -103
41  NYY     AL  1967  522  621   72  163         0  -99
21  NYY     AL  1989  698  792   74  161         0  -94
39  NYY     AL  1969  562  587   80  162         0  -25
18  NYY     AL  1992  733  746   76  162         0  -13
28  NYY     AL  1982  709  716   79  162         0   -7
42  NYY     AL  1966  611  612   70  160         0   -1
40  NYY     AL  1968  536  531   83  164         0    5
43  NYY     AL  1965  611  604   77  162         0    7
37  NYY     AL  1971  648  641   81  162         0    7
22  NYY     AL  1988  772  748   85  161         0   24
23  NYY     AL  1987  788  758   89  162         0   30
36  NYY     AL  1973  641  610   80  162         0   31
35  NYY     AL  1974  671  623   89  162         0   48
12  NYY     AL  2000  871  814   87  161         1   57
24  NYY     AL  1986  797  738   90  162         0   59
17  NYY     AL  1993  821  761   88  162         0   60
30  NYY     AL  1979  734  672   89  160         0   62
4   NYY     AL  2008  789  727   89  162         0   62
27  NYY     AL  1983  770  703   91  162         0   67
38  NYY     AL  1970  680  612   93  163         0   68
26  NYY     AL  1984  758  679   87  162         0   79
16  NYY     AL  1996  871  787   92  162         1   84
8   NYY     AL  2004  897  808  101  162         1   89
11  NYY     AL  2001  804  713   95  161         1   91
34  NYY     AL  1975  681  588   83  160         0   93
7   NYY     AL  2005  886  789   95  162         1   97
0   NYY     AL  2012  804  668   95  162         1  136
46  NYY     AL  1962  817  680   96  162         1  137
44  NYY     AL  1964  730  577   99  164         1  153
31  NYY     AL  1978  735  582  100  163         1  153
33  NYY     AL  1976  730  575   97  159         1  155
29  NYY     AL  1980  820  662  103  162         1  158
9   NYY     AL  2003  877  716  101  163         1  161
3   NYY     AL  2009  915  753  103  162         1  162
6   NYY     AL  2006  930  767   97  162         1  163
2   NYY     AL  2010  859  693   95  162         1  166
45  NYY     AL  1963  714  547  104  161         1  167
13  NYY     AL  1999  900  731   98  162         1  169
25  NYY     AL  1985  839  660   97  161         0  179
32  NYY     AL  1977  831  651  100  162         1  180
5   NYY     AL  2007  968  777   94  162         1  191
10  NYY     AL  2002  897  697  103  161         1  200
15  NYY     AL  1997  891  688   96  162         1  203
1   NYY     AL  2011  867  657   97  162         1  210
14  NYY     AL  1998  965  656  114  162         1  309

1. Incorrect! Not quite. There is a year where the Yankees had a better run differential than 210.

1. Correct! 

![image.png](attachment:c0f01b59-3c58-4da9-a734-fd81bdee6485.png)

1. Incorrect! Try again. Having a run differential this high is extremely unlikely.

1. Incorrect! Not exactly. In 1985, the Yankees had a run differential of 179, not 315.

## pandas alternative to looping

1. pandas alternative to looping
We've been looping over DataFrames row-by-row with ease in the past two lessons. But remember, in order to write efficient code, we want to avoid looping when possible. In this lesson, we'll explore an alternative to using dot-iterrows and dot-itertuples to perform calculations on a DataFrame.

2. Revisit run differentials
We'll continue using our baseball dataset and revisit the calc_run_diff function we've used in the past. This function calculates a team's run differential for a given year by subtracting the team's total number of runs allowed from their total number of runs scored in a season.

3. Run differentials with a loop
We'd like to create a new column in our baseball_df DataFrame called RD that stores each team's run differentials over the years. In previous lessons, we did this with a for loop using either dot-iterrows or dot-itertuples. Here, we'll use dot-iterrows as an example. Notice that we are iterating over baseball_df with a for loop, passing each row's RS and RA columns into our calc_run_diff function, and then appending each row's result to our run_diffs_iterrows list. This gives us our desired output, but it is not our most efficient option.

4. pandas .apply() method
One alternative to using a loop to iterate over a DataFrame is to use pandas' dot-apply method. This function acts like the map function we've used in the past. It takes a function as an input and applies this function to an entire DataFrame. Since we are working with tabular data, we must specify an axis that we'd like our function to act on. Using a zero for the axis argument will apply our function on columns while using a one for the axis will apply our function on all rows. Just like the map function, pandas' dot-apply method can be used with anonymous functions or lambdas. Let's walk through how'd we'd use the dot-apply method to calculate our run differentials. First, we call dot-apply on the baseball_df DataFrame. Then, we use a lambda function to iterate over the rows of the DataFrame. Notice that our argument for lambda is row (since we are applying to each row of the DataFrame). For every row, we grab the RS and RA columns and pass them to our calc_run_diff function. Lastly, we specify our axis to tell dot-apply that we want to iterate over rows instead of columns.

5. Run differentials with .apply()
When we use the dot-apply method to calculate our run differentials, we don't need to use a for loop. We can collect our run differentials directly into an object called run_diffs_apply. After creating our new column and printing the DataFrame, we see that our results are identical to the dot-iterrows approach. But, was using dot-apply more efficient?

6. Comparing approaches
When timing the dot-iterrows approach, we see that it took about 87 milliseconds to complete.

7. Comparing approaches
But, using the dot-apply method took only 30 milliseconds. A definite improvement!

8. Let's practice using pandas .apply() method!
Now, let's practice using dot-apply with some coding examples!

### Analyzing baseball stats with .apply()

![image.png](attachment:d1b169ea-c3a4-42bc-bd85-717bcf0e3f02.png)

In [None]:
       RS   RA   W  Playoffs
2012  697  577  90         0
2011  707  614  91         1
2010  802  649  96         1
2009  803  754  84         0
2008  774  671  97         1

In [None]:
# Gather sum of all columns
stat_totals = ____.____(____, axis=____)
print(stat_totals)

In [None]:
# Gather sum of all columns
stat_totals = rays_df.apply(sum, axis=0)
print(stat_totals)

In [None]:
<script.py> output:
    RS          3783
    RA          3265
    W            458
    Playoffs       3
    dtype: int64

![image.png](attachment:85645455-87fc-4d45-be7f-6609ecf5fa65.png)

In [None]:
# Gather total runs scored in all games per year
total_runs_scored = rays_df[['RS', 'RA']].____(____, axis=____)
print(total_runs_scored)

In [None]:
# Gather total runs scored in all games per year
total_runs_scored = rays_df[['RS', 'RA']].apply(sum, axis=1)
print(total_runs_scored)

![image.png](attachment:3db0337d-8f18-4fb2-b6cd-3f6e302f45dd.png)

In [None]:
# Convert numeric playoffs to text by applying text_playoffs()
textual_playoffs = rays_df.____(lambda row: ____(row['____']), axis=____)
print(textual_playoffs)

In [None]:
# Convert numeric playoffs to text by applying text_playoffs()
textual_playoffs = rays_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(textual_playoffs)

In [None]:
<script.py> output:
    2012     No
    2011    Yes
    2010    Yes
    2009     No
    2008    Yes
    dtype: object

![image.png](attachment:f8db85cc-9c9a-4790-aaa2-aea63a3f493f.png)

### Settle a debate with .apply()

![image.png](attachment:320907a9-9eac-4e33-a476-0b24538851d6.png)

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.____())

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

![image.png](attachment:84865666-4569-48fe-8bca-793bb0fd4c51.png)

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.____(lambda ____: ____(row[____], row[____]), axis=____)
print(win_percs, '\n')

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

In [None]:
    0     0.50
    1     0.58
    2     0.40
    3     0.43
    4     0.51
    5     0.56
    6     0.47
    7     0.48
    8     0.31
    9     0.52
    10    0.60
    11    0.57
    12    0.52
    13    0.62
    14    0.40
    dtype: float64

![image.png](attachment:344a244b-11ea-4744-b327-6616380a03b7.png)

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

# Append a new column to dbacks_df
dbacks_df[____] = ____
print(dbacks_df, '\n')

# Display dbacks_df where WP is greater than 0.50
print(dbacks_df[dbacks_df['WP'] >= 0.50])

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

# Append a new column to dbacks_df
dbacks_df['WP'] = win_percs
print(dbacks_df, '\n')

# Display dbacks_df where WP is greater than 0.50
print(dbacks_df[dbacks_df['WP'] >= 0.50])

In [None]:
<script.py> output:
      Team League  Year   RS   RA   W    G  Playoffs
    0  ARI     NL  2012  734  688  81  162         0
    1  ARI     NL  2011  731  662  94  162         1
    2  ARI     NL  2010  713  836  65  162         0
    3  ARI     NL  2009  720  782  70  162         0
    4  ARI     NL  2008  720  706  82  162         0
    0     0.50
    1     0.58
    2     0.40
    3     0.43
    4     0.51
    5     0.56
    6     0.47
    7     0.48
    8     0.31
    9     0.52
    10    0.60
    11    0.57
    12    0.52
    13    0.62
    14    0.40
    dtype: float64 
    
       Team League  Year   RS   RA    W    G  Playoffs    WP
    0   ARI     NL  2012  734  688   81  162         0  0.50
    1   ARI     NL  2011  731  662   94  162         1  0.58
    2   ARI     NL  2010  713  836   65  162         0  0.40
    3   ARI     NL  2009  720  782   70  162         0  0.43
    4   ARI     NL  2008  720  706   82  162         0  0.51
    5   ARI     NL  2007  712  732   90  162         1  0.56
    6   ARI     NL  2006  773  788   76  162         0  0.47
    7   ARI     NL  2005  696  856   77  162         0  0.48
    8   ARI     NL  2004  615  899   51  162         0  0.31
    9   ARI     NL  2003  717  685   84  162         0  0.52
    10  ARI     NL  2002  819  674   98  162         1  0.60
    11  ARI     NL  2001  818  677   92  162         1  0.57
    12  ARI     NL  2000  792  754   85  162         0  0.52
    13  ARI     NL  1999  908  676  100  162         1  0.62
    14  ARI     NL  1998  665  812   65  162         0  0.40 
    
       Team League  Year   RS   RA    W    G  Playoffs    WP
    0   ARI     NL  2012  734  688   81  162         0  0.50
    1   ARI     NL  2011  731  662   94  162         1  0.58
    4   ARI     NL  2008  720  706   82  162         0  0.51
    5   ARI     NL  2007  712  732   90  162         1  0.56
    9   ARI     NL  2003  717  685   84  162         0  0.52
    10  ARI     NL  2002  819  674   98  162         1  0.60
    11  ARI     NL  2001  818  677   92  162         1  0.57
    12  ARI     NL  2000  792  754   85  162         0  0.52
    13  ARI     NL  1999  908  676  100  162         1  0.62

![image.png](attachment:bba57c1f-4991-45a1-a8c7-3bd7ab04788a.png)

In [None]:
       Team League  Year   RS   RA    W    G  Playoffs    WP
    0   ARI     NL  2012  734  688   81  162         0  0.50
    1   ARI     NL  2011  731  662   94  162         1  0.58
    4   ARI     NL  2008  720  706   82  162         0  0.51
    5   ARI     NL  2007  712  732   90  162         1  0.56
    9   ARI     NL  2003  717  685   84  162         0  0.52
    10  ARI     NL  2002  819  674   98  162         1  0.60
    11  ARI     NL  2001  818  677   92  162         1  0.57
    12  ARI     NL  2000  792  754   85  162         0  0.52
    13  ARI     NL  1999  908  676  100  162         1  0.62

1. Incorrect! Incorrect. Look at the 'WP' column you created and the number listed in the 'Playoffs' column. A 1 means the team made the playoffs and a 0 means the team did not make the playoffs.

2. Correct! Nicely done! Using the .apply() method with a lambda function allows you to apply a function to a DataFrame without the need to write a for loop.

    Sadly, the second manager was correct. In the year 2012, 2008, 2003, and 2000 the Arizona Diamondbacks had a win percentage greater than or equal to 0.50, but still did not make the playoffs.

3. Incorrect! Not quite. You can see that the Diamondbacks have made the playoffs by looking at the 1's listed in the 'Playoffs' column.

## Optimal pandas iterating

1. Optimal pandas iterating
We've come a long way from our first dot-iloc approach for iterating over a DataFrame. Each approach we've discussed has really improved performance. But, these approaches focus on performing calculations for each row of our DataFrame individually. In this lesson, we'll explore some pandas internals that allow us to perform calculations even more efficiently.

2. pandas internals
As you know, we should try to stay away from loops when writing Python code - and working with pandas is no exception. In the previous lessons, we were iterating over a DataFrame row by row in order to perform a calculation. pandas is a library that is built on NumPy. This means that each pandas DataFrame we use can take advantage of the efficient characteristics of NumPy arrays that we learned in Chapter 1. Do you remember an array's broadcasting functionality? Broadcasting allows NumPy arrays to vectorize operations, so they are performed on all elements of an object at once. This allows us to efficiently perform calculations over entire arrays. Just like NumPy, pandas is designed to vectorize calculations so that they operate on entire datasets at once (not just on a row by row basis). Let's explore this concept with some examples.

3. DataFrame columns as arrays
We'll continue to use the baseball_df DataFrame we have been using throughout the chapter. Since pandas is built on top of NumPy, we can grab any of these DataFrame column's values as a NumPy array using the dot-values method. Here, we are collecting the W column's values into a NumPy array called wins_np. When we print the type of wins_np, we see that it is, in fact, a NumPy array. We can see the contents of the array by printing it and verifying that it is the same as the W column from our DataFrame.

4. Power of vectorization
The beauty of knowing that pandas is built on NumPy can be seen when taking advantage of a NumPy array's broadcasting abilities. Remember, this means we can vectorize our calculations, and perform them on entire arrays all at once! Instead of looping over a DataFrame, and treating each row independently, like we've done with dot-iterrows, dot-itertuples, and dot-apply, we can perform calculations on the underlying NumPy arrays of our baseball_df DataFrame. Here, we gather the RS and RA columns in our DataFrame as NumPy arrays, and use broadcasting to calculate run differentials all at once!

5. Run differentials with arrays
When we use NumPy arrays to perform our run differential calculations, we can see that our code becomes much more readable. Here, we can explicitly see how our run differentials are being calculated. After creating our new column and printing the DataFrame, we see that our results are identical to all other approaches. But, just how much more efficient is using NumPy arrays?

6. Comparing approaches
When we time our NumPy arrays approach, we see that our run differential calculations take microseconds! All other approaches were reported in milliseconds. Our array approach is orders of magnitude faster than all previous approaches!

7. Let's put our skills into practice!
It's clear that using a DataFrame's underlying NumPy arrays to perform calculations can help us gain some massive efficiencies. Let's practice with a few coding exercises.

### Replacing .iloc with underlying arrays

![image.png](attachment:d1892e30-117c-405f-96ab-a32cb1a45e6c.png)

In [None]:
In [1]:
baseball_df.head()
Out[1]:

  Team League  Year   RS   RA   W    G  Playoffs
0  ARI     NL  2012  734  688  81  162         0
1  ATL     NL  2012  700  600  94  162         1
2  BAL     AL  2012  712  705  93  162         1
3  BOS     AL  2012  734  806  69  162         0
4  CHC     NL  2012  613  759  61  162         0

In [None]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df[____].____, baseball_df[____].____)

In [None]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

![image.png](attachment:669534b5-bfad-4fe0-a6b0-164177d83dc3.png)

In [None]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df[____] = ____

print(baseball_df.head())

In [None]:
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df['WP'] = win_percs_np

print(baseball_df.head())

In [None]:
<script.py> output:
      Team League  Year   RS   RA   W    G  Playoffs    WP
    0  ARI     NL  2012  734  688  81  162         0  0.50
    1  ATL     NL  2012  700  600  94  162         1  0.58
    2  BAL     AL  2012  712  705  93  162         1  0.57
    3  BOS     AL  2012  734  806  69  162         0  0.43
    4  CHC     NL  2012  613  759  61  162         0  0.38

![image.png](attachment:35e612ee-9175-4605-878a-ce48a34f74f9.png)

In [None]:
%%timeit

win_percs_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

baseball_df['WP'] = win_percs_list

1.27 s +- 109 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

In [None]:
%%timeit

# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(baseball_df['W'].values, baseball_df['G'].values)

# Append a new column to baseball_df that stores all win percentages
baseball_df['WP'] = win_percs_np

883 us +- 143 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

1. Incorrect! Try again. We've seen that using .iloc is not very efficient.

3. Incorrect! Not quite. One of these approaches is more efficient than the other.

2. Correct! Great job! You're knocking it out of the park! Using a DataFrame's underlying arrays to perform calculations can really speed up your code and yields some significant efficiency gains. Did you notice that the NumPy array approach was not just faster, but that it also used fewer lines of code and was easier to read?



### Bringing it all together: Predict win percentage

![image.png](attachment:966c4faf-f8d2-474d-b2a0-3f98a6738074.png)

In [None]:
OrderedDict([('Team', 'Abbreviated team name'),
             ('League', 'Specifies National League or American League'),
             ('Year', "Each season's year"),
             ('RS', 'Runs scored in a season'),
             ('RA', 'Runs allowed in a season'),
             ('W', 'Wins in a season'),
             ('G', 'Games played in a season'),
             ('Playoffs', '`1` if a team made the playoffs; `0` if they did not'),
             ('WP', 'True win percentage for a season')])

In [None]:
In [1]:
baseball_df.head()
Out[1]:

  Team League  Year   RS   RA   W    G  Playoffs    WP
0  ARI     NL  2012  734  688  81  162         0  0.50
1  ATL     NL  2012  700  600  94  162         1  0.58
2  BAL     AL  2012  712  705  93  162         1  0.57
3  BOS     AL  2012  734  806  69  162         0  0.43
4  CHC     NL  2012  613  759  61  162         0  0.38

In [None]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for ____ in baseball_df.____():
    runs_scored = ____.____
    runs_allowed = ____.____
    win_perc_pred = predict_win_perc(____, ____)
    win_perc_preds_loop.append(____)

In [None]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

![image.png](attachment:92a31417-c594-487d-a227-b9cb806c4db1.png)

In [None]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.____(lambda ____: predict_win_perc(row[____], row[____]), axis=____)

In [None]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

![image.png](attachment:a0e87677-66ae-4979-a2df-8e8369ff4035.png)

In [None]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(baseball_df[____].____, baseball_df[____].____)
baseball_df['WP_preds'] = win_perc_preds_np
print(baseball_df.head())

In [None]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)
baseball_df['WP_preds'] = win_perc_preds_np
print(baseball_df.head())

![image.png](attachment:f6efdf99-725c-41ef-aad2-6dfd22bf1e00.png)

In [None]:
%%timeit

win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in baseball_df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)
    
69.8 ms +- 9.05 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)

In [None]:
%%timeit

# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = baseball_df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

225 ms +- 41.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

In [None]:
%%timeit

# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(baseball_df['RS'].values, baseball_df['RA'].values)
baseball_df['WP_preds'] = win_perc_preds_np

805 us +- 87.1 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

1. Incorrect! Try again. You want the fastest times first. The .apply() approach should not be the fastest approach.

2. Correct! 

![image.png](attachment:b9433294-0e0a-47d5-aee7-f6c4dcd41598.png)

3. Incorrect! Not quite. The itertuples() approach is efficient, but it's not the fastest approach out of the three.

3. Incorrect! Incorrect. These three approaches have different runtimes, and one approach is much more efficient than the others.





## Congratulations!

1. Congratulations!
Congratulations on completing the course! Now, you have the necessary tools to start writing efficient Python code!

2. What you have learned
Over the four chapters of this course, you have learned what writing efficient code truly means, and that writing Pythonic code often yields efficient code. You've explored Python's Standard Library and practiced using built-in functions like range, enumerate, and map. You know the power of NumPy arrays and can use them for fast, efficient calculations. You're a whiz at using magic commands like %timeit and know how to profile your code with the line_profiler and memory_profiler packages. You've also applied more advanced techniques to gain efficiencies by using built-in functions like zip, built-in modules like itertools and collections, and a branch of mathematics called set theory. Finally, you explored looping patterns in Python and why they are not always the most efficient approach to solving problems. You successfully eliminated loops in your code and even learned how to efficiently iterate over pandas DataFrames.

3. Well done!
Well done! It has been an absolute pleasure working with you! Thank you for taking the course, and I hope to see you again in the future!