### Author: Ran Meng

This jupyter notebook contains my work for certification of "Writting Efficient Python Code" instructed by Logan Thomas, from [DataCamp](https://learn.datacamp.com/courses/writing-efficient-python-code)

In [38]:
import numpy as np
import pandas as pd
import line_profiler
from collections import Counter
from itertools import combinations

#### Pythonic vs Non-Pythonic:

Collect Names in the list that have 6 letters or more

In [1]:
names = ['Jerry', 'Kramer', 'Elaine', 'George', 'Newman']

In [2]:
# Print the list created using the Non-Pythonic approach
i = 0
new_list= []
while i < len(names):
    if len(names[i]) >= 6:
        new_list.append(names[i])
    i += 1
print(new_list)

['Kramer', 'Elaine', 'George', 'Newman']


In [3]:
# Print the list created by looping over the contents of names
better_list = []
for name in names:
    if len(name) >= 6:
        better_list.append(name)
print(better_list)

['Kramer', 'Elaine', 'George', 'Newman']


In [4]:
# Print the list created by using list comprehension
best_list = [name for name in names if len(name) >= 6]
print(best_list)

['Kramer', 'Elaine', 'George', 'Newman']


#### Built-in practice: range()

In [5]:
# Create a range object that goes from 0 to 5
nums = range(6)
print(type(nums))

# Convert nums to a list
nums_list = list(nums)
print(nums_list)

# Create a new list of odd numbers from 1 to 11 by unpacking a range object
nums_list2 = [*range(1,12,2)]
print(nums_list2)

<class 'range'>
[0, 1, 2, 3, 4, 5]
[1, 3, 5, 7, 9, 11]


#### Built-in practice: enumerate() 

This function is useful for obtaining an indexed list. For example, suppose you had a list of people that arrived at a party you are hosting. The list is ordered by arrival (Jerry was the first to arrive, followed by Kramer, etc.):

In [6]:
# Rewrite the for loop to use enumerate
indexed_names = []
for i,name in enumerate(names):
    index_name = (i,name)
    indexed_names.append(index_name) 
print(indexed_names)

# Rewrite the above for loop using list comprehension
indexed_names_comp = [(i,name) for i,name in enumerate(names)]
print(indexed_names_comp)

# Unpack an enumerate object with a starting index of one
indexed_names_unpack = [*enumerate(names, start = 1)]
print(indexed_names_unpack)

[(0, 'Jerry'), (1, 'Kramer'), (2, 'Elaine'), (3, 'George'), (4, 'Newman')]
[(0, 'Jerry'), (1, 'Kramer'), (2, 'Elaine'), (3, 'George'), (4, 'Newman')]
[(1, 'Jerry'), (2, 'Kramer'), (3, 'Elaine'), (4, 'George'), (5, 'Newman')]


#### Built-in practice: map() 

You wanted to create a new list (called names_uppercase) that converted all the letters in each name to uppercase. 

In [8]:
names_uppercase = []

for name in names:
    names_uppercase.append(name.upper())

print(names_uppercase)

['JERRY', 'KRAMER', 'ELAINE', 'GEORGE', 'NEWMAN']


In [9]:
# Use map to apply str.upper to each element in names
names_map  = map(str.upper, names)

# Print the type of the names_map
print(type(names_map))

# Unpack names_map into a list
names_uppercase = [*list(names_map)]

# Print the list created above
print(names_uppercase)

<class 'map'>
['JERRY', 'KRAMER', 'ELAINE', 'GEORGE', 'NEWMAN']


#### Practice with NumPy arrays

Let's practice slicing numpy arrays and using NumPy's broadcasting concept. Remember, broadcasting refers to a numpy array's ability to vectorize operations, so they are performed on all elements of an object at once.

In [13]:
nums = np.array([[1,2,3,4,5], [6,7,8,9,10]])

In [18]:
# Print second row of nums
print("\n", nums[1, :])

# Print all elements of nums that are greater than six
print("\n", nums[nums > 6])

# Double every element of nums
nums_dbl = nums * 2
print("\n", nums_dbl)

# Add 1 to the third column of nums
nums[:,2] = nums[:,2] + 1
print("\n",nums)


 [ 6  7 10  9 10]

 [ 7 10  9 10]

 [[ 2  4 10  8 10]
 [12 14 20 18 20]]

 [[ 1  2  6  4  5]
 [ 6  7 11  9 10]]


#### Bringing it all together: Festivus!

In this exercise, you will be throwing a party—a Festivus if you will!

You have a list of guests (the names list). Each guest, for whatever reason, has decided to show up to the party in 10-minute increments. For example, Jerry shows up to Festivus 10 minutes into the party's start time, Kramer shows up 20 minutes into the party, and so on and so forth.

We want to write a few simple lines of code, using the built-ins we have covered, to welcome each of your guests and let them know how many minutes late they are to your party.

In [19]:
# Create a list of arrival times
arrival_times = [*range(10, 60, 10)]

print(arrival_times)

[10, 20, 30, 40, 50]


In [20]:
# Create a list of arrival times
arrival_times = [*range(10,60,10)]
print(arrival_times)

# Convert arrival_times to an array and update the times
arrival_times_np = np.array(arrival_times)
new_times = arrival_times_np - 3
print("\n", new_times)

# Use list comprehension and enumerate to pair guests to new times
guest_arrivals = [(names[i],time) for i,time in enumerate(new_times)]
print("\n", guest_arrivals)

[10, 20, 30, 40, 50]

 [ 7 17 27 37 47]

 [('Jerry', 7), ('Kramer', 17), ('Elaine', 27), ('George', 37), ('Newman', 47)]


In [21]:
def welcome_guest(guest_and_time):
    """
    Returns a welcome string for the guest_and_time tuple.
    
    Args:
        guest_and_time (tuple): The guest and time tuple to create
            a welcome string for.
            
    Returns:
        welcome_string (str): A string welcoming the guest to Festivus.
        'Welcome to Festivus {guest}... You're {time} min late.'
    
    """
    guest = guest_and_time[0]
    arrival_time = guest_and_time[1]
    welcome_string = "Welcome to Festivus {}... You're {} min late.".format(guest,arrival_time)
    return welcome_string

In [22]:
# Map the welcome_guest function to each (guest,time) pair
welcome_map = map(welcome_guest, guest_arrivals)
guest_welcomes = [*welcome_map]
print(*guest_welcomes, sep='\n')

Welcome to Festivus Jerry... You're 7 min late.
Welcome to Festivus Kramer... You're 17 min late.
Welcome to Festivus Elaine... You're 27 min late.
Welcome to Festivus George... You're 37 min late.
Welcome to Festivus Newman... You're 47 min late.


#### Using %timeit: your turn!

You'd like to create a list of integers from 0 to 50 using the range() function. However, you are unsure whether using list comprehension or unpacking the range object into a list is faster. Let's use %timeit to find the best implementation.

In [24]:
# Create a list of integers (0-50) using list comprehension
%timeit nums_list_comp = [num for num in range(51)]

1.94 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [25]:
%timeit nums_unpack = [*(nums_list_comp)]

140 ns ± 5.31 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


#### Using %timeit: specifying number of runs and loops

A list of 480 superheroes has been loaded into your session (called heroes). You'd like to analyze the runtime for converting this heroes list into a set. Instead of relying on the default settings for %timeit, you'd like to only use 5 runs and 25 loops per each run.

In [27]:
%timeit -r5 -n25 set(heroes)

16.4 µs ± 2.86 µs per loop (mean ± std. dev. of 5 runs, 25 loops each)


#### Formal list vs literal list

[] is preferred over list()

In [28]:
# Create a list using the formal name
%timeit formal_list = list()

108 ns ± 14 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [29]:
# Create a list using the literal name
%timeit literal_list = []

28.8 ns ± 5.44 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


#### Formal tuple vs literal tuple

() is preferred over tuple()

In [30]:
# Create a tuple using the formal name
%timeit formal_tuple = tuple()

73.8 ns ± 4.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [31]:
# Create a tuple using the liternal name
%timeit literal_tuple = ()

19.7 ns ± 1.94 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)


#### Convert a list of weights in kgs to pounds

In [40]:
%%timeit 
hero_wts_lbs = []
for wt in wts:
    hero_wts_lbs.append(wt * 2.20462)

49 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [43]:
%%timeit
wts_np = np.array(wts)
hero_wts_lbs_np = wts_np * 2.20462

12.7 µs ± 704 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Code Profiling

Detailed stats on frequency and duration of function calls

Line-by-line analyses

In [10]:
def convert_units(heroes, heights, weights):

    new_hts = [ht * 0.39370  for ht in heights]
    new_wts = [wt * 2.20462  for wt in weights]

    hero_data = {}

    for i,hero in enumerate(heroes):
        hero_data[hero] = (new_hts[i], new_wts[i])

    return hero_data

In [8]:
%load_ext line_profiler

In [19]:
%lprun -f convert_units convert_units(heroes, hts, wts)

#### Code Profilting for memory usage

In [None]:
# Function needs to be stored in a seperate script, e.g hero_func.py
from hero_funcs import convert_units

In [None]:
%load_ext memory_profiler

In [None]:
%mprun -f convert_units convert_units(heroes, hts, wts)

#### Bringing it all together: Star Wars profiling

A list of 480 superheroes has been loaded into your session (called heroes) as well as a list of each hero's corresponding publisher (called publishers).

You'd like to filter the heroes list based on a hero's specific publisher, but are unsure which of the below functions is more efficient.

In [21]:
def get_publisher_heroes(heroes, publishers, desired_publisher):

    desired_heroes = []

    for i,pub in enumerate(publishers):
        if pub == desired_publisher:
            desired_heroes.append(heroes[i])

    return desired_heroes


def get_publisher_heroes_np(heroes, publishers, desired_publisher):

    heroes_np = np.array(heroes)
    pubs_np = np.array(publishers)

    desired_heroes = heroes_np[pubs_np == desired_publisher]

    return desired_heroes

In [25]:
%lprun -f get_publisher_heroes get_publisher_heroes(heroes, publishers, "George Lucas")

In [26]:
%lprun -f get_publisher_heroes get_publisher_heroes_np(heroes, publishers, "George Lucas")

#### Counting Pokémon from a sample

A sample of 500 Pokémon has been generated, and three lists from this sample have been loaded into your session:

The names list contains the names of each Pokémon in the sample.
The primary_types list containing the corresponding primary type of each Pokémon in the sample.
The generations list contains the corresponding generation of each Pokémon in the sample.
You want to quickly gather a few counts from these lists to better understand the sample that was generated. Use Counter from the collections module to explore what types of Pokémon are in your sample, what generations they come from, and how many Pokémon have a name that starts with a specific letter.

In [36]:
# Collect the count of primary types
type_count = Counter(primary_types)
print(type_count, '\n')

# Collect the count of generations
gen_count = Counter(generations)
print(gen_count, '\n')

# Use list comprehension to get each Pokémon's starting letter
starting_letters = [name[0] for name in names]

# Collect the count of Pokémon for each starting_letter
starting_letters_count = Counter(starting_letters)
print(starting_letters_count)

Counter({'Water': 66, 'Normal': 64, 'Bug': 51, 'Grass': 47, 'Psychic': 31, 'Rock': 29, 'Fire': 27, 'Electric': 25, 'Ground': 23, 'Fighting': 23, 'Poison': 22, 'Steel': 18, 'Ice': 16, 'Fairy': 16, 'Dragon': 16, 'Ghost': 13, 'Dark': 13}) 

Counter({5: 122, 3: 103, 1: 99, 4: 78, 2: 51, 6: 47}) 

Counter({'S': 83, 'C': 46, 'D': 33, 'M': 32, 'L': 29, 'G': 29, 'B': 28, 'P': 23, 'A': 22, 'K': 20, 'E': 19, 'W': 19, 'T': 19, 'F': 18, 'H': 15, 'R': 14, 'N': 13, 'V': 10, 'Z': 8, 'J': 7, 'I': 4, 'O': 3, 'Y': 3, 'U': 2, 'X': 1})


#### Combinations of Pokémon

Ash, a Pokémon trainer, encounters a group of five Pokémon. These Pokémon have been loaded into a list within your session (called pokemon) and printed into the console for your convenience.

Ash would like to try to catch some of these Pokémon, but his Pokédex can only store two Pokémon at a time. Let's use combinations from the itertools module to see what the possible pairs of Pokémon are that Ash could catch.

In [40]:
# Create a combination object with pairs of Pokémon
combos_obj = combinations(pokemon, 2)
print(type(combos_obj), '\n')

# Convert combos_obj to a list by unpacking
combos_2 = [*combos_obj]
print(combos_2, '\n')

# Collect all possible combinations of 4 Pokémon directly into a list
combos_4 = [*combinations(pokemon, 4)]
print(combos_4)

<class 'itertools.combinations'> 

[('Geodude', 'Cubone'), ('Geodude', 'Lickitung'), ('Geodude', 'Persian'), ('Geodude', 'Diglett'), ('Cubone', 'Lickitung'), ('Cubone', 'Persian'), ('Cubone', 'Diglett'), ('Lickitung', 'Persian'), ('Lickitung', 'Diglett'), ('Persian', 'Diglett')] 

[('Geodude', 'Cubone', 'Lickitung', 'Persian'), ('Geodude', 'Cubone', 'Lickitung', 'Diglett'), ('Geodude', 'Cubone', 'Persian', 'Diglett'), ('Geodude', 'Lickitung', 'Persian', 'Diglett'), ('Cubone', 'Lickitung', 'Persian', 'Diglett')]


#### Comparing Pokédexes

Two Pokémon trainers, Ash and Misty, would like to compare their individual collections of Pokémon. Let's see what Pokémon they have in common and what Pokémon Ash has that Misty does not.

In [42]:
# Convert both lists to sets
ash_set = set(ash_pokedex)
misty_set = set(misty_pokedex)

# Find the Pokémon that exist in both sets
both = ash_set.intersection(misty_set)
print(both)

# Find the Pokémon that Ash has and Misty does not have
ash_only = ash_set.difference(misty_set)
print(ash_only)

# Find the Pokémon that are in only one set (not both)
unique_to_set = ash_set.symmetric_difference(misty_set)
print(unique_to_set)

{'Psyduck', 'Squirtle'}
{'Bulbasaur', 'Wigglytuff', 'Koffing', 'Vulpix', 'Pikachu', 'Zubat', 'Spearow', 'Rattata'}
{'Bulbasaur', 'Wigglytuff', 'Vaporeon', 'Starmie', 'Magikarp', 'Koffing', 'Slowbro', 'Vulpix', 'Pikachu', 'Zubat', 'Horsea', 'Spearow', 'Tentacool', 'Rattata', 'Poliwag', 'Krabby'}


In [49]:
# Check if Psyduck is in Ash's list and Brock's set
print('Psyduck' in ash_pokedex)
print('Psyduck' in brock_pokedex_set)

# Check if Machop is in Ash's list and Brock's set
print('Machop' in ash_pokedex)
print('Machop' in brock_pokedex_set)

True
False
False
True


#### Finding unique names via an adhoc function with set() and compare their performance difference

In [44]:
def find_unique_items(data):
    uniques = []

    for item in data:
        if item not in uniques:
            uniques.append(item)

    return uniques

In [46]:
# Use find_unique_items() to collect unique Pokémon names
uniq_names_func = find_unique_items(names)
print(len(uniq_names_func))

# Convert the names list to a set to collect unique Pokémon names
uniq_names_set = set(names)
print(len(uniq_names_set))

# Check that both unique collections are equivalent
print(sorted(uniq_names_func) == sorted(uniq_names_set))

369
369
True


In [47]:
%timeit find_unique_items(names)

996 µs ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [48]:
%timeit set(names)

11.2 µs ± 651 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Gathering Pokémon without a loop

A list containing 720 Pokémon has been loaded into your session as poke_names. Another list containing each Pokémon's corresponding generation has been loaded as poke_gens.

A for loop has been created to filter the Pokémon that belong to generation one or two, and collect the number of letters in each Pokémon's name:

In [57]:
%%timeit
gen1_gen2_name_lengths_loop = []
for name,gen in zip(poke_names, poke_gens):
    if gen < 3:
        name_length = len(name)
        poke_tuple = (name, name_length)
        gen1_gen2_name_lengths_loop.append(poke_tuple)

84.2 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [60]:
%%timeit 
gen1_gen2_pokemon = [name for name,gen in zip(poke_names, poke_gens) if gen < 3]
name_lengths_map = map(len, poke_names)
gen1_gen2_name_lengths = [*(gen1_gen2_pokemon, name_lengths_map)]

44.3 µs ± 3.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


#### One-time calculation loop

A list of integers that represents each Pokémon's generation has been loaded into your session called generations. You'd like to gather the counts of each generation and determine what percentage each generation accounts for out of the total count of integers.

In [61]:
# Collect the count of each generation
gen_counts = Counter(generations)

# Improve for loop by moving one calculation above the loop
total_count = len(generations)

for gen,count in gen_counts.items():
    gen_percent = round(count/total_count*100, 2)
    print('generation {}: count = {:3} percentage = {}'
          .format(gen, count, gen_percent))

generation 1: count =  99 percentage = 19.8
generation 5: count = 122 percentage = 24.4
generation 3: count = 103 percentage = 20.6
generation 6: count =  47 percentage = 9.4
generation 4: count =  78 percentage = 15.6
generation 2: count =  51 percentage = 10.2


#### Holistic conversion loop

A list of all possible Pokémon types has been loaded into your session as pokemon_types. It's been printed in the console for convenience.

You'd like to gather all the possible pairs of Pokémon types. You want to store each of these pairs in an individual list with an enumerated index as the first element of each list. This allows you to see the total number of possible pairs and provides an indexed label for each pair.

The below loop was written to accomplish this task:

In [66]:
# Collect all possible pairs using combinations()
possible_pairs = [*combinations(pokemon_types, 2)]

# Create an empty list called enumerated_tuples
enumerated_tuples = []

# Add a line to append each enumerated_pair_tuple to the empty list above
for i, pair in enumerate(possible_pairs, 1):
    enumerated_pair_tuple = (i,) + pair
    enumerated_tuples.append(enumerated_pair_tuple)

# Convert all tuples in enumerated_tuples to a list
enumerated_pairs = [*map(list, enumerated_tuples)]
print(enumerated_pairs)

[[1, 'Bug', 'Dark'], [2, 'Bug', 'Dragon'], [3, 'Bug', 'Electric'], [4, 'Bug', 'Fairy'], [5, 'Bug', 'Fighting'], [6, 'Bug', 'Fire'], [7, 'Bug', 'Flying'], [8, 'Bug', 'Ghost'], [9, 'Bug', 'Grass'], [10, 'Bug', 'Ground'], [11, 'Bug', 'Ice'], [12, 'Bug', 'Normal'], [13, 'Bug', 'Poison'], [14, 'Bug', 'Psychic'], [15, 'Bug', 'Rock'], [16, 'Bug', 'Steel'], [17, 'Bug', 'Water'], [18, 'Dark', 'Dragon'], [19, 'Dark', 'Electric'], [20, 'Dark', 'Fairy'], [21, 'Dark', 'Fighting'], [22, 'Dark', 'Fire'], [23, 'Dark', 'Flying'], [24, 'Dark', 'Ghost'], [25, 'Dark', 'Grass'], [26, 'Dark', 'Ground'], [27, 'Dark', 'Ice'], [28, 'Dark', 'Normal'], [29, 'Dark', 'Poison'], [30, 'Dark', 'Psychic'], [31, 'Dark', 'Rock'], [32, 'Dark', 'Steel'], [33, 'Dark', 'Water'], [34, 'Dragon', 'Electric'], [35, 'Dragon', 'Fairy'], [36, 'Dragon', 'Fighting'], [37, 'Dragon', 'Fire'], [38, 'Dragon', 'Flying'], [39, 'Dragon', 'Ghost'], [40, 'Dragon', 'Grass'], [41, 'Dragon', 'Ground'], [42, 'Dragon', 'Ice'], [43, 'Dragon', 'Nor

#### Pokémon z-scores

A list of 720 Pokémon has been loaded into your session as names. Each Pokémon's corresponding Health Points is stored in a NumPy array called hps. You want to analyze the Health Points using the z-score to see how many standard deviations each Pokémon's HP is from the mean of all HPs.

The below code was written to calculate the HP z-score for each Pokémon and gather the Pokémon with the highest HPs based on their z-scores

In [69]:
%%timeit 
poke_zscores = []

for name,hp in zip(names, hps):
    hp_avg = hps.mean()
    hp_std = hps.std()
    z_score = (hp - hp_avg)/hp_std
    poke_zscores.append((name, hp, z_score))
    
highest_hp_pokemon = []

for name,hp,zscore in poke_zscores:
    if zscore > 2:
        highest_hp_pokemon.append((name, hp, zscore))

14 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [71]:
%%timeit

# Calculate the total HP avg and total HP standard deviation
hp_avg = hps.mean()
hp_std = hps.std()

# Use NumPy to eliminate the previous for loop
z_scores = (hps - hp_avg)/hp_std

# Combine names, hps, and z_scores
poke_zscores2 = [*zip(names, hps, z_scores)]
# print(*poke_zscores2[:3], sep='\n')

# Use list comprehension with the same logic as the highest_hp_pokemon code block
highest_hp_pokemon = [(name, hp, z) for name, hp, z in poke_zscores2 if z > 2]
# print(*highest_hp_pokemon, sep='\n')

226 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Iterating with .iterrows()

In the video, we discussed that .iterrows() returns each DataFrame row as a tuple of (index, pandas Series) pairs. But, what does this mean? Let's explore with a few coding exercises.

In [72]:
df = pd.read_csv('baseball_stats.csv')

In [73]:
df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


In [74]:
pit_df = df[df['Team'] == 'PIT']

pit_df.shape

(47, 15)

In [79]:
pit_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
21,PIT,NL,2012,651,674,79,0.304,0.395,0.243,0,,,162,0.314,0.39
51,PIT,NL,2011,610,712,72,0.309,0.368,0.244,0,,,162,0.338,0.409
81,PIT,NL,2010,587,866,57,0.304,0.373,0.242,0,,,162,0.348,0.449
111,PIT,NL,2009,636,768,62,0.318,0.387,0.252,0,,,161,0.346,0.442
141,PIT,NL,2008,735,884,67,0.32,0.403,0.258,0,,,162,0.362,0.454


In [81]:
# Iterate over pit_df and print each row
for i,row in pit_df[:3].iterrows():
    print(row, '\n')

Team              PIT
League             NL
Year             2012
RS                651
RA                674
W                  79
OBP             0.304
SLG             0.395
BA              0.243
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.314
OSLG             0.39
Name: 21, dtype: object 

Team              PIT
League             NL
Year             2011
RS                610
RA                712
W                  72
OBP             0.309
SLG             0.368
BA              0.244
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.338
OSLG            0.409
Name: 51, dtype: object 

Team              PIT
League             NL
Year             2010
RS                587
RA                866
W                  57
OBP             0.304
SLG             0.373
BA              0.242
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OO

In [83]:
for i,row in pit_df[:3].iterrows():
    print(i)
    print(row)
    print(type(row), '\n')

21
Team              PIT
League             NL
Year             2012
RS                651
RA                674
W                  79
OBP             0.304
SLG             0.395
BA              0.243
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.314
OSLG             0.39
Name: 21, dtype: object
<class 'pandas.core.series.Series'> 

51
Team              PIT
League             NL
Year             2011
RS                610
RA                712
W                  72
OBP             0.309
SLG             0.368
BA              0.244
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.338
OSLG            0.409
Name: 51, dtype: object
<class 'pandas.core.series.Series'> 

81
Team              PIT
League             NL
Year             2010
RS                587
RA                866
W                  57
OBP             0.304
SLG             0.373
BA              0.242
Playoffs 

In [85]:
# Use one variable instead of two to store the result of .iterrows()
for row_tuple in pit_df[:3].iterrows():
    print(row_tuple, '\n')

(21, Team              PIT
League             NL
Year             2012
RS                651
RA                674
W                  79
OBP             0.304
SLG             0.395
BA              0.243
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.314
OSLG             0.39
Name: 21, dtype: object) 

(51, Team              PIT
League             NL
Year             2011
RS                610
RA                712
W                  72
OBP             0.309
SLG             0.368
BA              0.244
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.338
OSLG            0.409
Name: 51, dtype: object) 

(81, Team              PIT
League             NL
Year             2010
RS                587
RA                866
W                  57
OBP             0.304
SLG             0.373
BA              0.242
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G      

In [87]:
# Print the row and type of each row
for row_tuple in pit_df[:3].iterrows():
    print(row_tuple)
    print(type(row_tuple), '\n')

(21, Team              PIT
League             NL
Year             2012
RS                651
RA                674
W                  79
OBP             0.304
SLG             0.395
BA              0.243
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.314
OSLG             0.39
Name: 21, dtype: object)
<class 'tuple'> 

(51, Team              PIT
League             NL
Year             2011
RS                610
RA                712
W                  72
OBP             0.309
SLG             0.368
BA              0.244
Playoffs            0
RankSeason        NaN
RankPlayoffs      NaN
G                 162
OOBP            0.338
OSLG            0.409
Name: 51, dtype: object)
<class 'tuple'> 

(81, Team              PIT
League             NL
Year             2010
RS                587
RA                866
W                  57
OBP             0.304
SLG             0.373
BA              0.242
Playoffs            0
RankSeason        N

#### Run differentials with .iterrows()

You've been hired by the San Francisco Giants as an analyst—congrats! The team's owner wants you to calculate a metric called the run differential for each season from the year 2008 to 2012. This metric is calculated by subtracting the total number of runs a team allowed in a season from the team's total number of runs scored in a season. 'RS' means runs scored and 'RA' means runs allowed.

In [88]:
giants_df = df[df['Team'] == 'SFG']

giants_df.shape

(47, 15)

In [89]:
giants_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
24,SFG,NL,2012,718,649,94,0.327,0.397,0.269,1,4.0,1.0,162,0.313,0.393
54,SFG,NL,2011,570,578,86,0.303,0.368,0.242,0,,,162,0.309,0.346
84,SFG,NL,2010,697,583,92,0.321,0.408,0.257,1,5.0,1.0,162,0.313,0.37
114,SFG,NL,2009,657,611,88,0.309,0.389,0.257,0,,,162,0.314,0.372
144,SFG,NL,2008,640,759,72,0.321,0.382,0.262,0,,,162,0.341,0.404


In [90]:
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

In [91]:
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

giants_df['RD'] = run_diffs
print(giants_df)

     Team League  Year   RS   RA    W    OBP    SLG     BA  Playoffs  \
24    SFG     NL  2012  718  649   94  0.327  0.397  0.269         1   
54    SFG     NL  2011  570  578   86  0.303  0.368  0.242         0   
84    SFG     NL  2010  697  583   92  0.321  0.408  0.257         1   
114   SFG     NL  2009  657  611   88  0.309  0.389  0.257         0   
144   SFG     NL  2008  640  759   72  0.321  0.382  0.262         0   
174   SFG     NL  2007  683  720   71  0.322  0.387  0.254         0   
204   SFG     NL  2006  746  790   76  0.324  0.422  0.259         0   
234   SFG     NL  2005  649  745   75  0.319  0.396  0.261         0   
265   SFG     NL  2004  850  770   91  0.357  0.438  0.270         0   
295   SFG     NL  2003  755  638  100  0.338  0.425  0.264         1   
325   SFG     NL  2002  783  616   95  0.344  0.442  0.267         1   
355   SFG     NL  2001  799  748   90  0.342  0.460  0.266         0   
385   SFG     NL  2000  925  747   97  0.362  0.472  0.278      

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [101]:
%%timeit 
# Create an empty list to store run differentials
run_diffs = []

# Write a for loop and collect runs allowed and runs scored for each row
for i,row in giants_df.iterrows():
    runs_scored = row['RS']
    runs_allowed = row['RA']
    
    # Use the provided function to calculate run_diff for each row
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    # Append each run differential to the output list
    run_diffs.append(run_diff)

giants_df['RD'] = run_diffs
#print(giants_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


26.9 ms ± 4.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Iterating with .itertuples()

Remember, .itertuples() returns each DataFrame row as a special data type called a namedtuple. You can look up an attribute within a namedtuple with a special syntax. Let's practice working with namedtuples.

In [92]:
rangers_df = df[df['Team'] == 'TEX']

rangers_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
27,TEX,AL,2012,808,707,93,0.334,0.446,0.273,1,5.0,5.0,162,0.309,0.408
57,TEX,AL,2011,855,677,96,0.34,0.46,0.283,1,3.0,2.0,162,0.307,0.392
87,TEX,AL,2010,787,687,90,0.338,0.419,0.276,1,7.0,2.0,162,0.319,0.39
117,TEX,AL,2009,784,740,87,0.32,0.445,0.26,0,,,162,0.331,0.416
147,TEX,AL,2008,901,967,79,0.354,0.462,0.283,0,,,162,0.362,0.455


In [93]:
for tuple in rangers_df[:3].itertuples():
    print(tuple, '\n')

Pandas(Index=27, Team='TEX', League='AL', Year=2012, RS=808, RA=707, W=93, OBP=0.33399999999999996, SLG=0.446, BA=0.273, Playoffs=1, RankSeason=5.0, RankPlayoffs=5.0, G=162, OOBP=0.309, OSLG=0.408) 

Pandas(Index=57, Team='TEX', League='AL', Year=2011, RS=855, RA=677, W=96, OBP=0.34, SLG=0.46, BA=0.28300000000000003, Playoffs=1, RankSeason=3.0, RankPlayoffs=2.0, G=162, OOBP=0.307, OSLG=0.392) 

Pandas(Index=87, Team='TEX', League='AL', Year=2010, RS=787, RA=687, W=90, OBP=0.33799999999999997, SLG=0.419, BA=0.276, Playoffs=1, RankSeason=7.0, RankPlayoffs=2.0, G=162, OOBP=0.319, OSLG=0.39) 



In [94]:
for row in rangers_df[:3].itertuples():
    i = row.Index
    year = row.Year
    wins = row.W
    print(i, year, wins, '\n')

27 2012 93 

57 2011 96 

87 2010 90 



In [96]:
# Loop over the DataFrame and print each row's Index, Year and Wins (W)
for row in rangers_df.itertuples():
    i = row.Index
    year = row.Year
    wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
    if row.Playoffs == 1:
        print(i, year, wins, '\n')

27 2012 93 

57 2011 96 

87 2010 90 

418 1999 95 

448 1998 88 

504 1996 90 



#### Run differentials with .itertuples()

The New York Yankees have made a trade with the San Francisco Giants for your analyst contract— you're a hot commodity! Your new boss has seen your work with the Giants and now wants you to do something similar with the Yankees data. He'd like you to calculate run differentials for the Yankees from the year 1962 to the year 2012 and find which season they had the best run differential.

In [97]:
yankees_df =  df[df['Team'] == 'NYY']

yankees_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
18,NYY,AL,2012,804,668,95,0.337,0.453,0.265,1,3.0,3.0,162,0.311,0.419
48,NYY,AL,2011,867,657,97,0.343,0.444,0.263,1,2.0,4.0,162,0.322,0.399
78,NYY,AL,2010,859,693,95,0.35,0.436,0.267,1,3.0,3.0,162,0.322,0.399
108,NYY,AL,2009,915,753,103,0.362,0.478,0.283,1,1.0,1.0,162,0.327,0.408
138,NYY,AL,2008,789,727,89,0.342,0.427,0.271,0,,,162,0.329,0.405


In [100]:
%%timeit 
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA

    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)

# Append new column
yankees_df['RD'] = run_diffs
#print(yankees_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


22.1 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Analyzing baseball stats with .apply()

The Tampa Bay Rays want you to analyze their data.

They'd like the following metrics:

- The sum of each column in the data

- The total amount of runs scored in a year ('RS' + 'RA' for each year)

- The 'Playoffs' column in text format rather than using 1's and 0's

In [102]:
def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No' 

In [107]:
rays_df = df[df['Team'] == 'TBR'][['Year', 'RS', 'RA', 'W', 'Playoffs']]

rays_df.shape

(5, 5)

In [108]:
rays_df.head()

Unnamed: 0,Year,RS,RA,W,Playoffs
26,2012,697,577,90,0
56,2011,707,614,91,1
86,2010,802,649,96,1
116,2009,803,754,84,0
146,2008,774,671,97,1


In [112]:
rays_df.set_index('Year', inplace = True)

In [114]:
rays_df.head()

Unnamed: 0_level_0,RS,RA,W,Playoffs
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012,697,577,90,0
2011,707,614,91,1
2010,802,649,96,1
2009,803,754,84,0
2008,774,671,97,1


In [115]:
# Gather sum of all columns
stat_totals = rays_df.apply(sum, axis= 0)
print(stat_totals)

RS          3783
RA          3265
W            458
Playoffs       3
dtype: int64


In [116]:
# Gather total runs scored in all games per year
total_runs_scored = rays_df[['RS', 'RA']].apply(sum, axis=1)
print(total_runs_scored)

Year
2012    1274
2011    1321
2010    1451
2009    1557
2008    1445
dtype: int64


In [117]:
# Convert numeric playoffs to text
textual_playoffs = rays_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(textual_playoffs)

Year
2012     No
2011    Yes
2010    Yes
2009     No
2008    Yes
dtype: object


#### Settle a debate with .apply()

Word has gotten to the Arizona Diamondbacks about your awesome analytics skills. They'd like for you to help settle a debate amongst the managers. One manager claims that the team has made the playoffs every year they have had a win percentage of 0.50 or greater. Another manager says this is not true.

Let's use the below function and the .apply() method to see which manager is correct.

In [119]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

In [120]:
dbacks_df = df[df['Team'] == 'ARI']

dbacks_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
30,ARI,NL,2011,731,662,94,0.322,0.413,0.25,1,5.0,4.0,162,0.316,0.409
60,ARI,NL,2010,713,836,65,0.325,0.416,0.25,0,,,162,0.34,0.448
90,ARI,NL,2009,720,782,70,0.324,0.418,0.253,0,,,162,0.33,0.419
120,ARI,NL,2008,720,706,82,0.327,0.415,0.251,0,,,162,0.318,0.398


In [121]:
# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

0      0.50
30     0.58
60     0.40
90     0.43
120    0.51
150    0.56
180    0.47
210    0.48
241    0.31
271    0.52
301    0.60
331    0.57
361    0.52
391    0.62
421    0.40
dtype: float64 



In [122]:
# Append a new column to dbacks_df
dbacks_df['WP'] = win_percs
print(dbacks_df, '\n')

    Team League  Year   RS   RA    W    OBP    SLG     BA  Playoffs  \
0    ARI     NL  2012  734  688   81  0.328  0.418  0.259         0   
30   ARI     NL  2011  731  662   94  0.322  0.413  0.250         1   
60   ARI     NL  2010  713  836   65  0.325  0.416  0.250         0   
90   ARI     NL  2009  720  782   70  0.324  0.418  0.253         0   
120  ARI     NL  2008  720  706   82  0.327  0.415  0.251         0   
150  ARI     NL  2007  712  732   90  0.321  0.413  0.250         1   
180  ARI     NL  2006  773  788   76  0.331  0.424  0.267         0   
210  ARI     NL  2005  696  856   77  0.332  0.421  0.256         0   
241  ARI     NL  2004  615  899   51  0.310  0.393  0.253         0   
271  ARI     NL  2003  717  685   84  0.330  0.417  0.263         0   
301  ARI     NL  2002  819  674   98  0.346  0.423  0.267         1   
331  ARI     NL  2001  818  677   92  0.341  0.442  0.267         1   
361  ARI     NL  2000  792  754   85  0.333  0.429  0.265         0   
391  A

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [123]:
print(dbacks_df[dbacks_df['WP'] >= 0.50])

    Team League  Year   RS   RA    W    OBP    SLG     BA  Playoffs  \
0    ARI     NL  2012  734  688   81  0.328  0.418  0.259         0   
30   ARI     NL  2011  731  662   94  0.322  0.413  0.250         1   
120  ARI     NL  2008  720  706   82  0.327  0.415  0.251         0   
150  ARI     NL  2007  712  732   90  0.321  0.413  0.250         1   
271  ARI     NL  2003  717  685   84  0.330  0.417  0.263         0   
301  ARI     NL  2002  819  674   98  0.346  0.423  0.267         1   
331  ARI     NL  2001  818  677   92  0.341  0.442  0.267         1   
361  ARI     NL  2000  792  754   85  0.333  0.429  0.265         0   
391  ARI     NL  1999  908  676  100  0.347  0.459  0.277         1   

     RankSeason  RankPlayoffs    G   OOBP   OSLG    WP  
0           NaN           NaN  162  0.317  0.415  0.50  
30          5.0           4.0  162  0.316  0.409  0.58  
120         NaN           NaN  162  0.318  0.398  0.51  
150         3.0           3.0  162  0.334  0.420  0.56  
271 

#### Replacing .iloc with underlying arrays

Now that you have a better grasp on a DataFrame's internals let's update one of your previous analyses to leverage a DataFrame's underlying arrays. You'll revisit the win percentage calculations you performed row by row with the .iloc method:

In [125]:
%%timeit -r5 -n10
win_percs_list = []

for i in range(len(df)):
    row = df.iloc[i]

    wins = row['W']
    games_played = row['G']

    win_perc = calc_win_perc(wins, games_played)

    win_percs_list.append(win_perc)

df['WP'] = win_percs_list

320 ms ± 36.4 ms per loop (mean ± std. dev. of 5 runs, 10 loops each)


In [127]:
%%timeit -r5 -n10
# Use the W array and G array to calculate win percentages
win_percs_np = calc_win_perc(df['W'].values, df['G'].values)

# Append a new column to baseball_df that stores all win percentages
df['WP'] = win_percs_np


311 µs ± 75.8 µs per loop (mean ± std. dev. of 5 runs, 10 loops each)


#### Bringing it all together: Predict win percentage

You'd like to attempt to predict a team's win percentage for a given season by using the team's total runs scored in a season ('RS') and total runs allowed in a season ('RA') with the following function:

In [128]:
def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)

Use a for loop and **.itertuples()** to predict the win percentage for each row of baseball_df with the predict_win_perc() function. Save each row's predicted win percentage as win_perc_pred and append each to the win_perc_preds_loop list.

In [129]:
%%timeit 
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(row.RS, row.RA)
    win_perc_preds_loop.append(win_perc_pred)

22.9 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Apply predict_win_perc() to each row of the baseball_df DataFrame using **a lambda function**. Save the predicted win percentage as win_perc_preds_apply

In [132]:
%%timeit 
win_perc_preds_apply = df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

42 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Calculate the predicted win percentages by passing the underlying 'RS' and 'RA' **numpy arrays** from df into predict_win_perc(). Save these predictions as win_perc_preds_np

In [134]:
%%timeit 
win_perc_preds_np = predict_win_perc(df['RS'].values, df['RA'].values)
df['WP_preds'] = win_perc_preds_np

246 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [135]:
df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,WP_preds
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,0.53
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,0.58
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,0.5
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,0.45
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,0.39
