# 1. Combining objects

##### Say we want to combing the following two lists such that each HP is coupled with the corresponding Pokemon name.

In [1]:
names = ['Bulbasaur', 'Charmander', 'Squirtle']
hps = [45, 39, 44]

#### a) Inefficient way

In [2]:
def combine_ineff(pokemon, hps):
    combined = []
    for i, pokemon in enumerate(names):
        combined.append((pokemon, hps[i]))
    return combined

In [3]:
combine_ineff(names, hps)

[('Bulbasaur', 45), ('Charmander', 39), ('Squirtle', 44)]

#### b) Efficient way - using zip()

In [4]:
def combine_eff(pokemon, hps):
    combined_zip = zip(names, hps)
    return [*combined_zip]

In [5]:
combine_eff(names, hps)

[('Bulbasaur', 45), ('Charmander', 39), ('Squirtle', 44)]

#### speed comparison between (a) and (b)

In [6]:
%load_ext line_profiler

In [7]:
%lprun -f combine_ineff combine_ineff(names, hps)

Timer unit: 1e-09 s

Total time: 4.142e-06 s
File: <ipython-input-2-0c62f2f2295d>
Function: combine_ineff at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def combine_ineff(pokemon, hps):
     2         1        818.0    818.0     19.7      combined = []
     3         3       1855.0    618.3     44.8      for i, pokemon in enumerate(names):
     4         3       1284.0    428.0     31.0          combined.append((pokemon, hps[i]))
     5         1        185.0    185.0      4.5      return combined

In [8]:
%lprun -f combine_eff combine_eff(names, hps)

Timer unit: 1e-09 s

Total time: 2.447e-06 s
File: <ipython-input-4-ecca5421458e>
Function: combine_eff at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def combine_eff(pokemon, hps):
     2         1       1354.0   1354.0     55.3      combined_zip = zip(names, hps)
     3         1       1093.0   1093.0     44.7      return [*combined_zip]

## The collections module

<ol>
    <li>Part of Python's Standard library</li>
    <li>Specialized container datatypes</li>
    <ul>
        <li>Alternatives to general purpose dict, list, set and tuple.</li>
        <li>Notable:</li>
        <ul>
            <li>namedtyple: tuple substances with named fields.</li>
            <li>deque: list-like container with fast appends and pops.</li>
            <li>Counter: dict for counting hashable objects.</li>
            <li>OrderedDict: dict that retains order of entries.</li>
            <li>defaultdict: dict that calls a factory function to supply missing values.</li>
        </ul>
    </ul>
</ol>

## The itertools module

<ol>
    <li>Part of Python's Standard library</li>
    <li>Functional tools for creating and using iterators.</li>
    <ul>
        <li>Notable:</li>
        <ul>
            <li>Infinite iterators: count, cycle, repeat.</li>
            <li>Finite iterators: accumulate, chain, zip_longest, etc.</li>
            <li>Combination generators: product, permutations, combinations.</li>
        </ul>
    </ul>
</ol>

#### Suppose we want to gather all combination pairs of pokemon types possible.

In [9]:
poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']

def combos_ineff(poke_types):
    combos = []
    for x in poke_types:
        for y in poke_types:
            if x==y:
                continue
            if ((x, y) not in combos) & ((y, x) not in combos):
                combos.append((x, y))
                
    return combos

combos_ineff(poke_types)

[('Bug', 'Fire'),
 ('Bug', 'Ghost'),
 ('Bug', 'Grass'),
 ('Bug', 'Water'),
 ('Fire', 'Ghost'),
 ('Fire', 'Grass'),
 ('Fire', 'Water'),
 ('Ghost', 'Grass'),
 ('Ghost', 'Water'),
 ('Grass', 'Water')]

In [10]:
from itertools import combinations

def combos_eff(poke_types):
    combos_obj = combinations(poke_types, 2)
    return [*combos_obj]

combos_eff(poke_types)

[('Bug', 'Fire'),
 ('Bug', 'Ghost'),
 ('Bug', 'Grass'),
 ('Bug', 'Water'),
 ('Fire', 'Ghost'),
 ('Fire', 'Grass'),
 ('Fire', 'Water'),
 ('Ghost', 'Grass'),
 ('Ghost', 'Water'),
 ('Grass', 'Water')]

In [11]:
%lprun -f combos_ineff combos_ineff(poke_types)

Timer unit: 1e-09 s

Total time: 1.9025e-05 s
File: <ipython-input-9-61ee155c591e>
Function: combos_ineff at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
     3                                           def combos_ineff(poke_types):
     4         1        575.0    575.0      3.0      combos = []
     5         5       1031.0    206.2      5.4      for x in poke_types:
     6        25       4821.0    192.8     25.3          for y in poke_types:
     7        20       3254.0    162.7     17.1              if x==y:
     8         5        668.0    133.6      3.5                  continue
     9        10       5700.0    570.0     30.0              if ((x, y) not in combos) & ((y, x) not in combos):
    10        10       2815.0    281.5     14.8                  combos.append((x, y))
    11                                                           
    12         1        161.0    161.0      0.8      return combos

In [12]:
%lprun -f combos_eff combos_eff(poke_types)

Timer unit: 1e-09 s

Total time: 2.243e-06 s
File: <ipython-input-10-cfbc91d29b49>
Function: combos_eff at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
     3                                           def combos_eff(poke_types):
     4         1       1172.0   1172.0     52.3      combos_obj = combinations(poke_types, 2)
     5         1       1071.0   1071.0     47.7      return [*combos_obj]

---

# 2. Comparing objects

##### Often, we'd like to compare two objects to observe similarities and differences between their contents. When doing this type of comparison, it's best to leverage a branch of mathematics called set theory. As you know, Python comes with a built-in set data type. Sets come with some handy methods we can use for comparing.  Another nice feature of Python sets is their ability to quickly check if a value exists within its members. We call this membership testing. In this lesson, we'll show that using the in operator with a set is much faster than using it with a list or tuple.

In [13]:
import random

random_nos_500_lis = [random.randint(99, 1000) for _ in range(500)]
random_nos_500_lis.append(1000)
random_nos_500_set = set(random_nos_500_lis)
random_nos_500_tup = tuple(random_nos_500_lis)

In [14]:
tm_lis = %timeit -n50 -o 1000 in random_nos_500_lis
tm_set = %timeit -n50 -o 1000 in random_nos_500_set
tm_tup = %timeit -n50 -o 1000 in random_nos_500_tup

4.51 µs ± 43.7 ns per loop (mean ± std. dev. of 7 runs, 50 loops each)
46.2 ns ± 8.42 ns per loop (mean ± std. dev. of 7 runs, 50 loops each)
4.5 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 50 loops each)


In [15]:
times = {
    'list': tm_lis.average,
    'set': tm_set.average,
    'tuple': tm_tup.average
}

min(times, key=times.get)

'set'

In [16]:
print(f"Faster than list by {tm_lis.average/tm_set.average}x")
print(f"Faster than tuple by {tm_tup.average/tm_set.average}x")

Faster than list by 97.53870178468625x
Faster than tuple by 97.34422855497986x


---

# 3. Eliminating loops

#### Say we have a nested list of numbers and we wanted to sum each row of the list.

In [17]:
nums_list = [
    [90, 92, 75, 60],
    [25, 20, 15, 90],
    [65, 130, 60, 75],
    [11, 0, 33, 52]
]

#### Using a for loop

In [18]:
%%timeit -r5 -n100000
totals = []
for row in nums_list:
    totals.append(sum(row))

485 ns ± 1.08 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)


#### Using a list comprehension

In [19]:
%timeit -r5 -n100000 [sum(row) for row in nums_list]

440 ns ± 1.46 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)


#### Using map 

In [20]:
%timeit -r5 -n100000 [*map(sum, nums_list)]

411 ns ± 2.56 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)


### USING NUMPY

In [21]:
import numpy as np
nums_list_ar = np.array(nums_list)
nums_list_ar

array([[ 90,  92,  75,  60],
       [ 25,  20,  15,  90],
       [ 65, 130,  60,  75],
       [ 11,   0,  33,  52]])

In [22]:
%%timeit -r5 -n100000
nums_list_ar.sum(axis=1)

1.49 µs ± 10.4 ns per loop (mean ± std. dev. of 5 runs, 100000 loops each)


In [23]:
1.93e-6 > 427e-9

True

# 3. Writing better loops

<ul>
    <li>Understand what is beign done with each loop iteration.</li>
    <li>Move one-time calculations outside(above) the loop.</li>
    <li>Use holistic conversions outside(below) the loop.</li>
    <li>Anything that is done once should be outside the loop.</li>
</ul>

<pre>
Q) We have a list of Pokémon names and an array of each Pokémon's corresponding attack value. We'd like to print the names of each Pokémon with an attack value greater than the average of all attack values. 
</pre>

In [24]:
import numpy as np

names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']
attacks = np.array([130, 70, 50, 50, 45])

In [25]:
%%timeit
for pokemon, attack in zip(names, attacks):
    total_attack_avg = attacks.mean()
    if attack>total_attack_avg:
        pass
#         print(f"{pokemon}'s attack: {attack} > average: {total_attack_avg}!")

32.5 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


#### Improvement

In [26]:
%%timeit
total_attack_avg = attacks.mean()
for pokemon, attack in zip(names, attacks):
    if attack>total_attack_avg:
        pass

13.3 µs ± 72.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


<pre>
Q) We have three lists from our dataset of 720 Pokémon: a list of each Pokémon's name, a list corresponding to whether or not a Pokémon has a legendary status, and a list of each Pokémon's generation. We want to combine these objects so that each name, status, and generation is stored in an individual list.
</pre>

In [27]:
names = ['Pikachu', 'Squirtle', 'Articuno']
legend_status = [False, False, True]
generations = [1, 1, 1]

poke_data = []
for poke_tuple in zip(names, legend_status, generations):
    poke_list = list(poke_tuple)
    poke_data.append(poke_list)
    
print(poke_data)

[['Pikachu', False, 1], ['Squirtle', False, 1], ['Articuno', True, 1]]


#### Improvement: Not converting each tuple to a list within the loop

In [28]:
poke_data_tuple = []
for poke_tuple in zip(names, legend_status, generations):
    poke_data_tuple.append(poke_tuple)
    
print(f"Before: {poke_data_tuple}")
poke_data = [*map(list, poke_data_tuple)]
print(f"After: {poke_data}")

Before: [('Pikachu', False, 1), ('Squirtle', False, 1), ('Articuno', True, 1)]
After: [['Pikachu', False, 1], ['Squirtle', False, 1], ['Articuno', True, 1]]


## Pandas dataframe iteration

In [49]:
# Creating a sample dataframe

df_dict = {
    'Team': ['ARI', 'ATL', 'BAL', 'BOS', 'CHC', 'IND', 'TEX', 'XYZ', 'ABC'],
    'League': ['NL', 'NL', 'AL', 'AL', 'NL', 'AL', 'NL', 'AL', 'NL'],
    'Year': [2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012],
    'RS': [734, 700, 712, 734, 613, 650, 699, 611, 655],
    'RA': [688, 600, 705, 806, 759, 710, 700, 699, 799],
    # number of wins in a season
    'W': [81, 94, 93, 69, 61, 100, 50, 79, 59],
    # number of games a team played in a season
    'G': [162, 162, 162, 162, 162, 162, 162, 162, 162],
    'Playoffs': [0, 1, 1, 0, 0, 1, 0, 1, 0]
}

In [50]:
import pandas as  pd

df = pd.DataFrame(df_dict)
df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs
0,ARI,NL,2012,734,688,81,162,0
1,ATL,NL,2012,700,600,94,162,1
2,BAL,AL,2012,712,705,93,162,1
3,BOS,AL,2012,734,806,69,162,0
4,CHC,NL,2012,613,759,61,162,0
5,IND,AL,2012,650,710,100,162,1
6,TEX,NL,2012,699,700,50,162,0
7,XYZ,AL,2012,611,699,79,162,1
8,ABC,NL,2012,655,799,59,162,0


### Calculate Win percentage.

In [51]:
import numpy as np

def calc_win_per(wins, games_played):
    
    win_perc = wins/games_played
    return np.round(win_perc, 2)

In [52]:
calc_win_per(50, 100)

0.5

#### Approach - 1 

In [53]:
%%timeit
# Adding win % to the dataframe

win_percent_list = []
for i in range(len(df)):
    row = df.iloc[i]
    wins = row['W']
    games_played = row['G']
    
    win_perc = calc_win_per(wins, games_played)
    win_percent_list.append(win_perc)
    
df['WP'] = win_percent_list

899 µs ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [54]:
df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs,WP
0,ARI,NL,2012,734,688,81,162,0,0.5
1,ATL,NL,2012,700,600,94,162,1,0.58
2,BAL,AL,2012,712,705,93,162,1,0.57
3,BOS,AL,2012,734,806,69,162,0,0.43
4,CHC,NL,2012,613,759,61,162,0,0.38
5,IND,AL,2012,650,710,100,162,1,0.62
6,TEX,NL,2012,699,700,50,162,0,0.31
7,XYZ,AL,2012,611,699,79,162,1,0.49
8,ABC,NL,2012,655,799,59,162,0,0.36


In [55]:
df.drop(columns=['WP'], inplace=True)

#### Approach - 2

In [56]:
%%timeit
win_percent_list = []
# iterrows returns each dataframe row as a tuple of index and a series pairs
for i, row in df.iterrows():
    wins = row['W']
    games_played = row['G']
    win_perc = calc_win_per(wins, games_played)
    win_percent_list.append(win_perc)

df['WP'] = win_percent_list

510 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [57]:
df.drop(columns=['WP'], inplace=True)

### Iterating using .itertuples()

##### The dot-itertuples method returns each DataFrame row as a special data type called a namedtuple. A namedtuple is one of the specialized data types that exist within the collections module we've discussed previously. These data types behave just like a Python tuple but have fields accessible using attribute lookup. 

In [58]:
import random

df_dict = {
    'A': [random.random() for _ in range(1000)],
    'B': [random.random() for _ in range(1000)]
}

df_r = pd.DataFrame(df_dict)

In [59]:
%%timeit

loop_count = 0
for row_tuple in df_r.iterrows():
    loop_count += 1

17.6 ms ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [60]:
%%timeit

loop_count = 0
for row_tuple in df_r.itertuples():
    loop_count += 1

562 µs ± 408 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


##### The reason dot-itertuples is more efficient than dot-iterrows is due to the way each method stores its output. Since dot-iterrows returns each row's values as a pandas Series, there is a bit more overhead.

##### Also, comparisons like these only work when we have a substantial amount of rows in the dataset. For smaller datasets with rows<100, the results may not be as expected.

One more quick note about the differences between these methods. When using dot-iterrows, we can use square brackets to reference a column within our team_wins_df DataFrame. Here, we are printing the Team column for each row in our DataFrame. If we use the same syntax with dot-itertuples, we get a TypeError. This is due to the fact that namedtuples don't support square brackets like a pandas Series does. When looking up an attribute within a namedtuple, we must use a dot to reference the attribute. So anytime we use dot-itertuples we have to use a dot when referring to a column within our DataFrame.

### Pandas alternate to looping

In [61]:
df

Unnamed: 0,Team,League,Year,RS,RA,W,G,Playoffs
0,ARI,NL,2012,734,688,81,162,0
1,ATL,NL,2012,700,600,94,162,1
2,BAL,AL,2012,712,705,93,162,1
3,BOS,AL,2012,734,806,69,162,0
4,CHC,NL,2012,613,759,61,162,0
5,IND,AL,2012,650,710,100,162,1
6,TEX,NL,2012,699,700,50,162,0
7,XYZ,AL,2012,611,699,79,162,1
8,ABC,NL,2012,655,799,59,162,0


### Calculate run differentials (with loop)

In [62]:
def calc_run_diff(runs_sc, runs_al):
    return runs_sc - runs_al

In [63]:
%%timeit
run_diffs_iterrows = []

for i, row in df.iterrows():
    run_diff = calc_run_diff(row['RS'], row['RA'])
    run_diffs_iterrows.append(run_diff)
    
df['RD'] = run_diffs_iterrows

546 µs ± 4.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [64]:
df.drop(columns='RD', inplace=True)

#### using .apply()

- Takes a function and applies it to a dataframe (like map)
- Since we are working with tabular data, we must specify an axis (0 for cols and 1 for rows)
- Can be used with anonymous functions.

In [65]:
%%timeit

run_diffs_apply = df.apply(
    lambda row: calc_run_diff(row['RS'], row['RA']), axis=1
)
df['RD'] = run_diffs_apply

684 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### As mentioned above, if the number of rows are few, the efficiency comparisons do not work as expected. .apply() is much efficient than looping using either iterrows() or itertuples()

### Optimal pandas iterating
- Eliminating loops applies to using pandas as well because pandas is a library built on NumPy.
- Numpy arrays have broadcasting abilities which allows NumPy arrays to vectorize operations, so they are performed on all elements of an object at once. This allows us to efficiently perform calculations over entire arrays. Just like NumPy, pandas is designed to vectorize calculations so that they operate on entire datasets at once (not just on a row by row basis).

In [67]:
df.values

array([['ARI', 'NL', 2012, 734, 688, 81, 162, 0, 46],
       ['ATL', 'NL', 2012, 700, 600, 94, 162, 1, 100],
       ['BAL', 'AL', 2012, 712, 705, 93, 162, 1, 7],
       ['BOS', 'AL', 2012, 734, 806, 69, 162, 0, -72],
       ['CHC', 'NL', 2012, 613, 759, 61, 162, 0, -146],
       ['IND', 'AL', 2012, 650, 710, 100, 162, 1, -60],
       ['TEX', 'NL', 2012, 699, 700, 50, 162, 0, -1],
       ['XYZ', 'AL', 2012, 611, 699, 79, 162, 1, -88],
       ['ABC', 'NL', 2012, 655, 799, 59, 162, 0, -144]], dtype=object)

#### Power of vectorization

In [70]:
df.drop(columns='RD', inplace=True)

In [71]:
%%timeit
run_diffs_np = df['RS'] - df['RA']
df['RD'] = run_diffs_np

118 µs ± 2.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


---