### List of list to DataFrame

In [1]:
import pandas as pd
product_price_list=[['A',12, 3,6],['B',14,6,9],['C',28,7,3],['D',40,9,12]]
df=pd.DataFrame(product_price_list,columns=['product','prices','unit item','sales change(%)'])

In [2]:
df

Unnamed: 0,product,prices,unit item,sales change(%)
0,A,12,3,6
1,B,14,6,9
2,C,28,7,3
3,D,40,9,12


### List of Dictionaries to DataFrame

In [3]:
product_price_dicts={'A':12,'B':14,'C':28,'D':40}
df_dict=pd.DataFrame(list(product_price_dicts.items()),columns=['product','price'])
df_dict

Unnamed: 0,product,price
0,A,12
1,B,14
2,C,28
3,D,40


In [4]:
list(product_price_dicts.items())

[('A', 12), ('B', 14), ('C', 28), ('D', 40)]

It's convenient way to see that Python can create a datafram from list of tuples.

In [5]:
import numpy as np
def split_and_stack(df, new_names):
    """Split a DataFrame into two halves and then stack them vertically, returning a new DataFrame with new_names as the column name.
    Args:
    df (DataFrame): The DataFrame to split.
    new_names (iterable of str): The column names for the new DataFrame.
    
    Returns:
    DataFrame
    """
    half=int(len(df.columns)/2)
    left=df.iloc[:, :half]
    right=df.iloc[:,half:]
    return pd.DataFrame(
    data=np.vstack([left.values, right.values]),
    columns=new_names
    )

#### Google Style docstrings
- Description of what the function does.
   - It should be imperative language. For instance: "Split the data frame and...".
- Args description and their types
   - If argument has a default value, mark it as "optional" when describing the type.
- Returns: Description of what kind of return type you expect.
- Finally, if your function intentionally raises any errors, you should add a "Raises" section.

#### Numpydoc

In [6]:
def function(arg1,arg2=42):
    """
    Description of what the function does.
    
    Parameters
    ----------
    arg1 : expected type of agr1
      Description of agr1.
    arg2 : int, optional
      Write optional when an agrument has a default value.
     Default=42
     
    Returns
    -------
    The type of the return value.
      Can include a description of retun value
      Replace """Return" with "Yields" if this function is a generator.
     """

SyntaxError: invalid syntax (<ipython-input-6-0cbef4d47831>, line 17)

In [7]:
def the_answer():
    """Return the answer to life, 
    the universe, and everything.
    
    
    Returns:
      int
    """
    return 42
print(the_answer.__doc__)

Return the answer to life, 
    the universe, and everything.
    
    
    Returns:
      int
    


Every function in Python comes with a __doc__ attribute that hold the information.

In [8]:
import inspect
print(inspect.getdoc(the_answer))

Return the answer to life, 
the universe, and everything.


Returns:
  int


In [9]:
new_names=['1','2']
split_and_stack(df,new_names)

Unnamed: 0,1,2
0,A,12
1,B,14
2,C,28
3,D,40
4,3,6
5,6,9
6,7,3
7,9,12


- df.iloc[:,:0]: tüm rowları al, 0'cı columndan önceki her sutünü al
- df.iloc[:,0:]: tüm rowları al, 0'ıncı column ve  sonraki tüm sutünleri al

In [10]:
df.iloc[:,2:]

Unnamed: 0,unit item,sales change(%)
0,3,6
1,6,9
2,7,3
3,9,12


#### DRY and "Do One Thing"
DRY (also known as "don't repeat yourself") and the "Do One Thing" principle are good ways to ensure that your function are well designed and easy to test.

# WRITING EFFICIENT PYTHON CODE

Writing an efficient Python code requires two main things:
- Minimal completion time
- Minimal memory overhead

In [12]:
# Non-Pythonic
numbers=[1,2,3]
doubled_numbers=[]
for i in range(len(numbers)):
    doubled_numbers.append(numbers[i]*2)

Pythonic code tends to be less verbose and easier to interpret.

In [13]:
doubled_numbers

[2, 4, 6]

In [15]:
# Pythonic (efficient code)
doublend_numbers=[x*2 for x in numbers]

In [16]:
len(numbers)

3

In [17]:
names=['Jerry', 'Kramer', 'Elaine', 'George', 'Newman']

In [19]:
name_list=[]
for name in names:
    if len(name)>=6:
        name_list.append(name)
print(name_list)        

['Kramer', 'Elaine', 'George', 'Newman']


In [49]:
name_list=[name for name in names if len(name)>=6]

In [50]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## Built-in Components
Buit-in components are referred to as the Python Standard Library (part of standard Python installation).
* Built-in types
  - list, tuple, set, dict, and others.
* Built-in functions
  - print(), len(), range(), enumerate(), map(), zip(), and others.
* Built-in modules
  - os, sys, itertools, collections, math, and others.
Note: We should default to using a built-in solution (if exists) rather than developing our own.

In [53]:
list(range(4,10))

[4, 5, 6, 7, 8, 9]

In [54]:
list(range(4,400,40))

[4, 44, 84, 124, 164, 204, 244, 284, 324, 364]

In [55]:
alfabe=['a','b','c','d','e','f','g','h','i','ı']
indexed_letters=enumerate(alfabe)
indexed_letters_list=list(indexed_letters)
print(indexed_letters_list)

[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), (6, 'g'), (7, 'h'), (8, 'i'), (9, 'ı')]


In [56]:
alfabe=['a','b','c','d','e','f','g','h','i','ı']
indexed_letters=enumerate(alfabe,start=29)
indexed_letters_list=list(indexed_letters)
print(indexed_letters_list)

[(29, 'a'), (30, 'b'), (31, 'c'), (32, 'd'), (33, 'e'), (34, 'f'), (35, 'g'), (36, 'h'), (37, 'i'), (38, 'ı')]


In [61]:
nums=[1.23,2.345,23.44,23.4113,3.453]
# map(function, object)
rnd_nums=map(round,nums)
print(list(rnd_nums))

[1, 2, 23, 23, 3]


In [66]:
nums=[2,4,6,8]
new_nums=map(lambda x:x**3,nums)
print(list(new_nums))

[8, 64, 216, 512]


In [67]:
# Create a range object that goes from 0 to 5
nums = range(6)
print(type(nums))

# Convert nums to a list
nums_list = list(nums)
print(nums_list)

# Create a new list of odd numbers from 1 to 11 by unpacking a range object
nums_list2 = [*range(1,12,2)]
print(nums_list2)

<class 'range'>
[0, 1, 2, 3, 4, 5]
[1, 3, 5, 7, 9, 11]


In [68]:
nums_list2

[1, 3, 5, 7, 9, 11]

In [69]:
# Use map to apply str.upper to each element in names
names_map  = map(str.upper, names)

# Print the type of the names_map
print(type(names_map))

# Unpack names_map into a list
names_uppercase = [*names_map]

# Print the list created above
print(names_uppercase)

<class 'map'>
['JERRY', 'KRAMER', 'ELAINE', 'GEORGE', 'NEWMAN']


## NumPy
Numpy, or Numerical Python, is an invaluable Python package for Data Scientists.
- Alternative to Python list. It's more fast and memory efficient way.
- numpy.array([list)

In [70]:
nums_list=list(range(5))

In [71]:
nums_list

[0, 1, 2, 3, 4]

In [72]:
nums_np=np.array(range(5))

In [73]:
nums_np

array([0, 1, 2, 3, 4])

- Numpy arrays are homogeneous, which means that they must contain elements of the same type.

In [74]:
nums_np_ints=np.array([1,2,3])

In [75]:
nums_np_ints

array([1, 2, 3])

In [77]:
nums_np_ints.dtype

dtype('int32')

In [78]:
nums_np_floats=np.array([1,2.5,3])

In [79]:
nums_np_floats

array([1. , 2.5, 3. ])

In [80]:
nums_np_floats.dtype

dtype('float64')

NumPy converted the integers to floats to maintain homogeneity nature.

* Python lists don't support broadcasting

In [82]:
nums=[1,5,7,9]
nums*2

[1, 5, 7, 9, 1, 5, 7, 9]

In [84]:
# All elements are squared at once.
np.array(nums)*2

array([ 2, 10, 14, 18])

In [85]:
nums_2=[[2,4,5],[9,12,17]]

In [86]:
nums_2[0][1]

4

In [87]:
[row[0] for row in nums_2]

[2, 9]

In [88]:
nums_2_np=np.array(nums_2)

In [89]:
nums_2_np[0,1]

4

In [90]:
nums_2_np[:,0]

array([2, 9])

In [91]:
nums_2_np>3

array([[False,  True,  True],
       [ True,  True,  True]])

In [98]:
nums_2_np[nums_2_np>3]

array([ 4,  5,  9, 12, 17])

### %timeit

In [103]:
# retuns an array consisting of 1,000 random numbers between 0 and 1.
%timeit rand_nums=np.random.rand(1000)

15.3 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Efficiently combining, counting, and iterating

In [104]:
names=['Bulbasaur','Charmander','Squirtle']
hps=[45,39,44]

In [107]:
combined=[]
for i, pokemon in enumerate(names):
    combined.append((pokemon,hps[i]))
print(combined)

[('Bulbasaur', 45), ('Charmander', 39), ('Squirtle', 44)]


But Python's built-in function zip provides a more elegent solution.

In [125]:
combined_zip=zip(names,hps)

In [126]:
# You can unpack zip in that way
combined_list=[*combined_zip]

In [129]:
combined_list

[('Bulbasaur', 45), ('Charmander', 39), ('Squirtle', 44)]

### Collections Module

Alternatives to general purpose dict, list, set, and tuple.A few notable datatypes are lested here:
* namedtuple: tupble subclasses with named fields
* deque: list-like container with fast appends and pops.
* Counter: dict for counting hashable objects.
* OrderedDict: dict that retains order of entries.
* defaultict: dict that calls a factory function to supply missing values

In [130]:
import pandas as pd
df=pd.read_csv('baseball_stats.csv')
df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


In [137]:
df.shape

(1232, 15)

In [138]:
pok=pd.read_csv('Pokemon.csv')
pok.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [142]:
list_pok_types=pok['Type 1'].to_list()

In [146]:
type_counts={}
for types in list_pok_types:
    if types not in type_counts:
        type_counts[types]=1
    else:
        type_counts[types]+=1
print(type_counts)

{'Grass': 70, 'Fire': 52, 'Water': 112, 'Bug': 69, 'Normal': 98, 'Poison': 28, 'Electric': 44, 'Ground': 32, 'Fairy': 17, 'Fighting': 27, 'Psychic': 57, 'Rock': 44, 'Ghost': 32, 'Ice': 24, 'Dragon': 32, 'Dark': 31, 'Steel': 27, 'Flying': 4}


Using Counter is much more efficient way since it does not requires a loop.

In [148]:
from collections import Counter
# No need for a loop
type_counts=Counter(list_pok_types)
print(type_counts)

Counter({'Water': 112, 'Normal': 98, 'Grass': 70, 'Bug': 69, 'Psychic': 57, 'Fire': 52, 'Electric': 44, 'Rock': 44, 'Ground': 32, 'Ghost': 32, 'Dragon': 32, 'Dark': 31, 'Poison': 28, 'Fighting': 27, 'Steel': 27, 'Ice': 24, 'Fairy': 17, 'Flying': 4})


It's ordered by highest to lowest counts. Using Counter takes half the time as the standard dictionary approach.

### Itertools Module
* Functional tools for creating and using iterators.
* Notable:
  - Infinite iterators: count, cycle, repeat
  - Finite iterators: accumulate, chain, zip_longest, etc.
  - Combination generators: product, permutations, combinations.

##### Combination with loop
Suppose we want to gather all combination pairs of Pokemon types possible.

In [169]:
poke_list=['Geodude', 'Cubone', 'Lickitung', 'Persian', 'Diglett']
combos=[]
for x in poke_list:
    for y in poke_list:
        if x==y:
            continue
        if ((x,y) not in combos) & ((y,x) not in combos):
            combos.append((x,y))
print(combos)

[('Geodude', 'Cubone'), ('Geodude', 'Lickitung'), ('Geodude', 'Persian'), ('Geodude', 'Diglett'), ('Cubone', 'Lickitung'), ('Cubone', 'Persian'), ('Cubone', 'Diglett'), ('Lickitung', 'Persian'), ('Lickitung', 'Diglett'), ('Persian', 'Diglett')]


In [170]:
# A more efficient way
from itertools import combinations
combos_obj=combinations(poke_list,2)
combos=[*combos_obj]
print(combos)

[('Geodude', 'Cubone'), ('Geodude', 'Lickitung'), ('Geodude', 'Persian'), ('Geodude', 'Diglett'), ('Cubone', 'Lickitung'), ('Cubone', 'Persian'), ('Cubone', 'Diglett'), ('Lickitung', 'Persian'), ('Lickitung', 'Diglett'), ('Persian', 'Diglett')]


If comparint runtimes, we'd see using combinations is significantly faster than the nested loop.

In [171]:
combos_4 = [*combinations(poke_list, 4)]
print(combos_4)

[('Geodude', 'Cubone', 'Lickitung', 'Persian'), ('Geodude', 'Cubone', 'Lickitung', 'Diglett'), ('Geodude', 'Cubone', 'Persian', 'Diglett'), ('Geodude', 'Lickitung', 'Persian', 'Diglett'), ('Cubone', 'Lickitung', 'Persian', 'Diglett')]


In [157]:
pok

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [160]:
pok_names=pok.Name.to_list()
pok_type_1=pok['Type 1'].to_list()
pok_type_2=pok['Type 2'].to_list()

In [165]:
name_and_type=[*zip(pok_names[:5],pok_type_1[:5])]
print(name_and_type)

[('Bulbasaur', 'Grass'), ('Ivysaur', 'Grass'), ('Venusaur', 'Grass'), ('VenusaurMega Venusaur', 'Grass'), ('Charmander', 'Fire')]


In [167]:
name_and_types=[*zip(pok_names[:5],pok_type_1[:5],pok_type_2[:5])]
print(name_and_types)

[('Bulbasaur', 'Grass', 'Poison'), ('Ivysaur', 'Grass', 'Poison'), ('Venusaur', 'Grass', 'Poison'), ('VenusaurMega Venusaur', 'Grass', 'Poison'), ('Charmander', 'Fire', nan)]


In [168]:
# Collect the count of generations
gen_count = Counter(pok.Generation.to_list())
print(gen_count, '\n')

# Use list comprehension to get each Pokémon's starting letter
starting_letters = [name[0] for name in pok_names]

# Collect the count of Pokémon for each starting_letter
starting_letters_count = Counter(starting_letters)
print(starting_letters_count)

Counter({1: 166, 5: 165, 3: 160, 4: 121, 2: 106, 6: 82}) 

Counter({'S': 112, 'M': 67, 'C': 58, 'G': 58, 'P': 53, 'D': 46, 'B': 43, 'A': 42, 'T': 40, 'L': 39, 'R': 31, 'H': 31, 'K': 28, 'F': 26, 'V': 23, 'W': 23, 'E': 21, 'N': 16, 'Z': 10, 'J': 7, 'O': 6, 'I': 5, 'U': 5, 'Q': 4, 'Y': 4, 'X': 2})


### Set Theory
Python has built-in set datatype with accompanying methods:
- intersection()
- difference()
- symmetric_difference(): all elements in exactly one set
- union(): all elements that are in either set

In [194]:
list_a=pok_names[:5]
list_b=pok_names[15:20]+['Bulbasaur']
set_a=set(list_a)
print(set_a)
set_b=set(list_b)
print(set_b)

{'Bulbasaur', 'Venusaur', 'Ivysaur', 'VenusaurMega Venusaur', 'Charmander'}
{'Bulbasaur', 'Beedrill', 'Weedle', 'BeedrillMega Beedrill', 'Butterfree', 'Kakuna'}


In [195]:
set_a.intersection(set_b)

{'Bulbasaur'}

In [196]:
set_a.union(set_b)

{'Beedrill',
 'BeedrillMega Beedrill',
 'Bulbasaur',
 'Butterfree',
 'Charmander',
 'Ivysaur',
 'Kakuna',
 'Venusaur',
 'VenusaurMega Venusaur',
 'Weedle'}

In [197]:
set_a.symmetric_difference(set_b)

{'Beedrill',
 'BeedrillMega Beedrill',
 'Butterfree',
 'Charmander',
 'Ivysaur',
 'Kakuna',
 'Venusaur',
 'VenusaurMega Venusaur',
 'Weedle'}

Using sets is a much faster approach.

In [198]:
set_a.difference(set_b)

{'Charmander', 'Ivysaur', 'Venusaur', 'VenusaurMega Venusaur'}

In [199]:
set_b.difference(set_a)

{'Beedrill', 'BeedrillMega Beedrill', 'Butterfree', 'Kakuna', 'Weedle'}

In [202]:
set_types=set(pok_type_1)
print(set_types)

{'Dark', 'Rock', 'Poison', 'Steel', 'Flying', 'Bug', 'Ice', 'Dragon', 'Water', 'Normal', 'Fire', 'Ground', 'Fairy', 'Grass', 'Fighting', 'Electric', 'Ghost', 'Psychic'}


In [203]:
print('Psyduck'  in set_a)
print('Psyduck'  in set_b)

False
False


In [205]:
pok.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

In [210]:
pok_stats=pok[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]

In [217]:
pok_stats_list=pok_stats[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']].values.tolist()

In [218]:
totals_map=[*map(sum, pok_stats_list)]

In [221]:
pok_stats_np=pok_stats[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']].values

In [222]:
# Specifiying axis=1 means that we calculate an average across the column values.
avgs_np=pok_stats_np.mean(axis=1)

It's significantly haster than using a loop.

In [228]:
total_np=pok_stats_np.sum(axis=1)

In [229]:
poke_list_np = [*zip(pok_names, total_np, avgs_np)]

In [232]:
top_3 = sorted(poke_list_np, key=lambda x: x[1], reverse=True)[:5]
print('3 strongest Pokémon:\n{}'.format(top_3))

3 strongest Pokémon:
[('MewtwoMega Mewtwo X', 780, 130.0), ('MewtwoMega Mewtwo Y', 780, 130.0), ('RayquazaMega Rayquaza', 780, 130.0), ('KyogrePrimal Kyogre', 770, 128.33333333333334), ('GroudonPrimal Groudon', 770, 128.33333333333334)]


In this section, we'll explore how to make loop more efficient when looping is unavoidable.

In [234]:
names=['Absol','Aron','Jynx','Natu','Onix']
attacks=np.array([130,70,50,50,45])

We'd like to print out the names of each Pokemon with an attack value greater than the average of all attack values.

In [None]:
x=[name for name in names if ]

In [235]:
names_and_attack=[*zip(names,attacks)]

In [238]:
names_and_attack

[('Absol', 130), ('Aron', 70), ('Jynx', 50), ('Natu', 50), ('Onix', 45)]

In [239]:
names_and_attack[0][1]

130

In [249]:
[name[0] for name in names_and_attack if name[1]>np.mean(attacks)]

['Absol', 'Aron']

Another way is

In [255]:
for pokemon, attack in zip(names, attacks):
    total_attack_avg=attacks.mean()
    if attack>total_attack_avg:
        print(
            "{}'s attack: {} > average: {}!"
            .format(pokemon,attack, total_attack_avg))

Absol's attack: 130 > average: 69.0!
Aron's attack: 70 > average: 69.0!


In this loop, total_attack_avg variable is being created with each iteration of the loop. If you need a certain calculation at once, just use them outside of the for loop.

In [259]:
%timeit
total_attack_avg=attacks.mean()
for pokemon, attack in zip(names, attacks):
    if attack>total_attack_avg:
        print(
            "{}'s attack: {} > average: {}!"
            .format(pokemon,attack, total_attack_avg))

Absol's attack: 130 > average: 69.0!
Aron's attack: 70 > average: 69.0!


#### Using Holistic Conversions

In [262]:
names=pok.Name.to_list()
legend_status=pok.Legendary.to_list()
generations=pok.Generation.to_list()

In [None]:
pok_data=[]
for poke_tuple in zip(names,legend_status,generations):
    poke_list=list(poke_tuple)
    pok_data.append(poke_list)
print(pok_data)

Instead, we should collect all of our poke_tuples together, and use the map function to convert each tuple to a list.

In [None]:
poke_data_tuples=[]
for poke_tuple in zip(names,legend_status,generations):
    poke_data_tuples.append(poke_tuple)
poke_data=[*map(list,poke_data_tuples)]
print(poke_data)

The loop no longer convers tuples to lists with each iteration. Instead, we moved this tuple to list conversion outside (or below) the loop. That way, we convert data types all at once (or holistically) rahter than converting each iteration.

In [267]:
pokemon_types=list(set(pok['Type 1']))

In [268]:
pokemon_types

['Dark',
 'Rock',
 'Poison',
 'Steel',
 'Flying',
 'Bug',
 'Ice',
 'Dragon',
 'Water',
 'Normal',
 'Fire',
 'Ground',
 'Fairy',
 'Grass',
 'Fighting',
 'Electric',
 'Ghost',
 'Psychic']

In [None]:
# Collect all possible pairs using combinations()
possible_pairs = [*combinations(pokemon_types, 2)]

# Create an empty list called enumerated_tuples
enumerated_tuples = []

# Append each enumerated_pair_tuple to the empty list above
for i,pair in enumerate(possible_pairs, 1):
    enumerated_pair_tuple = (i,) + pair
    enumerated_tuples.append(enumerated_pair_tuple)

# Convert all tuples in enumerated_tuples to a list
enumerated_pairs = [*map(list, enumerated_tuples)]
print(enumerated_pairs)

In [275]:
hps=pok.HP.to_list()
hps=np.array(hps)

In [276]:
# Calculate the total HP avg and total HP standard deviation
hp_avg = hps.mean()
hp_std = hps.std()

# Use NumPy to eliminate the previous for loop
z_scores = (hps - hp_avg)/hp_std

# Combine names, hps, and z_scores
poke_zscores2 = [*zip(names, hps, z_scores)]
print(*poke_zscores2[:3], sep='\n')

('Bulbasaur', 45, -0.9506262218221118)
('Ivysaur', 60, -0.3628220964103872)
('Venusaur', 80, 0.42091673747191216)


In [None]:
# Use list comprehension with the same logic as the highest_hp_pokemon code block
highest_hp_pokemon2 = [(name, hp, zscore) for name,hp,zscore in poke_zscores2 if zscore > 2]
print(*highest_hp_pokemon2, sep='\n')

## Pandas Optimization

In [279]:
df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


### A Team's Win Percentage

In [321]:
def win_perc(wins, game_played):
    win_percentage=wins/game_played
    return np.round(win_percentage,2)
win_perc(50,100)

0.5

I'd like to create a new column storing each teman's win percantages. To do this, we'll need to iterate over the DataFrame's rows and apply win_perc function.

In [302]:
win_perc_list=[]
for i in range(len(df)):
    row=df.iloc[i]
    wins=row['W']
    games_played=row['G']
    win_perc_df=win_perc(wins,games_played)
    win_perc_list.append(win_perc_df)
df['WP']=win_perc_list

Pandas comes with a few efficient methods for looping over a DataFrame.

### .iterrows()
This is similar as .iloc method, but .iterrows() method returns each DataFrame row as a tuple of index, pandas Series pairs. This means each object returned from .iterrows contains the index of each row as the first element and the data in each row as a pandas Series as the second element. Now we don't have to create an index variable to look up each row within the DataFrame.

In [322]:
win_perc_list_1=[]
for i, row in df.iterrows():
    wins=row['W']
    games_played=row['G']
    
    win_perc_df=win_perc(wins,games_played)
    win_perc_list_1.append(win_perc_df)
df['WP']=win_perc_list_1

In [325]:
df.shape

(1232, 16)

In [333]:
pit_df=df[(df['Team']=='PIT')&(df['Year']>=2007)]

In [None]:
for i,row in pit_df.iterrows():
    print(row)

In [None]:
for i,row in pit_df.iterrows():
    print(i)
    print(row)
    print(type(row))

In [None]:
for row_tuple in pit_df.iterrows():
    print(row_tuple)

### .itertuples()
It's often more efficient than .iterrows()

In [343]:
team_wins_df=df[['Team','Year','W']]
team_wins_df.head()

Unnamed: 0,Team,Year,W
0,ARI,2012,81
1,ATL,2012,94
2,BAL,2012,93
3,BOS,2012,69
4,CHC,2012,61


In [None]:
for row_namedtuple in team_wins_df.itertuples():
    print(row_namedtuple)

In [None]:
for row_tuple in team_wins_df.iterrows():
    print(row_tuple[1]['Team'])

In [None]:
for row_namedtuple in team_wins_df.itertuples():
    print(row_namedtuple.Team)

Namedtuples does not support square bracket like a pandas Series does.

In [None]:
for row in team_wins_df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  print(i, year, wins)

In [None]:
for row in df.itertuples():
  i = row.Index
  year = row.Year
  wins = row.W
  
  # Check if rangers made Playoffs (1 means yes; 0 means no)
  if row.Playoffs == 1:
    print(i, year, wins)

In [None]:
yankees_df=df[df.Team=='NYY']

In [365]:
def calc_run_diff(runs_scored, runs_allowed):

    run_diff = runs_scored - runs_allowed

    return run_diff

In [None]:
run_diffs = []

# Loop over the DataFrame and calculate each row's run differential
for row in yankees_df.itertuples():
    
    runs_scored = row.RS
    runs_allowed = row.RA
    
    run_diff = calc_run_diff(runs_scored, runs_allowed)
    
    run_diffs.append(run_diff)
yankees_df['RD'] = run_diffs
print(yankees_df)

### Pandas .apply() method
- Takes a function and applies it to a DataFrame.
  - Must specify an axis to apply (0 for columns; 1 for rows)
- Just like the map function, pandas .apply()method can be used with anonymous functions or kambdas.

In [None]:
run_diffs_apply=df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),axis=1)
df['RD']=run_diffs_apply
print(df)

In [None]:
dbacks_df=df[df['Team']=='ARI']

In [373]:
def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc,2)

In [None]:
# Display the first five rows of the DataFrame
print(dbacks_df.head())

# Create a win percentage Series 
win_percs = dbacks_df.apply(lambda row: calc_win_perc(row['W'], row['G']), axis=1)
print(win_percs, '\n')

In [378]:
def text_playoffs(num_playoffs): 
    if num_playoffs == 1:
        return 'Yes'
    else:
        return 'No' 

In [379]:
# Convert numeric playoffs to text by applying text_playoffs()
textual_playoffs = dbacks_df.apply(lambda row: text_playoffs(row['Playoffs']), axis=1)
print(textual_playoffs)

0       No
30     Yes
60      No
90      No
120     No
150    Yes
180     No
210     No
241     No
271     No
301    Yes
331    Yes
361     No
391    Yes
421     No
dtype: object


In [380]:
dbacks_df['Textual Playoffs']=textual_playoffs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [386]:
wins_np=df['W'].values
print(type(wins_np))

<class 'numpy.ndarray'>


In [387]:
print(wins_np)

[ 81  94  93 ... 103  84  60]


### Power of vectorization
- Broadcasting (vectorizing) is extremely efficient!

In [388]:
df['RS'].values-df['RA'].values

array([  46,  100,    7, ...,  188,  110, -117], dtype=int64)

In [389]:
df['RD']=df['RS'].values-df['RA'].values
df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,RD
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,46
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,100
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,7
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,-72
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,-146


### Predicting Win Percentages

In [390]:
def predict_win_perc(RS, RA):
    prediction = RS ** 2 / (RS ** 2 + RA ** 2)
    return np.round(prediction, 2)

In [392]:
win_perc_preds_loop = []

# Use a loop and .itertuples() to collect each row's predicted win percentage
for row in df.itertuples():
    runs_scored = row.RS
    runs_allowed = row.RA
    win_perc_pred = predict_win_perc(runs_scored, runs_allowed)
    win_perc_preds_loop.append(win_perc_pred)

In [394]:
# Apply predict_win_perc to each row of the DataFrame
win_perc_preds_apply = df.apply(lambda row: predict_win_perc(row['RS'], row['RA']), axis=1)

In [398]:
# Calculate the win percentage predictions using NumPy arrays
win_perc_preds_np = predict_win_perc(df['RS'].values, df['RA'].values)
df['WP_preds'] = win_perc_preds_np
print(df.head())

  Team League  Year   RS   RA   W    OBP    SLG     BA  Playoffs  RankSeason  \
0  ARI     NL  2012  734  688  81  0.328  0.418  0.259         0         NaN   
1  ATL     NL  2012  700  600  94  0.320  0.389  0.247         1         4.0   
2  BAL     AL  2012  712  705  93  0.311  0.417  0.247         1         5.0   
3  BOS     AL  2012  734  806  69  0.315  0.415  0.260         0         NaN   
4  CHC     NL  2012  613  759  61  0.302  0.378  0.240         0         NaN   

   RankPlayoffs    G   OOBP   OSLG    WP   RD  WP_preds  
0           NaN  162  0.317  0.415  0.50   46      0.53  
1           5.0  162  0.306  0.378  0.58  100      0.58  
2           4.0  162  0.315  0.403  0.57    7      0.50  
3           NaN  162  0.331  0.428  0.43  -72      0.45  
4           NaN  162  0.335  0.424  0.38 -146      0.39  


#### Using NumPy arrays was the fastest approach, followed by the .itertuples() approach, and the .apply() approach was slowest.