## Create a generator for a pandas DataFrame

As you've seen in the video, you can easily create a generator out of a `pandas` DataFrame. Each time you iterate through it, it will yield two elements:

- the index of the respective row
- a pandas Series with all the elements of that row

You are going to create a generator over the poker dataset, imported as `poker_hands`. Then, you will print all the elements of the 2nd row, using the generator.

Instructions

1. Assign a generator over the rows of the data dataset on the variable generator.
2. Print all the elements of the 2nd element of the created generator.

In [3]:
# Import pandas
import pandas as pd

# Import dataset
poker_hands = pd.read_csv('poker_hands.csv')

In [6]:
# Create a generator over the rows
generator = poker_hands.iterrows()

# Access the elements of the 2nd row
first_element = next(generator)
second_element = next(generator)

display(first_element, second_element)

(0,
 S1                       1
 R1                      10
 S2                       1
 R2                      11
 S3                       1
 R3                      13
 S4                       1
 R4                      12
 S5                       1
 R5                       1
 Class                    9
 Explanation    Royal flush
 Name: 0, dtype: object)

(1,
 S1                       2
 R1                      11
 S2                       2
 R2                      13
 S3                       2
 R3                      10
 S4                       2
 R4                      12
 S5                       2
 R5                       1
 Class                    9
 Explanation    Royal flush
 Name: 1, dtype: object)

## The iterrows() function for looping

You just saw how to create a generator out of a `pandas` DataFrame. You will now use this generator and see how to take advantage of that method of looping through a `pandas` DataFrame, still using the `poker_hands` dataset.

Specifically, we want the sum of the ranks of all the cards, if the index of the hand is an odd number. The ranks of the cards are located in the odd columns of the DataFrame.

Instructions

1. Check if the hand index is an odd number.
2. If it is, calculate the sum of the rank of all the cards in that hand. It could take a little longer than usual to compute the results.

In [9]:
data_generator = poker_hands.iterrows()

for index, values in data_generator:
    # Check if index is odd
    if index % 2 != 0:
        # Sum the ranks of all the cards
        hand_sum = sum([values[1], values[3], values[5], values[7], values[9]])

## .apply() function in every cell

As you saw in the lesson, you can use `.apply()` to map a function to every cell of the DataFrame, regardless the column or the row.

You're going to try it out on the `poker_hands` dataset. You will use `.apply()` to square every cell of the DataFrame. The native Python way to square a number `n` is `n**2`.

Instructions

1. Define the lambda transformation for the square.
2. Apply the transformation using the `.apply()` function.

In [15]:
# Define the lambda transformation
get_square = lambda x: x**2

# Apply the transformation
data_sum = poker_hands.apply(get_square)
print(data_sum.head())

TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

## .apply() for rows iteration

`.apply()` is a very useful to iterate through the rows of a DataFrame and apply a specific function.

You will work on a subset of the `poker_hands` dataset, which includes only the rank of all the five cards of each hand in each row (this subset is generated for you in the script). You're going to get the variance of every hand for all ranks, and every rank for all hands.

Instructions

1. Define a `lambda` function to return the variance, using the `numpy` package.
2. Apply the transformation for every row.
3. Modify the script to apply the function on every rank.

In [16]:
# Define the lambda transformation
get_variance = lambda x: np.var(x)

# Apply the transformation
data_tr = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_variance, axis=1)
print(data_tr.head())

0    18.64
1    18.64
2    18.64
3    18.64
4    18.64
dtype: float64


In [17]:
get_variance = lambda x: np.var(x)

# Apply the transformation
data_tr = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_variance, axis=0)
print(data_tr.head())

R1    14.060473
R2    14.189523
R3    14.024270
R4    14.040552
R5    13.998851
dtype: float64


## Why vectorization in pandas is so fast?

As you probably noticed in this lesson, we achieved a massive improvement using some form of vectorization.

Where does this improvement come from?

Fewer operations are required due to optimization in pandas.

## pandas vectorization in action

In this exercise, you will apply vectorization over pandas series to:

- calculate the mean rank of all the cards in each hand (row)
- calculate the mean rank of each of the 5 cards in each hand (column)

You will use the `poker_hands` dataset once again to compare both methods' efficiency.

Instructions

1. Calculate the mean rank in each hand.
2. Calculate the mean rank of each of the 5 card in all hands.

In [20]:
# Import time
import time

In [29]:
# Calculate the mean rank in each hand
row_start_time = time.time()
mean_r = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=1)
print(f'Time using pandas vectorization for rows: {time.time() - row_start_time} sec')
print(mean_r.head(), '\n')

# Calculate the mean rank of each of the 5 card in all hands
col_start_time = time.time()
mean_c = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=0)
print(f'Time using pandas vectorization for columns: {time.time() - col_start_time} sec')
print(mean_c.head())

Time using pandas vectorization for rows: 0.006748676300048828 sec
0    9.4
1    9.4
2    9.4
3    9.4
4    9.4
dtype: float64 

Time using pandas vectorization for columns: 0.003331422805786133 sec
R1    6.995242
R2    7.014194
R3    7.014154
R4    6.942463
R5    6.962735
dtype: float64


## Best method of vectorization

So far, you have encountered two vectorization methods:

- Vectorization over `pandas` Series
- Vectorization over `Numpy` ndarrays

While these two methods outperform all the other methods, when can vectorization over `NumPy` ndarrays be used to replace vectorization over `pandas` Series?

When operations like indexing or data type are not used.

## Vectorization methods for looping a DataFrame

Now that you're familiar with vectorization in `pandas` and `NumPy`, you're going to compare their respective performances yourself.

Your task is to calculate the variance of all the hands in each hand using the vectorization over `pandas` Series and then modify your code using the vectorization over `Numpy` ndarrays method.

Instructions

1. Calculate the variance of the ranks of all the cards in each hand using vectorization with `pandas`.
2. Calculate the variance of the ranks of all the cards in each hand using vectorization with `NumPy`.

In [30]:
# Calculate the variance in each hand
start_time = time.time()
poker_var = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].var(axis=1)
print(f'Time using pandas vectorization: {time.time() - start_time} sec', '\n')
print(poker_var.head())

Time using pandas vectorization: 0.003813505172729492 sec 

0    23.3
1    23.3
2    23.3
3    23.3
4    23.3
dtype: float64


In [32]:
# Calculate the variance in each hand
start_time = time.time()
poker_var = poker_hands[['R1', 'R2', 'R3', 'R4', 'R5']].values.var(axis=1, ddof=1)
print(f'Time using NumPy vectorization: {time.time() - start_time} sec', '\n')
print(poker_var[0:5])

Time using NumPy vectorization: 0.0035817623138427734 sec 

[23.3 23.3 23.3 23.3 23.3]
