## Iterating with a for loop

- Iterate over a list using a for loop

In [2]:
employees = ['Nick', 'Lore', 'Hugo']

for employee in employees:
    print(employee)

Nick
Lore
Hugo


In [3]:
for letter in 'DataCamp':
    print(letter)

D
a
t
a
C
a
m
p


In [4]:
for  i in range(4):
    print(i)

0
1
2
3


Iterable:
- Lists
- Strings
- Dictionaries
- File connections

An object with an associated **iter()** method

Applying **iter()** to an iterable creates an iterator


**ITERATOR =** an object as an associated *next()* method that produces the consecutive values

In [5]:
word = 'Da'
it = iter(word)

next(it)

'D'

In [6]:
next(it)

'a'

In [7]:
next(it)

StopIteration: 

Using **iter()**:

In [8]:
# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']

# Print each list item in flash using a for loop
for person in flash:
    print(person)

# Create an iterator for flash: superspeed
superspeed = iter(flash)

# Print each item from the iterator
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))

jay garrick
barry allen
wally west
bart allen
jay garrick
barry allen
wally west
bart allen


In [9]:
# Create an iterator for range(3): small_value
small_value = iter(range(3))

# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))

# Loop over range(3) and print the values
for num in range(3):
    print(num)

# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))

# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))


0
1
2
0
1
2
0
1
2
3
4


In [10]:
# Create a range object: values
values = range(10,21)

# Print the range object
print(values)

# Create a list of integers: values_list
values_list = list(values)

# Print values_list
print(values_list)

# Get the sum of values: values_sum
values_sum = sum(values)

# Print values_sum
print(values_sum)

range(10, 21)
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
165


## Playing with iterators

**Enumerate:** is a function that takes any iterable as argument (example = list) an returns a special enumerate object which consists of pairs containing the elements of the original iterable, along with their index within the iterable.

In [11]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']

e = enumerate(avengers)
print(type(e))

<class 'enumerate'>


We can use the function **list** to turn this enumerate object into a list of tuples and print it to see what it contains.

In [12]:
e_list = list(e)

print(e_list)

[(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')]


The enumerate object itself is also an iterable and we can loop over it while unpacking its element using the clause for index, value in enumerate.

In [13]:
for index, value in enumerate(avengers):
    print(index, value)

0 hawkeye
1 iron man
2 thor
3 quicksilver


In [14]:
for index, value in enumerate(avengers, start=10):
    print(index, value)

10 hawkeye
11 iron man
12 thor
13 quicksilver


**ZIP:** accepts an aribtrary number of iterables and returns an iterator of tuples.

In [21]:
names = ['Barton', 'Stark', 'Odinson', 'Maximoff']
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']

In [22]:
z = zip(avengers, names)

print(z)
print(type(z))

<zip object at 0x000001BD5E973548>
<class 'zip'>


We can turn this zip object into a list and print the list

In [23]:
z_list = list(z)

print(z_list)

[('hawkeye', 'Barton'), ('iron man', 'Stark'), ('thor', 'Odinson'), ('quicksilver', 'Maximoff')]


We could use a for loop to iterate over the zip object and print the tuples. 

In [24]:
for z1, z2 in zip(avengers, names):
    print(z1, z2)

hawkeye Barton
iron man Stark
thor Odinson
quicksilver Maximoff


We could also have used the splat operator to print all the elements

In [26]:
z = zip(avengers, names)
print(*z)

('hawkeye', 'Barton') ('iron man', 'Stark') ('thor', 'Odinson') ('quicksilver', 'Maximoff')


In [29]:
# Create a list of strings: mutants
mutants = ['charles xavier', 
            'bobby drake', 
            'kurt wagner', 
            'max eisenhardt', 
            'kitty pryde']

aliases = ['prof x',
          'iceman',
          'nightcrawler',
          'magneto',
          'shadowcat']

powers = ['telepathy',
         'thermokinesis',
         'teleportation',
         'magnetokinesis',
         'intangibility']

In [30]:
# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))

# Print the list of tuples
print(mutant_list)

# Unpack and print the tuple pairs
for index1, value1 in enumerate(mutants):
    print(index1, value1)

# Change the start index
for index2, value2 in enumerate(mutants, start=1):
    print(index2, value2)

[(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pryde')]
0 charles xavier
1 bobby drake
2 kurt wagner
3 max eisenhardt
4 kitty pryde
1 charles xavier
2 bobby drake
3 kurt wagner
4 max eisenhardt
5 kitty pryde


In [31]:
# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))

# Print the list of tuples
print(mutant_data)

# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)

# Print the zip object
print(mutant_zip)

# Unpack the zip object and print the tuple values
for value1, value2, value3 in mutant_zip:
    print(value1 + "-" + value2 + "-" + value3)

[('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pryde', 'shadowcat', 'intangibility')]
<zip object at 0x000001BD5E96B1C8>
charles xavier-prof x-telepathy
bobby drake-iceman-thermokinesis
kurt wagner-nightcrawler-teleportation
max eisenhardt-magneto-magnetokinesis
kitty pryde-shadowcat-intangibility


In [32]:
# Create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# Print the tuples in z1 by unpacking with *
print(*z1)

# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(* z1)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pryde', 'intangibility')
False
False


## Manage iterators in BIG DATA

There can be too much data to hold in memory.
 - SOLUTION = load data in **chunks!**
   * Load the data in Chunks
   * Perform the desired operation or operations on each chuck
   * Store the result
   * Discard the chunk
   * Load the next chunk
 - To perform this: use *Pandas function:* **read_csv()**
 - Specify the chunk: **chunksize**

In [1]:
import pandas as pd

In [2]:
result = []

for chunk in pd.read_csv('data.csv', chunksize=1000):
    result.append(sum(chunk['x']))

total = sum(result)

print(total)

FileNotFoundError: [Errno 2] File b'data.csv' does not exist: b'data.csv'

In [3]:
total = 0

for chunk in pd.read_csv('data.csv', chunksize=1000):
    total += sum(chunk['x'])

print(total)

FileNotFoundError: [Errno 2] File b'data.csv' does not exist: b'data.csv'

In [4]:
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)

FileNotFoundError: [Errno 2] File b'tweets.csv' does not exist: b'tweets.csv'

In [5]:
# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

In [6]:
# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')

# Print result_counts
print(result_counts)


FileNotFoundError: [Errno 2] File b'tweets.csv' does not exist: b'tweets.csv'

# List Comprehensions

In [7]:
nums = [12, 8, 21, 3, 16]

new_nums = []

for num in nums:
    new_nums.append(num + 1)

print(new_nums)

[13, 9, 22, 4, 17]


In [8]:
new_nums = [num + 1 for num in nums]

In [9]:
print(new_nums)

[13, 9, 22, 4, 17]


![image.png](attachment:image.png)

List comprehension with **range()**

In [10]:
result = [num for num in range(11)]

print(result)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Collaps for loops for building lists into a single line

Components:
 - Iterable
 - Iterator variable (represent members of iterable)
 - Output expression
 
You could use also the List Comprehension instead of Nested loops. 

In [21]:
pairs_l = []

for num1 in range(0,2):
    for num2 in range(6, 8):
        pairs_l.append((num1, num2))
        
print(pairs_l)

[(0, 6), (0, 7), (1, 6), (1, 7)]


How to do this with a list comprehension?

In [22]:
pairs_2 = [(num1, num2) for num1 in range(0,2) for num2 in range(6,8)]

print(pairs_2)

[(0, 6), (0, 7), (1, 6), (1, 7)]


We sacrifice some readability of the code as a **Tradeoff**.

You will have to consider if you would like to use list comprehensions in cases such as this. 


In [1]:
doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']

list_doc = [doc[0] for doc in doctor]

print(list_doc)

['h', 'c', 'c', 't', 'w']


In [2]:
# Create list comprehension: squares
squares = [i**2 for i in range(0,10)]

![image.png](attachment:image.png)

In [3]:
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]

# Print the matrix
for row in matrix:
    print(row)

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]


## Advanced Comprehensions

- Conditionals in Comprehensions:

In [4]:
[num ** 2 for num in range(10) if num % 2 == 0]

[0, 4, 16, 36, 64]

Python documentation on the **% Operator**: 
 - The " *%* " (modulo) operator yields the remainder from the division of the first argument by the second. 

In [2]:
[num ** 2 if num % 2 == 0 else 0 for num in range(10)]

[0, 0, 4, 0, 16, 0, 36, 0, 64, 0]

### Dict Comprehensions 

- Create dictionaries
- Use curly braces {} instead of brackets

In [3]:
pos_neg = {num: -num for num in range(9)}

print(pos_neg)

print(type(pos_neg))

{0: 0, 1: -1, 2: -2, 3: -3, 4: -4, 5: -5, 6: -6, 7: -7, 8: -8}
<class 'dict'>


- Contitionals on the output expression

![image.png](attachment:image.png)

In [6]:
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member) >= 7]

# Print the new list
print(new_fellowship)

['samwise', 'aragorn', 'legolas', 'boromir']


In [7]:
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create list comprehension: new_fellowship
new_fellowship = [member if len(member) >= 7 else member.replace(member, "") for member in fellowship]

# Print the new list
print(new_fellowship)

['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']


In [8]:
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create dict comprehension: new_fellowship
new_fellowship = {member: len(member) for member in fellowship}

# Print the new dictionary
print(new_fellowship)

{'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}


## Generators expressions

- Recall list comprehension

In [9]:
[2 * num for num in range(10)]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [10]:
(2 * num for num in range(10))

<generator object <genexpr> at 0x000002323F37A660>

A **Generator** is like a list comprehension except it does not store the list in memory: it does, it doest not construct the list, but is an object we can iterate over to produce elements of the list as required.

- List comprehension - returns a list
- Generators - returns a generator object
- Both can be iterated over

In [11]:
result = (num for num in range(6))

for num in result:
    print(num)

0
1
2
3
4
5


We could also pass a generator to the function list to create a list. 

In [12]:
result = (num for num in range(6))

print(list(result))

[0, 1, 2, 3, 4, 5]


We can pass a generator to the function **next()** in order to iterate through its elements.

In [13]:
result = (num for num in range(6))

print(next(result))
print(next(result))
print(next(result))
print(next(result))

0
1
2
3


This is an example of something called **lazy evaluation** - the evaluation of the expression is delayed until its value is needed.
 - This can help deal when working with extremely large sequences as you don't want to store the entire list in memory, which is what comprehensions would do.
 - You wan to generate elements of the sequence on the fly

In [14]:
(num for num in range(10**1000000))

<generator object <genexpr> at 0x000002323F37A7C8>

In [15]:
even_nums = (num for num in range(10) if num % 2 == 0)

print(list(even_nums))

[0, 2, 4, 6, 8]


**Generator functions** are functions that, when called, produce generator objects.
 - Are written with the syntax of any other user-defined function.
 - Instead of returning values using the keyword return, they yield sequences of values using the keyword yield.

In [16]:
def num_sequence(n):
    """Generate values from 0 to n"""
    i = 0
    while i < n:
        yield i
        i += 1
        
result = num_sequence(5)
print(list(result))

[0, 1, 2, 3, 4]


In [17]:
print(type(result))

<class 'generator'>


In [20]:
result = num_sequence(5)
for item in result:
    print(item)

0
1
2
3
4


A list comprehension produces a list as output, a generator produces a generator object.

In [21]:
# Create generator object: result
result = (num for num in range(0, 31))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


In [22]:
# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister)

# Iterate over and print the values in lengths
for value in lengths:
    print(value)

6
5
5
6
7


In [1]:
# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)

6
5
5
6
7


## Wrap-up: comprehensions

**Re-cap: list comprehensions

- Basic
 
 *[output expression for iterator variable in iterable]*
 
 
- Advanced
 
 *[output expression + conditional on output for iterator variable in iterable + conditional on iterable]*

In [1]:
# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']
    # The extracted column in "tweet_time" here is a Series data structure!

# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time]

# Print the extracted times
print(tweet_clock_time)

NameError: name 'df' is not defined

In [2]:
# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']

# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']

# Print the extracted times
print(tweet_clock_time)

NameError: name 'df' is not defined

## Generators for the large data limit

- Use a generator to load a file line by line
- Works on streaming data! 
  * If new lines are being written to the file you're reading, this method will keep on reading and processing the file until there are no lines left for it to read. 
- Read and process the file until all lines are exhausted


In [1]:
def num_sequence(n):
    """Generate values from 0 to n."""
    i = 0
    while i < n:
        yield i
        i += 1

In [2]:
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(0, 1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

FileNotFoundError: [Errno 2] No such file or directory: 'world_dev_ind.csv'

**Lazily Evaluate data =** is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by *yielding* only chunks of data at a time instead of the whole thing at once.

In [3]:
# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

FileNotFoundError: [Errno 2] No such file or directory: 'world_dev_ind.csv'

Note that since a **file object** is already a *generator*, you don't have to explicitly create a generator object.

Is still good to pratice how to create generators. 

In [4]:
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)

FileNotFoundError: [Errno 2] No such file or directory: 'world_dev_ind.csv'