<h1><center> PPOLS564: Foundations of Data Science </center><h1>
<h3><center> Lecture 6 <br><br><font color='grey'> Iterables/Iterators and Reading Files </font></center></h3>

# Iterators

We encountered the concept of iteration in the last lecture when we were introduced to the `for` loop.

- **Iteration** in a basic sense is simply taking one item at a time from a collection. We start at the beginning of the collection and mover through it until we reach the end. Any time we use a loop we are going over each and every item in collection

**Empty container**

```
                ___________
                |         |
                |         |
                |         |
                |_________|
```


**Assign items to the container**

```
                ___________
                | "apple" |
                | "orange"|
                | "grapes"|
                |_________|
```


**For each iteration, we take one item out and do something with it.**
```
    
                    \
                     \
       eat("apple")   \
                       \
                        \__
                |         |
                | "orange"|
                | "grapes"|
                |_________|
```

```
    
                    \
                     \
       eat("orange")  \
                       \
                        \__
                |         |
                |         |
                | "grapes"|
                |_________|
```


**We _stop_ once the container is empty**
```
    
                    \
                     \
       eat()          \
                       \
                        \__
                |         |
                |         |
                |         |
                |_________|
```                

In [1]:
# create a list of items
actions = ["read","write","relax","talk"]

for action in actions:
    print(action)

- **iterable** is any object that contains the `__iter__` methods.

This method returns an "**iterator**" object.

In [2]:
actions.__iter__()

<list_iterator at 0x10df390f0>

We can also use the `iter()` onstructor which calls to the method directly. 

In [3]:
iter(actions)

<list_iterator at 0x10df39550>

An **iterator** object has a special method called `__next__()`, which summons each item in the collection one at a time.

In [4]:
next(iter(actions))

'read'

In [5]:
# Or just using the methods
actions.__iter__().__next__()

'read'

**So what does a `for` loop do?**

1. The `for` statement calls `iter()` on the container object (e.g. a list). 
2. The function returns an iterator object that defines the method `__next__()` which accesses elements in the container _one at a time_. 
3. When there are no more elements, `__next__()` raises a `StopIteration` exception which tells the for loop to terminate.

In [6]:
# Step one: turn into iterable
iterator = iter(actions)
iterator

<list_iterator at 0x10df39780>

In [7]:
# Step 2: call __next__() to retrieve items one at a time from the container
next(iterator)

'read'

In [8]:
next(iterator)

'write'

In [9]:
next(iterator)

'relax'

In [10]:
next(iterator)

'talk'

In [11]:
# Step 3: stop once there are not more items in the container
next(iterator)

StopIteration: 

In sum, iterable is an object with things in it that we can pull out one at a time. An iterator is an object that tells us _how to_ pull each item one at a time. Specifically, it tells use how to use next() (what's the next thing I should draw and how should I draw it?)

> Consult the [python documentation](https://docs.python.org/3/tutorial/classes.html#iterators) for a more detailed discussion

**Keep in mind that not all objects are iterable** (i.e. have an `__iter__` method). This is one of the main distinctions between scalar types and container types.

In [12]:
x = 3
x.__iter__()

AttributeError: 'int' object has no attribute '__iter__'

In [13]:
x = [3]
x.__iter__()

<list_iterator at 0x10dfcfd68>

# Reading Files

### `open()`

Now, let's open this file in Python. 

The built-in `open()` function opens files on our system. The function takes the following arguments:

- a file path
- a mode describing how to treat the file (e.g. read the file, write to the file, append to the file, etc.). Default is read mode ("r").
- an encoding. Default is "UTF-8" for most systems.

In [16]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')

`open()` returns a special item type `_io.TextIOWrapper`. Note that a file-like-object is loosely defined in Python. Again, we see duck-typing in action: if it looks like a file and behave like a file then, heck, it's probably a file.

In [17]:
type(file)

_io.TextIOWrapper

In [18]:
print(file.read())

Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.

If confirmed, they would be the first such air strikes there in three weeks.

Earlier, US President Donald Trump warned Syria's Bashar al-Assad against launching a "reckless attack" on Idlib.


Five reasons why the battle for Idlib matters
Why is there a war in Syria?
Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.

The UN has warned of a humanitarian catastrophe if an all-out assault takes place.

The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.

He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".

Mr de Mistura also welcomed Mr Trump

In [19]:
print(file.read()) # Once we've read through the items, the file object is empty




----
### `close()`

Once we are done with a file, we need to close it.

In [20]:
file.close()

Opening and forgetting to close files can lead to a bunch of issues --- mainly the mismanagement of computational resources on your machine. 

Moreover, `close()` is necessary for actually writing files to our computer

___
### Methods available when reading in files

**<center>Methods in object type `TextIOWrapper`</center>**

| Method  | Description |
|:---------:|:---------:|
|**`._CHUNK_SIZE()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`._finalizing()`**| bool(x) -> bool|
|**`.buffer()`**| Create a new buffered reader using the given readable raw IO object.|
|**`.closed()`**| bool(x) -> bool|
|**`.encoding()`**| str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str|
|**`.errors()`**| str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str|
|**`.line_buffering()`**| bool(x) -> bool|
|**`.mode()`**| str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str|
|**`.name()`**| str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str|
|**`.readlines()`**| Return a list of lines from the stream.|
|**`.reconfigure()`**| Reconfigure the text stream with new parameters.|
|**`.write_through()`**| bool(x) -> bool|

In [21]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')
file.readlines() # convert all items to a list

['Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.\n',
 '\n',
 'If confirmed, they would be the first such air strikes there in three weeks.\n',
 '\n',
 'Earlier, US President Donald Trump warned Syria\'s Bashar al-Assad against launching a "reckless attack" on Idlib.\n',
 '\n',
 '\n',
 'Five reasons why the battle for Idlib matters\n',
 'Why is there a war in Syria?\n',
 'Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.\n',
 '\n',
 'The UN has warned of a humanitarian catastrophe if an all-out assault takes place.\n',
 '\n',
 'The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.\n',
 '\n',
 'He said telephone talks between Russian President Vladimir Putin and his Turkish counterp

In [22]:
# Is the file closed?
file.closed

False

-----

### File `mode`s

|Mode|Description|
|---|-------|
| r | "open for reading" default|
|w | open for writing |
|x  | open for exclusive creation, failing if the file already exists |
|a | open for writing, appending to the end of the file if it exists |
|b | binary mode |
|t | text mode (default) |

Examples,

- `mode = 'rb'` &rarr; "read binary"
- `mode = 'wt'` &rarr; "write text"

In [23]:
f = open('news-story.txt',mode="rt",encoding='utf-8')

# Print the mode
print(f.mode)

f.close()

rt


----
### Writing files

In [24]:
f = open('text_file.txt',mode="wt",encoding='utf-8')
f.write('This is an example\n') 
f.write('Of writing a file...\n')
f.write('Neat!\n')
f.close()

> **NOTE that you _must_ `close()` for your lines to be written to the file**

Now, read the file back in in "read mode"

In [25]:
f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())

This is an example
Of writing a file...
Neat!



We can even batch write using a container.

In [26]:
sent = "This is a sentence.".split()
print(sent)

['This', 'is', 'a', 'sentence.']


In [27]:
# Note here I'm opening the file in "append mode"
f = open('text_file.txt',mode="at",encoding='utf-8')
f.writelines(sent)
f.close()

In [28]:
f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())

This is an example
Of writing a file...
Neat!
Thisisasentence.


Note that `\n` is the delimiter for line breaks.

In [29]:
sent2 = []
for word in sent:
    new_word = word + "\n"
    sent2.append(new_word)
print(sent2)

['This\n', 'is\n', 'a\n', 'sentence.\n']


In [30]:
# Open the file, and write our new sentence list object
f = open('text_file.txt',mode="at",encoding='utf-8')
f.writelines(sent2)
f.close()

f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())

This is an example
Of writing a file...
Neat!
Thisisasentence.This
is
a
sentence.



---
### Iterating over files
We'll note when looking at the object's attributes that there is an `__iter__()` and `__next__()` method, meaning we can iterate over the open file object.

In [31]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')
for line in file:
    print(line)
file.close()

Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.



If confirmed, they would be the first such air strikes there in three weeks.



Earlier, US President Donald Trump warned Syria's Bashar al-Assad against launching a "reckless attack" on Idlib.






Five reasons why the battle for Idlib matters

Why is there a war in Syria?

Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.



The UN has warned of a humanitarian catastrophe if an all-out assault takes place.



The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.



He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".



Mr de Mistura also

In [32]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')
for line in file:
    if line == '\n':
        continue
    print(line)        
file.close()

Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.

If confirmed, they would be the first such air strikes there in three weeks.

Earlier, US President Donald Trump warned Syria's Bashar al-Assad against launching a "reckless attack" on Idlib.


Five reasons why the battle for Idlib matters

Why is there a war in Syria?

Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.

The UN has warned of a humanitarian catastrophe if an all-out assault takes place.

The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.

He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".

Mr de Mistura also welcomed Mr Tru

In [33]:
# Example: How many words are in each line?

file = open("news-story.txt",mode='rt',encoding='UTF-8')

for line in file:
    if line == '\n':
        continue
    n_words_per_line = len(line.split())
    print(n_words_per_line)
    
file.close()

21
14
16
23
8
7
30
14
22
21
18


--------

### `with`: beyond opening and closing with context managers

As you'll note, the need to `open()` and `close()` files can get a bit redundant after awhile. This issue of closing after opening to deal with resource cleanup is common enough that python has a special protocol for it: the `with` code block.

In [34]:
with open("news-story.txt",mode='rt',encoding='UTF-8') as file:
    for line in file:
        if line == '\n':
            continue
        n_words_per_line = len(line.split())
        print(n_words_per_line)

21
14
16
23
8
7
30
14
22
21
18


In [35]:
file.closed

True

---
## Reading Comma Separated Values (CSV)

See the [python documentation](https://docs.python.org/2/library/csv.html) for more on the `csv` module located in the standard library.

In [36]:
import csv

Reading in .csv data 

In [37]:
with open("student_data.csv",mode='rt') as file:
    data = csv.reader(file)
    for row in data:
        print(row)

['Student', 'Grade']
['Susan', 'A']
['Sean', 'B-']
['Cody', 'A-']
['Karen', 'B+']


Writing csv data

In [38]:
# Student data as a nested list.
student_data = [["Student","Grade"],
                ["Susan","A"],
                ["Sean","B-"],
                ["Cody","A-"],
                ["Karen",'B+']]

# Write the rows with the .writerows() method
with open("student_data.csv",mode='w') as file:
    csv_file = csv.writer(file)
    csv_file.writerows(student_data)

### Reading csv files as dictionaries

Assigning value to variables by using `DictReader()`/`DictWriter()` method. Here our variable names operate as keys that we can easily reference. 

In [39]:
with open("student_data.csv", 'r') as file:
    csv_file = csv.DictReader(file)
    for row in csv_file:
        print(row)

OrderedDict([('Student', 'Susan'), ('Grade', 'A')])
OrderedDict([('Student', 'Sean'), ('Grade', 'B-')])
OrderedDict([('Student', 'Cody'), ('Grade', 'A-')])
OrderedDict([('Student', 'Karen'), ('Grade', 'B+')])


In [40]:
with open("student_data.csv", 'r') as file:
    csv_file = csv.DictReader(file)
    for row in csv_file:
        print(f"{row['Student']} received a {row['Grade']} in the course")

Susan received a A in the course
Sean received a B- in the course
Cody received a A- in the course
Karen received a B+ in the course


Writing csv file types as dictionaries

In [41]:
with open("student_data.csv", 'w') as file:
    variable_names = ["Student","Grade"]
    csv_file = csv.DictWriter(file, fieldnames=variable_names)

    csv_file.writeheader()
    for student in student_data[1:]:
        csv_file.writerow({'Student':student[0],'Grade':student[1]})

### Dealing with different delimiters

In a csv, commas are used to separate values, but we could just as easily use something else to separate values.

In [42]:
with open("student_data.csv", 'r') as file:
    
    csv_file = csv.reader(file, delimiter = ",") # comma separated values  
    
    with open("only_student_data.csv", 'w') as new_file:
        
            new_csv_file = csv.writer(new_file, delimiter = "\t") # tab separated values
            
            for row in csv_file:
                
                new_csv_file.writerow(row) # only write the student's name                

# Applied Example

Examining the on one's age and his or her likelihood of voting using the 2008 National Election Survey data.

In [None]:
import csv # standard library module for reading a csv
data = []
with open("nes_2018_age-voted.csv") as f:
    for row in csv.reader(f):
        data.append(row)
data[:10]        

Convert to tuple pairs.

In [None]:
dat_tup = []
for row in data[1:]:
    dat_tup.append(tuple(row))
dat_tup[20:40]    

Drop missing values.

In [None]:
dat_tup2 = []
for v, a in dat_tup:
    if v == "NA" or a=="NA":
        continue
    dat_tup2.append((int(v),int(a)))

dat_tup2[20:40]   

Analyze

In [None]:
# Average age of those who vote
age_voted = []
age_no_vote = []
for voted, age in dat_tup2:
    if voted == 1:
        age_voted.append(age)
    else:
        age_no_vote.append(age)

voted_ave_age = sum(age_voted)/len(age_voted) 
no_vote_ave_age = sum(age_no_vote)/len(age_no_vote) 
voted_ave_age = round(voted_ave_age,2)
no_vote_ave_age = round(no_vote_ave_age,2)

print(f'''
Average age of those who voted: {voted_ave_age}
Average age of those who didn't vote: {no_vote_ave_age}
''')

**Our hypothesis**: fifty years of voting theory is correct, and the relationship between voting and age is curvilinear. The older you get, the more likely you are to vote, until you're too old vote.

In [None]:
young = []
middle_aged = []
elderly = []
for voted, age in dat_tup2:
    if age <= 35:
        young.append(voted)
    elif age > 35 and age <= 80:
        middle_aged.append(voted)
    else: 
        elderly.append(voted)

Calculate proportions of those who voted

In [None]:
prop_young = round(sum(young)/len(young),3)
prop_middle_aged = round(sum(middle_aged)/len(middle_aged),3)
prop_elderly = round(sum(elderly)/len(elderly),3)

Generate a table to print results

In [None]:
title = "Proportion who voted by age grouping"

border = "="*len(title)

print(f'''
{border}
{title}
{border}
Age Category             prop. voted
{border.replace("=","-")}
Young (18 to 35):              {prop_young*100}%
Middle Aged (36 to 80):        {prop_middle_aged*100}%
Elderly (80 to 100):           {prop_elderly*100}%
{border}
''')

Anything odd about these voting numbers?