* [Introduction to iterators](#Iti)
* [Iterators vs Iterables](#IvI)
* [Iterating over iterables (1)](#Ioi()
* [Iterating over iterables (2)](#Ioi()
* [Iterators as function arguments](#Iafa)
* [Playing with iterators](#Pwi)
* [Using enumerate](#Ue)
* [Using zip](#Uz)
* [Using * and zip to 'unzip'](#U*azt')
* [Using iterators to load large files into memory](#Uitllfim)
* [Processing large amounts of Twitter data](#PlaoTd)
* [Extracting information for large amounts of Twitter data](#EiflaoTd)
* [Congratulations!!](#C)
* [List comprehensions](#Lc)
* [Write a basic list comprehension](#Wablc)
* [List comprehension over iterables](#Lcoi)
* [Writing list comprehensions](#Wlc)
* [Nested list comprehensions](#Nlc)
* [Advanced comprehensions](#Ac)
* [Using conditionals in comprehensions (1)](#Ucic()
* [Using conditionals in comprehensions (2)](#Ucic()
* [Dict comprehensions](#Dc)
* [Introduction to generator expressions](#Itge)
* [List comprehensions vs generators](#Lcvg)
* [Write your own generator expressions](#Wyoge)
* [Changing the output in generator expressions](#Ctoige)
* [Build a generator](#Bag)
* [Wrapping up comprehensions and generators.](#Wucag)
* [List comprehensions for time-stamped data](#Lcftd)
* [Conditional list comprehensions for time-stamped data](#Clcftd)
* [Welcome to the case study!](#Wttcs)
* [Dictionaries for data science](#Dfds)
* [Writing a function to help you](#Wafthy)
* [Using a list comprehension](#Ualc)
* [Turning this all into a DataFrame](#TtaiaD)
* [Using Python generators for streaming data](#UPgfsd)
* [Processing data in chunks (1)](#Pdic()
* [Writing a generator to load data in chunks (2)](#Wagtldic()
* [Writing a generator to load data in chunks (3)](#Wagtldic()
* [Using pandas` read_csv iterator for streaming data](#Uprifsd)
* [Writing an iterator to load data in chunks (1)](#Waitldic()
* [Writing an iterator to load data in chunks (2)](#Waitldic()
* [Writing an iterator to load data in chunks (3)](#Waitldic()
* [Writing an iterator to load data in chunks (4)](#Waitldic()
* [Writing an iterator to load data in chunks (5)](#Waitldic()


<p id ='Iti'><p>
### Introduction to iterators

<p id ='IvI'><p>
### Iterators vs Iterables
* Iterable
    * Examples : lists, strings, dictionaries, file connection
    * Applying `iter()` to iterables creates an iterator
* Iterator
    * Produces `next` value with `nxt()`
    

In [14]:
mylist = ['jay garrick', 'barry allen', 'wally west', 'bart allen']
print(mylist)

['jay garrick', 'barry allen', 'wally west', 'bart allen']


In [15]:
myiter = iter(mylist)
print(myiter)

<list_iterator object at 0x10bcda278>


In [16]:
next(myiter)

'jay garrick'

<p id ='Ioi('><p>
### Iterating over iterables (1)

In [18]:
small_value  = iter(range(0,3))
small_value

<range_iterator at 0x10bb57900>

In [19]:
# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))


0
1
2


<p id ='Ioi('><p>
### Iterating over iterables (2)

In [20]:
# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))

# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))


0
1
2
3
4


<p id ='Iafa'><p>
### Iterators as function arguments

In [24]:
values = range(10, 20)
print(values)
print(list(values))
print(sum(values))

range(10, 20)
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
145


<p id ='Pwi'><p>
### Playing with iterators

<p id ='Ue'><p>
### Using enumerate

In [32]:
mutants = ['charles xavier', 
            'bobby drake', 
            'kurt wagner', 
            'max eisenhardt', 
            'kitty pryde']


Generate the tuples using enumerate() and turn the result from it into a list using list().

In [33]:
mutant_list = enumerate(mutants)
mutant_list

<enumerate at 0x10bcddc60>

In [34]:
for index1, value1 in mutant_list:
    print(index1, value1)

0 charles xavier
1 bobby drake
2 kurt wagner
3 max eisenhardt
4 kitty pryde


In [35]:
for index1, value1 in enumerate(mutants, start = 1):
    print(index1, value1)

1 charles xavier
2 bobby drake
3 kurt wagner
4 max eisenhardt
5 kitty pryde


<p id ='Uz'><p>
### Using zip

In [50]:
mutants

['charles xavier',
 'bobby drake',
 'kurt wagner',
 'max eisenhardt',
 'kitty pryde']

In [51]:
aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy', 'thermokinesis', 'teleportation', 'magnetokinesis', 'intangibility']

In [59]:
for value1, value2, value3 in zip(mutants, aliases, powers):
    print(value1, value2, value3)

charles xavier prof x telepathy
bobby drake iceman thermokinesis
kurt wagner nightcrawler teleportation
max eisenhardt magneto magnetokinesis
kitty pryde shadowcat intangibility


In [58]:
list(zip(mutants, aliases, powers))

[('charles xavier', 'prof x', 'telepathy'),
 ('bobby drake', 'iceman', 'thermokinesis'),
 ('kurt wagner', 'nightcrawler', 'teleportation'),
 ('max eisenhardt', 'magneto', 'magnetokinesis'),
 ('kitty pryde', 'shadowcat', 'intangibility')]

<p id ='U*azt'><p>
### Using * and zip to unzip
* unpacks an iterable such as a list or a tuple into positional arguments in a function call.

In [66]:
z1 = zip(mutants, powers)

In [67]:
print(*z1)

('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pryde', 'intangibility')


In [71]:
# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

In [72]:
# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)

In [74]:
result1

('charles xavier',
 'bobby drake',
 'kurt wagner',
 'max eisenhardt',
 'kitty pryde')

In [75]:
result2

('telepathy',
 'thermokinesis',
 'teleportation',
 'magnetokinesis',
 'intangibility')

<p id ='Uitllfim'><p>
### Using iterators to load large files into memory

Sometimes, the data we have to process reaches a size that is too much for a computer's memory to handle. This is a common problem faced by data scientists. A solution to this is to process an entire data source chunk by chunk, instead of a single go all at once.



<p id ='PlaoTd'><p>
### Processing large amounts of Twitter data- Reading part of DataFrame
* BTW this example does not use large amount of data !

In [77]:
import pandas as pd

In [82]:
count_dict = {}

In [83]:
more ./data/tweets.csv

In [86]:
# Iterate over the file chunk by chunk

for chunk in pd.read_csv('./data/tweets.csv', chunksize= 10):
    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in count_dict.keys():
            count_dict[entry] += 1
        else:
            count_dict[entry] = 1


In [88]:
count_dict

{'en': 97, 'et': 1, 'und': 2}

In [91]:
# Iterate over the file chunk by chunk

df = pd.read_csv('./data/tweets.csv')
for entry in df['lang']:
        if entry in count_dict.keys():
            count_dict[entry] += 1
        else:
            count_dict[entry] = 1




In [93]:
df.shape

(100, 31)

In [95]:
matrix = [[col for col in range(5)] for i in range(5)]
matrix

[[0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4]]

<p id ='Ac'><p>
### Advanced comprehensions

In [96]:
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']


<p id ='Ucic('><p>
### Using conditionals in comprehensions (1)

In [99]:
# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member)>=7]

# Print the new list
print(new_fellowship)


['samwise', 'aragorn', 'legolas', 'boromir']


<p id ='Ucic('><p>
### Using conditionals in comprehensions (2)

In [100]:
new_fellowship = [member if len(member)>=7 else '' for member in fellowship]
new_fellowship

['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']

<p id ='Dc'><p>
### Dict comprehensions
Comprehensions aren't relegated merely to the world of lists. There are many other objects you can build using comprehensions, such as dictionaries, pervasive objects in Data Science. You will create a dictionary using the comprehension syntax for this exercise. In this case, the comprehension is called a dict comprehension.

`<key> : <value>`

In [102]:
fellowship

['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

In [104]:
new_fellowship = {member: len(member) for member in fellowship}
new_fellowship

{'frodo': 5,
 'samwise': 7,
 'merry': 5,
 'aragorn': 7,
 'legolas': 7,
 'boromir': 7,
 'gimli': 5}

<p id ='Itge'><p>
### Introduction to generator expressions

<p id ='Lcvg'><p>
### List comprehensions vs generators
    
** List Comprehension **
![Screenshot%202019-03-30%20at%201.23.40%20PM.png](attachment:Screenshot%202019-03-30%20at%201.23.40%20PM.png)
**   Generators ** 
![Screenshot%202019-03-30%20at%201.23.47%20PM.png](attachment:Screenshot%202019-03-30%20at%201.23.47%20PM.png)

<p id ='Wyoge'><p>
### Write your own generator expressions

In [12]:
result = (num for num in range(31))

In [13]:
type(result)

generator

In [22]:
next(result)

8

In [24]:
next(result)

10

In [25]:
for value in result:
    print(value)

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


<p id ='Ctoige'><p>
### Changing the output in generator expressions
    
  You are given a list of strings lannister and, using a generator expression, create a generator object that you will iterate over to print its values.



In [26]:
lannister =  ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

In [27]:
lengths = (len(i) for i in lannister)

In [29]:
lengths

<generator object <genexpr> at 0x104f8a9e8>

In [31]:
for value in lengths:
    print(value)

<p id ='Bag'><p>
### Build a generator
Generator functions are functions that, like generator expressions, yield a series of values, instead of returning a single value. A generator function is defined as you do a regular function, but whenever it generates a value, it uses the keyword yield instead of return.
    
In this exercise, you will create a generator function with a similar mechanism as the generator expression you defined in the previous exercise:





In [33]:
lannister

['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

In [34]:
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""
    for person in input_list:
        yield len(person)

In [43]:
for value in get_lengths(lannister):
    print(value)

6
5
5
6
7


In [45]:
def num_sequence(n):
    i = 0
    while i<n:
        yield i
        i+=1

In [50]:
result = num_sequence(10)

<p id ='Wucag'><p>
### Wrapping up comprehensions and generators.

<p id ='Lcftd'><p>
### List comprehensions for time-stamped data

In [58]:
import pandas as pd
df = pd.read_csv('./data/tweets.csv')
df.head()

Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,filter_level,geo,id,...,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [{'screen_na...","{'media': [{'sizes': {'large': {'w': 1024, 'h'...",0,False,low,,714960401759387648,...,,,0,False,"{'retweeted': False, 'text': "".@krollbondratin...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @bpolitics: .@krollbondrating's Christopher...,1459294817758,False,"{'utc_offset': 3600, 'profile_image_url_https'..."
1,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [{'text': 'cruzsexscandal', 'indi...","{'media': [{'sizes': {'large': {'w': 500, 'h':...",0,False,low,,714960401977319424,...,,,0,False,"{'retweeted': False, 'text': '@dmartosko Cruz ...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @HeidiAlpine: @dmartosko Cruz video found.....,1459294817810,False,"{'utc_offset': None, 'profile_image_url_https'..."
2,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [], 'symbols...",,0,False,low,,714960402426236928,...,,,0,False,,"<a href=""http://www.facebook.com/twitter"" rel=...",Njihuni me Zonjën Trump !!! | Ekskluzive https...,1459294817917,False,"{'utc_offset': 7200, 'profile_image_url_https'..."
3,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [], 'symbols...",,0,False,low,,714960402367561730,...,7.149239e+17,7.149239e+17,0,False,,"<a href=""http://twitter.com/download/android"" ...",Your an idiot she shouldn't have tried to grab...,1459294817903,False,"{'utc_offset': None, 'profile_image_url_https'..."
4,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [{'screen_na...",,0,False,low,,714960402149416960,...,,,0,False,"{'retweeted': False, 'text': 'The anti-America...","<a href=""http://twitter.com/download/iphone"" r...",RT @AlanLohner: The anti-American D.C. elites ...,1459294817851,False,"{'utc_offset': -18000, 'profile_image_url_http..."


In [60]:
created_at = df['created_at']
created_at.head()

0    Tue Mar 29 23:40:17 +0000 2016
1    Tue Mar 29 23:40:17 +0000 2016
2    Tue Mar 29 23:40:17 +0000 2016
3    Tue Mar 29 23:40:17 +0000 2016
4    Tue Mar 29 23:40:17 +0000 2016
Name: created_at, dtype: object

In [64]:
tweet_time = [entry[11:19] for entry in created_at]
tweet_time[:5]

['23:40:17', '23:40:17', '23:40:17', '23:40:17', '23:40:17']

<p id ='Clcftd'><p>
### Conditional list comprehensions for time-stamped data

In [69]:
# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in created_at if entry[17:19] == '19']
tweet_clock_time[:5]

['23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19']

<p id ='Wttcs'><p>
### Welcome to the case study!

<p id ='Dfds'><p>
### Dictionaries for data science

In [70]:
feature_names = ['CountryName',
 'CountryCode',
 'IndicatorName',
 'IndicatorCode',
 'Year',
 'Value']

In [71]:
row_vals =['Arab World',
 'ARB',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'SP.ADO.TFRT',
 '1960',
 '133.56090740552298']

In [72]:
# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

In [73]:
type(zipped_lists)

zip

In [74]:
rs_dict = dict(zipped_lists)
rs_dict

{'CountryName': 'Arab World',
 'CountryCode': 'ARB',
 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'IndicatorCode': 'SP.ADO.TFRT',
 'Year': '1960',
 'Value': '133.56090740552298'}

<p id ='Wafthy'><p>
### Writing a function to help you

In [75]:
def list2dict(list1, list2):
   return dict(zip(list1, list2)) 

In [77]:
list2dict(feature_names, row_vals)

{'CountryName': 'Arab World',
 'CountryCode': 'ARB',
 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'IndicatorCode': 'SP.ADO.TFRT',
 'Year': '1960',
 'Value': '133.56090740552298'}

<p id ='Ualc'><p>
### Using a list comprehension
Turn a bunch of lists into a list of dictionaries with the help of a list comprehension.

In [81]:
from data.rowlist import row_list

In [83]:
row_list

[['Arab World',
  'ARB',
  'Adolescent fertility rate (births per 1,000 women ages 15-19)',
  'SP.ADO.TFRT',
  '1960',
  '133.56090740552298'],
 ['Arab World',
  'ARB',
  'Age dependency ratio (% of working-age population)',
  'SP.POP.DPND',
  '1960',
  '87.7976011532547'],
 ['Arab World',
  'ARB',
  'Age dependency ratio, old (% of working-age population)',
  'SP.POP.DPND.OL',
  '1960',
  '6.634579191565161'],
 ['Arab World',
  'ARB',
  'Age dependency ratio, young (% of working-age population)',
  'SP.POP.DPND.YG',
  '1960',
  '81.02332950839141'],
 ['Arab World',
  'ARB',
  'Arms exports (SIPRI trend indicator values)',
  'MS.MIL.XPRT.KD',
  '1960',
  '3000000.0'],
 ['Arab World',
  'ARB',
  'Arms imports (SIPRI trend indicator values)',
  'MS.MIL.MPRT.KD',
  '1960',
  '538000000.0'],
 ['Arab World',
  'ARB',
  'Birth rate, crude (per 1,000 people)',
  'SP.DYN.CBRT.IN',
  '1960',
  '47.697888095096395'],
 ['Arab World',
  'ARB',
  'CO2 emissions (kt)',
  'EN.ATM.CO2E.KT',
  '1960',


In [84]:
row_list[0]

['Arab World',
 'ARB',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'SP.ADO.TFRT',
 '1960',
 '133.56090740552298']

In [86]:
row_list[1]

['Arab World',
 'ARB',
 'Age dependency ratio (% of working-age population)',
 'SP.POP.DPND',
 '1960',
 '87.7976011532547']

In [87]:
list_of_dict = [list2dict(feature_names, sub_list) for sub_list in row_list]

In [90]:
list_of_dict[0]

{'CountryName': 'Arab World',
 'CountryCode': 'ARB',
 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'IndicatorCode': 'SP.ADO.TFRT',
 'Year': '1960',
 'Value': '133.56090740552298'}

In [91]:
list_of_dict[1]

{'CountryName': 'Arab World',
 'CountryCode': 'ARB',
 'IndicatorName': 'Age dependency ratio (% of working-age population)',
 'IndicatorCode': 'SP.POP.DPND',
 'Year': '1960',
 'Value': '87.7976011532547'}

<p id ='TtaiaD'><p>
### Turning this all into a DataFrame

In [95]:
list_of_dict

[{'CountryName': 'Arab World',
  'CountryCode': 'ARB',
  'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
  'IndicatorCode': 'SP.ADO.TFRT',
  'Year': '1960',
  'Value': '133.56090740552298'},
 {'CountryName': 'Arab World',
  'CountryCode': 'ARB',
  'IndicatorName': 'Age dependency ratio (% of working-age population)',
  'IndicatorCode': 'SP.POP.DPND',
  'Year': '1960',
  'Value': '87.7976011532547'},
 {'CountryName': 'Arab World',
  'CountryCode': 'ARB',
  'IndicatorName': 'Age dependency ratio, old (% of working-age population)',
  'IndicatorCode': 'SP.POP.DPND.OL',
  'Year': '1960',
  'Value': '6.634579191565161'},
 {'CountryName': 'Arab World',
  'CountryCode': 'ARB',
  'IndicatorName': 'Age dependency ratio, young (% of working-age population)',
  'IndicatorCode': 'SP.POP.DPND.YG',
  'Year': '1960',
  'Value': '81.02332950839141'},
 {'CountryName': 'Arab World',
  'CountryCode': 'ARB',
  'IndicatorName': 'Arms exports (SIPRI trend indicator values)'

In [93]:
df = pd.DataFrame(list_of_dict)

In [94]:
df.head()

Unnamed: 0,CountryCode,CountryName,IndicatorCode,IndicatorName,Value,Year
0,ARB,Arab World,SP.ADO.TFRT,"Adolescent fertility rate (births per 1,000 wo...",133.56090740552298,1960
1,ARB,Arab World,SP.POP.DPND,Age dependency ratio (% of working-age populat...,87.7976011532547,1960
2,ARB,Arab World,SP.POP.DPND.OL,"Age dependency ratio, old (% of working-age po...",6.634579191565161,1960
3,ARB,Arab World,SP.POP.DPND.YG,"Age dependency ratio, young (% of working-age ...",81.02332950839141,1960
4,ARB,Arab World,MS.MIL.XPRT.KD,Arms exports (SIPRI trend indicator values),3000000.0,1960


<p id ='UPgfsd'><p>
### Using Python generators for streaming data

<p id ='Pdic('><p>
### Processing data in chunks (1)

In [97]:
world = pd.read_csv('./data/world_ind_pop_data2.csv')
world.head()

Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
0,Arab World,ARB,1960,92495900.0,31.285384
1,Caribbean small states,CSS,1960,4190810.0,31.59749
2,Central Europe and the Baltics,CEB,1960,91401580.0,44.507921
3,East Asia & Pacific (all income levels),EAS,1960,1042475000.0,22.471132
4,East Asia & Pacific (developing only),EAP,1960,896493000.0,16.917679


In [99]:
world.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName                      13374 non-null object
CountryCode                      13374 non-null object
Year                             13374 non-null int64
Total Population                 13374 non-null float64
Urban population (% of total)    13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB


In [142]:
with open('./data/world_ind_pop_data2.csv') as file:
    file.readline()
    counts_dict = {}
    
    for j in range(0, 1000):
        # Split the current line into a list: line
        line = file.readline().split(',')
        first_col = line[0]
         # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1



In [143]:
counts_dict

{'Arab World': 5,
 'Caribbean small states': 5,
 'Central Europe and the Baltics': 5,
 'East Asia & Pacific (all income levels)': 5,
 'East Asia & Pacific (developing only)': 5,
 'Euro area': 5,
 'Europe & Central Asia (all income levels)': 5,
 'Europe & Central Asia (developing only)': 5,
 'European Union': 5,
 'Fragile and conflict affected situations': 5,
 'Heavily indebted poor countries (HIPC)': 5,
 'High income': 5,
 'High income: nonOECD': 5,
 'High income: OECD': 5,
 'Latin America & Caribbean (all income levels)': 5,
 'Latin America & Caribbean (developing only)': 5,
 'Least developed countries: UN classification': 5,
 'Low & middle income': 5,
 'Low income': 5,
 'Lower middle income': 5,
 'Middle East & North Africa (all income levels)': 5,
 'Middle East & North Africa (developing only)': 5,
 'Middle income': 5,
 'North America': 5,
 'OECD members': 5,
 'Other small states': 5,
 'Pacific island small states': 5,
 'Small states': 5,
 'South Asia': 5,
 'Sub-Saharan Africa (all 

<p id ='Wagtldic('><p>
### Writing a generator to load data in chunks (2)
Generators allows users to lazily evaluate the data.
The concept of lazily evaluation is very useful when we have to deal with very  large datasets because it lets us to deal with very large datasets because it lets us to generate value in efficient manner by yielding only chunks of data 
instead of whole thing at once.

Define a generator function `read_large_file()` that produces a generator object which yields a single line from a file each time `next()` is called on it

In [106]:
def read_large_file(file_object):
    """A generator to read file lazily."""
    while True:
        data = file_object.readline()
        # Break if this is EOF
        if not data:
            break
        yield data

In [121]:
file =  open('./data/world_ind_pop_data2.csv')
gen_obj = read_large_file(file)
print(next(gen_obj))

CountryName,CountryCode,Year,Total Population,Urban population (% of total)



In [122]:
print(next(gen_obj))

Arab World,ARB,1960,92495902.0,31.285384211605397



In [123]:
file.close()

<p id ='Wagtldic('><p>
### Writing a generator to load data in chunks (3)
Now let's use your generator function to process the World Bank dataset like you did previously. You will process the file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. For this exercise, however, you won't process just 1000 rows of data, you'll process the entire dataset!

In [136]:
counts_dict= {}

In [137]:
read_large_file(file)

<generator object read_large_file at 0x10e4b4a98>

In [138]:
with open('./data/world_ind_pop_data2.csv') as file:
    for line in read_large_file(file):
        row = line.split(',')
        first_col = row[0]
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

In [141]:
counts_dict

{'CountryName': 1,
 'Arab World': 55,
 'Caribbean small states': 55,
 'Central Europe and the Baltics': 55,
 'East Asia & Pacific (all income levels)': 55,
 'East Asia & Pacific (developing only)': 55,
 'Euro area': 55,
 'Europe & Central Asia (all income levels)': 55,
 'Europe & Central Asia (developing only)': 55,
 'European Union': 55,
 'Fragile and conflict affected situations': 55,
 'Heavily indebted poor countries (HIPC)': 55,
 'High income': 55,
 'High income: nonOECD': 55,
 'High income: OECD': 55,
 'Latin America & Caribbean (all income levels)': 55,
 'Latin America & Caribbean (developing only)': 55,
 'Least developed countries: UN classification': 55,
 'Low & middle income': 55,
 'Low income': 55,
 'Lower middle income': 55,
 'Middle East & North Africa (all income levels)': 55,
 'Middle East & North Africa (developing only)': 55,
 'Middle income': 55,
 'North America': 55,
 'OECD members': 55,
 'Other small states': 55,
 'Pacific island small states': 55,
 'Small states': 5

<p id ='Uprifsd'><p>
### Using pandas `read_csv` iterator for streaming data

In [153]:
df_reader = pd.read_csv('./data/world_ind_pop_data2.csv', chunksize=10)


In [154]:
type(df_reader)

pandas.io.parsers.TextFileReader

In [155]:
next(df_reader)

Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
0,Arab World,ARB,1960,92495900.0,31.285384
1,Caribbean small states,CSS,1960,4190810.0,31.59749
2,Central Europe and the Baltics,CEB,1960,91401580.0,44.507921
3,East Asia & Pacific (all income levels),EAS,1960,1042475000.0,22.471132
4,East Asia & Pacific (developing only),EAP,1960,896493000.0,16.917679
5,Euro area,EMU,1960,265396500.0,62.096947
6,Europe & Central Asia (all income levels),ECS,1960,667489000.0,55.378977
7,Europe & Central Asia (developing only),ECA,1960,155317400.0,38.066129
8,European Union,EUU,1960,409498500.0,61.212898
9,Fragile and conflict affected situations,FCS,1960,120354600.0,17.891972


In [156]:
next(df_reader)

Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
10,Heavily indebted poor countries (HIPC),HPC,1960,162491200.0,12.236046
11,High income,HIC,1960,907597500.0,62.680332
12,High income: nonOECD,NOC,1960,186676700.0,56.107863
13,High income: OECD,OEC,1960,720920800.0,64.285435
14,Latin America & Caribbean (all income levels),LCN,1960,220564200.0,49.284688
15,Latin America & Caribbean (developing only),LAC,1960,177682200.0,44.863308
16,Least developed countries: UN classification,LDC,1960,241072800.0,9.616261
17,Low & middle income,LMY,1960,2127373000.0,21.272894
18,Low income,LIC,1960,157188400.0,11.498396
19,Lower middle income,LMC,1960,942911600.0,19.810513


In [157]:
next(df_reader)

Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
20,Middle East & North Africa (all income levels),MEA,1960,105512600.0,34.951334
21,Middle East & North Africa (developing only),MNA,1960,97869420.0,33.875012
22,Middle income,MIC,1960,1970185000.0,22.053114
23,North America,NAC,1960,198624400.0,69.918403
24,OECD members,OED,1960,786648200.0,62.480915
25,Other small states,OSS,1960,6590560.0,14.337844
26,Pacific island small states,PSS,1960,861378.0,22.043762
27,Small states,SST,1960,11642750.0,21.120573
28,South Asia,SAS,1960,572036100.0,16.735545
29,Sub-Saharan Africa (all income levels),SSF,1960,228268800.0,14.631387


<p id ='Waitldic('><p>
### Writing an iterator to load data in chunks (1)
In this exercise, you will read in a file using a bigger DataFrame chunk size and then process the data from the first chunk.

In [168]:
urb_pop_reader = pd.read_csv('./data/world_ind_pop_data2.csv', chunksize= 1000, index_col='CountryCode')

In [169]:
urb_pop_reader

<pandas.io.parsers.TextFileReader at 0x10f9fe080>

In [170]:
df_urb_pop = next(urb_pop_reader)

In [171]:
df_urb_pop.head()

Unnamed: 0_level_0,CountryName,Year,Total Population,Urban population (% of total)
CountryCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ARB,Arab World,1960,92495900.0,31.285384
CSS,Caribbean small states,1960,4190810.0,31.59749
CEB,Central Europe and the Baltics,1960,91401580.0,44.507921
EAS,East Asia & Pacific (all income levels),1960,1042475000.0,22.471132
EAP,East Asia & Pacific (developing only),1960,896493000.0,16.917679


In [172]:
df_urb_pop.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, ARB to UMC
Data columns (total 4 columns):
CountryName                      1000 non-null object
Year                             1000 non-null int64
Total Population                 1000 non-null float64
Urban population (% of total)    1000 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 39.1+ KB


In [176]:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop.loc['CEB']
df_pop_ceb.head()

Unnamed: 0_level_0,CountryName,Year,Total Population,Urban population (% of total)
CountryCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CEB,Central Europe and the Baltics,1960,91401583.0,44.507921
CEB,Central Europe and the Baltics,1961,92237118.0,45.206665
CEB,Central Europe and the Baltics,1962,93014890.0,45.866565
CEB,Central Europe and the Baltics,1963,93845749.0,46.534093
CEB,Central Europe and the Baltics,1964,94722599.0,47.208743


In [177]:
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], 
           df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)

[(91401583.0, 44.5079211390026), (92237118.0, 45.206665319194), (93014890.0, 45.866564696018), (93845749.0, 46.5340927663649), (94722599.0, 47.2087429803526)]
