# Comprehensions

1. What are comprehensions?
2. List comprehensions
3. List comprehensions and files
4. Set comprehensions
5. Dict comprehensions

Notebook on GitHub is at https://github.com/reuven/pycontw-2025-comprehensions

Or go here:  https://github.com/reuven/ and get the latest repo

In [1]:
# let's try something simple
# Given a list of integers from 0-9

numbers = list(range(9))
numbers

[0, 1, 2, 3, 4, 5, 6, 7, 8]

In [4]:
# I want a list of the squares of every number in that list

# One way
output = []

for one_number in numbers:
    output.append(one_number ** 2)

output


[0, 1, 4, 9, 16, 25, 36, 49, 64]

# Unfortunately, this works!

In [5]:
# so, what does a comprehension look like:

[one_number ** 2 for one_number in numbers]

[0, 1, 4, 9, 16, 25, 36, 49, 64]

In [6]:
# there is a much better way to write comprehensions!

[one_number ** 2              # SELECT -- any Python expression can go here
 for one_number in numbers]   # FROM  -- any Python iterable can go here

[0, 1, 4, 9, 16, 25, 36, 49, 64]

In [8]:
# here's an example

mylist = [10, 20, 30]
mylist

[10, 20, 30]

In [9]:
' '.join(mylist)  # I want a list back with the elements of numbers, with ' ' between them

TypeError: sequence item 0: expected str instance, int found

In [10]:
' '.join([str(one_item)
         for one_item in mylist])

'10 20 30'

# How to think about comprehensions (and when do we use them?)

1. If we have an iterable
2. We want a new list back
3. There is a way that we can convert each element of the source to a destination

In [11]:
text = 'this is a bunch of words for my tutorial'

text.title()   # returns a new string, where each word is capitalized

'This Is A Bunch Of Words For My Tutorial'

In [14]:
# can I do the same thing as str.title, but using str.capitalize?

' '.join([one_word.capitalize()
for one_word in text.split()])

'This Is A Bunch Of Words For My Tutorial'

# Exercise:

1. Ask the user to enter a string containing numbers, separated by spaces. Add those numbers together (as integers), and print the result. It's OK to use the builtin `sum` function. Assume that the user will indeed only enter digits and spaces.
2. Ask the user to enter a string, and print the length of the string, not inlcuding whitespace. Don't use `str.replace` or the like.

In [17]:
# 1. Ask the user to enter a string containing numbers, separated by spaces. 
# Add those numbers together (as integers), and print the result. It's OK to use 
# the builtin `sum` function. Assume that the user will indeed only enter
# digits and spaces.

text = input('Enter numbers, separated by spaces: ').strip()

sum([int(one_item)
     for one_item in text.split()])

Enter numbers, separated by spaces:  10 20 30 40 50


150

In [20]:
# 2. Ask the user to enter a string, and print the length of the string, 
# not inlcuding whitespace. Don't use `str.replace` or the like.

text = 'this is yet another amazing example of my English'

sum([len(one_item)
 for one_item in text.split()])

41

When we run our comprehension over an iterable, it can be *any* iterable. That includes files!



In [23]:
# what if I want a list of the usernames?

[one_line.split(':')[0]
for one_line in open('/etc/passwd')]

['##\n',
 '# User Database\n',
 '# \n',
 '# Note that this file is consulted directly only when the system is running\n',
 '# in single-user mode.  At other times this information is provided by\n',
 '# Open Directory.\n',
 '#\n',
 '# See the opendirectoryd(8) man page for additional information about\n',
 '# Open Directory.\n',
 '##\n',
 'nobody',
 'root',
 'daemon',
 '_uucp',
 '_taskgated',
 '_networkd',
 '_installassistant',
 '_lp',
 '_postfix',
 '_scsd',
 '_ces',
 '_appstore',
 '_mcxalr',
 '_appleevents',
 '_geod',
 '_devdocs',
 '_sandbox',
 '_mdnsresponder',
 '_ard',
 '_www',
 '_eppc',
 '_cvs',
 '_svn',
 '_mysql',
 '_sshd',
 '_qtss',
 '_cyrus',
 '_mailman',
 '_appserver',
 '_clamav',
 '_amavisd',
 '_jabber',
 '_appowner',
 '_windowserver',
 '_spotlight',
 '_tokend',
 '_securityagent',
 '_calendar',
 '_teamsserver',
 '_update_sharing',
 '_installer',
 '_atsserver',
 '_ftp',
 '_unknown',
 '_softwareupdate',
 '_coreaudiod',
 '_screensaver',
 '_locationd',
 '_trustevaluationagent',
 '

In [24]:
[one_line.split(':')[0]      # SELECT -- expression
for one_line in open('/etc/passwd')  # FROM -- iteration
if not one_line.startswith('#')]   # WHERE -- condition

['nobody',
 'root',
 'daemon',
 '_uucp',
 '_taskgated',
 '_networkd',
 '_installassistant',
 '_lp',
 '_postfix',
 '_scsd',
 '_ces',
 '_appstore',
 '_mcxalr',
 '_appleevents',
 '_geod',
 '_devdocs',
 '_sandbox',
 '_mdnsresponder',
 '_ard',
 '_www',
 '_eppc',
 '_cvs',
 '_svn',
 '_mysql',
 '_sshd',
 '_qtss',
 '_cyrus',
 '_mailman',
 '_appserver',
 '_clamav',
 '_amavisd',
 '_jabber',
 '_appowner',
 '_windowserver',
 '_spotlight',
 '_tokend',
 '_securityagent',
 '_calendar',
 '_teamsserver',
 '_update_sharing',
 '_installer',
 '_atsserver',
 '_ftp',
 '_unknown',
 '_softwareupdate',
 '_coreaudiod',
 '_screensaver',
 '_locationd',
 '_trustevaluationagent',
 '_timezone',
 '_lda',
 '_cvmsroot',
 '_usbmuxd',
 '_dovecot',
 '_dpaudio',
 '_postgres',
 '_krbtgt',
 '_kadmin_admin',
 '_kadmin_changepw',
 '_devicemgr',
 '_webauthserver',
 '_netbios',
 '_warmd',
 '_dovenull',
 '_netstatistics',
 '_avbdeviced',
 '_krb_krbtgt',
 '_krb_kadmin',
 '_krb_changepw',
 '_krb_kerberos',
 '_krb_anonymous',
 '_asse

# Exercise: Summing numbers

1. Get the zipfile from https://files.lerner.co.il/ (the first link).
2. In that zipfile, you'll find the file `nums.txt`.
3. Each line in that file contains either 0 or 1 integers, with whitespace on either side.
4. Use a comprehension to sum all of the numbers in the file.

In [26]:
!unzip exercise-files.zip


Archive:  exercise-files.zip
  inflating: mini-access-log.txt     
  inflating: wcfile.txt              
  inflating: shoe-data.txt           
  inflating: nums.txt                
  inflating: linux-etc-passwd.txt    


In [29]:
[one_line
for one_line in open('nums.txt')]

['5\n',
 '\t10     \n',
 '\t20\n',
 '  \t3\n',
 '\t\t   \t20        \n',
 '\n',
 ' 25\n']

In [31]:
int('      5     ')

5

In [32]:
int('5')

5

In [34]:
int('\n')

ValueError: invalid literal for int() with base 10: '\n'

In [40]:
sum([int(one_line)
  for one_line in open('nums.txt')
  if one_line.strip().isdigit()])

83

In [42]:
!head shoe-data.txt

Adidas	orange	43
Nike	black	41
Adidas	black	39
New Balance	pink	41
Nike	white	44
New Balance	orange	38
Nike	pink	44
Adidas	pink	44
New Balance	orange	39
New Balance	black	43


In [43]:
# I want to create, from shoe-data.txt, a list of dicts
# every dict should contain three key-value pairs for brand, color, and size

In [47]:
def line_to_dict(one_line):
    fields = one_line.strip().split('\t')

    return {'brand': fields[0],
           'color':fields[1],
           'size':fields[2]}

[line_to_dict(one_line)
for one_line in open('shoe-data.txt')]

[{'brand': 'Adidas', 'color': 'orange', 'size': '43'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'black', 'size': '39'},
 {'brand': 'New Balance', 'color': 'pink', 'size': '41'},
 {'brand': 'Nike', 'color': 'white', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '38'},
 {'brand': 'Nike', 'color': 'pink', 'size': '44'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '39'},
 {'brand': 'New Balance', 'color': 'black', 'size': '43'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '44'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '37'},
 {'brand': 'Adidas', 'color': 'black', 'size': '38'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '41'},
 {'brand': 'Adidas', 'color': 'white', 'size': '36'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '36'},
 {'brand': 'Nike', 'color': 'pink', 'size': '41'},
 {'brand': '

In [48]:
def line_to_dict(one_line):
    brand, color, size = one_line.strip().split('\t')

    return {'brand': brand,
           'color': color,
           'size': size}

[line_to_dict(one_line)
for one_line in open('shoe-data.txt')]

[{'brand': 'Adidas', 'color': 'orange', 'size': '43'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'black', 'size': '39'},
 {'brand': 'New Balance', 'color': 'pink', 'size': '41'},
 {'brand': 'Nike', 'color': 'white', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '38'},
 {'brand': 'Nike', 'color': 'pink', 'size': '44'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '39'},
 {'brand': 'New Balance', 'color': 'black', 'size': '43'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '44'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '37'},
 {'brand': 'Adidas', 'color': 'black', 'size': '38'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '41'},
 {'brand': 'Adidas', 'color': 'white', 'size': '36'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '36'},
 {'brand': 'Nike', 'color': 'pink', 'size': '41'},
 {'brand': '

In [49]:
# mini-access-log.txt -- a (very) old Apache http server log

!head mini-access-log.txt

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"
66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.12 - - [30/J

In [53]:
# what if I want all of the IP addresses from this file?
# what if I want to know how often each IP address appeared?
# we can use collections.Counter!


[one_line.split()[0]
for one_line in open('mini-access-log.txt')]

['67.218.116.165',
 '66.249.71.65',
 '65.55.106.183',
 '65.55.106.183',
 '66.249.71.65',
 '66.249.71.65',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.106.131',
 '65.55.106.131',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.106.186',
 '65.55.106.186',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '74.52.245.146',
 '74.52.245.146',
 '66.249.65.43',
 '66.249.65.43',
 '66.249.65.43',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.207.25',
 '65.55.207.25',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.207.94',
 '65.55.207.94',
 '66.249.65.12',
 '65.55.207.71',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '98.242.170.241',
 '66.249.65.38',
 '66.249.65.38',
 '66.249.65.38',
 '66.249.65.38',
 '66.249.65.38',
 '

In [54]:
from collections import Counter

Counter('abcaaaabaaac')

Counter({'a': 8, 'b': 2, 'c': 2})

In [55]:
Counter([one_line.split()[0]
        for one_line in open('mini-access-log.txt')])

Counter({'66.249.65.38': 100,
         '66.249.65.12': 32,
         '89.248.172.58': 22,
         '67.195.112.35': 16,
         '66.249.71.65': 3,
         '66.249.65.43': 3,
         '65.55.207.50': 3,
         '67.218.116.165': 2,
         '65.55.106.183': 2,
         '65.55.106.131': 2,
         '65.55.106.186': 2,
         '74.52.245.146': 2,
         '65.55.207.25': 2,
         '65.55.207.94': 2,
         '65.55.207.126': 2,
         '82.34.9.20': 2,
         '65.55.106.155': 2,
         '65.55.207.77': 2,
         '65.55.215.75': 2,
         '65.55.207.71': 1,
         '98.242.170.241': 1,
         '208.80.193.28': 1})

In [57]:
# I want to show this to my boss
# who hasn't taken a math class in many years...

c = Counter([one_line.split()[0]
        for one_line in open('mini-access-log.txt')])

for key, value in c.items():
    print(f'{key}\t{value * "x"}')

67.218.116.165	xx
66.249.71.65	xxx
65.55.106.183	xx
66.249.65.12	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
65.55.106.131	xx
65.55.106.186	xx
74.52.245.146	xx
66.249.65.43	xxx
65.55.207.25	xx
65.55.207.94	xx
65.55.207.71	x
98.242.170.241	x
66.249.65.38	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
65.55.207.126	xx
82.34.9.20	xx
65.55.106.155	xx
65.55.207.77	xx
208.80.193.28	x
89.248.172.58	xxxxxxxxxxxxxxxxxxxxxx
67.195.112.35	xxxxxxxxxxxxxxxx
65.55.207.50	xxx
65.55.215.75	xx


In [59]:
# I want to sum the numbers that a user enters
# only counting each number once...

numbers = input('Enter numbers: ').strip()

sum([int(one_item)
    for one_item in numbers.split()])

Enter numbers:  10 20 30 10 20 30


120

In [60]:
# I can use a set!
# a set is a dict without any values (i.e., it's very immoral)

numbers = input('Enter numbers: ').strip()

sum(set([int(one_item)
    for one_item in numbers.split()]))

Enter numbers:  10 20 30 10 20 30 10 20 30


60

In [61]:
# there is an easier way -- a set comprehension!
# if we use {} instead of [], we get back a set!


numbers = input('Enter numbers: ').strip()

sum({int(one_item)
    for one_item in numbers.split()})

Enter numbers:  10 20 30 10 20 30 10 20 30


60

In [64]:
# be careful, because every element of a set needs to be hashable,
# just like dict keys

{one_line.split()
for one_line in open('/etc/passwd')}

TypeError: unhashable type: 'list'

# Exercise: Unique shells in linux-etc-passwd.txt

1. In the zipfile, there is a file called `linux-etc-passwd.txt`.
2. Use a set comprehension to return the unique/different shells people use on that system.
3. The login shell is the final field on each line with user info.