GitHub repo for this course:

https://github.com/reuven/PyCon-04April-19-comprehensions

# Agenda

1. What are comprehensions?
2. List comprehensions
3. List comprehensions and files
4. Set comprehensions
5. Dict comprehensions
6. Nested comprehensions
7. Generator expressions

In [2]:
# I have a list of integers 
# I want to create a list of those integers squared

numbers = list(range(10))
numbers

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [3]:
output = []

for one_number in numbers:
    output.append(one_number ** 2)
    
output    

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [4]:
# what is another way to do this?
# list comprehension

[one_number ** 2 for one_number in numbers]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# List comprehension

Much easier to write (and understand) if we pick it apart
and write it on multiple lines.

In a comprehension, the *first* thing that runs is the loop, and the second thing is the expression.

The result of a list comprehension is a list.  We have created a new list! We can pass it as an argument to a function, or assign it to a variable.

The new list is the result of evaluating our expression on every element of the input list.

As a result, the output list will have the same number of elements as the input list.

In [5]:
[one_number ** 2             # expression -- any Python expression
 for one_number in numbers]  # iteration -- any Python loop

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# When use a loop, and when a comprehension?

The big distinction is between getting a new value back and having side effects.

Meaning:

If you have an existing list, and you want a new list, and you can describe the mapping from the first to the second, then you should use a comprehension.

I want the list!

But... if you are assigning repeatedly, if you are modifying things repeatedly.. then use a regular `for` loop.

In [6]:
# Let's say that I have a list of strings

mylist = ['abcd', 'ef', 'ghi']

# I want to have a new string based on mylist with '*' between
# the elements

'*'.join(mylist)

'abcd*ef*ghi'

In [7]:
# what if I have a list of integers?

mylist = [10, 20, 30]

'*'.join(mylist)

TypeError: sequence item 0: expected str instance, int found

In [8]:
# we have: a list of integers
# we want: a list of strings
# we can convert one int to one string with str()

[str(one_item)
 for one_item in mylist]

['10', '20', '30']

In [9]:
'*'.join([str(one_item)
         for one_item in mylist])

'10*20*30'

In [10]:
# I have a string, and I want to capitalize the start
# of each word

s = 'This is a sample sentence for my tutorial'

s.title()

'This Is A Sample Sentence For My Tutorial'

In [11]:
# what if str.title didn't exist? Could I still do
# something like this?

# what if, for example, I were to break the string
# into individual words?

# I have: a list of strings
# I want: a list of strings whose first letters are capitalized
# I can use: str.capitalize 

[one_word.capitalize()
 for one_word in s.split()]

['This', 'Is', 'A', 'Sample', 'Sentence', 'For', 'My', 'Tutorial']

In [12]:
' '.join([one_word.capitalize()
         for one_word in s.split()])

'This Is A Sample Sentence For My Tutorial'

# Exercises:

1. Ask the user to enter a string containing numbers, separated by spaces. Add those numbers together (as integers), and print the result.  It's OK to use the builtin `sum` function.

2. Ask the user to enter a string, and print the length of the string, except for whitespace. It's *not* OK to use `str.replace`.


In [13]:
s = input('Enter numbers, separated by whitespace: ').strip()

Enter numbers, separated by whitespace: 10 20 30 40 45


In [14]:
s

'10 20 30 40 45'

In [15]:
sum(s)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [16]:
# what I have: a list of strings, containing digits
# what I want: the sum of the integers in that string
# I can transform one to the other with int

[int(one_item)
for one_item in s.split()]

[10, 20, 30, 40, 45]

In [17]:
sum([int(one_item)
    for one_item in s.split()])

145

In [18]:
sum([int(one_item)
    for one_item in s])

ValueError: invalid literal for int() with base 10: ' '

In [19]:
# find the lengths of the words (not the whitespace)
# in the user's input

s = input('Enter a sentence: ').strip()

len(s)  # how many characters in the entire sentence?

Enter a sentence: this is a test sentence


23

In [22]:
# how long is this, if we ignore the whitespace?

# if I use s.split(), I get a list of strings
# without any whitespace

# I have: a list of strings
# I want: the sum of their lengths
# I can apply: len

sum([len(one_word)
    for one_word in s.split()])

19

In [23]:
s = '    a    b    c     '

s.strip()

'a    b    c'

In [25]:
s = '10 20 30'

[int(one_item)             # expression --  SELECT
for one_item in s.split()] # iteration  --  FROM

[10, 20, 30]

In [26]:
# I have a string with words
# I want to print each word with stars around it!

s = 'this is fantastic'

[print(f'*{one_word}*')
for one_word in s.split()]


*this*
*is*
*fantastic*


[None, None, None]

The list that we get back from a comprehension contains the values that the expression returned.

`print` always returns `None`, no matter what you print.

Here, it printed on the screen, but it returned `None`, which affected our output comprehension rather dramatically.

In [27]:
# how about iterating over files?

[one_line
 for one_line in open('linux-etc-passwd.txt')]

['# This is a comment\n',
 '# You should ignore me\n',
 'root:x:0:0:root:/root:/bin/bash\n',
 'daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin\n',
 'bin:x:2:2:bin:/bin:/usr/sbin/nologin\n',
 'sys:x:3:3:sys:/dev:/usr/sbin/nologin\n',
 'sync:x:4:65534:sync:/bin:/bin/sync\n',
 'games:x:5:60:games:/usr/games:/usr/sbin/nologin\n',
 'man:x:6:12:man:/var/cache/man:/usr/sbin/nologin\n',
 'lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin\n',
 'mail:x:8:8:mail:/var/mail:/usr/sbin/nologin\n',
 '\n',
 '\n',
 '\n',
 'news:x:9:9:news:/var/spool/news:/usr/sbin/nologin\n',
 'uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin\n',
 'proxy:x:13:13:proxy:/bin:/usr/sbin/nologin\n',
 'www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin\n',
 'backup:x:34:34:backup:/var/backups:/usr/sbin/nologin\n',
 'list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin\n',
 'irc:x:39:39:ircd:/var/run/ircd:/usr/sbin/nologin\n',
 'gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin\n',
 '\n

In [28]:
# Can I get the usernames from this passwd file?

# each record contains fields
# fields are separated by ':'
# the first field is the username

[one_line.split(':')[0]
 for one_line in open('linux-etc-passwd.txt')]

['# This is a comment\n',
 '# You should ignore me\n',
 'root',
 'daemon',
 'bin',
 'sys',
 'sync',
 'games',
 'man',
 'lp',
 'mail',
 '\n',
 '\n',
 '\n',
 'news',
 'uucp',
 'proxy',
 'www-data',
 'backup',
 'list',
 'irc',
 'gnats',
 '\n',
 'nobody',
 'syslog',
 'messagebus',
 'landscape',
 'jci',
 'sshd',
 'user',
 'reuven',
 'postfix',
 'colord',
 'postgres',
 'dovecot',
 'dovenull',
 'postgrey',
 'debian-spamd',
 'memcache',
 'genadi',
 'shira',
 'atara',
 'shikma',
 'amotz',
 'mysql',
 'clamav',
 'amavis',
 'opendkim',
 'gitlab-redis',
 'gitlab-psql',
 'git',
 'opendmarc',
 'dkim-milter-python',
 'deploy',
 'redis']

In [29]:
[one_line.split(':')[0]       # expression -- SELECT
 for one_line in open('linux-etc-passwd.txt')  # iteration -- FROM
 if ':' in one_line  ]       # condition -- WHERE

['root',
 'daemon',
 'bin',
 'sys',
 'sync',
 'games',
 'man',
 'lp',
 'mail',
 'news',
 'uucp',
 'proxy',
 'www-data',
 'backup',
 'list',
 'irc',
 'gnats',
 'nobody',
 'syslog',
 'messagebus',
 'landscape',
 'jci',
 'sshd',
 'user',
 'reuven',
 'postfix',
 'colord',
 'postgres',
 'dovecot',
 'dovenull',
 'postgrey',
 'debian-spamd',
 'memcache',
 'genadi',
 'shira',
 'atara',
 'shikma',
 'amotz',
 'mysql',
 'clamav',
 'amavis',
 'opendkim',
 'gitlab-redis',
 'gitlab-psql',
 'git',
 'opendmarc',
 'dkim-milter-python',
 'deploy',
 'redis']

In [30]:
!cat nums.txt

5
	10     
	20
  	3
		   	20        

 25


# Exercise: Sum numbers

Use a comprehension to read through `nums.txt`, and sum the numbers it contains.

Each line of the file contains either zero integers or one integers.  The integer might well have whitespace before or after it.

In [33]:
# I have: a file whose lines (strings) contain numbers
# I want: a list of numbers
# transform from a string to an int with int()

[int(one_line.strip())
 for one_line in open('nums.txt')]

ValueError: invalid literal for int() with base 10: ''

In [34]:
int('5')

5

In [35]:
int('    5      ')

5

In [36]:
int('    ')

ValueError: invalid literal for int() with base 10: '    '

In [37]:
int('')

ValueError: invalid literal for int() with base 10: ''

In [40]:
# add a condition

sum([int(one_line)
 for one_line in open('nums.txt')
 if one_line.strip()])   # just check that the line isn't empty

83

In [41]:
sum([int(one_line)
 for one_line in open('nums.txt')
 if one_line.strip().isdigit()])   # just check that the line isn't empty

83

In [43]:
# I want to know how many vowels are a string
# how can I use a comprehension for that?

s = 'whatever'

sum([1
 for one_character in s
 if one_character in 'aeiou'])


3

In [44]:
!head shoe-data.txt

Adidas	orange	43
Nike	black	41
Adidas	black	39
New Balance	pink	41
Nike	white	44
New Balance	orange	38
Nike	pink	44
Adidas	pink	44
New Balance	orange	39
New Balance	black	43


# Exercise: Shoe dicts

`shoe-data.txt` contains 100 lines. Each line contains three fields: Brand, color, and size. Fields are separated by tabs (`'\t'`).

Use a list comprehension to turn this file into a list of dictionaries. Each line should be turned into a dict whose keys are `brand`, `color`, and `size`. The values can remain strings; don't worry about the sizes.

I recommend that you write an external function that takes a string as input and returns a dict, then invoke that in your comprehension.

The result will be a list of 

```python
[
    {'brand':'Adidas',
     'color':'black',
     'size':'45'},
    ...

]
```

In [47]:
# a simple (not working) approach

def line_to_dict(one_line):
    return one_line.split('\t')

[line_to_dict(one_line)
 for one_line in open('shoe-data.txt')]

[['Adidas', 'orange', '43\n'],
 ['Nike', 'black', '41\n'],
 ['Adidas', 'black', '39\n'],
 ['New Balance', 'pink', '41\n'],
 ['Nike', 'white', '44\n'],
 ['New Balance', 'orange', '38\n'],
 ['Nike', 'pink', '44\n'],
 ['Adidas', 'pink', '44\n'],
 ['New Balance', 'orange', '39\n'],
 ['New Balance', 'black', '43\n'],
 ['New Balance', 'orange', '44\n'],
 ['Nike', 'black', '41\n'],
 ['Adidas', 'orange', '37\n'],
 ['Adidas', 'black', '38\n'],
 ['Adidas', 'pink', '41\n'],
 ['Adidas', 'white', '36\n'],
 ['Adidas', 'orange', '36\n'],
 ['Nike', 'pink', '41\n'],
 ['Adidas', 'pink', '35\n'],
 ['New Balance', 'orange', '37\n'],
 ['Nike', 'pink', '43\n'],
 ['Nike', 'black', '43\n'],
 ['Nike', 'black', '42\n'],
 ['Nike', 'black', '35\n'],
 ['Adidas', 'black', '41\n'],
 ['New Balance', 'pink', '40\n'],
 ['Adidas', 'white', '35\n'],
 ['New Balance', 'pink', '41\n'],
 ['New Balance', 'orange', '41\n'],
 ['Adidas', 'orange', '40\n'],
 ['New Balance', 'orange', '40\n'],
 ['New Balance', 'white', '44\n'],
 [

In [48]:
# a simple working

def line_to_dict(one_line):
    fields = one_line.split('\t')
    
    return {'brand':fields[0],
           'color':fields[1],
           'size':fields[2]}

[line_to_dict(one_line)
 for one_line in open('shoe-data.txt')]

[{'brand': 'Adidas', 'color': 'orange', 'size': '43\n'},
 {'brand': 'Nike', 'color': 'black', 'size': '41\n'},
 {'brand': 'Adidas', 'color': 'black', 'size': '39\n'},
 {'brand': 'New Balance', 'color': 'pink', 'size': '41\n'},
 {'brand': 'Nike', 'color': 'white', 'size': '44\n'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '38\n'},
 {'brand': 'Nike', 'color': 'pink', 'size': '44\n'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '44\n'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '39\n'},
 {'brand': 'New Balance', 'color': 'black', 'size': '43\n'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '44\n'},
 {'brand': 'Nike', 'color': 'black', 'size': '41\n'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '37\n'},
 {'brand': 'Adidas', 'color': 'black', 'size': '38\n'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '41\n'},
 {'brand': 'Adidas', 'color': 'white', 'size': '36\n'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '36\n'},
 {'brand': 'Nike', 'color': '

In [50]:
# a little better, using unpacking

def line_to_dict(one_line):
    brand, color, size = one_line.strip().split('\t')
    
    return {'brand':brand,
           'color':color,
           'size':size}

[line_to_dict(one_line)
 for one_line in open('shoe-data.txt')]

[{'brand': 'Adidas', 'color': 'orange', 'size': '43'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'black', 'size': '39'},
 {'brand': 'New Balance', 'color': 'pink', 'size': '41'},
 {'brand': 'Nike', 'color': 'white', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '38'},
 {'brand': 'Nike', 'color': 'pink', 'size': '44'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '39'},
 {'brand': 'New Balance', 'color': 'black', 'size': '43'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '44'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '37'},
 {'brand': 'Adidas', 'color': 'black', 'size': '38'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '41'},
 {'brand': 'Adidas', 'color': 'white', 'size': '36'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '36'},
 {'brand': 'Nike', 'color': 'pink', 'size': '41'},
 {'brand': '

In [51]:
!ls *.txt

linux-etc-passwd.txt  mini-access-log.txt  nums.txt  shoe-data.txt  wcfile.txt


In [52]:
!head mini-access-log.txt

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"
66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.12 - -

In [54]:
# I want to retrieve all of the IP addresses from mini-access-log

[one_line.split()[0]
 for one_line in open('mini-access-log.txt')]

['67.218.116.165',
 '66.249.71.65',
 '65.55.106.183',
 '65.55.106.183',
 '66.249.71.65',
 '66.249.71.65',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.106.131',
 '65.55.106.131',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.106.186',
 '65.55.106.186',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '74.52.245.146',
 '74.52.245.146',
 '66.249.65.43',
 '66.249.65.43',
 '66.249.65.43',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.207.25',
 '65.55.207.25',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '65.55.207.94',
 '65.55.207.94',
 '66.249.65.12',
 '65.55.207.71',
 '66.249.65.12',
 '66.249.65.12',
 '66.249.65.12',
 '98.242.170.241',
 '66.249.65.38',
 '66.249.65.38',
 '66.249.65.38',
 '66.249.65.38',
 '66.249.65.38',
 '

In [55]:
# how many times did each IP address access my server?

# Counter

In [56]:
from collections import Counter

In [57]:
# the bad, wrong way to use Counter
# is as a cheap defaultdict

c = Counter()
c['a'] += 5
c['b'] += 3
c

Counter({'a': 5, 'b': 3})

In [59]:
# the real way to use Counter is to initialize it
# with an iterable

# it will count how many times each element of that 
# iterable is there. Each element becomes a key,
# the number times becomes the value

c = Counter([one_line.split()[0]
        for one_line in open('mini-access-log.txt')])
c

Counter({'67.218.116.165': 2,
         '66.249.71.65': 3,
         '65.55.106.183': 2,
         '66.249.65.12': 32,
         '65.55.106.131': 2,
         '65.55.106.186': 2,
         '74.52.245.146': 2,
         '66.249.65.43': 3,
         '65.55.207.25': 2,
         '65.55.207.94': 2,
         '65.55.207.71': 1,
         '98.242.170.241': 1,
         '66.249.65.38': 100,
         '65.55.207.126': 2,
         '82.34.9.20': 2,
         '65.55.106.155': 2,
         '65.55.207.77': 2,
         '208.80.193.28': 1,
         '89.248.172.58': 22,
         '67.195.112.35': 16,
         '65.55.207.50': 3,
         '65.55.215.75': 2})

In [60]:
# Counter inherits from dict

for key, value in c.items():
    print(f'{key}: {value}')

67.218.116.165: 2
66.249.71.65: 3
65.55.106.183: 2
66.249.65.12: 32
65.55.106.131: 2
65.55.106.186: 2
74.52.245.146: 2
66.249.65.43: 3
65.55.207.25: 2
65.55.207.94: 2
65.55.207.71: 1
98.242.170.241: 1
66.249.65.38: 100
65.55.207.126: 2
82.34.9.20: 2
65.55.106.155: 2
65.55.207.77: 2
208.80.193.28: 1
89.248.172.58: 22
67.195.112.35: 16
65.55.207.50: 3
65.55.215.75: 2


In [62]:
for key, value in c.items():
    print(f'{key:18}: {value * "x"}')

67.218.116.165    : xx
66.249.71.65      : xxx
65.55.106.183     : xx
66.249.65.12      : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
65.55.106.131     : xx
65.55.106.186     : xx
74.52.245.146     : xx
66.249.65.43      : xxx
65.55.207.25      : xx
65.55.207.94      : xx
65.55.207.71      : x
98.242.170.241    : x
66.249.65.38      : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
65.55.207.126     : xx
82.34.9.20        : xx
65.55.106.155     : xx
65.55.207.77      : xx
208.80.193.28     : x
89.248.172.58     : xxxxxxxxxxxxxxxxxxxxxx
67.195.112.35     : xxxxxxxxxxxxxxxx
65.55.207.50      : xxx
65.55.215.75      : xx


In [63]:
# this is a Counter method that returns
# a list of tuples, from the most common to the least,
# based on the Counter
c.most_common()

[('66.249.65.38', 100),
 ('66.249.65.12', 32),
 ('89.248.172.58', 22),
 ('67.195.112.35', 16),
 ('66.249.71.65', 3),
 ('66.249.65.43', 3),
 ('65.55.207.50', 3),
 ('67.218.116.165', 2),
 ('65.55.106.183', 2),
 ('65.55.106.131', 2),
 ('65.55.106.186', 2),
 ('74.52.245.146', 2),
 ('65.55.207.25', 2),
 ('65.55.207.94', 2),
 ('65.55.207.126', 2),
 ('82.34.9.20', 2),
 ('65.55.106.155', 2),
 ('65.55.207.77', 2),
 ('65.55.215.75', 2),
 ('65.55.207.71', 1),
 ('98.242.170.241', 1),
 ('208.80.193.28', 1)]

In [64]:
# I can invoke most_common with an argument
# I get only that many elements in the output list

c.most_common(5)

[('66.249.65.38', 100),
 ('66.249.65.12', 32),
 ('89.248.172.58', 22),
 ('67.195.112.35', 16),
 ('66.249.71.65', 3)]

In [65]:
!ls *.txt

linux-etc-passwd.txt  mini-access-log.txt  nums.txt  shoe-data.txt  wcfile.txt


In [66]:
usernames = [one_line.split(':')[0]
            for one_line in open('linux-etc-passwd.txt')
            if ":" in one_line]

In [67]:
usernames

['root',
 'daemon',
 'bin',
 'sys',
 'sync',
 'games',
 'man',
 'lp',
 'mail',
 'news',
 'uucp',
 'proxy',
 'www-data',
 'backup',
 'list',
 'irc',
 'gnats',
 'nobody',
 'syslog',
 'messagebus',
 'landscape',
 'jci',
 'sshd',
 'user',
 'reuven',
 'postfix',
 'colord',
 'postgres',
 'dovecot',
 'dovenull',
 'postgrey',
 'debian-spamd',
 'memcache',
 'genadi',
 'shira',
 'atara',
 'shikma',
 'amotz',
 'mysql',
 'clamav',
 'amavis',
 'opendkim',
 'gitlab-redis',
 'gitlab-psql',
 'git',
 'opendmarc',
 'dkim-milter-python',
 'deploy',
 'redis']

In [68]:
# we can now search for usernames in this list using "in"

'root' in usernames

True

In [69]:
'reuven' in usernames

True

In [70]:
'asdfasfa' in usernames

False

In [71]:
# if I'm going to be searching a lot through my usernames,
# then maybe I should use a different data structure

# sets guarantee uniqueness in their members, searching is
# very fast, and all elements are hashable -- just like dict
# keys.

In [72]:
usernames = set([one_line.split(':')[0]
            for one_line in open('linux-etc-passwd.txt')
            if ":" in one_line])

In [73]:
type(usernames)

set

In [74]:
# we can use a set comprehension!

# looks almost exactly like a list comprehension
# but it uses {} instead

usernames = {one_line.split(':')[0]
            for one_line in open('linux-etc-passwd.txt')
            if ":" in one_line}

In [75]:
type(usernames)

set

In [76]:
'root' in usernames

True

# Exercise: Sum unique numbers

1. Ask the user to enter numbers, separated by whitespace
2. Print their sum, but *only count each number once*.

Example:

    Enter numbers: 10 20 30 10 20 30
    Total is 60

In [78]:
s = input('Enter numbers: ').strip()

sum({int(one_item)
    for one_item in s.split()})

Enter numbers: 10 20 30 10 20 30


60

In [79]:
!head linux-etc-passwd.txt

# This is a comment
# You should ignore me
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin


# Exercise: Which shells?

Read through `linux-etc-passwd.txt`, and find the different shells that are used on the system.

In [86]:
{
    one_line.split(':')[-1].strip()
    for one_line in open('linux-etc-passwd.txt')
    if ':' in one_line
}

{'/bin/bash',
 '/bin/false',
 '/bin/nologin',
 '/bin/sh',
 '/bin/sync',
 '/usr/sbin/nologin'}

In [89]:
# I have a string with some words

# I want to create a dict where each word is the key
# and the word length is the value

s = 'this is a bunch of words'

# we can invoke dict() on a list of tuples
# and get back a dictionary!

dict([(one_word, len(one_word))
    for one_word in s.split()])

{'this': 4, 'is': 2, 'a': 1, 'bunch': 5, 'of': 2, 'words': 5}

In [90]:
# dictionary comprehension
# we use curly braces, just as with a set comprehension

# but we have *two* expressions in the first line,
# separated by a colon

{    one_word   :   len(one_word)   # key:value expression
    for one_word in s.split()
}

{'this': 4, 'is': 2, 'a': 1, 'bunch': 5, 'of': 2, 'words': 5}

In [91]:
# I'm going to create a really fast config file
# with name = value on each line

with open('myconfig.txt', 'w') as outfile:
    # outfile.__enter__()
    for index, one_character in enumerate('abcd', 1):
        outfile.write(f'{one_character}={index}\n')
    # outfile.__exit__()

In [92]:
!cat myconfig.txt


a=1
b=2
c=3
d=4


In [93]:
# I can use a dict comprehension to read this file
# into a dict!

{   one_line.split('=')[0]   : one_line.split('=')[1].strip()
    for one_line in open('myconfig.txt')
}

{'a': '1', 'b': '2', 'c': '3', 'd': '4'}

# Exercise: Usernames and shells

Use a dict comprehension to create a dict in which the keys are usernames and the values are the shells associated with those usernames in `linux-etc-passwd.txt`.

In [96]:
# start with list-tuple combo

# [(one_line.split(':')[0], one_line.split(':')[-1].strip())
# for one_line in open('linux-etc-passwd.txt')
# if ':' in one_line]


{one_line.split(':')[0]   :  one_line.split(':')[-1].strip() 
 for one_line in open('linux-etc-passwd.txt')
 if ':' in one_line }

{'root': '/bin/bash',
 'daemon': '/usr/sbin/nologin',
 'bin': '/usr/sbin/nologin',
 'sys': '/usr/sbin/nologin',
 'sync': '/bin/sync',
 'games': '/usr/sbin/nologin',
 'man': '/usr/sbin/nologin',
 'lp': '/usr/sbin/nologin',
 'mail': '/usr/sbin/nologin',
 'news': '/usr/sbin/nologin',
 'uucp': '/usr/sbin/nologin',
 'proxy': '/usr/sbin/nologin',
 'www-data': '/usr/sbin/nologin',
 'backup': '/usr/sbin/nologin',
 'list': '/usr/sbin/nologin',
 'irc': '/usr/sbin/nologin',
 'gnats': '/usr/sbin/nologin',
 'nobody': '/usr/sbin/nologin',
 'syslog': '/bin/false',
 'messagebus': '/bin/false',
 'landscape': '/bin/false',
 'jci': '/bin/bash',
 'sshd': '/usr/sbin/nologin',
 'user': '/bin/bash',
 'reuven': '/bin/bash',
 'postfix': '/bin/false',
 'colord': '/bin/false',
 'postgres': '/bin/bash',
 'dovecot': '/bin/false',
 'dovenull': '/bin/false',
 'postgrey': '/bin/false',
 'debian-spamd': '/bin/sh',
 'memcache': '/bin/false',
 'genadi': '/bin/bash',
 'shira': '/bin/bash',
 'atara': '/bin/bash',
 'shikma

In [102]:
{fields[0]   :  fields[-1].strip()
 for one_line in open('linux-etc-passwd.txt')
 if ':' in one_line and (fields := one_line.split(':'))}

{'root': '/bin/bash',
 'daemon': '/usr/sbin/nologin',
 'bin': '/usr/sbin/nologin',
 'sys': '/usr/sbin/nologin',
 'sync': '/bin/sync',
 'games': '/usr/sbin/nologin',
 'man': '/usr/sbin/nologin',
 'lp': '/usr/sbin/nologin',
 'mail': '/usr/sbin/nologin',
 'news': '/usr/sbin/nologin',
 'uucp': '/usr/sbin/nologin',
 'proxy': '/usr/sbin/nologin',
 'www-data': '/usr/sbin/nologin',
 'backup': '/usr/sbin/nologin',
 'list': '/usr/sbin/nologin',
 'irc': '/usr/sbin/nologin',
 'gnats': '/usr/sbin/nologin',
 'nobody': '/usr/sbin/nologin',
 'syslog': '/bin/false',
 'messagebus': '/bin/false',
 'landscape': '/bin/false',
 'jci': '/bin/bash',
 'sshd': '/usr/sbin/nologin',
 'user': '/bin/bash',
 'reuven': '/bin/bash',
 'postfix': '/bin/false',
 'colord': '/bin/false',
 'postgres': '/bin/bash',
 'dovecot': '/bin/false',
 'dovenull': '/bin/false',
 'postgrey': '/bin/false',
 'debian-spamd': '/bin/sh',
 'memcache': '/bin/false',
 'genadi': '/bin/bash',
 'shira': '/bin/bash',
 'atara': '/bin/bash',
 'shikma

In [103]:
# list of lists, where inner lists contain integers
mylist = [[10, 20, 25],
         [30, 35, 40, 45, 50],
         [60, 70, 80, 90, 100],
         [110, 115, 120, 130, 140, 145]]


In [104]:
mylist

[[10, 20, 25],
 [30, 35, 40, 45, 50],
 [60, 70, 80, 90, 100],
 [110, 115, 120, 130, 140, 145]]

In [105]:
# how can I sum the integers in this nested list?

# first guess: sum!  (bad guess)

sum(mylist)

TypeError: unsupported operand type(s) for +: 'int' and 'list'

In [107]:
# guess 2: use a comprehension!

[one_sublist
 for one_sublist in mylist]

[[10, 20, 25],
 [30, 35, 40, 45, 50],
 [60, 70, 80, 90, 100],
 [110, 115, 120, 130, 140, 145]]

In [112]:
# nested list comprehensions!

[one_number
 for one_sublist in mylist
 for one_number in one_sublist]

[10,
 20,
 25,
 30,
 35,
 40,
 45,
 50,
 60,
 70,
 80,
 90,
 100,
 110,
 115,
 120,
 130,
 140,
 145]

In [113]:
[one_number for one_sublist in mylist for one_number in one_sublist]

[10,
 20,
 25,
 30,
 35,
 40,
 45,
 50,
 60,
 70,
 80,
 90,
 100,
 110,
 115,
 120,
 130,
 140,
 145]

In [114]:
[one_number
 for one_sublist in mylist
 if len(one_sublist) > 3
 for one_number in one_sublist]

[30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 110, 115, 120, 130, 140, 145]

In [115]:
[one_number
 for one_sublist in mylist
 if len(one_sublist) > 3  # only long sublists
 for one_number in one_sublist
 if one_number % 2  ]   # only if it's odd

[35, 45, 115, 145]

In [116]:
numbers = [10, 20, 30, 35, 40, 50, 55, 60, 70]

[one_number
 for one_number in numbers
 if one_number > 40
 if one_number % 2]

[55]

In [117]:
!ls 

'PyCon - 2023-04April-19-comprehensions.ipynb'	 movies.dat
 README.md					 myconfig.txt
 README.md~					 nums.txt
 advanced-exercise-files.zip			 shoe-data.txt
 exercise-files.zip				 taxi.csv
 linux-etc-passwd.txt				 wcfile.txt
 mini-access-log.txt


In [118]:
# download the movies.dat file from here:
# https://files.lerner.co.il/advanced-exercise-files.zip

In [119]:
!head movies.dat

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller


# Exercise: Movie genres

Goal: Find out what the 5 most popular movie genres are in `movies.dat`

Use a nested comprehension to read through the file, find the appropriate fields and lines, and then use `Counter` to find the most common genres.

If a movie has more than one genre, each should be counted once.

Hint: You'll want to hand `Counter` a list of genres from the file.

In [120]:
[one_line
 for one_line in open('movies.dat')]

["1::Toy Story (1995)::Animation|Children's|Comedy\n",
 "2::Jumanji (1995)::Adventure|Children's|Fantasy\n",
 '3::Grumpier Old Men (1995)::Comedy|Romance\n',
 '4::Waiting to Exhale (1995)::Comedy|Drama\n',
 '5::Father of the Bride Part II (1995)::Comedy\n',
 '6::Heat (1995)::Action|Crime|Thriller\n',
 '7::Sabrina (1995)::Comedy|Romance\n',
 "8::Tom and Huck (1995)::Adventure|Children's\n",
 '9::Sudden Death (1995)::Action\n',
 '10::GoldenEye (1995)::Action|Adventure|Thriller\n',
 '11::American President, The (1995)::Comedy|Drama|Romance\n',
 '12::Dracula: Dead and Loving It (1995)::Comedy|Horror\n',
 "13::Balto (1995)::Animation|Children's\n",
 '14::Nixon (1995)::Drama\n',
 '15::Cutthroat Island (1995)::Action|Adventure|Romance\n',
 '16::Casino (1995)::Drama|Thriller\n',
 '17::Sense and Sensibility (1995)::Drama|Romance\n',
 '18::Four Rooms (1995)::Thriller\n',
 '19::Ace Ventura: When Nature Calls (1995)::Comedy\n',
 '20::Money Train (1995)::Action\n',
 '21::Get Shorty (1995)::Action|C