# Agenda, week 5: Modules and packages

1. Review of the challenge
2. Q&A
3. What are modules?
4. Using `import` to retrieve data from modules
5. Different variations on `import`
6. How do we develop a module?
7. What happens when a module is imported?
8. Python's standard library
9. Packages and PyPI -- finding and downloading them onto your computer
10. Using `pip`
11. What's next?



In [1]:
# Challenge solution

# Write your code below...

def count_ips(filename):
    output = {}
    for one_line in open(filename):

        # turn the line into a list of strings, separating on whitespace
        # grab the first field, at index 0-- the IP address
        ip_address = one_line.split()[0]  

        # have I seen this IP address already?
        # If so, then just add 1 to its value

        if ip_address in output:     # "in" on a dict checks the keys
            output[ip_address] += 1
        else:                        # first time seeing this IP address
            output[ip_address] = 1

    return output



In [2]:
counts = count_ips('mini-access-log.txt')

In [5]:
# let's print this dict nicely!

for key, value in counts.items():
    print(f'{key}:\t{value}')

67.218.116.165:	2
66.249.71.65:	3
65.55.106.183:	2
66.249.65.12:	32
65.55.106.131:	2
65.55.106.186:	2
74.52.245.146:	2
66.249.65.43:	3
65.55.207.25:	2
65.55.207.94:	2
65.55.207.71:	1
98.242.170.241:	1
66.249.65.38:	100
65.55.207.126:	2
82.34.9.20:	2
65.55.106.155:	2
65.55.207.77:	2
208.80.193.28:	1
89.248.172.58:	22
67.195.112.35:	16
65.55.207.50:	3
65.55.215.75:	2


In [8]:
# what if your boss doesn't want to see numbers, but rather wants to see a histogram?

for key, value in counts.items():
    print(f'{key}:\t{value * "x"}')

67.218.116.165:	xx
66.249.71.65:	xxx
65.55.106.183:	xx
66.249.65.12:	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
65.55.106.131:	xx
65.55.106.186:	xx
74.52.245.146:	xx
66.249.65.43:	xxx
65.55.207.25:	xx
65.55.207.94:	xx
65.55.207.71:	x
98.242.170.241:	x
66.249.65.38:	xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
65.55.207.126:	xx
82.34.9.20:	xx
65.55.106.155:	xx
65.55.207.77:	xx
208.80.193.28:	x
89.248.172.58:	xxxxxxxxxxxxxxxxxxxxxx
67.195.112.35:	xxxxxxxxxxxxxxxx
65.55.207.50:	xxx
65.55.215.75:	xx


In [6]:
5 + 'a'

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [7]:
5 * 'a'   # yes, this will work!

'aaaaa'

# Modules and packages

We've spoken several times about "Don't repeat yourself" -- the "DRY rule."  We've talked about it in two different contexts so far:

1. If we have several lines in a row that are roughly the same, we should turn those into a loop.
2. If we have several places in a program that are roughly the same, we should turn those into a function and then invoke the function in several places.
3. If we have the same code in several different programs, we can use a library, and then reference the code in that library whenever we need it.

In Python, our libraries are known as "modules and packages." A module is a single file containing Python code, and a package is a directory containing one or more modules + other packages.  (You can think of them as files and folders.)

1. If you want to do something that others have already done many times before, the odds are good that you can use someone else's library to do it.
2. If you have implemented something (including at work) that might help others (or yourself) in future programs, then you can write a module and share it with others.

The whole idea here is that you shouldn't be re-inventing the wheel. And if you use a module, then you don't have to worry about writing or maintaining that code. It's someone else's problem!

Modules in Python do all of this -- they let us reuse code, and concentrate on the new, distinct problems we have to solve.

But modules also do something else: They are also namespaces! In other words, if I'm working on a program that uses a variable `x` and you're working on a program that uses a variable `x`, then we don't want them to collide and interfere with one another. Modules separate each file, such that this cannot happen (or at least, not easily).

To use a module, you use the `import` statement.  A few things about `import`:

- It is a statement, not a function. Don't use parentheses.
- The argument you give to `import` is a word, not a string. It is the name of the module you want to load.  It is not a filename, either!
- After you use `import`, you can use the module; anything defined in the module is available as an *attribute* on the module object, after a `.` .

In [9]:
import random       # this imports the "random" module that comes with Python

# once we've done that, we can access an attribute x as random.x
# attributes can be data, functions, or even classes (data types)

random.randint(0, 100)   # here, we retrieve the "randint" function from the "random" module ... and run it!

95

# Exercise: `glob`

The `glob` module contains a function, also called `glob` (so yes, it's `glob.glob`) that takes a pattern of filenames, including `*` and `?`, and returns all of the filenames that match that pattern. It returns a list of strings, the filenames matching that pattern.

1. Ask the user to enter a pattern. The pattern can be as simple as `'*.txt'`, which means that it'll return all of the text files in the current directory.
2. Create an empty dict, `counts`. This dict will count the file extensions (suffixes). The keys will be strings and the values will be integers, counting how often each extension exists.
3. Iterate over each filename, grabbing the extension (remember, you can use `str.split` and look at the end with an index of -1), and using it to count.
4. Iterate over the dict, and print its values.

If you get an error saying "SOMETHING is not callable," that means you wrote

    SOMETHING()

with parentheses after it, and SOMETHING is not a function and not a class.  I'm guessing that you wrote

    import glob

and you then wrote glob() rather than glob.glob().    

In [18]:
import glob

pattern = input('Enter a pattern: ').strip()

counts = {}
for one_filename in glob.glob(pattern):
    # print(f'\t{one_filename}')
    extension = one_filename.split('.')[-1]   # grab the final part of the filename, after the last .

    if extension in counts:
        counts[extension] += 1
    else:
        counts[extension] = 1

for key, value in counts.items():
    print(f'{key}:\t{value}')

Enter a pattern:  *


txt:	6
ipynb:	5
md:	1
md~:	1
zip:	1


# What does a module contain?

- Variables
- Functions
- Classes (new data types)

All of these are available as attributes, after the module name and a `.`.  If you aren't sure what attributes a module supports, you have a few options to learn. One of the easiest is the `dir` function; if you run it on a module, you'll see all of its attributes.

Any name that starts and ends with double underscore (known as a "dunder" in the Python world), is "magic" and shouldn't really be touched unless you know what you're doing.

Any name that starts with a single underscore is considered private/internal, and definitely shouldn't be used or relied upon, because it might change.

In [19]:
dir(glob)  # what does this module define?

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dir_open_flags',
 '_glob0',
 '_glob1',
 '_glob2',
 '_iglob',
 '_isdir',
 '_ishidden',
 '_isrecursive',
 '_iterdir',
 '_join',
 '_lexists',
 '_listdir',
 '_rlistdir',
 'contextlib',
 'escape',
 'fnmatch',
 'glob',
 'glob0',
 'glob1',
 'has_magic',
 'iglob',
 'itertools',
 'magic_check',
 'magic_check_bytes',
 'os',
 're',
 'stat',
 'sys']

In [20]:
glob  # print this

<module 'glob' from '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/glob.py'>

In [21]:
random

<module 'random' from '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/random.py'>

# Where is the module loaded from? 

When you say `import MODNAME`, Python looks for a file called `MODNAME.py`. We can see this if we print the module object that we created.

How does Python know to look in those directories?

Basically, there's a special list of strings called `sys.path`. When you use the `import` statement, Python iterates over that list of strings, one at a time, looking for the module name you specified.  As soon as it finds the module you want, Python stops.

This means that if more than one directory has your module's filename in it, the first one wins.

In [22]:
import sys    
sys.path

["/Users/reuven/Courses/Current/O'Reilly-2023-07July",
 '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python311.zip',
 '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11',
 '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload',
 '',
 '/usr/local/lib/python3.11/site-packages',
 '/usr/local/Cellar/pybind11/2.11.1/libexec/lib/python3.11/site-packages',
 '/usr/local/opt/python-tk@3.11/libexec']

In [23]:
import asdfasdfsafdafa

ModuleNotFoundError: No module named 'asdfasdfsafdafa'

In [24]:
help(dir)

Help on built-in function dir in module builtins:

dir(...)
    dir([object]) -> list of strings
    
    If called without an argument, return the names in the current scope.
    Else, return an alphabetized list of names comprising (some of) the attributes
    of the given object, and of attributes reachable from it.
    If the object supplies a method named __dir__, it will be used; otherwise
    the default dir() logic is used and returns:
      for a module object: the module's attributes.
      for a class object:  its attributes, and recursively the attributes
        of its bases.
      for any other object: its attributes, its class's attributes, and
        recursively the attributes of its class's base classes.



# What happens when you import a module?

1. Python checks to see if the module was already loaded. If so, then it defines the variable, using the cached values.
2. Otherwise, Python looks for the file you have specified in `import`, in each of the directories in `sys.path`.
3. If it finds the file somewhere, then it loads the module and assigns it to the variable you named.

My talk: What happens when you import a module?

https://www.youtube.com/watch?v=CraNpITZwRo



# Exercise: Total file sizes

1. We have now seen the `glob` module and the `glob.glob` function.
2. There's also an `os` module, which works with your operating system and returns information about files, directories, and the like.  One of the useful functions in there is the `os.stat` function, which returns data about a file. So if you pass `os.stat` a filename, you'll get back an object containing information about when the file was created, modified, and also its size. The size is available as `st_size` on the returned object.
3. Ask the user to enter a file pattern.
4. Use `glob.glob` to find all of the files with this pattern.
5. Print each of the files and their sizes (thanks to `os.stat`)
6. Then print the total size, in bytes, of these files.
 

In [27]:
os.stat('/etc/passwd').st_size

8164

In [34]:
import glob
import os

pattern = input('Enter a pattern: ').strip()
total = 0

for one_filename in glob.glob(pattern):
    size = os.stat(one_filename).st_size
    total += size
    
    print(f'{one_filename:20}:{size}')   # show the filename on a field of 15 characters

print('-' * 40)
print(total)

Enter a pattern:  *.txt


mini-access-log.txt :36562
nums.txt            :42
shoe-data.txt       :1676
linux-etc-passwd.txt:2683
wcfile.txt          :165
myfile.txt          :35
----------------------------------------
41163


# Next up

1. Variations on `import`
2. Developing a module

# When we use `import`...

`import MODNAME` means: (a) import the contents of the module, (b) based on that, create a module object, (c) assign a global variable to that module object.  

Let's say that I'm going to use `random.randint` a lot in my program. Do I really want to say `random.randint` each time? No! I want to just say `randint`.  But I can't!

In [35]:
# Python finds the global variable random.
# It asks the variable random -- do you have an attribute named "randint"?
# Yes, so we get the function that randint refers to 
# Then we invoke it with ()

random.randint(0, 100)   

24

In [36]:
# what if I just say randint?

randint(0, 100)

NameError: name 'randint' is not defined

In [37]:
# sometimes, we want to be able to call it directly
# for those times, we have the "from .. import" syntax

from random import randint    # this means: load the module, don't define the variable, do define randint = random.randint

In [38]:
randint(0, 100)

64

# Things to remember about `from .. import`

1. The module variable is *not* defined.
2. The attribute you asked for is defined as a variable.
3. Note that the module, if it wasn't yet loaded, is loaded **COMPLETELY** into memory.
4. If you do this, then some other people (or even yourself, in the future) might lose track of the fact that `random` (or whatever you have imported) comes from the `random` module. It might make the code harder to read and maintain.
5. If you're working with a module that is deeply nested inside of packages, then it's often easier to use `from .. import` than to say `a.b.c.d.e.f.g()`. Instead, you could say `from a.b.c.d.e.f import g` and then only talk about `g`.

# What if the module name is bad/long/clashes with something else?

If I import `random`, then that name is taken by the module. What if I had previously written (foolishly) a function called `random`? Now I have a namespace collision!

We can avoid this, or other problems, by using `as` -- we can say `import MODNAME as ALIAS`. Then the module is loaded as usual, but the variable that is defined is aliased to whatever we wrote.

In [39]:
import random as r     # we'll still load the module as per usual, but the variable will be r

In [40]:
r.randint(0, 100)

99

In [41]:
# this is especially common in the data science world
# where a number of popular modules have standard aliases

import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
# You can also use aliases with "from .. import"

from random import randint as ri   # now, I'm not clashing with the name "randint"

# Four variations on `import`

1. `import MODNAME`
2. `import MODNAME as ALIAS`
3. `from MODNAME import NAME`
4. `from MODNAME import NAME as ALIAS`

There is a fifth way to use `import`, which is:

5. `from MODNAME import *`

**PLEASE PLEASE PLEASE NEVER USE THIS!**

This takes all of the names defined in a module and creates global variables based on them, with their values.

There are *MANY* problems with this:

- What if there are clashes with names in your program already?
- What if you upgrade to a newer version of the module, and it adds 10 new attributes which you didn't know about? Surprise! You now have 10 new variables that you need to avoid accidentally stepping on.
- It negates the whole idea of a namespace, of keeping things separate.

# Exercise: Punctuation counter

1. Ask the user to enter the name of a text file. (This will fail on a binary file!)
2. Use `string.punctuation`, a string in the `string` module, to create a dict whose keys are the punctuation characters and whose values are all 0.
3. Go through the file, one character at a time, and count how many times each puncutation character is in it.
4. Since `string.punctuation` is very long and tedious to write, use a shorter name for it.

In [42]:
from string import punctuation as punct

filename = input('Enter filename: ').strip()

counts = {}
for one_character in punct:
    counts[one_character] = 0

for one_line in open(filename):      # go through the file, one line at a time
    for one_character in one_line:   # go through the line, one character at a time
        if one_character in counts:  # are we tracking this?
            counts[one_character] += 1

for key, value in counts.items():
    print(f'{key}: {value}')

Enter filename:  /etc/passwd


!: 0
": 0
#: 12
$: 0
%: 0
&: 0
': 0
(: 1
): 1
*: 117
+: 0
,: 0
-: 16
.: 3
/: 616
:: 702
;: 0
<: 0
=: 0
>: 0
?: 0
@: 0
[: 0
\: 0
]: 0
^: 0
_: 124
`: 0
{: 0
|: 0
}: 0
~: 0


In [None]:
from string import punctuation as punct

filename = input('Enter filename: ').strip()

# I have a string, and I want to create a dict whose keys are
# the characters in this string, and whose values are all 0
counts = dict.fromkeys(punct, 0)

for one_line in open(filename):      # go through the file, one line at a time
    for one_character in one_line:   # go through the line, one character at a time
        if one_character in counts:  # are we tracking this?
            counts[one_character] += 1

for key, value in counts.items():
    print(f'{key}: {value}')

We've seen that when we say `import MODNAME`, Python looks for `MODNAME.py`, first in the current directory and then elsewhere in `sys.path`.

Can we write a module? Can it be used? Absolutely! A module is just a text file containing Python code.


In [43]:
# Let's load the module we just created!

import mymod

In [44]:
mymod

<module 'mymod' from "/Users/reuven/Courses/Current/O'Reilly-2023-07July/mymod.py">

In [45]:
dir(mymod)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'hello',
 'x',
 'y']

In [46]:
mymod.x

100

In [47]:
mymod.y

[10, 20, 30]

In [48]:
mymod.hello('world')

'Hello, world, from mymod!'

In [49]:
dir(mymod)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'hello',
 'x',
 'y']

In [50]:
import mymod

In [51]:
dir(mymod)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'hello',
 'x',
 'y']

In [52]:
# we need to take advantage of the "reload" facility in Python
# where is that? in a module! the importlib module re-implements all of importing in Python,
# so we can use / learn it

from importlib import reload   # now I'll be able to reload modules

In [53]:
reload(mymod)

<module 'mymod' from "/Users/reuven/Courses/Current/O'Reilly-2023-07July/mymod.py">

In [54]:
dir(mymod)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'd',
 'hello',
 'x',
 'y']

In [55]:
mymod.d

{'a': 100, 'b': 200, 'c': 300}

In [58]:
reload(mymod)

Hello from mymod!
Goodbye from mymod!
You imported mymod.py!


<module 'mymod' from "/Users/reuven/Courses/Current/O'Reilly-2023-07July/mymod.py">

# Exercise: Writing a module

1. Create a module file (you can use the Jupyter editor, if you want) called `mathfun.py`.
2. This module will define a single function, `square`. This function will take a single number, and return the number squared.
3. From Jupyter, import the module and execute the function. Check to see that you got the right answer.


In [59]:
import mathfun

mathfun.square(5)

25

In [61]:
reload(mathfun)

<module 'mathfun' from "/Users/reuven/Courses/Current/O'Reilly-2023-07July/mathfun.py">

# Next up

1. Python's standard library
2. PyPI

# Python standard library

When you download/install Python, it comes with the language, but it also comes with a very large number of modules. These are known as the "standard library." Anyone who installs Python has the same modules.

If you only use modules in the standard library, then anyone else who has Python on their system can use your program, too, without having to install anything else.

The standard library only gets updated and distributed with Python itself -- so if a module might be updated every month, that's too fast for the Python release cycle, and it's unlikely to be in the standard library.

In [62]:
sys.path

["/Users/reuven/Courses/Current/O'Reilly-2023-07July",
 '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python311.zip',
 '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11',
 '/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/lib-dynload',
 '',
 '/usr/local/lib/python3.11/site-packages',
 '/usr/local/Cellar/pybind11/2.11.1/libexec/lib/python3.11/site-packages',
 '/usr/local/opt/python-tk@3.11/libexec']

In [63]:
# /usr/local/lib/python3.11/site-packages

import rich

In [71]:
rich.print(':thumbsup: Hello, [red on purple]world[/red on purple]')

# Exercise: Downloading, installing, and using Rich

1. Use `pip` to install (and/or upgrade) Rich from PyPI
2. Use `import` to load `rich` into your system
3. Use `rich.print` to display some combination of colorized text and emojis.

In [73]:
# you can install things from inside of Jupyter using the ! syntax to run a program in the terminal

!pip install -U rich



In [74]:
rich.print(':thumbsdown: Hi again')

In [75]:
!pip install -U requests




In [76]:
import requests 

r = requests.get('https://python.org')

In [77]:
# r is a "response object," with the result of our request

r

<Response [200]>

In [78]:
r.headers

{'Connection': 'keep-alive', 'Content-Length': '49987', 'Server': 'nginx', 'Content-Type': 'text/html; charset=utf-8', 'X-Frame-Options': 'SAMEORIGIN', 'Via': '1.1 vegur, 1.1 varnish, 1.1 varnish', 'Accept-Ranges': 'bytes', 'Date': 'Thu, 03 Aug 2023 18:47:40 GMT', 'Age': '2915', 'X-Served-By': 'cache-iad-kiad7000025-IAD, cache-mrs10575-MRS', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '157, 214', 'X-Timer': 'S1691088460.304253,VS0,VE0', 'Vary': 'Cookie', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload'}

In [79]:
r.content



# Exercise: Web page sizes

1. Download and install `requests`.
2. Create a list of strings, URLs of pages you like.
3. Iterate over each URL, and use `requests.get` to retrieve the page.
4. Print the URL and the size (length) of the content you got back.
5. Which of your URLs has the largest page size?

In [80]:
my_urls = ['https://lerner.co.il',
           'https://python.org',
           'https://oreilly.com']

for one_url in my_urls:
    r = requests.get(one_url)
    print(f'{one_url}\t:{len(r.content)}')

https://lerner.co.il	:295175
https://python.org	:49987
https://oreilly.com	:81818


In [81]:
# we can use requests to consume APIs that give off JSON

r = requests.get('http://httpbin.org/json')


In [82]:
r.content

b'{\n  "slideshow": {\n    "author": "Yours Truly", \n    "date": "date of publication", \n    "slides": [\n      {\n        "title": "Wake up to WonderWidgets!", \n        "type": "all"\n      }, \n      {\n        "items": [\n          "Why <em>WonderWidgets</em> are great", \n          "Who <em>buys</em> WonderWidgets"\n        ], \n        "title": "Overview", \n        "type": "all"\n      }\n    ], \n    "title": "Sample Slide Show"\n  }\n}\n'

In [86]:
# I want to get back Python objects from this string
# option 1: use the "json" module in the standard library

import json
json.loads(r.content)

{'slideshow': {'author': 'Yours Truly',
  'date': 'date of publication',
  'slides': [{'title': 'Wake up to WonderWidgets!', 'type': 'all'},
   {'items': ['Why <em>WonderWidgets</em> are great',
     'Who <em>buys</em> WonderWidgets'],
    'title': 'Overview',
    'type': 'all'}],
  'title': 'Sample Slide Show'}}

In [87]:
# option 2: let requests do it for you!

r.json()  # this is a method, not data

{'slideshow': {'author': 'Yours Truly',
  'date': 'date of publication',
  'slides': [{'title': 'Wake up to WonderWidgets!', 'type': 'all'},
   {'items': ['Why <em>WonderWidgets</em> are great',
     'Who <em>buys</em> WonderWidgets'],
    'title': 'Overview',
    'type': 'all'}],
  'title': 'Sample Slide Show'}}