# Parsing a Jupyter Lab Notebook with Regular Expressions

This is a fairly involved example that requires knowledge similar to that for my notebooks:
1. Core Python
2. Core Python 2
3. Regular Expressions

## Task Overview

Find all imported modules in all notebooks found in my \\$\{HOME\} directory and below.

## Task Detail

* Skip notebooks under the \\$\{HOME\}/anaconda3
* Only check notebook cells of type 'code'
* Ignore cells that begin with %% (cell magics)
* Ignore content of comments, triple quoted strings, etc.
* Determine which modules found in all notebooks are missing from the conda environment this notebook is running in
* Create a mapping between each module and the notebooks which use them
* Find the most commonly imported modules

Jupyter Notebook Format: https://nbformat.readthedocs.io/en/latest/format_description.html  
API to read Jupyter Notebooks: https://nbformat.readthedocs.io/en/latest/api.html

For complete parsing of Python syntax use: https://greentreesnakes.readthedocs.io/en/latest/

This example uses Regular Expression to parse all common use cases of Python's import syntax.

### Parse the Test Notebook Cell using Regular Expressions

Primary import syntax use cases:
1. import module
2. import module.submodule
3. from \_\_future\_\_ import module
4. from module import submodule
5. from module import not_a_module

From the syntax alone, it is not possible to know the difference between use cases 4 and 5.

However   
```import module.submodule``` is valid whereas  
```import module.not_a_module``` is not valid.

Valid module characters are all word characters and '.'

#### Find Modules on Line

Express these as a list such as:
\[module, module.submodule, module.submodule.submodule\]

Later each of these models will attempt to be imported.

In [1]:
import re

def parse_modules(line):
    """Finds potential modules on a given line"""
    m = re.search(r'(from\s+([\w|\.]+)\s+)?import\s+([\w|\.]+)', line)
    if m:
        mod_list = []
        if m.group(2) is not None and m.group(2) != '__future__':
            module = m.group(2) + '.' + m.group(3)
        else:
            module = m.group(3)
         
        modules = module.split('.')
        mod_list = []
        for i in range(1, len(modules)+1):
            mod_list.append(".".join(modules[:i]))
            
        return mod_list

#### Line Test of parse_modules()

In [2]:
print(parse_modules('import module'))
print(parse_modules('import module.submodule'))
print(parse_modules('import module.submodule.submodule'))
print(parse_modules('from __future__ import module'))
print(parse_modules('from __future__ import module.submodule'))
print(parse_modules('from module import submodule'))
print(parse_modules('from module.submodule import subsubmodule'))
print(parse_modules('result_set = %sql select from actor where first_name = "Bob"'))

['module']
['module', 'module.submodule']
['module', 'module.submodule', 'module.submodule.submodule']
['module']
['module', 'module.submodule']
['module', 'module.submodule']
['module', 'module.submodule', 'module.submodule.subsubmodule']
None


#### Cell Test of parse_modules()

##### Create Test Cell

In [3]:
# create a test notebook cell with valid and invalid import use cases
from __future__ import annotations
import sys
import numpy as np
from numpy import random
import sklearn.model_selection
from sklearn.feature_extraction import DictVectorizer

# when parsed, these should be the modules found to be referenced
parsed_modules = ['annotations',
 'nbformat',
 'numpy',
 'numpy.random',
 'sklearn',
 'sklearn.feature_extraction',
 'sklearn.feature_extraction.DictVectorizer', # later this will be shown not to be a module
 'sklearn.model_selection',
 'sys']

# from __future__ import not_real_module
print("hello world") # import not_real_module
print("import not_real_module")
s = """
import not_real_module
"""
import nbformat
t = """import not_real_module
"""
a = 'import not_real_module'
b = 'from not_real_module import not_real_class'

hello world
import not_real_module


##### Find the above notebook cell and display it

In [4]:
nb = nbformat.read('RegExParseNB.ipynb', as_version=4)
for cell_num, cell in enumerate(nb.cells):
    if '# create a test notebook cell with' in cell.source:
        break

print(f'Cell Index: {cell_num}')
print()
print(nb.cells[cell_num].source)

Cell Index: 10

# create a test notebook cell with valid and invalid import use cases
from __future__ import annotations
import sys
import numpy as np
from numpy import random
import sklearn.model_selection
from sklearn.feature_extraction import DictVectorizer

# when parsed, these should be the modules found to be referenced
parsed_modules = ['annotations',
 'nbformat',
 'numpy',
 'numpy.random',
 'sklearn',
 'sklearn.feature_extraction',
 'sklearn.feature_extraction.DictVectorizer', # later this will be shown not to be a module
 'sklearn.model_selection',
 'sys']

# from __future__ import not_real_module
print("hello world") # import not_real_module
print("import not_real_module")
s = """
import not_real_module
"""
import nbformat
t = """import not_real_module
"""
a = 'import not_real_module'
b = 'from not_real_module import not_real_class'


#### Perform Cell Test of parse_modules()

In [5]:
import re
modules = []
cell = nb.cells[cell_num]

# remove contents of triple quoted strings
source = re.sub(r'\"{3}(.*?)\"{3}', '', cell.source, flags = re.DOTALL | re.MULTILINE)

# process each line
lines = source.splitlines()
for line in lines:
    
    # only consider text before # or first single or double quote
    line = re.split('[#\'"]',line)[0]  
    
    mods = parse_modules(line)
    if mods:
        modules.extend(mods)
        
# correct set of modules        
modules = sorted(set(modules))
modules == parsed_modules

True

#### Software Engineering Note
The above shows a piece of code that works on a good test example.  The next step is refactor this code into a function or method.

In [6]:
# initial version of method, cut and paste from above tested code as much as possible
def find_modules_from_cell(cell, modules):
    
    # remove contents of triple quoted strings
    source = re.sub(r'\"{3}(.*?)\"{3}', '', cell.source, flags = re.DOTALL | re.MULTILINE)

    # process each line
    lines = source.splitlines()
    for line in lines:

        # only consider text before # or first single or double quote
        line = re.split('[#\'"]',line)[0]  

        mods = parse_modules(line)
        if mods:
            modules.extend(mods)

In [7]:
# try refactored code
modules = list()
cell = nb.cells[cell_num]

find_modules_from_cell(cell, modules)
modules = sorted(set(modules))
modules == parsed_modules

True

#### Find \\${HOME}

In [8]:
env = %env
home = env['HOME']

#### Find Notebooks to Parse

Find all files that:
1. end in '.ipynb'
2. do not end in '-checkpoint.ipynb'
3. do not begin with \\$\{HOME\}/anaconda3

Note that regular expressions are not used here as str.startswith() and str.endswith() are easier to read than regular expressions and execute faster.

In [9]:
import os

notebooks = []
for dirpath, dirnames, filenames in os.walk(home):
    for filename in filenames:
        fullname = os.path.join(dirpath, filename)
        if fullname.endswith('.ipynb') \
            and not fullname.endswith('-checkpoint.ipynb') \
            and not fullname.startswith(home+'/anaconda3'):
            notebooks.append(fullname)

In [10]:
# find all potential modules referenced in all notebooks found above
modules = []
for notebook in notebooks:
    nb = nbformat.read(notebook, as_version=4)
    for cell in nb.cells:
        if cell.cell_type == 'code' and not cell.source.startswith('%%'):
            find_modules_from_cell(cell, modules)

#### Software Engineering Note

find_modules_from_cell is a "helper" function.  In other words, it is a function which is used locally to solve a specific problem and is not intended to be used by other software developers.

#### Try Loading Each Module

Modules could fail to load because:
1. notebook it was found from is old and module location has since changed
2. module is not available in the virtual environment from which this notebook is being run
3. notebook was parsed incorrectly and the module is not really a module

Get a list of all the modules that will not load.

Note: use of exec() could result in potentially hackable code and shouldn't be used in production code.  exec() is fine to use for testing in a safe environment.

In [11]:
# unloadable modules are of two types
# 1 the module is not in the virutual environment this notebook is running in
# 2 the module is not really a module
modules = sorted(set(modules))
missing_modules = []
not_modules = []
loadable_modules = []
for module in modules:
    try:
        # attempt to import each module
        exec(f'import {module}')
        loadable_modules.append(module)
    except ModuleNotFoundError as err:
        if '.' in module:
            # 'from module import not_a_module' -> import module.not_a_module
            not_modules.append(module)
        else:
            missing_modules.append(module)
        continue


- use nbformat for read/write/validate public API
- use nbformat.vX directly to composing notebooks of a particular version

  """)


In [12]:
len(loadable_modules), len(missing_modules), len(not_modules)

(221, 38, 386)

#### Software Engineering Note
For a given module, it would be helpful to know which notebooks referenced it.

It is easy to modify the above code to create a mapping of module name to notebook list, using defaultdict(list).

The only changes to the already tested method are:
1. add notebook to the argument list
2. last line: append notebook to the list of notebooks for that module

In [13]:
# modify to map module to list of notebooks
def find_modules_from_cell(cell, notebook, dd):
    
    # remove contents of triple quoted strings
    source = re.sub(r'\"{3}(.*?)\"{3}', '', cell.source, flags = re.DOTALL | re.MULTILINE)

    # process each line
    lines = source.splitlines()
    for line in lines:

        # only consider text before # or first single or double quote
        line = re.split('[#\'"]',line)[0]  

        mods = parse_modules(line)
        if mods:
            for mod in mods:
                dd[mod].append(notebook)

#### Reparse the Notebook Cells, this time keeping a mapping from module to list of notebooks

In [14]:
from collections import defaultdict

# use new version of find_modules_from_cell
dd = defaultdict(list)
for notebook in notebooks:
    nb = nbformat.read(notebook, as_version=4)
    for cell in nb.cells:
        if cell.cell_type == 'code' and not cell.source.startswith('%%'):  
            find_modules_from_cell(cell, notebook, dd)

#### Software Engineering Note
It is a good idea to check every step of the way.

The list of keys in the above dictionary should match the set of modules found above.  Verify this.

In [15]:
m_list1 = sorted(modules)
m_list2 = sorted(list(dd.keys()))
m_list1 == m_list2

True

In [16]:
# arbitrarily pick a missing module and show a notebook that refers to it
a_missing_module = missing_modules[1]
notebook_list = dd[a_missing_module]

# print the last 76 characters of the first notebook in notebook_list
print(f'Missing module: {a_missing_module}')
print(f'Referenced in: {notebook_list[0][-76:]}')

Missing module: PyQt4
Referenced in: /Python/ipython-in-depth-master/examples/IPython Kernel/Terminal Usage.ipynb


#### Software Engineering Note

My RegEx parsing does not handle every possible valid Python syntax. For example:  
```
import \  
    os
```
is valid Python import syntax, but it is not handled by my RegEx parsing.

To handle all possible valid parsing of Python syntax, you should use AST:
https://greentreesnakes.readthedocs.io/en/latest/

This notebook handles all common use cases of Python's import syntax.

#### Find the 5 Modules Referenced Most Often 

In [17]:
# to know how often each module was referenced:
# create a new dictionary that maps key to number of notebooks in value list

# dictionary comprehension
counts = {key:len(value) for (key,value) in dd.items()}

In [18]:
# create a sorted list of (key,value) tuples from the counts dictionary
sorted_by_value = sorted(counts.items(), key=lambda item: item[1], reverse=True)

# top 5 most used
for key, value in sorted_by_value[:5]:
    print(f'{key:<24} referenced in {value:>3} notebooks')

sklearn                  referenced in 925 notebooks
numpy                    referenced in 361 notebooks
matplotlib               referenced in 324 notebooks
matplotlib.pyplot        referenced in 280 notebooks
pandas                   referenced in 278 notebooks
