# Parsing a Jupyter Lab Notebook with Regular Expressions

This is a fairly involved example that requires knowledge similar to that for my notebooks:
1. Core Python
2. Core Python 2
3. Regular Expressions

## Task Overview

Find all imported modules in all notebooks found in my \\$\{HOME\} directory and below.

## Task Detail

* Skip notebooks under the \\$\{HOME\}/anaconda3
* Only check notebook cells of type 'code'
* Ignore cells that begin with %% (cell magics)
* Ignore content of comments, triple quoted strings, etc.
* Determine which root level modules found in all notebooks are missing from the conda environment this notebook is running in
* Create a mapping between each module and the notebooks which use them
* Find the most commonly imported modules

Jupyter Notebook Format: https://nbformat.readthedocs.io/en/latest/format_description.html  
API to read Jupyter Notebooks: https://nbformat.readthedocs.io/en/latest/api.html

For complete parsing of Python syntax use: https://greentreesnakes.readthedocs.io/en/latest/

This example uses Regular Expression to parse all common use cases of Python's import syntax.

### Parse the Test Notebook Cell using Regular Expressions

Primary import syntax use cases:
1. import module
2. import module.submodule
3. from \_\_future\_\_ import module
4. from module import something
5. from module.submodule import something

To determine which modules are referenced, its representation will be normalized.

It is not possible to know from the syntax whether 'something' is or is not a module.  However an attempt can be made to load it as a module, and if it loads, then it must be a module.

RegEx note: a module name can be captured with: ```([\w|\.]+)```

#### Find Modules on a Line

Express the import statement found by the regex in the following normalized form:

```
from module.submodule import something ->
[module, module.submodule, module.submodule.something]

import module.submodule.subsubmodule ->
[module, module.submodule, module.submodule.subsubmodule]
```

Later each of these modules will attempt to be imported.

No attempt is made to parse an import statement that continues across lines.  This is not recommended practice and is very rare.

In [1]:
import re

def parse_modules(line):
    """Finds potential modules on a given line"""
    m = re.search(r'(from\s+([\w|\.]+)\s+)?import\s+([\w|\.]+)', line)
    if m:
        mod_list = []
        if m.group(2) is not None and m.group(2) != '__future__':
            module = m.group(2) + '.' + m.group(3)
        else:
            module = m.group(3)
         
        modules = module.split('.')
        mod_list = []
        for i in range(1, len(modules)+1):
            mod_list.append(".".join(modules[:i]))
            
        return mod_list

#### Line Test of parse_modules()

This is effectively a unit test.  pytest or similar should be used for unit testing, not a Jupyter Notebook.  However the principals of unit testing can still be followed within a Jupyter Notebook.

In [2]:
# test parsing of import statements on a line
print(parse_modules('import module'))
print(parse_modules('import module.submodule'))
print(parse_modules('import module.submodule.submodule'))
print(parse_modules('from __future__ import module'))
print(parse_modules('from __future__ import module.submodule'))
print(parse_modules('from module import submodule'))
print(parse_modules('from module.submodule import subsubmodule'))
print(parse_modules('result_set = %sql select from actor where first_name = "Bob"'))

['module']
['module', 'module.submodule']
['module', 'module.submodule', 'module.submodule.submodule']
['module']
['module', 'module.submodule']
['module', 'module.submodule']
['module', 'module.submodule', 'module.submodule.subsubmodule']
None


#### Cell Test of parse_modules()

This is a slightly higher level "unit test", as it makes use of the above and additional code to parse an entire notebook cell.  The cell is created specifically for testing.

##### Create Test Cell

In [3]:
# create a test notebook cell with imports
from __future__ import annotations
import sys
import numpy as np
import sys
from numpy import random
import sklearn.model_selection
from sklearn.feature_extraction import DictVectorizer

# when parsed, these should be the modules found to be referenced
parsed_modules = {'annotations',
 'nbformat',
 'numpy',
 'numpy.random',
 'sklearn',
 'sklearn.feature_extraction',
 'sklearn.feature_extraction.DictVectorizer',
 'sklearn.model_selection',
 'sys'}

# from __future__ import not_real_module
print("hello world") # import not_real_module
print("import not_real_module")
s = """
import not_real_module
"""
import nbformat
t = """import not_real_module
"""
a = 'import not_real_module'
b = 'from not_real_module import not_real_class'

hello world
import not_real_module


##### Find the above notebook cell and display it
Note that if you change the previous cell and do not save the notebook, then the following code will read the old version of the cell.

In [4]:
nb = nbformat.read('RegExParseNB.ipynb', as_version=4)
for cell_num, cell in enumerate(nb.cells):
    if '# create a test notebook cell with imports' in cell.source:
        break

print(f'Cell Index: {cell_num}')
print()
print(nb.cells[cell_num].source)

Cell Index: 10

# create a test notebook cell with imports
from __future__ import annotations
import sys
import numpy as np
import sys
from numpy import random
import sklearn.model_selection
from sklearn.feature_extraction import DictVectorizer

# when parsed, these should be the modules found to be referenced
parsed_modules = {'annotations',
 'nbformat',
 'numpy',
 'numpy.random',
 'sklearn',
 'sklearn.feature_extraction',
 'sklearn.feature_extraction.DictVectorizer',
 'sklearn.model_selection',
 'sys'}

# from __future__ import not_real_module
print("hello world") # import not_real_module
print("import not_real_module")
s = """
import not_real_module
"""
import nbformat
t = """import not_real_module
"""
a = 'import not_real_module'
b = 'from not_real_module import not_real_class'


#### Perform Cell Test of parse_modules()

In [5]:
import re
modules = set()
cell = nb.cells[cell_num]

# remove contents of triple quoted strings
source = re.sub(r'\"{3}(.*?)\"{3}', '', cell.source, flags = re.DOTALL | re.MULTILINE)

# process each line
lines = source.splitlines()
for line in lines:
    
    # only consider text before # or first single or double quote
    line = re.split('[#\'"]',line)[0]  
    
    mods = parse_modules(line)
    if mods:
        modules.update(mods)
        
# correct set of modules        
modules == parsed_modules

True

#### Software Engineering Note
The above shows a piece of code that works on a good test example.  The next step is refactor this code into a function or method.

In [6]:
# initial version of method, cut and paste from above tested code as much as possible
def find_modules_from_cell(cell, modules):
    
    # remove contents of triple quoted strings
    source = re.sub(r'\"{3}(.*?)\"{3}', '', cell.source, flags = re.DOTALL | re.MULTILINE)

    # process each line
    lines = source.splitlines()
    for line in lines:

        # only consider text before # or first single or double quote
        line = re.split('[#\'"]',line)[0]  

        mods = parse_modules(line)
        if mods:
            modules.update(mods)

In [7]:
# "unit test" refactored code
modules = set()
cell = nb.cells[cell_num]

find_modules_from_cell(cell, modules)
modules == parsed_modules

True

#### Find \\${HOME}

In [8]:
env = %env
home = env['HOME']

#### Find Notebooks to Parse

Find all files that:
1. end in '.ipynb'
2. do not end in '-checkpoint.ipynb'
3. do not begin with \\$\{HOME\}/anaconda3

Note that regular expressions are not used here as str.startswith() and str.endswith() are easier to read than regular expressions and execute faster.

In [9]:
import os

notebooks = []
for dirpath, dirnames, filenames in os.walk(home):
    for filename in filenames:
        fullname = os.path.join(dirpath, filename)
        if fullname.endswith('.ipynb') \
            and not fullname.endswith('-checkpoint.ipynb') \
            and not fullname.startswith(home+'/anaconda3'):
            notebooks.append(fullname)

In [10]:
# find all potential modules referenced in all notebooks found above
modules = set()
for notebook in notebooks:
    nb = nbformat.read(notebook, as_version=4)
    for cell in nb.cells:
        if cell.cell_type == 'code' and not cell.source.startswith('%%'):
            find_modules_from_cell(cell, modules)

#### Software Engineering Note

find_modules_from_cell is a "helper" function.  In other words, it is a function which is used locally to solve a specific problem and is not intended to be used by other software developers.

#### Try Loading Each Module

Modules could fail to load because:
1. old notebook reference module that has moved
2. module is not available in the virtual environment from which this notebook is being run
3. 'from module import something' was normalized to module.something, which may not be a module
4. an error in parsing could incorrectly identify something that is not a module

Get a list of all the modules that will not load.

If the root module will not load, then this is either a parsing error or a module not in the current virtual environment.

If module.something will not load, but module will load, then module.something is not a module.

Note: use of exec() could result in potentially hackable code and shouldn't be used in production code.  exec() is fine to use for testing in a safe environment.

In [11]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

missing_root_modules = set()
loadable_modules = set()
for module in modules:
    try:
        # attempt to import each module
        exec(f'import {module}')
        loadable_modules.add(module)
    except ModuleNotFoundError as err:
        # skip modules which include a possible submodule
        if '.' not in module:
            missing_root_modules.add(module)
        continue

In [12]:
print(len(loadable_modules), len(missing_root_modules), len(modules))
missing_root_modules

221 38 645


{'PyPDF2',
 'PyQt4',
 'annotations',
 'cufflinks',
 'cv',
 'dask_ml',
 'dask_xgboost',
 'descartes',
 'dill',
 'division',
 'file1',
 'foo',
 'graphviz',
 'helpers_05_08',
 'historical_prices_and_dividends',
 'keras',
 'mglearn',
 'mod',
 'mpi4py',
 'mprun_demo',
 'nbpackage',
 'netCDF4',
 'pandas_datareader',
 'plotly',
 'print_function',
 'pydot',
 'pygame',
 'pyspark',
 'rasterio',
 'rmtkernel',
 'selenium',
 'shapely',
 'spacy',
 'splipy',
 'tensorflow',
 'testdill',
 'vincent',
 'xgboost'}

#### Software Engineering Note
For a given module, it is helpful to know the set of notebook filenames which import this module.

In computer science, this is called an inverted index.  It is inverted because instead of mapping notebooks to modules (we began with os.walk(home) to find notebooks) we are mapping modules to notebooks.

It is easy to modify the above code to create a mapping of module name to set of notebook filenames, using defaultdict(set).

The only changes to the already tested method are:
1. add notebook to the argument list
2. last line: for each key, add a notebook to its value

In [13]:
# modified to map module to set of notebooks
def find_modules_from_cell(cell, notebook, dd):
    
    # remove contents of triple quoted strings
    source = re.sub(r'\"{3}(.*?)\"{3}', '', cell.source, flags = re.DOTALL | re.MULTILINE)

    # process each line
    lines = source.splitlines()
    for line in lines:

        # only consider text before # or first single or double quote
        line = re.split('[#\'"]',line)[0]  

        mods = parse_modules(line)
        if mods:
            for mod in mods:
                dd[mod].add(notebook)

#### Reparse the Notebook Cells, this time keeping a mapping from module to set of notebooks

In [14]:
from collections import defaultdict

# uses new version of find_modules_from_cell
dd = defaultdict(set)
for notebook in notebooks:
    nb = nbformat.read(notebook, as_version=4)
    for cell in nb.cells:
        if cell.cell_type == 'code' and not cell.source.startswith('%%'):  
            find_modules_from_cell(cell, notebook, dd)

#### Software Engineering Note
It is a good idea to check every step of the way.

The list of keys in the above dictionary should match the set of modules found above.  Verify this.

In [15]:
# although the keys are unique, set equality requires both objects being compared to be sets
modules == set(dd.keys())

True

#### Find the 5 Modules Referenced Most Often 

In [16]:
# to know how often each module was referenced:
# create a new dictionary that maps key to number of notebooks in value set

# dictionary comprehension
counts = {key:len(value) for (key,value) in dd.items()}

In [17]:
# create a sorted list of (key,value) tuples from the counts dictionary
sorted_by_value = sorted(counts.items(), key=lambda item: item[1], reverse=True)

# top 5 most used
for key, value in sorted_by_value[:5]:
    print(f'{key:<18} referenced in {value:>3} notebooks')

numpy              referenced in 292 notebooks
matplotlib         referenced in 231 notebooks
matplotlib.pyplot  referenced in 228 notebooks
pandas             referenced in 224 notebooks
seaborn            referenced in 144 notebooks


##### Use the "inverted index" (dd) to display the location of one of the notebooks which reference the most common module

In [37]:
# most common module
module = sorted_by_value[0][0]
print(f'Module: {module}')

# one of its file locations
notebook = sorted(dd[module])[0]

# path relative to home directory
notebook.split(home)[1]

Module: numpy


'/Python/Complete-Python-3-Bootcamp-master/16-Bonus Material - Introduction to GUIs/06-Custom Widget.ipynb'