# Parsing a Jupyter Lab Notebook with Regular Expressions

This is a fairly involved example that requires knowledge of the simple examples presented in my notebooks for:
1. Core Python
2. Core Python 2
3. Regular Expressions

## Task Overview

Find all imported modules in all notebooks found in my \\$\{HOME\} directory and below.

## Task Detail

* Skip notebooks under the \\$\{HOME\}/anaconda
* Only check notebook cells of type 'code'
* Ignore cells that use magic commands
* Ignore content of comments, triple quoted strings, etc.
* Determine which modules found in all notebooks are missing from the conda environment this notebook is running in
* Create a mapping between each module and the notebooks which use them
* Find the most commonly imported modules

Jupyter Notebook Format: https://nbformat.readthedocs.io/en/latest/format_description.html  
API to read Jupyter Notebooks: https://nbformat.readthedocs.io/en/latest/api.html

For complete parsing of Python syntax use: https://greentreesnakes.readthedocs.io/en/latest/

This example uses Regular Expression to parse all common use cases of Python's import syntax.

#### Software Engineering Note

Collecting several edge cases, in a quick and easy to run "unit test", is helpful.

Unit testing in a professional environment is performed with a tool such as pytest.

Jupyter Notebooks are not good for unit testing per se, but the principles of unit testing can still be applied within a Jupyter Notebook.

In [1]:
# create a test notebook cell with valid and invalid import use cases
# these are valid imports
from __future__ import annotations
import sys
import numpy as np
from numpy import random
import nbformat
import sklearn.model_selection
from sklearn.feature_extraction import DictVectorizer

# when parsed, these should be the modules found to be imported
valid_modules = {'annotations', 'nbformat', 'numpy', 'sklearn.feature_extraction', 'sklearn.model_selection','sys'}

# the following are not actually importing a module
# from __future__ import not_real_module
print("hello world") # import not_real_module
print("import not_real_module")
s = """
import not_real_module
"""
t = """import not_real_module
"""
a = 'import not_real_module'
b = 'from not_real_module import not_real_class'

hello world
import not_real_module


#### Find the above notebook cell and display it

In [2]:
nb = nbformat.read('RegExParseNB.ipynb', as_version=4)
for cell_num, cell in enumerate(nb.cells):
    if '# create a test notebook cell with' in cell.source:
        break

print(f'Cell Index: {cell_num}')
print()
print(nb.cells[cell_num].source)

Cell Index: 4

# create a test notebook cell with valid and invalid import use cases
# these are valid imports
from __future__ import annotations
import sys
import numpy as np
from numpy import random
import nbformat
import sklearn.model_selection
from sklearn.feature_extraction import DictVectorizer

# when parsed, these should be the modules found to be imported
valid_modules = {'annotations', 'nbformat', 'numpy', 'sklearn.feature_extraction', 'sklearn.model_selection','sys'}

# the following are not actually importing a module
# from __future__ import not_real_module
print("hello world") # import not_real_module
print("import not_real_module")
s = """
import not_real_module
"""
t = """import not_real_module
"""
a = 'import not_real_module'
b = 'from not_real_module import not_real_class'


#### Parse the Test Notebook Cell using Regular Expressions

In [3]:
import re
modules = set()
cell = nb.cells[cell_num]

# remove contents of triple quoted strings
source = re.sub(r'\"{3}(.*)?\"{3}', '', cell.source, flags = re.DOTALL)

# process each line
lines = source.splitlines()
for line in lines:
    
    # only consider text before # or first single or double quote
    line = re.split('[#\'"]',line)[0]
    m = re.search(r'(from(?:\s+__future__\s+import)?|import)\s+([a-zA-Z0-9._]+)', line)
    if m:
        modules.add(m.group(2))
        
# correct set of modules        
modules == valid_modules                         

True

#### Software Engineering Note
The above shows a piece of code that works on a good test example.  The next step is refactor this code into a function or method.

In [4]:
# initial version of method, cut and paste from above as much as possible
def find_modules_from_cell(cell, collection):
    
    # remove contents of triple quoted strings
    source = re.sub(r'\"{3}(.*)\"{3}', '', cell.source, flags = re.DOTALL)
    lines = source.splitlines()
    for line in lines:
        
        # only consider text before # or first single or double quote
        line = re.split('[#\'"]',line)[0]
        m = re.search(r'(from(?:\s+__future__\s+import)?|import)\s+([a-zA-Z0-9._]+)', line)
        if m:
            collection.add(m.group(2))

In [5]:
# try refactored code
modules = set()
cell = nb.cells[cell_num]

find_modules_from_cell(cell, modules)
modules == valid_modules

True

#### Find \\${HOME}

In [6]:
env = %env
home = env['HOME']

#### Find Notebooks to Parse

Find all files that:
1. end in '.ipynb'
2. do not end in '-checkpoint.ipynb'
3. do not begin with \\$\{HOME\}/anaconda3

Note that regular expressions are not used here as str.startswith() and str.endswith() are easier to read than regular expressions and execute faster.

In [7]:
import os

notebooks = []
for dirpath, dirnames, filenames in os.walk(home):
    for filename in filenames:
        fullname = os.path.join(dirpath, filename)
        if fullname.endswith('.ipynb') \
            and not fullname.endswith('-checkpoint.ipynb') \
            and not fullname.startswith(home+'/anaconda3'):
            notebooks.append(fullname)

#### Exclude Magic Cells

Examples of magic commands used in cells:  
```
%%sql
result_set = %sql select
%env
env = %env
```  

RegEx: ```^(%{1,2}\w+)``` with re.MULTILINE  
will match if any line that begins with one or two % immediately followed by 1+ word characters

RegEx: ```=\s+(%\w+)```  
will match 'env = %env'  but not '6 %sql' which is 6 modulus the value of sql

In [8]:
# find all modules reference in all notebooks found above
modules = set()
for notebook in notebooks:
    nb = nbformat.read(notebook, as_version=4)
    for cell in nb.cells:
        if cell.cell_type == 'code':
            # skip cells with magic
            if re.search(r'^(%{1,2}\w+)|=\s+(%\w+)', cell.source, re.MULTILINE):
                continue
            
            find_modules_from_cell(cell, modules)

#### Software Engineering Note

find_modules_from_cell is a "helper" function.  In other words, it is a function which is used locally to solve a specific problem and is not intended to be used by other software developers.

#### Try Loading Each Module

Modules could fail to load because:
1. notebook it was found from is old and module location has since changed
2. module is not available in the virtual environment from which this notebook is being run
3. notebook was parsed incorrectly and the module is not really a module

Get a list of all the modules that will not load.

Note: use of exec() could result in potentially hackable code and shouldn't be used in production code.  exec() is fine to use for testing in a safe environment.

In [9]:
missing_modules = set()
for module in modules:
    try:
        # attempt to import each module
        exec(f'import {module}')
    except ModuleNotFoundError as err:
        # capture module name from exception string
        err_str = str(err)
        match = re.search(r"\s'(.*)'", err_str)
        missing_modules.add(match.group(1))
        continue



In [10]:
# convert set of modules sorted list
missing_list = sorted(missing_modules)
missing_list[:3]

['IPython.nbconvert.transformers', 'PyPDF2', 'PyQt4']

#### Software Engineering Note
For a given module, it would be helpful to know which notebooks referenced it.

It is easy to modify the above code to create a mapping of module name to notebook list, using defaultdict(list).

The only changes to the already tested method are:
1. add notebook to the argument list
2. last line: append notebook to the list of notebooks for that module

In [11]:
# the only change is to add the notebook to the argument list and the last line of code which 
# appends the notebook to the (value) list for the given module (dictionary key)
def find_modules_from_cell(cell, notebook, dd):
    
    # remove contents of triple quoted strings
    source = re.sub(r'\"{3}(.*)\"{3}', '', cell.source, flags = re.DOTALL)
    lines = source.splitlines()
    for line in lines:
        
        # only consider text before # or first single or double quote
        line = re.split('[#\'"]',line)[0]
        m = re.search(r'(from(?:\s+__future__\s+import)?|import)\s+([a-zA-Z0-9._]+)', line)        
        if m:
            dd[m.group(2)].append(notebook)

#### Reparse the Notebook Cells, this time keeping a mapping from module to list of notebooks

In [12]:
from collections import defaultdict

# use new version of find_modules_from_cell
dd = defaultdict(list)
for notebook in notebooks:
    nb = nbformat.read(notebook, as_version=4)
    for cell in nb.cells:
        # only consider code cells
        if cell.cell_type == 'code':
            # skip cells with magic
            if re.search(r'^(%{1,2}\w+)|=\s+(%\w+)', cell.source, re.MULTILINE):
                continue
            
            find_modules_from_cell(cell, notebook, dd)

#### Software Engineering Note
It is a good idea to check every step of the way.

The list of keys in the above dictionary should match the set of modules found above.  Verify this.

In [13]:
m_list1 = sorted(modules)
m_list2 = sorted(list(dd.keys()))
m_list1 == m_list2

True

In [14]:
# try an example of finding a notebook used a missing module
a_missing_module = missing_list[2]
notebook_list = dd[a_missing_module]

# print the last 76 characters of the first notebook in notebook_list
print(f'Missing module: {a_missing_module}')
print(f'Referenced in: {notebook_list[0][-76:]}')

Missing module: PyQt4
Referenced in: /Python/ipython-in-depth-master/examples/IPython Kernel/Terminal Usage.ipynb


#### Software Engineering Note

My RegEx parsing does not handle every possible valid Python syntax. For example:  
import \  
    os

is valid python import syntax, but is not handled by my RegEx parsing.

To handle all possible valid parsing of Python syntax, you should use AST:
https://greentreesnakes.readthedocs.io/en/latest/

This example is a useful tool which demonstrates the use of regular expressions to properly handle all common use cases of Python's import syntax.

#### Find the 5 Modules Most Often 

In [15]:
# to know how often each module was referenced:
# create a new dictionary that maps key to number of notebooks in value list

# dictionary comprehension
counts = {key:len(value) for (key,value) in dd.items()}

In [16]:
# create a sorted list of (key,value) pairs from the counts dictionary
sorted_by_value = sorted(counts.items(), key=lambda item: item[1], reverse=True)

# top 5 most used
for key, value in sorted_by_value[:5]:
    print(f'{key:<24} referenced in {value:>3} notebooks')

numpy                    referenced in 185 notebooks
pandas                   referenced in 159 notebooks
sklearn.metrics          referenced in 134 notebooks
sklearn.model_selection  referenced in 133 notebooks
IPython.display          referenced in  97 notebooks
