# UnpackAI Library Development Plan
> What library should we be?
This proposal, is more of a branding proposal, targeting people who's going to play with AI, from various back grounds.
* That means, we're going to talk about how people view this library, how they think of ```pip install -Uqq unpackai``` like if I have dandroff recengtly and my mind just jump right into the headshoulders.
* For ML, currently, the **jump** is about the following, this is not a throught marketing research, just quick examples from a deep learning practitioner:
    * Try free structure quickly, do experiments: pytorch
    * Goes to production, run model on edge devices, Tensorflow
    * Play with GPU accelerated tensor calculation: Jax
    * Play with tf but in simpler layer sense: Keras
    * Transformer in clean code: Huggingface
    * Visualize things with interactive features: Plotly
    * Deploy model prototype: streamlit
* Surely you think I fail to mention ```fastai```, this is where the **branding goes wrong**, fastai library is bounded tightly with the education. It's considered a good creation along side its famous course, after the education. Its product feature has many limitation: docs too brief, not supporting multi-device training, very limited numbers of callbacks went beyond Jeremy H's own teaching.
* Most important of all, ```fastai``` isn't enjoyable to use, **it's just packing many things mentioned in the course**.

## What we shouldn't be
I know the course is life changing for me and I feel very grateful. But let's not be their library.

### The pipeline wrapping plan
It all started from a notebook, quite like a template notebook we have for the course. A notebook that achieves the data processing, model building, interpretation for a specific DL task.

Then came the packaging part, we wrap **dozens of lines of codes**, which scares our kind students, into simple functions, or class.

The wrapped functions are simple to use, to look at, it was executed in 1 line mostly. So friendly to our innocent students.

This is what a python library is about, right? Wrap things into functions which can be further wraped into even less lines.

It's nothing wrong about this approach at first. Some DL task, if need be, can be shrank into **less than 10 lines of codes.**
* The 1st line load the data, 
* the 2nd line set how to transform data, 
* the 3rd line build/load the model, 
* the 4th line trained model.
* the 5th line interpret the model in various ways

Well the above do look like a decent **structure** to start with, then we pave out the tasks, different contributors take different tasks, can be developed in parallel, and we can have the agile/crum/kanban fun to track our progress!

Even if we do this, we could build a useful product, no less.

#### Bad side about pipeline wrapping plan
So so many libraries are doing the same, from awesome people even. They usually end up to the following:
* It's a mess of functions, among them many good functions but a mess. It ends up a branding disaster. (**There is no way to answer: what can you library do, in a slogan**)
* A model zoo for a specific domain.
* Wraping things up means less and less involvement from the user. The user will spend very little time play with the functions, and each function usually achieve very specific task. Actually I do believe there is a equilibrium like:
$\large{UserPlayHours = a * Task Transferability}$

## Alternative approach

The salvation plan is somehow simpler at how we perceive the library:
* A library that allows you experiment AI/DL for various tasks

**BUT!!!**
* Many module with in the pipeline should be dropdown-list/checkbox **Choosable**.
* The **level of detail** we let them to play and choose, is the **level of the difficulty** we want them to enjoy

### What is level of detail ?
Level of details is the level of fuss we want user to focus on, this is the exact part fastai library got **WRONG**, which will explain most of our struggle so far:
* It offers smooth/ easy pipelines, for newbies and business people even.
* Any amount of reconfigure, is usually way too complicated for such audience
* There is a **GAP** between the 2 points above, hence no room for playing

#### Keras Example 
I started my AI journey with Keras, and I love keras by that time, because:
* Keras plays with **layers**(eg. Linear, Convolution), its most strenth is at astracting details beneath this level, and let users play with layers. 
* I spent lots of time, having fun playing with layers
* Aside from the things I have to redesign layer, I can deploy almost all kinds of models mentioned in any DL paper (𝑈𝑠𝑒𝑟𝑃𝑙𝑎𝑦𝐻𝑜𝑢𝑟𝑠=𝑎∗𝑇𝑎𝑠𝑘𝑇𝑟𝑎𝑛𝑠𝑓𝑒𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦)

#### Pytorch lightning example
Well I moved on to the career team. I have to deal with layer level, I have to deal with different data/forward pipeline. PL is a good library because:
* It allows me play with the things I mentioned, but save my energy on things like looping, logging, multidevice training detail etc.
* If you see a training notebook built by PL, you'll see very little lines around training template.
* You'll find about a lots of lines on the specifications you intend to be different.

>The branding image of the examples are simple:
* Keras: play TensorFlow in a concept of layers
* Pytorch-Lightning: writting less template code

#### Unpackai Example
For our lib, I intend for them to focus on, exactly the same range of things we want people to learn:
* choose the columns they intend to use, in what way
* choose the data transformations
* choose the loss, the model structure to use (not keras.layer, not nn.module)
* hit run

## Demo of such example

In [1]:
from ipywidgets import interact, interact_manual
from forgebox.imports import *
from forgebox.category import Category
from tqdm.notebook import tqdm

# for the purpose of easier developing
# I'm using pytorch-lightning here
# This is a questionable, tough and revokable dicision
import pytorch_lightning as pl

In [2]:
HOME = Path(os.environ['HOME'])
# PROJECT = Path("./project")
# PROJECT = Path("./project/image_regression")
# PROJECT = Path("./project/rotten1")
PROJECT = Path("./project/rotten_text")

Let's skip data download here, I mean it's download, we're not going to reinvent brilliant stuff around download

In [3]:
# BEAR_DATASET = HOME/"Downloads"/"bear_dataset"
BEAR_DATASET = Path("/GCI/data/bear_dataset")
ROTTEN_TOMATOES = Path("/GCI/data/rttmt")

### Step 1 Everything starts with dataframe

For fastai, everything starts from list, an **ItemList** to be specific. **ImageList** and **TextList** is [**ItemList**](https://fastai1.fast.ai/tutorial.itemlist.html) with some slight enhanced feature.```[🧂, 🏓, 🍷, 🐻]```

For the clarity of education, or for simplecity as ultimate form of beauty, we use [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) as starting point, ItemList in table format. In this way, every dataset has the same starting point, even the tabular data. 

In [4]:
def df_creator_image_folder(path: Path)-> pd.DataFrame:
    """
    Create a dataframe ,
    Which list all the image path under a system folder
    """
    path = Path(path)
    files = []
    formats = ["jpg","jpeg","png"]
    for fmt in formats:
        files.extend(path.rglob(f"*.{fmt.lower()}"))
        files.extend(path.rglob(f"*.{fmt.upper()}"))
    return pd.DataFrame({"path":files}).sample(frac=1.).reset_index(drop=True)

## Helpers

### Phase/ Configuration

In [5]:
from typing import List, Dict, Callable, Any, Tuple
from torchvision import transforms as tfm
from PIL import Image
from forgebox.html import DOM
from ipywidgets import VBox, HBox, HTML

from ipywidgets import (
    Text, Textarea, IntSlider, FloatSlider, SelectMultiple, Dropdown, Checkbox,
    Layout, Button
)
from typing import List, Dict, Any

In [6]:
class Phase:
    """
    A configuration management mechanism
    """
    is_phase = True
    def __init__(self, **kwargs):
        self.config = dict()
        self.config.update(kwargs)
        
    def __setitem__(self, k, v):
        self.config[k] = v
    
    def __getitem__(self, k):
        return self.config[k]
    
    def __contains__(self, k):
        return k in self.config
    
    def __call__(self):
        return self.get_data(self.config)
    
    def get_data(self, raw):
        """
        Reconstruct back to dict or list or value format
        """
        if hasattr(raw,"is_phase"):
            return raw.get_data(raw.config)
        if type(raw) == list:
            raw = list(self.get_data(i) for i in raw)
            return raw
        if type(raw) == dict:
            for k, v in raw.items():
                raw[k] = self.get_data(v)
            return raw
        return raw

    def __str__(self):
        return json.dumps(self(), indent=2)
    
    def __repr__(self,):
        return f"Phase:{self}"
    

# class EnrichPhase(Phase):
#     def __init__(self, *steps):
#         super().__init__()
#         self.config['steps'] = []
#         for step in steps:
#             checked = self.check_step(step)
#             if checked:
#                 self.config['steps'].append(checked)
                
#     def new_step(self, process, dst:str, src: str=None):
#         self.config['steps'].append({
#             "process":process,
#             "src":src,
#             "dst":dst
#         })
    
#     def check_step(self,step):
#         return step
    
def save_phase():
    global phase
    global PROJECT
    PROJECT = Path(PROJECT)
    PROJECT.mkdir(exist_ok=True, parents=True)
    with open(PROJECT/"phase.json", "w") as f:
        f.write(str(phase))
        
def load_phase():
    global phase
    global PROJECT
    PROJECT = Path(PROJECT)
    PROJECT.mkdir(exist_ok=True, parents=True)
    if (PROJECT/"phase.json").exists():
        with open(PROJECT/"phase.json", "r") as f:
            phase = Phase(**json.loads(f.read()))
            print(phase)
    else:
        phase = Phase()

### Loading config

In [7]:
load_phase()

### Widget Helpers

#### More or less
And editable list within jupyter notebook

In [8]:
class MoreOrLess(VBox):
    """
    Interactive list
    You can add item to the list
    Each added item has a remove button to remove such item
    """
    def __init__(self, data_list: List[Any] = []):
        super().__init__([])
        for data in data_list:
            self+data

    def create_line(self, data):
        children = list(self.children)
        children.append(self.new_line(data))
        self.children = children

    @staticmethod
    def data_to_dom(data):
        return HTML(json.dumps(data))

    def new_line(self, data) -> HBox:
        del_btn = Button(description="Remove", icon="trash")
        del_btn.button_style = 'danger'
        hbox = HBox([del_btn, self.data_to_dom(data)])
        hbox.data = data

        def remove_hbox():
            children = list(self.children)
            for i, c in enumerate(children):
                if id(c) == id(hbox):
                    children.remove(c)
            self.children = children
        del_btn.click = remove_hbox
        return hbox

    def __add__(self, data):
        self.create_line(data)
        return self

    def get_data(self):
        """
        Return the data of this list
        """
        return list(x.data for x in self.children)

In [9]:
mol = MoreOrLess()

In [10]:
mol+{"hello":"Hello"}
mol+{"may I":"yes"}

MoreOrLess(children=(HBox(children=(Button(button_style='danger', description='Remove', icon='trash', style=Bu…

### typings for interactives

Typing for interactive details
```self()``` will create widgets automatically

In [11]:
class InteractiveTyping:
    """
    Typing for interactive details
    self.__call__() will create widgets directly
    """
    name = "anything"
    is_typing = True

    def solid(self, default) -> None:
        """
        Reset default value
        """
        if default is not None:
            self.default = default


class INT(InteractiveTyping):
    def __init__(self, min_: int = 0, max_: int = 10, step: int = 1, default: int = None):
        self.max_ = max_
        self.min_ = min_
        self.step = step
        self.default = default if default is not None else 1

    def __repr__(self):
        return f"int[{self.min_}-{self.max_}, :{self.step}]={self.default}"

    def __call__(self, default: int = None):
        self.solid(default)
        return IntSlider(
            value=self.default,
            min=self.min_,
            max=self.max_,
            step=self.step,
        )


class BOOL(InteractiveTyping):
    def __init__(self, name:str="", default: bool = True,):
        self.default = default
        self.name = name

    def __repr__(self):
        return f"bool={self.default}"

    def __call__(self, default: bool = None) -> Checkbox:
        self.solid(default)
        return Checkbox(value=self.default, description=self.name)


class FLOAT(InteractiveTyping):
    def __init__(self, min_: int = -1., max_: int = 1., step: int = .01, default: int = None):
        self.max_ = max_
        self.min_ = min_
        self.step = step
        self.default = default if default is not None else 0.01

    def __repr__(self):
        return f"float[{self.min_}-{self.max_}, :{self.step}]={self.default}"

    def __call__(self, default: int = None):
        self.solid(default)
        return FloatSlider(
            value=self.default,
            min=self.min_,
            max=self.max_,
            step=self.step,
        )


class STR(InteractiveTyping):
    """
    String object
    will create text or textarea
    """

    def __init__(self, default: str = None, use_area: bool = False):
        """
        use_area: do we use Textarea, if False,we use Text
        """
        self.default = "" if default is None else default
        self.use_area = use_area

    def __repr__(self):
        return f"str='{self.default}'"

    def __call__(self, default: str = None):
        self.solid(default)
        if self.use_area:
            return Textarea(value=self.default, layout=Layout(width="80%"))
        return Text(value=self.default)


class LIST(InteractiveTyping):
    """
    dropdown list type or multiselection type
    """

    def __init__(self, options: List[Any] = [], default: Any = None, multi: bool = False):
        """
        if multi: default should be iterable
        else: default should be one of the option
        """
        self.options = options
        self.default = default
        self.multi = multi

    def __repr__(self):
        if self.multi:
            size = f"[0-{self.default}]/{len(self.options)}"
        else:
            size = f"1/{len(self.options)}"
        return f"list,{size}"

    def __call__(self, default: Any = None):
        self.solid(default)
        if self.multi:
            inter = SelectMultiple(options=self.options)
        else:
            inter = Dropdown(options=self.options)

        if self.default is not None:
            # if multi: default should be iterable
            # else: default should be one of the option
            inter.value = self.default
        return inter

### Enhanced Interactive

The original ```interact_manual``` isn't powerful enough for this situation, so the following is a more flexible way to decorate an interactive function

In [12]:
class InteractiveAnnotations:
    """
    Build interactive based on the info of function's ```__annotations__```
    """

    def __init__(
        self, func: Callable,
        icon: str = "rocket",
        description: str = 'Run',
        button_style='primary'
    ):
        self.func = func
        self.icon = icon
        self.button_style = button_style
        self.description = description
        self.build_vbox(func)

    @classmethod
    def on(
        cls,
        callback: Callable,
        icon: str = 'rocket',
        description: str = 'Run',
        button_style: str = 'primary'
    ) -> Callable:
        """
        Use this class as a decorator
        @InteractiveAnnotation.on(callback)
        def target_func(a:STR(), b:INT()=1):
            ...
        """
        def decorator(func: Callable):
            obj = cls(
                func,
                icon=icon,
                description=description,
                button_style=button_style
            )
            display(obj.vbox)
            obj.register_callback(callback=callback)
            return func
        return decorator

    def build_vbox(self, func: Callable):
        row_list = []
        self.fields = dict()
        for k, v in func.__annotations__.items():
            if hasattr(v, "is_typing") == False:
                continue
            widget = v()
            widget.description = k
            row_list.append(widget)
            self.fields.update({k: widget})

        # final button
        self.final_btn = Button(
            description=self.description,
            icon=self.icon,
        )
        self.final_btn.button_style = self.button_style
        row_list.append(self.final_btn)

        # create interactive
        self.vbox = VBox(row_list)
        return self.vbox

    def register_callback(
        self,
        callback: Callable
    ) -> None:
        def run_callback():
            kwargs = self()
            callback(kwargs)
        self.final_btn.click = run_callback

    def __call__(self) -> Dict[str, Any]:
        """
        extract interactive data values
        """
        rt = dict()
        for k, widget in self.fields.items():
            rt.update({k: widget.get_interact_value()})
        return rt

#### Test callback & decorator

In [13]:
def print_stuff(kwargs):
    print(kwargs)

@InteractiveAnnotations.on(print_stuff, "flask", "test", button_style="warning")
def some_func(e, a:STR(), b:INT()=2, d=3):
    print(1)

VBox(children=(Text(value='', description='a'), IntSlider(value=1, description='b', max=10), Button(button_sty…

In [14]:
STR('RGB')()

Text(value='RGB')

### Intercept interactive

In [15]:
def print_kwargs(kwargs):
    print(kwargs)
    return kwargs


def reconfig_manual_interact(
    widget,
    description: str = "Create",
    button_style: str = "primary",
    icon: str = "plus"
) -> Button:
    """
    reconfigure the button of interactive features
    """
    btn = None
    for w in widget.children:
        if type(w) == Button:
            btn = w
            break
    btn.description = description
    btn.button_style = button_style
    btn.icon = icon
    return btn


def interact_intercept(
    func:Callable,
    result_cb: Callable = print_kwargs
):
    """
    Initialize a class with interactive features
    """
    annotations = func.__annotations__
    defaults = func.__defaults__
    kwargs = dict()
    if defaults is not None:
        for (k, typing), default in zip(annotations.items(), defaults):
            kwargs.update({k: typing(default)})
    obj = dict()

    def fillin_init(**kwargs):
        obj.update({
            "kwargs": kwargs,
        })
    f = interact_manual(fillin_init, **kwargs)

    btn = reconfig_manual_interact(f.widget)

    if btn is not None:
        original = btn.click

        def new_click_event():
            original()
            return result_cb(obj['kwargs'])
        btn.click = new_click_event

    return obj, f

def init_interact(cls, result_cb: Callable = print_kwargs):
    return interact_intercept(cls.__init__, result_cb=result_cb)

## Enrich columns (feature transformation, label extraction)
After this step, there will only be **MORE** column ➕

### Enrich Classes

In [16]:
class Enrich:
    """
    Enrich Base Class
    Some default attributes
    - is_enrich = True
    - typing = None # output typing
    - multi_cols = False # use multi-column as input
    - prefer = None
    - lazy = False  # shall we execute enrichment only through the iteration
    - src = None # source column
    """
    is_enrich = True
    typing = None # output typing
    multi_cols = False # use multi-column as input
    prefer = None
    lazy = False  # shall we execute enrichment only through the iteration
    src = None # source column

    def __init__(self): pass

    def __call__(self, row):
        return row
    
    def rowing(self, row):
        if self.multi_cols:
            return self(row)
        else:
            return self(row[self.src])


class EnrichImage(Enrich):
    """
    Create Image column from image path column
    """
    prefer = "QuantifyImage"
    typing = Image
    lazy = True
    

    def __init__(
        self, convert: STR("RGB") = "RGB",
        size: LIST(options=[28, 128, 224, 256, 512], default=224) = 224,
    ):
        self.convert = convert
        self.size = size

    def __repr__(self):
        return f"[Image:{self.size}]"

    def __call__(self, x):
        img = Image.open(x).convert(self.convert)
        img = img.resize((self.size, self.size))
        return img


class ParentAsLabel(Enrich):
    typing = str
    prefer = "QuantifyCategory"
    def __call__(self, path: Path,) -> str:
        """
        Use parent folder name as label
        """
        return Path(path).parent.name
    
ENRICHMENTS = dict(
    EnrichImage=EnrichImage,
    ParentAsLabel=ParentAsLabel,
)

In [17]:
obj,f = init_interact(EnrichImage)

interactive(children=(Text(value='RGB', description='convert'), Dropdown(description='size', index=2, options=…

#### Interact creation

In [22]:
# base_df = df_creator_image_folder(BEAR_DATASET)

# the rotten tomatoes dataset, we are not using every line
base_df = pd.read_csv(ROTTEN_TOMATOES/'critic_reviews.csv', nrows=200000)
base_df = base_df[~base_df['review_score'].isna()].reset_index(drop=True)
base_df = base_df[~base_df['review_content'].isna()].reset_index(drop=True)
base_df = base_df[base_df['review_score'].apply(lambda x: "/" in x)].reset_index(drop=True)

base_df['review_score'] = base_df['review_score'].apply(eval)

base_df

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,0.700,2010-02-09,Whether audiences will get behind The Lightnin...
1,m/0814255,Nick Schager,False,Slant Magazine,Rotten,0.250,2010-02-10,Harry Potter knockoffs don't come more transpa...
2,m/0814255,Bill Goodykoontz,True,Arizona Republic,Fresh,0.700,2010-02-10,"Percy Jackson isn't a great movie, but it's a ..."
3,m/0814255,Jim Schembri,True,The Age (Australia),Fresh,0.600,2010-02-10,"Crammed with dragons, set-destroying fights an..."
4,m/0814255,Mark Adams,False,Daily Mirror (UK),Fresh,0.800,2010-02-10,"This action-packed fantasy adventure, based on..."
...,...,...,...,...,...,...,...,...
109463,m/bottle_shock,Phil Villarreal,False,Arizona Daily Star,Rotten,0.500,2008-08-29,"It might have worked better as a documentary, ..."
109464,m/bottle_shock,Todd Gilchrist,False,IGN Movies,Rotten,0.400,2008-08-29,Bottle Shock feels more like an excuse to exer...
109465,m/bottle_shock,Austin Kennedy,False,Sin Magazine,Rotten,0.625,2008-09-02,"I was slightly involved towards the end, but t..."
109466,m/bottle_shock,Sean P. Means,False,Salt Lake Tribune,Rotten,0.500,2008-09-05,"Flat, musty and with a hint of flopsweat."


### Set Enrich 🎸

In [23]:
def set_enrich(df):
    global phase
    DOM(f"{len(df)} rows of data, example table", "h3")()
    display(df.sample(5))
    display(HTML("<hr>"))

    def setting_col():
        enrich_data_list = phase['enrich'] if 'enrich' in phase else []
        enrich_box = MoreOrLess(enrich_data_list)
        display(enrich_box)

        
        def set_enrich_(src=["[all_columns]", ]+list(df.columns)):
            DOM(f"Setting up column enrich: {src}", "h4")()
            if src == "[all_columns]":
                display(df.head(3))
            else:
                display(df[[src, ]].head(3))

            def choose_enrich(dst="", enrich=ENRICHMENTS):
                DOM(f"Source: {src}, Destination: {dst}, for {enrich.__name__}", "h4")(
                )
                DOM(f"{enrich.__doc__}", "quote")()

                def result_callback(kwargs):
                    extra = {"src": src, "dst": dst,
                                "kwargs": kwargs, "enrich": enrich.__name__}
                    enrich_box+extra
                    phase['enrich'] = enrich_box.get_data()
                obj, decoed_func = init_interact(enrich, result_callback)
            choose_enrich_widget = interact_manual(choose_enrich).widget
            reconfig_manual_interact(
                choose_enrich_widget,
                description="Choose", button_style='warning')
        set_enrich_widget = interact_manual(set_enrich_).widget
        reconfig_manual_interact(set_enrich_widget, button_style='warning')
    setting_col()

In [24]:
set_enrich(base_df)

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
15348,m/1028564-guardian,Matt Brunson,False,Creative Loafing,Rotten,0.5,2016-01-23,"As far as flying nannies go, Jenny Seagrove's ..."
22793,m/1105979-brothers,Steve Rhodes,False,Internet Reviews,Rotten,0.5,2001-03-21,It isn't much of a movie.
78302,m/arizona_2018,Jared Mobarak,False,Jaredmobarak.com,Fresh,0.6,2018-08-28,"It's stupid, exploits the housing crisis as fo..."
59161,m/accepted,Chris Cabin,False,Filmcritic.com,Rotten,0.4,2006-08-28,expect an uneasy feeling of recycling.
32873,m/1178913-1178913-you_kill_me,Rob Thomas,False,"Capital Times (Madison, WI)",Fresh,0.75,2007-07-13,When Frank stands up at a meeting and confesse...


HTML(value='<hr>')

MoreOrLess()

interactive(children=(Dropdown(description='src', options=('[all_columns]', 'rotten_tomatoes_link', 'critic_na…

In [25]:
phase

Phase:{}

### Execute enrichment
> apply the enrichment settings to the dataframe

In [26]:
def execute_enrich(
    df: pd.DataFrame, phase:Phase
):
    if 'enrich' not in phase:
        return df
    for en_conf in tqdm(phase["enrich"], leave=False):
        enrich_name = en_conf['enrich']
        enrich_cls = ENRICHMENTS[enrich_name]
        kwargs = en_conf['kwargs']
        src = en_conf['src']
        dst = en_conf['dst']
        # The class with lazy loading, will only 
        # call the class only if necessary
        if enrich_cls.lazy:
            obj = enrich_cls(**kwargs)
            obj.src = src
            df[dst] = obj
        # The class without lazy loading
        # create the column now
        else:
            obj = enrich_cls(**kwargs)
            if src=="[all_columns]":
                df[dst] = df.apply(obj, axis=1)
            else:
                df[dst] = df[src].apply(obj)
    return df

see? more columns

In [27]:
execute_enrich(base_df, phase)

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,0.700,2010-02-09,Whether audiences will get behind The Lightnin...
1,m/0814255,Nick Schager,False,Slant Magazine,Rotten,0.250,2010-02-10,Harry Potter knockoffs don't come more transpa...
2,m/0814255,Bill Goodykoontz,True,Arizona Republic,Fresh,0.700,2010-02-10,"Percy Jackson isn't a great movie, but it's a ..."
3,m/0814255,Jim Schembri,True,The Age (Australia),Fresh,0.600,2010-02-10,"Crammed with dragons, set-destroying fights an..."
4,m/0814255,Mark Adams,False,Daily Mirror (UK),Fresh,0.800,2010-02-10,"This action-packed fantasy adventure, based on..."
...,...,...,...,...,...,...,...,...
109463,m/bottle_shock,Phil Villarreal,False,Arizona Daily Star,Rotten,0.500,2008-08-29,"It might have worked better as a documentary, ..."
109464,m/bottle_shock,Todd Gilchrist,False,IGN Movies,Rotten,0.400,2008-08-29,Bottle Shock feels more like an excuse to exer...
109465,m/bottle_shock,Austin Kennedy,False,Sin Magazine,Rotten,0.625,2008-09-02,"I was slightly involved towards the end, but t..."
109466,m/bottle_shock,Sean P. Means,False,Salt Lake Tribune,Rotten,0.500,2008-09-05,"Flat, musty and with a hint of flopsweat."


In [28]:
import random
def noise():
    return random.random()*.1

In [None]:
base_df["grizzly_score"] = base_df['bear_kind'].apply(
    lambda x: .9 +noise()  if x=='grizly' else .1+noise())

## Quantify: Choose columns as X and Y, put them into number

### Size classes

In [30]:
class SIZE_DIMENSION:
    pass

class BATCH_SIZE(SIZE_DIMENSION):
    def __repr__(self): return f"BATCH_SIZE"

class SEQUENCE_SIZE(SIZE_DIMENSION):
    pass

class IMAGE_SIZE(SIZE_DIMENSION):
    pass

### Quantify classes

In [40]:
class Quantify:
    is_quantify = True
    """
    # From all things to number
    The AI model does not understand anything, say, picture, text
    Unless you transform it to integer and float tensors

    Quantify and its subclass controls the
        numericalization / collation of the data pipeline
    The base class of quantify does: NOTHING
    """

    def __init__(self,):
        pass

    def __call__(self, list_of_items):
        return list(list_of_items)

    def adapt(self, column):
        """
        A function to let the data processing
        adapt to the data column
        """
        pass

    def __hash__(self,):
        if hasattr(self, "name"):
            return self.name
        else:
            return self.__class__.__name__


class QuantifyImage(Quantify):
    """
    Transform PIL.Image to tensor
    """

    def __init__(
        self,
        mean_: LIST(["imagenet", "0.5 x 3"]) = "imagenet",
        std_: LIST(["imagenet", "0.5 x 3"]) = "imagenet",
    ):
        if type(mean_) == str:
            if mean_ == "imagenet":
                mean_ = [0.485, 0.456, 0.406]
            elif mean_ == "0.5 x 3":
                mean_ = [.5, .5, .5]
            else:
                raise ValueError(
                    f"Mean configuration: {mean_} not valid")

        if type(std_) == str:
            if std_ == "imagenet":
                std_ = [0.229, 0.224, 0.225]
            elif std_ == "0.5 x 3":
                std_ = [.5, .5, .5]
            else:
                raise ValueError(
                    f"Standard Variation configuration: {std_} not valid")

        self.transform = tfm.Compose([
            tfm.ToTensor(),
            tfm.Normalize(mean=mean_, std=std_),
        ])

        self.shape = (BATCH_SIZE, 3, IMAGE_SIZE, IMAGE_SIZE)

    def __repr__(self):
        return f"Quantify Image to tensors:{self.transform}"

    def __call__(self, list_of_image):
        return torch.stack(list(
            self.transform(img) for img in list_of_image))


class QuantifyText(Quantify):
    def __init__(
        self,
        pretrained: STR(default="bert-base-cased") = "bert-base-cased",
        max_length: INT(default=512, min_=12, max_=1024, step=4) = 512,
        padding: LIST(options=[
            "do_not_pad",
            "max_length",
            "longest"], default="max_length") = "max_length",
        return_token_type_ids: BOOL(name="Token Type IDs", default=True) = True,
        return_attention_mask: BOOL(name="Attention Mask", default=True) = True,
        return_offsets_mapping: BOOL(name="Offset Mapping", default=False) = False,
    ):
        self.pretrained = pretrained
        self.max_length = max_length
        self.padding = padding
        self.return_token_type_ids = return_token_type_ids
        self.return_attention_mask = return_attention_mask
        self.return_offsets_mapping = return_offsets_mapping
        self.truncation = True
        self.return_tensors = 'pt'
        self.shape = (BATCH_SIZE, SEQUENCE_SIZE)

    def adapt(self, column):
        """
        Initialize tokenizer
        """
        from transformers import AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.pretrained, use_fast=True)

    def __call__(self, list_of_text: List[str]):
        list_of_text = list(list_of_text)
        return self.tokenizer(
            list_of_text,
            padding=self.padding,
            max_length=self.max_length,
            truncation=self.truncation,
            return_token_type_ids=self.return_token_type_ids,
            return_attention_mask=self.return_attention_mask,
            return_tensors=self.return_tensors,
            return_offsets_mapping=self.return_offsets_mapping,
        )


class QuantifyCategory(Quantify):
    """
    Transform single categorical data to index numbers in pytorch tensors
    """

    def __init__(
        self,
        min_frequency: INT(min_=1, max_=20, default=1) = 1,
    ):
        self.min_frequency = min_frequency

    def adapt(self, column):
        # category statistics
        value_counts = pd.DataFrame(column.value_counts())

        # if minimun freq is 1
        # very category occured should be accounted for
        # hence no missing token padding is required
        if self.min_frequency < 2:
            self.category = Category(
                arr=np.array(value_counts.index),
                pad_mst=False)

        # we need missing token
        # for category's frequency < self.min_frequency
        else:
            categories = np.array(
                list(value_counts.index[
                    value_counts.values.reshape(-1) > self.min_frequency]))
            self.category = Category(arr=categories, pad_mst=True)

        self.shape = (BATCH_SIZE, len(self.category))

    def __repr__(self):
        return f"Quantify Category:{self.category}"

    def __call__(self, list_of_strings):
        return torch.LongTensor(self.category.c2i[np.array(list_of_strings)])


class QuantifyNum(Quantify):
    """
    Quantify contineous data, like float numbers
    The only process is normalization on the entire population
    """
    shape = (BATCH_SIZE, 1)

    def adapt(self, column):
        self.mean_ = column.mean()
        self.std_ = column.std()

    def __call__(self, list_of_num):
        return (torch.FloatTensor(list_of_num)[None, :]-self.mean_)/self.std_

    def backward(self, x):
        return x*self.std_+self.mean_


class QuantifyMultiCategory(Quantify):
    """
    Turn Multi-categorical data to n_hot encoding numbers in pytorch tensors
    """

    def __init__(self, col_name: str):
        self.col_name = col_name


QUANTIFY = dict(
    Quantify=Quantify,
    QuantifyNum=QuantifyNum,
    QuantifyImage=QuantifyImage,
    QuantifyCategory=QuantifyCategory,
    QuantifyMultiCategory=QuantifyMultiCategory,
    QuantifyText=QuantifyText,
)

#### Pytorch Dataset

In [41]:
class TaiChiDataset(Dataset):
    """
    A pytorch dataset working under our core engine
    The dataset class should on be defined here once
    """
    def __init__(self, df, columns: List[Any] = None):
        self.df = df
        self.columns = list(df.columns) if columns is None else columns

    def __len__(self):
        return len(self.df)

    def shuffle(self):
        self.df = self.df.sample(frac=1.).reset_index(drop=True)

    def __getitem__(self, idx: int) -> Dict[str, Any]:
        row = dict(self.df.loc[idx])
        rt = dict()
        for col in self.columns:
            v = row[col]
            if hasattr(v, "is_enrich"):
                rt[col] = v.rowing(row)
            else:
                rt[col] = v
        return rt
    
    def split(
        self,
        valid_ratio:FLOAT(min_=0.01, max_=0.5, default=.1, step=0.01)=.1
    ) -> Tuple[Any]:
        """
        Split dataset to train, validation
        """
        cls = self.__class__
        slicing = (np.random.rand(len(self)) < valid_ratio)
        return (
            cls(self.df[~slicing].reset_index(drop=True), self.columns),
            cls(self.df[slicing].reset_index(drop=True), self.columns)
        )

    def dataloader(
        self,
        batch_size: LIST(options=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512], default=32) = 32,
        shuffle: LIST(options=[True, False], default=False) = False,
        num_workers: LIST(options=[0, 2, 4, 8, 16], default=0) =0,
    ):
        """
        Create dataloader from dataset
        """
        return DataLoader(
            self,
            batch_size=batch_size,
            shuffle=shuffle,
            num_workers=num_workers)

In [42]:
from forgebox.html import list_group_kv

In [43]:
ds = TaiChiDataset(base_df)
ds[1]

{'rotten_tomatoes_link': 'm/0814255',
 'critic_name': 'Nick Schager',
 'top_critic': False,
 'publisher_name': 'Slant Magazine',
 'review_type': 'Rotten',
 'review_score': 0.25,
 'review_date': '2010-02-10',
 'review_content': "Harry Potter knockoffs don't come more transparent and slapdash than this wannabe-franchise jumpstarter directed by Chris Columbus."}

### Choose XY 🎸

In [44]:
def choose_xy(df):
    global phase
    DOM(f"{len(df)} rows of data, example table", "h3")()
    display(df.sample(5))
    display(HTML("<hr>"))
    DOM("Please Choose Column", "h3")()
    DOM("The AI model will try to guess the Y with the input X", "div", {"style":"color:#666699"})()
    
    task = 'quantify'
    # enrich by columns
    if "enrich" in phase:
        by_destination = dict((en['dst'], en) for en in phase['enrich'])
    else:
        by_destination = dict()
    
    data_list = phase[task] if task in phase else []
    mol_box = MoreOrLess(data_list)
    display(mol_box)

    @interact_manual
    def set_quantify_(src=list(df.columns), use_for = ["As X", "As Y"]):
        DOM(f"Quantify Column: {src} {use_for}", "h4")()
        display(df[[src, ]].head(3))
        
        quantify_dropdown = Dropdown(options=list(QUANTIFY.keys()))
        
        # check the hint from last step
        prefer = None
        if src in by_destination:
            col_config = by_destination[src]
            cls = ENRICHMENTS[col_config['enrich']]

            # In case the enrich layer has the preference
            if hasattr(cls, "prefer"):
                prefer = cls.prefer
                
                # set default value to drop down value,
                # if the the previous hint suggest so
                quantify_dropdown.value = prefer
                DOM(f"Prefered quantifying:\t{cls.prefer}", "h4")()
            if hasattr(cls, "typing"):
                DOM(f"Output data type:\t{cls.typing}", "h4")()
        
        @interact_manual
        def choose_quantify(quantify = quantify_dropdown):
            cls = QUANTIFY[quantify]
            def result_callback(kwargs):
                extra = {"src": src, "x":(use_for=="As X"),
                        "kwargs": kwargs, "quantify": cls.__name__}
                mol_box+extra
                phase['quantify'] = mol_box.get_data()
                
            obj, decoded = init_interact(cls, result_callback)

In [36]:
choose_xy(base_df)

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
5092,m/10009462-g_force,Sean P. Means,False,Salt Lake Tribune,Rotten,0.5,2009-08-09,Guided by the principle that action movies hav...
101556,m/black_panther_2018,Nigel Andrews,True,Financial Times,Fresh,0.8,2018-02-07,"Crashingly enjoyable, frequently exciting and ..."
12576,m/1012225-les_miserables,Tim Brayton,False,Antagony & Ecstasy,Fresh,0.8,2013-01-06,While there's no such thing as a timeless '30s...
284,m/10000_bc,Linda Cook,False,"Quad City Times (Davenport, IA)",Rotten,0.5,2008-03-09,"On a Neanderthal level, ""10,000 B.C."" works."
1894,m/10007902-delirious,S. Jhoanna Robledo,False,Common Sense Media,Rotten,0.4,2007-10-22,Mature paparazzi drama isn't quite in focus.


HTML(value='<hr>')

MoreOrLess()

interactive(children=(Dropdown(description='src', options=('rotten_tomatoes_link', 'critic_name', 'top_critic'…

#### Check configuration after setting

In [37]:
phase

Phase:{
  "quantify": [
    {
      "src": "review_content",
      "x": true,
      "kwargs": {
        "pretrained": "bert-base-cased",
        "max_length": 512,
        "padding": "max_length",
        "return_token_type_ids": true,
        "return_attention_mask": true,
        "return_offsets_mapping": false
      },
      "quantify": "QuantifyText"
    },
    {
      "src": "rotten_tomatoes_link",
      "x": true,
      "kwargs": {
        "min_frequency": 1
      },
      "quantify": "QuantifyCategory"
    },
    {
      "src": "review_score",
      "x": false,
      "kwargs": {},
      "quantify": "QuantifyNum"
    }
  ]
}

In [45]:
def execute_quantify(
    df: pd.DataFrame, phase:Phase
):
    # existance check
    if 'quantify' not in phase:
        raise KeyError(f"No quantify stepset")
    
    qdict = dict()
    for i, qconf in tqdm(enumerate(phase['quantify']), leave = False):
        qname = qconf['quantify']
        kwargs = qconf['kwargs']
        src = qconf['src']
        x = qconf['x']
        
        cls = QUANTIFY[qname]
        qobj = cls(**kwargs)
        qobj.src = src
        qobj.is_x = x
        qobj.adapt(df[src])
        qdict.update({src:qobj})
    return qdict

In [46]:
qdict = execute_quantify(base_df, phase)

0it [00:00, ?it/s]

In [47]:
qdict

{'review_content': <__main__.QuantifyText at 0x7f936baaed50>,
 'rotten_tomatoes_link': Quantify Category:Category Manager with 3869,
 'review_score': <__main__.QuantifyNum at 0x7f92b898fa10>}

### Create Dataloader

#### Collating function

In [48]:
class TaiChiCollate:
    """
    Universal all power full collate function
    1 for all collation
    """
    def __init__(self, quantify_dict):
        self.quantify_dict = quantify_dict
        
    def make_df(self, batch):
        return pd.DataFrame(list(batch))
        
    def __len__(self):
        return len(self.quantify_dict)
        
    def __call__(self, batch) -> Dict[str, torch.Tensor]:
        """
        This call will execute the __call__(a_list_of_items)
        from Quantify objects column by column
        """
        batch_df = self.make_df(batch)
        rt = dict()
        for src,qobj in self.quantify_dict.items():
            rt.update({
                src:qobj(list(batch_df[src]))
            })
        return rt

#### Data module

This part handles:
* Spliting
* To dataloader

In [49]:
class TaiChiDataModule(pl.LightningDataModule):
    def __init__(self, dataset: TaiChiDataset, quantify_dict: Dict[str, Quantify]):
        super().__init__()
        self.dataset = dataset
        self.quantify_dict = quantify_dict
        
        self.collate = TaiChiCollate(quantify_dict)
        
    def configure(
        self,
        valid_ratio:FLOAT(min_=0.01, max_=0.5, default=.1, step=0.01)=.1,
        batch_size: LIST(options=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512], default=32) = 32,
        shuffle: LIST(options=[True, False], default=False) = True,
        num_workers: LIST(options=[0, 2, 4, 8, 16], default=0) =0,
    ):  
        self.train_ds, self.val_ds = self.dataset.split(valid_ratio)
        self.batch_size=batch_size
        self.shuffle=shuffle
        self.num_workers=num_workers
        
    def train_dataloader(self):
        self.train_dl = self.train_ds.dataloader(
            batch_size=self.batch_size,
            shuffle=self.shuffle,
            num_workers=self.num_workers)
        self.train_dl.collate_fn = self.collate
        return self.train_dl
    
    def val_dataloader(self):
        self.val_dl = self.val_ds.dataloader(
            batch_size=self.batch_size,
            shuffle=self.shuffle,
            num_workers=self.num_workers)
        self.val_dl.collate_fn = self.collate
        return self.val_dl

In [50]:
def execute_datamodule(df, qdict, phase):
    ds = TaiChiDataset(df)
    datamodule = TaiChiDataModule(ds, qdict)
    
    def configure_setting(kwargs):
        global phase
        datamodule.configure(**kwargs)
        phase['batch_level'] = kwargs
    interact_intercept(datamodule.configure, configure_setting)
    return datamodule

In [51]:
datamodule = execute_datamodule(base_df, qdict, phase)

interactive(children=(FloatSlider(value=0.1, description='valid_ratio', max=0.5, min=0.01, step=0.01), Dropdow…

In [83]:
data = next(iter(datamodule.train_dataloader()))

In [84]:
phase

Phase:{
  "quantify": [
    {
      "src": "review_content",
      "x": true,
      "kwargs": {
        "pretrained": "bert-base-cased",
        "max_length": 512,
        "padding": "max_length",
        "return_token_type_ids": true,
        "return_attention_mask": true,
        "return_offsets_mapping": false
      },
      "quantify": "QuantifyText"
    },
    {
      "src": "rotten_tomatoes_link",
      "x": true,
      "kwargs": {
        "min_frequency": 1
      },
      "quantify": "QuantifyCategory"
    },
    {
      "src": "review_score",
      "x": false,
      "kwargs": {},
      "quantify": "QuantifyNum"
    }
  ],
  "batch_level": {
    "valid_ratio": 0.1,
    "batch_size": 16,
    "shuffle": true,
    "num_workers": 0
  }
}

In [85]:
data

{'review_content': {'input_ids': tensor([[  101,   138,  9189,  ...,     0,     0,     0],
         [  101,  1247,   112,  ...,     0,     0,     0],
         [  101, 15295,  1105,  ...,     0,     0,     0],
         ...,
         [  101,   119,   119,  ...,     0,     0,     0],
         [  101,  9326, 12788,  ...,     0,     0,     0],
         [  101,  1109,  1842,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])},
 'rotten_tomatoes_link': tensor([ 109,  475,  154,  495, 1958,  372,  967,  341,  399, 1430,  704, 1396,
      

###  Choose your model, loss

In [55]:
from torchvision.models import (
    resnet18, resnet34, resnet50, resnet101, resnet152,
    resnext101_32x8d, resnext50_32x4d)

## Models

### Entry modules

In [56]:
RESNET_OPTIONS = {"resnet18": resnet18,
                  "resnet34": resnet34,
                  "resnet50": resnet50,
                  "resnet101": resnet101,
                  "resnet152": resnet152,
                  "resnext101_32x8d": resnext101_32x8d,
                  "resnext50_32x4d": resnext50_32x4d}

In [57]:
class MidJoint1d(nn.Module):
    def __init__(self, keys):
        super().__init__()
        self.keys = keys
    
    def forward(self, data):
        tensors = list(data[key] for key in self.keys)
        return torch.cat(tensors,dim=1)

In [79]:
class EntryModel(nn.Module):
    is_entry = True
    
    @classmethod
    def from_quantify(cls, ):
        raise ImportError(
            f"Please define class function 'from_quantify' for {cls.__name__}"
        )
    
class Empty(EntryModel):
    def __init__(self):
        super().__init__()
        self.out_features=1
    
    def forward(self, x):
        return x
    
    @classmethod
    def from_quantify(cls,
        quantify):
        return cls()

class ImageConvEncoder(EntryModel):
    def __init__(self, model):
        super().__init__()
        self.name = "cnn"
        self.output_shape = (BATCH_SIZE, model.fc.in_features)
        self.out_features = model.fc.in_features
        model.fc = Empty()
        self.model = model

    def forward(self, data):
        return self.model(data)

    def __repr__(self):
        return f"""ComputerVisionEncoder: {self.name}
        Outputs shape:{self.output_shape}"""

    @classmethod
    def from_quantify(
        cls,
        quantify,
        name: LIST(options=list(
            RESNET_OPTIONS.keys()), default="resnet18"),
    ):
        model = RESNET_OPTIONS[name](pretrained=True, progress=True,)
        obj = cls(model)
        obj.name = name
        return obj


class CategoryEncoder(EntryModel):
    def __init__(self, num_embeddings, embedding_dim):
        super().__init__()
        self.model = nn.Embedding(
            num_embeddings,
            embedding_dim)
        
    def forward(self, idx):
        return self.model(idx)
    
    @classmethod
    def from_quantify(
        cls,
        quantify,
        embedding_dim: LIST(
            options=[4, 8, 16, 32, 64, 128, 256, 512], default=128) = 128):
        num_embeddings = len(quantify.category)
        obj = CategoryEncoder(
            num_embeddings=num_embeddings,
            embedding_dim=embedding_dim,
        )
        obj.out_features = embedding_dim
        return obj


class TransformerEncoder(EntryModel):
    """
    A model part to encode sequnce data in to vectors
    """

    def __init__(self, model, encoder_mode: BOOL(default=True) = True,):
        super().__init__()
        self.model = model
        self.encoder_mode = encoder_mode

    def forward(self, kwargs):
        outputs = self.model(**kwargs)
        if self.encoder_mode:
            # output vector
            if "pooler_output" in outputs:
                return outputs.pooler_output
            else:
                return (
                    outputs.last_hidden_state*kwargs['attention_mask'][:,:,None]
                ).mean(1)
        return outputs

    @classmethod
    def from_quantify(
        cls,
        quantify,
        name: STR(default="bert-base-uncased") = 'bert-base-uncased',
        encoder_mode: BOOL(default=True) = True,
    ):
        from transformers import AutoModel
        model = AutoModel.from_pretrained(name)
        obj = cls(model)
        obj.name = name
        obj.encoder_mode = encoder_mode
        if encoder_mode:
            obj.out_features= model.config.hidden_size
        return obj

In [80]:
entry = TransformerEncoder.from_quantify(0)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [81]:
with torch.no_grad():
    vector = entry(data['review_content'])

In [82]:
vector.shape

torch.Size([32, 768])

In [84]:
# entry = ImageConvEncoder.from_quantify(0,name="resnet18")

# with torch.no_grad():
#     vectors = entry(data['image'])

### Exit modules

In [86]:
def accuracy(y_, y):
    return (y_.argmax(-1) == y).float().mean()

In [87]:
class ExitModel(nn.Module):
    metric_funcs = dict()

    def loss_step(self, x, y):
        y_ = self(x)
        loss = self.crit(y_, y)
        metrics = dict()
        for k, func in self.metric_funcs.items():
            metrics.update({k:func(y_, y)})
        return dict(loss=loss, y_=y_, **metrics)
    
class CategoryTop(ExitModel):
    prefer = "CrossEntropyLoss"
    input_dim = 2

    def __init__(self, in_features, out_features):
        super().__init__()
        self.top = nn.Linear(
            in_features=in_features, out_features=out_features)
        self.activation = nn.Softmax(dim=-1)
        self.crit = nn.CrossEntropyLoss()
        self.metric_funcs.update({"acc":accuracy})

    def forward(self, x):
        return self.top(x)
    
    @classmethod
    def from_quantify(cls, quantify, entry_part):
        out_features = len(quantify.category)
        in_features = entry_part.out_features
        return cls(
            in_features=in_features,
            out_features=out_features,
        )

class MultiCategoryTop(ExitModel):
    prefer = "BCEWithLogitsLoss"
    input_dim = 2

    def __init__(self, in_features, out_features):
        super().__init__()
        self.top = nn.Linear(
            in_features=in_features, out_features=out_features)
        self.activation = nn.Sigmoid()
        self.crit = nn.BCEWithLogitsLoss()

    def forward(self, x):
        return self.top(x)
    
    def loss_step(self, x, y):
        y_ = self(x)
        loss = self.crit(y_, y)
        return {"loss": loss, "y_": self.activation(y_)}
    
    @classmethod
    def from_quantify(cls, quantify, entry_part):
        out_features = len(quantify.category)
        in_features = entry_part.out_features
        return cls(
            in_features=in_features,
            out_features=out_features,
        )


class RegressionTop(ExitModel):
    prefer = "MSELoss"
    input_dim = 2

    def __init__(self, in_features, out_features):
        super().__init__()
        self.top = nn.Linear(
            in_features=in_features, out_features=out_features)
        self.crit = nn.MSELoss()

    def forward(self, x):
        return self.top(x)
    
    @classmethod
    def from_quantify(cls, quantify, entry_part):
        out_features = 1
        in_features = entry_part.out_features
        return cls(
            in_features=in_features,
            out_features=out_features,
        )

### EntireModel

In [88]:
# mapping quantify to the following entry or exit model
QUANTIFY_2_ENTRY_MAP = dict({
    QuantifyImage:[
        ImageConvEncoder,
    ],
    QuantifyCategory:[
        CategoryEncoder,
    ],
    QuantifyText:[
        TransformerEncoder,
    ],
    QuantifyNum:[
        Empty,
    ],
})
QUANTIFY_2_EXIT_MAP = dict({
    QuantifyCategory:[
        CategoryTop,
    ],
    QuantifyNum:[
        RegressionTop,
    ],
})

# all entry and exit model
ENTRY_ALL = dict(
    ImageConvEncoder=ImageConvEncoder,
    CategoryEncoder=CategoryEncoder,
    TransformerEncoder=TransformerEncoder,
    Empty=Empty,
)
EXIT_ALL = dict(
    CategoryTop=CategoryTop,
    RegressionTop=RegressionTop,
)

In [89]:
def choose_models(quantify, cls_options, model_conf:str='x_models'):
    def config_model(ModelClass=cls_options):
        def starting_cls(kwargs):
            global phase
            if model_conf in phase:
                models = phase[model_conf]
            else:
                models = dict()
            models[quantify.src] = dict(
            model_name=ModelClass.__name__,
            src=quantify.src,
            kwargs=kwargs,
            )
            phase[model_conf] = models
        ia = InteractiveAnnotations(
            ModelClass.from_quantify,
            description="Okay",
            icon='rocket',
            button_style='success')
            
        ia.register_callback(starting_cls)
        display(ia.vbox)
    inter = interact_manual(config_model)
    reconfig_manual_interact(
        inter.widget,
        description="Yes!", icon="cube", button_style='info')
    return inter
    

In [90]:
def set_model(quantify_dict: Dict[str, Quantify]):
    global phase
    x_models = dict()
    y_models = dict()
    for src, quantify in quantify_dict.items():
        if quantify.is_x:
            entry_cls_options = dict(
                (q.__name__, q)
                for q in QUANTIFY_2_ENTRY_MAP.get(quantify.__class__))

            if entry_cls_options is None:
                print(f"We do not support {quantify.__class__} as X data")
                continue
            display(HTML(f"""
            <h3 class='text-primary'>Choose Model For X Column:
            <strong>{src}</strong></h3>"""))
            choose_models(quantify, entry_cls_options, "x_models")
    for src, quantify in quantify_dict.items():
        if quantify.is_x == False:
            exit_cls_options = dict(
                (q.__name__, q)
                for q in QUANTIFY_2_EXIT_MAP.get(quantify.__class__))
            if entry_cls_options is None:
                print(f"We do not support {quantify.__class__} as Y data")
            display(HTML(f"""
            <h3 class='text-danger'>Choose Model For Y Column:
            <strong>{src}</strong></h3>"""))
            choose_models(quantify, exit_cls_options, "y_models")

In [91]:
set_model(qdict)

HTML(value="\n            <h3 class='text-primary'>Choose Model For X Column:\n            <strong>review_cont…

interactive(children=(Dropdown(description='ModelClass', options={'TransformerEncoder': <class '__main__.Trans…

HTML(value="\n            <h3 class='text-primary'>Choose Model For X Column:\n            <strong>rotten_toma…

interactive(children=(Dropdown(description='ModelClass', options={'CategoryEncoder': <class '__main__.Category…

HTML(value="\n            <h3 class='text-danger'>Choose Model For Y Column:\n            <strong>review_score…

interactive(children=(Dropdown(description='ModelClass', options={'RegressionTop': <class '__main__.Regression…

In [100]:
phase

Phase:{
  "quantify": [
    {
      "src": "review_content",
      "x": true,
      "kwargs": {
        "pretrained": "bert-base-cased",
        "max_length": 512,
        "padding": "max_length",
        "return_token_type_ids": true,
        "return_attention_mask": true,
        "return_offsets_mapping": false
      },
      "quantify": "QuantifyText"
    },
    {
      "src": "rotten_tomatoes_link",
      "x": true,
      "kwargs": {
        "min_frequency": 1
      },
      "quantify": "QuantifyCategory"
    },
    {
      "src": "review_score",
      "x": false,
      "kwargs": {},
      "quantify": "QuantifyNum"
    }
  ],
  "batch_level": {
    "valid_ratio": 0.1,
    "batch_size": 16,
    "shuffle": true,
    "num_workers": 0
  },
  "x_models": {
    "rotten_tomatoes_link": {
      "model_name": "CategoryEncoder",
      "src": "rotten_tomatoes_link",
      "kwargs": {
        "embedding_dim": 64
      }
    },
    "review_content": {
      "model_name": "TransformerEncoder",
 

In [101]:
class EntryDict(nn.Module):
    """
    Create entry parts for different columns
    """
    def __init__(
        self,
        phase: Phase,
        qdict: Dict[str, EntryModel]
    ):
        super().__init__()
        model_dict = nn.ModuleDict()
        for src, model_cfg in phase['x_models'].items():
            quantify = qdict[src]
            
            # find column class
            model_cls = ENTRY_ALL[model_cfg['model_name']]
            # the kwargs to start the column model object
            model_kwargs = model_cfg['kwargs']
            # the model object
            model = model_cls.from_quantify(quantify, **model_kwargs)
            
            # add the model by column name
            model_dict[src] = model
        
        # calculate the output size for dimention 1 (after concatenation)
        self.out_features = sum(
            list(model.out_features for src, model in model_dict.items()))
        self.model_dict = model_dict

    def forward(self, inputs):
        outputs = []
        for src, model in self.model_dict.items():
            # input data for column
            src_input = inputs[src]
            
            # forward pass for column_model(column_data)
            outputs.append(model(src_input))
        # concat the results
        return torch.cat(outputs, dim=1)

In [102]:
entry_dict = EntryDict(phase, qdict)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [95]:
with torch.no_grad():
    y_ = entry_dict(data)

In [103]:
y_.shape

torch.Size([16, 832])

In [104]:
class AssembledModel(pl.LightningModule):
    def __init__(
        self,
        phase: Phase,
        qdict: Dict[str, EntryModel],
        entry_lr: LIST(options=[1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6], default=1e-4)=1e-4,
        exit_lr: LIST(options=[1e-1, 1e-2, 1e-3, 1e-4, ], default=1e-3)=1e-3,
    ):
        super().__init__()
        self.entry_lr = entry_lr
        self.exit_lr = exit_lr
        self.entry_dict = EntryDict(phase, qdict)
        exit_cfg = list(phase['y_models'].values())[0]
        
        self.exit_src = exit_cfg['src']
        self.exit_kwargs = exit_cfg['kwargs']
        exit_cls = EXIT_ALL[exit_cfg['model_name']]
        
        exit_quantify = qdict[self.exit_src]
        
        self.exit_part = exit_cls.from_quantify(
            exit_quantify,self.entry_dict, **self.exit_kwargs)
    
    def forward(self, inputs):
        vec = self.entry_dict(inputs)
        return self.exit_part(vec)
    
    def loss_step(self, inputs):
        vec = self.entry_dict(inputs)
        return self.exit_part.loss_step(vec, inputs[self.exit_src])
    
    def training_step(self, batch, batch_idx):
        rt = self.loss_step(batch)
        for k, v in rt.items():
            if v.numel()==1:
                self.log(f"trn_{k}", v)
        return rt['loss']

    def validation_step(self, batch, batch_idx):
        rt = self.loss_step(batch)
        for k, v in rt.items():
            if v.numel()==1:
                self.log(f"val_{k}", v)
        return rt['loss']
    
    def configure_optimizers(self,):
        param_groups = [
            {"params":self.entry_dict.parameters(), "lr":self.entry_lr},
            {"params":self.exit_part.parameters(), "lr":self.exit_lr},
        ]
        return torch.optim.Adam(param_groups)
    

In [105]:
def create_model(phase, qdict):
    if "y_models" in phase:
        y_models = phase["y_models"]
        if len(y_models)>1:
            raise ValueError("Multiple targets are not supported by now")
        else:
            return AssembledModel(phase, qdict)
    else:
        raise ValueError("phase must contain 'y_models' configuration for now")

In [106]:
final_model = create_model(phase, qdict)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [107]:
with torch.no_grad():
    final_model.loss_step(data)

  return F.mse_loss(input, target, reduction=self.reduction)


### Training

In [108]:
save_phase()

In [109]:
def make_slug_name(phase):
    xs = '-'.join(list(q['src'] for q in phase['quantify'] if q["x"]))
    ys = '-'.join(list(q['src'] for q in phase['quantify'] if q["x"]==False))
    return '_'.join([xs,'to',ys])

Create a vivid **name** for the task

In [110]:
TASK_SLUG = make_slug_name(phase)
TASK_SLUG

'review_content-rotten_tomatoes_link_to_review_score'

In [111]:
from forgebox.thunder.callbacks import DataFrameMetricsCallback

In [115]:
def set_trainer(
    project:STR(default="default",) = "default",
    tensorboard: BOOL(default=True)=True,
    show_metric: BOOL(default=True)=True,
    max_epochs: INT(min_=1, max_=200, default=5)=5,
    use_gpu:BOOL(default=True)=True,
):
    global phase
    if project=='default':
        global PROJECT
        if str(PROJECT)!="None":
            project = PROJECT
        else:
            project = "./project"
    project = Path(project)
    TASK_SLUG = make_slug_name(phase)
    csv_logger = pl.loggers.CSVLogger(project/"csv_log", name = TASK_SLUG, )
    loggers = [
        csv_logger
    ]
    if tensorboard:
        loggers.append(
            pl.loggers.TensorBoardLogger(save_dir=project/'tensorboard', name=TASK_SLUG)
        )
    rt = dict(
        max_epochs = max_epochs,
        logger =loggers)
    callbacks = []
    if show_metric:
        callbacks.append(
            DataFrameMetricsCallback())
        
    rt.update({"callbacks":callbacks})
    
    if use_gpu:
        rt.update(dict(gpus=1))
#     rt.update(dict(
#         auto_select_gpus=True,
#     ))
    return rt

def run_training(final_model, datamodule):
    def set_trainer_callback(kwargs):
        trainer_kwargs = set_trainer(**kwargs)
        trainer = pl.Trainer(**trainer_kwargs)
        trainer.fit(final_model, datamodule=datamodule)
        return trainer
    return set_trainer_callback

In [116]:
interact_intercept(set_trainer, run_training(final_model, datamodule))

interactive(children=(Text(value='default', description='project'), Checkbox(value=True, description='tensorbo…

({}, <function __main__.interact_intercept.<locals>.fillin_init(**kwargs)>)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Output()


  | Name       | Type          | Params
---------------------------------------------
0 | entry_dict | EntryDict     | 109 M 
1 | exit_part  | RegressionTop | 833   
---------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
438.923   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f'Your {mode}_dataloader has `shuffle=True`, it is best practice to turn'
  f'The dataloader, {name}, does not have many workers which may be a bottleneck.'
  return F.mse_loss(input, target, reduction=self.reduction)
  f'The dataloader, {name}, does not have many workers which may be a bottleneck.'


Training: 0it [00:00, ?it/s]

  return F.mse_loss(input, target, reduction=self.reduction)
  rank_zero_warn('Detected KeyboardInterrupt, attempting graceful shutdown...')
