# Preprocessing pipeline

Even though Kaggle has worked to collect and ordered the data, the JSON files need some work before becoming trainable dataset

This will be organized in a three step process


P1 ==>
-  We will reduce the number of JSON files we will run on (1 in 100). This is to produce a more managable dataset as we are only testing the waters with our model. The full data can be reprocessed if the model is found to be good.
-  The info in CSVs will be brought in to become columns in panda dataset

P2 ==>
- Tokens will be created from the markdowns and codes, with slightly different process
- features will be added (length of text, num of lines of code vs comments, dummy variables to replace left values - which should be encoded differently when it's a right value)

P3 ==>
-  We will split the dataset into two. One for each type of cell. As it would be easier to build a model predicting just code and just MarkDown. We could then use their output to train our full model

P4 ==>
- replace order of cells with comparison relations `{ <, =, > }`, creating n(n-1)/2 combinations

P5 ==>
- Dataframes/ CSVs will be concatenated


Timing will be mandatory for long steps and recorded for purpose of optimizing re-running strategies




In [70]:
df1 = pd.read_json("data/train/00001756c60be8.json")

In [48]:
train_order = pd.read_csv("data/train_orders.csv")
train_order.cell_order=train_order.cell_order.str.split()
train_order.sample(5)

Unnamed: 0,id,cell_order
101544,ba8ec0edee076c,"[b60583e5, f410fa1b, 710020c5, 667d777e, b3f52..."
121325,df2df37e4c6910,"[5c08e7c6, 7d256d36, 338b13e6, d31ca04d, 09d9d..."
16508,1e18a15c453f3e,"[248c3be9, fb743e4c, 29a3b7c6, 76d33b56, 05423..."
98435,b4d7d0c828d1c1,"[fe995b22, 591cd7c0, df5f3d91, 0b4de330, 2b461..."
10369,1305ae8cd924b0,"[e0afd782, ff4b7e44, 75ae29d3, cd435475, a089f..."


## Step P1: Skim dataset and add order information

In [131]:
%%time
#  time: / memory: 
for index, row in train_order.iterrows():
    if index%1000==0:
        true_order = { h: i for i, h in enumerate(row["cell_order"])}

        # get dataframe
        df = pd.read_json(f"data/train/{row['id']}.json")
        df["true_order"] = df.index.map(true_order)
        #df["true_order_code"] = df[df.cell_type=="code"].rank()
        df.to_csv(f"data/train_P1/{row['id']}.csv")
        

CPU times: user 4.61 s, sys: 186 ms, total: 4.79 s
Wall time: 4.91 s


## Pipeline decorator

Write a decorator that encapsulates the process of opening files, performing transformations and saving them into csv 

In [180]:
import os
# this configures and returns the decorator
def pipeline(indir, outdir=None):

    # the decorator only receives the function and nothing else
    def pipelinedecor(funct):

        ## this will take the function and run it
        def pipelineinner(df, *args, **kwargs):
            df_ = df
            
            for f in os.listdir(indir):
                
                if os.path.isfile(indir+f):

                    fn_ = f.split(".")
                    #print(indir, fn_)
                    
                    if fn_[1]=="csv":
                        #df is just a dummy variable as we will give it our own input and output
                        df = pd.read_csv(indir+ f"{fn_[0]}.csv")
                        # main function
                        df = funct(df, *args, **kwargs)
                        
                        if outdir:
                            df.to_csv(outdir+ f"{fn_[0]}.csv")

            return df_ # return the dummy variable unchanged
        return pipelineinner  
    return pipelinedecor
    

## Step P2: NLP preprocessing

In [90]:
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

def preprocessing_unit(sourcetext: str):
    words = re.sub(r"[^a-zA-Z0-9]", " ", sourcetext.lower())
    words=nltk.word_tokenize(words)
    words=[w for w in words if w not in stopwords.words("english")]
    words = [words for words in words if not words.isdigit()]
    return words

In [None]:
import ast
def ast_unit(sourcetext):
    #source = "import pandas as pd \nfrom imblearn.over_sampling import RandomOverSampler, SMOTE \nimport matplotlib; x = train_df.drop(columns = \"target\", random_seed=seed)"
    root = ast.parse(source)

    # all use of variables
    all_vars = {node.id for node in ast.walk(root) if isinstance(node, ast.Name)}
    
    # left values
    assignment = {n.id  for node in ast.walk(root) if isinstance(node, ast.Assign) for n in node.targets}
    all_imports =  {n.asname or n.name  for node in ast.walk(root) \
                                    if isinstance(node, (ast.Import, ast.ImportFrom)) \
                                    for n in node.names} 
    #print(all_vars)
    #print(assignment)

In [94]:
testphrase = df1.iloc[-2]["source"]
print(testphrase, " ==> ", preprocessing_unit(testphrase))
testphrase = df.iloc[-2]["source"]
print(testphrase, " ==> ", preprocessing_unit(testphrase))

Инициализация класса Data  ==>  ['data']
So some people are just not paying. Let's not worry about them.  ==>  ['people', 'paying', 'let', 'worry']


This applies the preprocessing unit above

In [179]:
%%time
@pipeline("data/train_P1/","data/train_P2/")
def P2(df):
    df["tokens"] = df["source"].apply(preprocessing_unit)
    return df

P2("running") # input is just dummy variable

CPU times: user 20.6 s, sys: 2.05 s, total: 22.7 s
Wall time: 23.9 s


'running'

## References

Berkeley Stat 157: Word2vec demonstration

https://nbviewer.org/url/courses.d2l.ai/berkeley-stat-157/slides/4_18/word2vec-gluon.ipynb

#### old code snippets

In [None]:

## this configures and returns the decorator
#def pipeline(df,indir, outdir, skim=1, intype = "csv", outtype = "csv"):
#
#    # the decorator only receives the function and nothing else
#    def pipelinedecor(funct):
#
#        ## this will take the function and 
#        def pipelineinner(df, *args, **kwargs):
#            for index, row in train_order.iterrows():
#                if index%skim==0:
#
#                    if intype == "csv":
#                        df = pd.read_csv(indir+ f"{row['id']}.csv")
#                    elif intype == "json":
#                        df = pd.read_json(indir+ f"{row['id']}.json")
#
#                    # main function
#                    df = funct(df, *args, **kwargs)
#
#                    if outtype == "csv":
#                        df.to_csv(outdir+ f"{row['id']}.csv")
#                    elif outtype == "json":
#                        df.to_json(outdir+ f"{row['id']}.json")
#
#            return 
#        return pipelineinner  
#    return pipelinedecor
#    

In [None]:
#@pipeline(df, "data/train/","data/train_P1/",skim=1000, intype="json")
#def P1(df):
#    df["true_order"] = df.index.map(true_order)
#    df["true_order_by_type"] = df.groupby("cell_type")["true_order"].rank()
#    return df

In [None]:
#%%time
#P1(df) # 6 secs/ 2MB for 1 in 1000 skim rate