# Refactoring code exercise

We think that code refactoring is the most effective way to show the knowledge and way of thinking of a developer, rather than coding anything from zero. 

Therefore, we provide in this notebook with a code that should be refactored to obtain a final output similar to the one show here. 

Along the notebook, we will state for some explicit rules on how the code should be developed.
for the code refactoring, 

**About the task** We provide a *csv* file resulted from an optical character recommendation engine. Each line belongs to a word in an image containing text. It has information about the pixels of the boxes that have text inside (*left,right,top,bottom*), the text within those boxes (*Word_text*),
and two id fields which give a unique identification for a file (*temp_id,doc_id*). Finally, it has a field *label* which is the classification of the word into a series of labels defined. This label is the class that a machine learning algorithm would eventually be trained to predict. **We are not asking for implementing a machine learning algorithm**. The task at hand is more about the feature engineering side of machine learning. We consider the **computational efficiency** and **clean code** as the main code value topics so please keep that in mind for the whole exercise.

There is only one general rule we ask for:

**The code generated must be compatible wit the Pipeline module of scikit-learn.** This means, the transformations done to the *csv* file, read as a *pd.DataFrame*, need to be coded so that at the end, running a pipeline.fit_transform() method should give the expected result.

As a final question, we would like you to ask the question on why do we require the sci-kit learn compatibility of the code.

Although not explicitly required, **the use of DOCSTRING descriptions** will be highly appreciated.

We are aware that in some of the feature engineering tasks required, due to problems of the OCR output, a general feature engineering that works perfect in every single case might not be possible. So we are not asking for the perfect feature engineering, but the most general possible.

**Do not hesitate on reaching us if you have doubts in any instructions.**


#### Small tip

To do things easier, generate your own conda environment and install the following packages:
```
conda create -n camelot-interview python=3.6
conda activate camelot-interview 
pip install pandas
pip install scikit-learn
pip install numpy
pip install quantulum3
pip install tqdm
```

In [1]:
# Importing the libraries
import pandas as pd    
import glob
import sklearn
import numpy as np
from quantulum3 import parser
from tqdm import tqdm


## 1. Reading the data.

You can just copy and paste this section of the code. 

**BONUS TASK** a more effective way of reading the data would be appreciated.

In [2]:
def load_test_data(path):
    files = glob.glob(path+"template7_*.csv")
    li = []
    for filename in files[:]:
        df = pd.read_csv(filename, index_col=None, header=0)
        li.append(df)

    df_test = pd.concat(li, axis=0, ignore_index=True)
    df_test.drop(df_test.columns[0], axis=1, inplace=True)
    
    return df_test
df_test = load_test_data("training_data_new/predictions_labels/")



Below you can see the list of labels present inthe dataset.

In [3]:
df_test["label"].unique()

array(['none', 'date', 'un_label', 'un_value', 'customer_name',
       'customer_street', 'customer_zip_city', 'customer_country',
       'packing_type_label', 'packing_type_value', 'material_id',
       'volume_label', 'volume_value', 'volume_unit', 'volume_unc',
       'weight_label', 'weight_value', 'height_unit', 'height_label',
       'diameter_label', 'height_value', 'diameter_value', 'height_unc',
       'diameter_unc', 'diameter_unit', 'wt_top_label', 'wt_body_label',
       'wt_bottom_label', 'delivery_label', 'delivery_value',
       'color_label', 'color_value', 'customer_id_label', 'customer_id',
       'weight_unit', 'width_label', 'length_value', 'width_value',
       'length_unit', 'width_unit', 'length_unc', 'Width_unc',
       'weight_unc', 'wt_top_unit', 'wt_top_value', 'wt_body_value',
       'wt_bottom_value', 'wt_body_unit', 'wt_top_unc', 'wt_body_unc',
       'wt_bottom_unc', 'wt_bottom_unit', 'length_label'], dtype=object)

## 2. Feature engineering


**MAIN TASK**: Put each function into its own class and group all of them together in one main feature engineering class that you will use to generate the DataFrame expected in the result

This is the main part of the task. In this implementation, as you can see, there is a single function called feature engineering that has defined internally several other functions which is bad practice. We finally call *feature_engineering()*.  

As stated above, we would like to have each of these functions in its own class, ideally they should be compatible with sklearn.Pipeline as you will see in part 3. Below we give a list of the functions and transformations that we would need:

* **is_date** - will generate a column called is_date that is 1 when the text is compatible with a date and 0 when not
* **Extra step**: it is quite common that OCR engines confuse l (the low case L) with 1, when appearing together. This is critical when you try to find out numbers and units, as is the case in the context of this exercise.

**BONUS TASK** Please write a function, again compatible with sklearn pipeline, that changes 1 for l, when 1 appears alone in the text field.
* **is_number** - Generates boolean column that gives 1 when the *Word_text* field is a number
* **is_percent** - Boolean column stating if a percentage value is present in *Word_text*
* **unit_type** - The context of this exercise is to find out on a material specification sheet, values of different measurements, dimensions and their units in order to automate the data processing from scanned documents to DB systems. To facilitate the life to the algorithm, we thought that telling the kind of unit in a row of text might help. Here, we generate a column called unit_type that says, if a unit is present on the text box, whether it is a unit of volume, mass or length. 

**BONUS TASK** *Consider improving the implementation below by using a more robust alternative. For example, a package like quantulum 3, which is intended to detect units and magnitudes. If quantulum is used, make sure of taking an special look to those lines with mass units on them. Is there something that is not working and could be made to work by tweaking something in quantulum3 package?.* **NOTE IMPORTANT** Do not use the column label to detect if a unit is present on the dataset row. This information would not exist in the production environment.
* **is_UN** UN numbers are approval numbers for dangerous materials. These have easily recognizable patterns. This function creates a boolean column that says whether a UN value is present.
* **y_mean** (Not in the code below) Add a column that gives the mean between the bottom and top columns for every line.
* **row_unit_type** If a unit is present in the same line of text (consider a line as that with similar y_mean values) then, asign **unit_type** to all the elements in that line. 

**BONUS TASK** Note that row_unit_type process can be massively improved compared to the implementation shown here.


In [4]:
#define own units because we can be sure they exists + we can define categories
# Pre processing needed for the unit_type
units = ["m","mm","cm","µm"]
units.extend(["l","ml","m3"])
units.extend(["g","gr", "kg"])
units

unit_to_category = dict() #to check what units belongs to what category
                          #0 would be none
unit_to_category["m"] = 1 #1 would be length
unit_to_category["mm"] = 1
unit_to_category["cm"] = 1
unit_to_category["µm"] = 1
#inches
unit_to_category["l"] = 2 #2 would be volume
unit_to_category["ml"] = 2 
unit_to_category["m3"] = 2
#gallons
unit_to_category["g"] = 3 #3 would be weight
unit_to_category["gr"] = 3 
unit_to_category["kg"] = 3

import re
unit_to_cat=dict()
unit_to_cat["volume"] = 1
unit_to_cat["mass"] = 2
unit_to_cat["length"] = 3


In [5]:

def feature_engineering(df):
    
    def is_date(string):
        try:
            return 1 if re.match(r"\d{4}-\d{2}-\d{2}",string) is not None else 0
        except TypeError:
            print(string)
    df["is_date"] = df.Word_text.apply(is_date).astype("boolean")
    
    def is_number(string):#try to convert to numpy float
        try:
            np.array(string,dtype="float")
            return 1
        except ValueError:
            return 0
    df["is_number"] = df.Word_text.apply(is_number).astype("boolean")
    
    #def is_number_with_character(string):
    #    return 1 if re.match(r"\d+\w{0,2}",string) is not None else 0
    #df["is_number_with_character"] = df.Word_text.apply(is_number_with_character).astype("boolean")
    
    def is_number_with_percent(string):
        return 1 if re.match(r"\d+%",string) is not None else 0
    df["is_number_with_percent"] = df.Word_text.apply(is_number_with_percent).astype("boolean")
    
    def is_a_unit(string): #if unit type != 0 then it is already a unit
        return 1 if string.lower() in units else 0
    df["is_a_unit"] = df.Word_text.apply(is_a_unit).astype("boolean")
    
    def unit_type(string):
        if string.lower() in units:
            return unit_to_category[string.lower()]
        return 0
    df["unit_type"] = df.Word_text.apply(unit_type).astype("int8")
    
        
    
    UN_REGEX = r"UN([a-zA-Z0-9.]+/)+[a-zA-Z0-9]+"
    def is_UN(string):
        return 1 if re.match(UN_REGEX, string.strip()) is not None else 0
    df["is_UN"] = df.Word_text.apply(is_UN).astype("boolean")
    
    # Below there is the function that tranforms the "1" cointained in a character string in a "l"
    
    def one_replace_l(string):
        for item in df["Word_text"]:
            if "1" in item:
                if sum(c.isdigit() for c in item) == 1:
                    item = item.replace("1", "l")
            
    return df
df_test = feature_engineering(df_test)

df_test["y_mean"] = (df_test["top"] + df_test["bottom"]) / 2

In [None]:
df_test["row_unit_type"] = 0
y_delta = 10

for template_id in tqdm(df_test.temp_id.unique(), desc="finding and labeling lines..."):
    template_df = df_test[df_test.temp_id == template_id]
    for doc_id in template_df.doc_id.unique():
        doc_df = template_df[template_df.doc_id==doc_id]
        for i in range(len(doc_df)):
            rows = doc_df[(abs(doc_df.top-doc_df.iloc[i].top) < y_delta) &\
                          (abs(doc_df.bottom-doc_df.iloc[i].bottom) < y_delta)]
            if len(rows.unit_type.unique()) > 1:
                for j in rows.index:
                    df_test.loc[j,"row_unit_type"] = np.delete(rows.unit_type.unique(),0)[0]


finding and labeling lines...:   0%|          | 0/1 [00:00<?, ?it/s]

## 3. Example of the dataframe expected as return

In [None]:
df_test

# Adapting the data frame to the pipeline

**MAIN TASK** In this part of the code, you will make sure that the feature engineering class is compatible 
Sklearn Pipeline module. Scikit-learn is one of the most popular library in the implementation of ML algorithm in python. It is widely used because of its built in features that allow the user to model and manage data also in the pre-processing stages.

An example implementation can be found here https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

**BONUS TASK** A lot of optimization can be made in preprocessing to make the machine LEARNING task easy, one example is one hot encoding which changes categorical variables to boolean format. other methods like rescaling, normalization and Standardization can also be benificial but must be used with care. Explain what extra methods you would use on this dataset to make the machine learning task easier, even better just code it in the sklearn compatible class

In [None]:
from sklearn.preprocessing import OneHotEncoder


In [None]:
oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(df_test[["Word_text", "unit_type", "label", 'temp_id', 
                                             'doc_id', 'is_date', 'is_number', 'is_number_with_percent',
                                              'is_a_unit','unit_type', 'is_UN', 'y_mean', 'row_unit_type']])
pd.DataFrame(oe_results.toarray()).head()