# Define a SemantiLabeler class with the following methods:
* **predict(column,filename)**: given a column (column id + file name), predict semantic type of that column, using a pre-trained semantic labeling model
* **add_labeled_column(column,filename,attribute)**: given a labeled column (column id + file name + semantic label), add this labeled column to the "database" of labeled columns. If this column already exists in the "database", return an error.
* **update_labeled_column(column,filename,attribute)**: given a labeled column, find it in the "database" of labeled columns, and update its attribute with the new one. If there is nothing to update, return an error.
* **validate_labels()**: Validate the labels file, by checking whether the column data can be extracted, and update the status of columns accordingly
* **update_training_examples()**: generate labeled examples from new labeled columns, and add them to the previous training set of labeled examples. This method updates the training set, in preparation for re-training the semantic classifier with train().
* **train()**: train a semantic labeling model on the current (updated) training set of examples
* **reset()**: reset the training set of examples, and the trained model. This makes the SemanticLabeler instance forget all it has learned.
* **status()**: return the current status of the semantic labeler: its directory (where the training sets and the trained model are stored), description of the trained model (including the random seed used), hyperparameters.

# Column status in the "database" of columns
The "knowlege base" of column labels 'self.labels' is a dataframe with the following fields: 'col' (column id, assume that this is unique), 'file' (id of the file where the column is found), 'attribute' (semantic type of the column), and 'status'.
The 'status' field is used for internal purposes, and can have one of the following values:
* 'PROCESSED': this column has been processed, i.e., training examples have already been generated from it, and added to the training set of examples.
* 'NEW': this column has been added but has not yet been processed
* 'UPDATED': this column might have been processed before, but has been updated since. It needs to be re-processed by calling update_training_examples().
* 'NO_DATA': this column has not been found in the specified file, so its data is not available.

# Training examples
The training examples used for the semantic classifier are stored in 'examples.pkl.gz' file in the local directory of the SemanticLabeler instance, and is updated by update_training_examples() method. Once all the training examples set has been updated, the semantic classifier must be re-trained with train() (this is expensive, so should be done after all the column labels have been added/updated). If train() is not called after updating the training examples, the previous classifier will be used when making predictions with predict().

# To do:
* **[DONE]** Implement handling new semantic types, on top of the previously known ones (open world)
* **[DONE]** Check for conflicting semantic types of the same columns
* **[DONE]** Finish the method for updating training examples.
* **[DONE]** implement train()
* implement reset()
* implement status()

In [1]:
import os, csv
import gzip, pickle
import string
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

In [2]:
LOCATION = '../feature_engineering/repo/council_spending_data/data/'   # real estate data

In [3]:
class Column(object):
    '''
    Data type to be passed to SemanticLabeler.predict
    '''
    def __init__(self, column, file):
        self.column = column
        self.file = file

class LabeledColumn(object):
    '''
    Data type to be passed to SemanticLabeler.label
    '''
    def __init__(self, column, file, label):
        self.column = column
        self.file = file
        self.label = label
        
class RequestHandler:
    def __init__(self):
        sl = SemanticLabeler(model_dir="SemanticLabeler")   # model_dir is the directory used internally by SemanticLabeler
        
    def send_to_semantic_labeler(request):
        '''
        request: JSON string, e.g. 
        '[{“col”: 1234, “filename”: “/some/other/path”, “label”: “Address"},
          {“col”: 9894, “filename”: “/some/path”, “label”: “Name"}]’
        '''
        reqs = json.loads(request)
        for req in reqs:
            sl.label(r["col"], r["filename"], r["label"])
        
    def update_semantic_labeler(): 
        sl.train()

In [4]:
class SemanticLabeler(object):
    '''
    The task of SemanticLabeler class is to implement the methods for training and prediction of semantic data types, 
    given a list of data columns
    '''
#     ------ some internal functions: ------------------------------------------------------------------------
    def load_model(self, file_name):  #load the hyperparameters and the pre-trained model 
        save_file = gzip.open(file_name, 'rb')
        hp = pickle.load(save_file, encoding='bytes')      #load the hyperparameters
        model = pickle.load(save_file, encoding='bytes')   #load the model
        labels_dict = pickle.load(save_file, encoding='bytes') #load the targets_dict - the dict translating integer labels to attributes
        save_file.close()
        return (hp, model, labels_dict)
    
    def save_model(self,model,hp,labels_dict,file_name):  #save the hyperparameters and the model 
        save_file = gzip.open(file_name, 'wb')
        pickle.dump(hp, save_file, -1)      #save the hyperparameters
        pickle.dump(model, save_file, -1)   #save the model
        pickle.dump(labels_dict, save_file, -1)  # save the targets_dict - the dict translating integer labels to attributes
        save_file.close()
        
    def load_examples(self):
        examples_file = gzip.open(self.examples_path, 'rb')
        examples = pickle.load(examples_file,encoding='bytes')
        examples_file.close()
        return examples
    
    def save_examples(self, examples):
        examples_file = gzip.open(self.examples_path, 'wb')
        pickle.dump(examples,examples_file,-1)
        examples_file.close()
    
    def extract_column(self, col, file):
        df = pd.read_csv(file)
        if col in df:
            return df[col].values.tolist()
        else:
            print("ERROR: column",col,"is not found in file",file)
            return None
        
    def char_freq(self, text, freq=True, lowercase=True, entropy=True):  # extract character counts/frequencies in text (text is a list of strings)
        text = str(text)   # concatenate the elements of text into a single string
        all_chars = string.printable

        if lowercase: 
            text = text.lower()
        char_dic = {}
        if text is not '':
            for x in all_chars:
                char_dic[x] = float(text.count(x))/len(text) if freq else text.count(x)
        else:
            for x in all_chars:
                char_dic[x] = 0

        # add the information measure (negative entropy) of text, Sum(p(i) log(p(i)), i=1,...,n), where p(i) is the i-th character frequency, 
        # n is the number of possible characters in the "alphabet", i.e., number of possible states

        entr = 0
        max_entr = -np.log2(1./len(all_chars))   # maximum entropy, in bits per character, for a text made of all_chars
        if entropy:
            for x in string.printable:
                if freq:
                    p_x = char_dic[x]
                else:
                    p_x = float(char_dic[x])/len(text)
                if p_x > 0:
                    entr += - p_x * np.log2(p_x)   # entropy of text, in bits per character

        char_dic['_entropy_'] = entr/max_entr   # return the normalized entropy of text

        return char_dic
    
    def generate_examples(self,col,col_data,attribute=None,verbose=False):  #generate a matrix of examples from records, with number of records per example and number of examples per column defined in hp 
        if verbose: 
            print ('records_per_example = %i' %self.hp['records_per_example'])
            print ('examples_per_bucket = %i' %self.hp['examples_per_bucket'])
        if self.hp['entropy']:
            all_chars = list(string.printable) + ['_entropy_']
        else:
            all_chars = list(string.printable)
        # define the number of examples to generate:
        n_examples = self.hp['examples_per_bucket']

        examples = pd.DataFrame(index=range(n_examples), columns=['col']+['attribute']+['label']+all_chars)
        examples['col'][:] = col    # col is the column name
        if attribute is None: 
            examples['attribute'][:] = np.nan    # the attribute is not known
            examples['label'][:] = np.nan    # the label is not known
        else:
            examples['attribute'][:] = attribute
            examples['label'][:] = self.attributes_dict[attribute]  # int label corresponding to attribute
            
        for row in range(self.hp['examples_per_bucket']):
            if len(col_data)>0:
                sample = np.random.choice(col_data, size=self.hp['records_per_example'], replace=True)
            else:
                sample = ''   # empty string
            freq_vec = self.char_freq(sample, freq=self.hp['freq'], lowercase=self.hp['lowercase'], entropy=self.hp['entropy'])
            for char in all_chars:
                examples.set_value(row, char, freq_vec[char])

        return examples
    
#     ------ METHODS: --------------------------------------------------------------------------------

    def __init__(self,model_dir,seed=None,hp=None):
        
        if hp is not None: 
            self.hp = hp
        else:
            self.hp = {}
        self.seed = seed   # random seed to use for both example generation and random forest initialization
        self.path = os.path.join(os.getcwd(),model_dir)
        self.model_path = os.path.join(self.path, "rf_model.pkl.gz")
        self.labels_path = os.path.join(self.path, "labels.csv")
        self.examples_path = os.path.join(self.path, "examples.pkl.gz")
        self.updated_training_set = False
        if os.path.exists(self.path):
            print("The directory",self.path,"already exists.")
            print("Loading semantic classifier and hyperparameters from",self.model_path)
        else:
            print("Creating",self.path)
            os.makedirs(self.path)
            
        if os.path.exists(self.model_path):
                self.hp, self.classifier, self.labels_dict = self.load_model(self.model_path)  #load the model, hp parameters, and the labels_dict (dictionary of targets)
                print('\nHyperparameters overrided with the loaded hyperparameters:')
                print(self.hp)
        else:
            print("The model file",self.model_path,"is not found. Starting afresh.")
            self.classifier = None
            self.labels_dict = {}
            self.labels_dict[0] = 'UNKNOWN'   # this attribute should always be in the labels_dict
        self.attributes_dict = {v: k for k, v in self.labels_dict.items()}  # mapping from semantic types to int labels
        print('\nLoaded semantic classifier:\n', self.classifier,'\n')
        attributes = list(self.attributes_dict.keys())  # list of all known semantic types (attributes)

        print("\nChecking whether the labels file exists...", end='')
        if os.path.exists(self.labels_path):
            print("YES")
        else:
            print("NO. \ncreating a new labels file...")
            labels_file = csv.writer(open(self.labels_path, 'w'))
            labels_file.writerow(["col", "file", "attribute", "status"])   # create a header

        try:
            self.labels = pd.read_csv(self.labels_path, header=0)  # read the labels file
        except ValueError:
            #print('Error:', ValueError)
            self.labels = pd.DataFrame(columns=["col", "file", "attribute", "status"])  # create an empty dataframe

        print("\nChecking whether the training examples file exists...", end='')
        if os.path.exists(self.examples_path):
            print("YES")
        else:
            if self.classifier is not None:
                print("NO. \nYou still will be able to predict semantic types using this model, but you won't be able to update the model with new examples.")
            else:
                print("NO.")

    def predict(self, col, file, verbose=False):
        """
        Given col and file, predict the col's semantic type, using self.classifier
        """

        # Extract the column data:
        col_data = self.extract_column(col, file)
        if col_data is None:
            return None
        # Generate unlabeled examples from col_data:
        examples = self.generate_examples(col,col_data)
        if len(examples.index)==0:
            print ('predict: no examples to label, stopping.')
            return

        # predict soft labels (labels' posteriors) of examples:
        x = examples.iloc[:,3:].as_matrix()  # feature vectors of the examples
        if str(self.classifier.__class__) == "<class 'sklearn.ensemble.forest.RandomForestClassifier'>":
            y_score = self.classifier.predict_proba(x)
        else:
            print('predict: this classifier type',self.classifier.__class__ ,'is not yet implemented, stopping.')
            return 

        y_score_mean = np.mean(y_score, axis=0)  # take a mean of each column in y_score - i.e., mean soft label of all examples
        attribute = self.labels_dict[self.classifier.classes_[np.argmax(y_score_mean)]]  # semantic type with the highest posterior wins
        if verbose: 
            print("column",col,": semantic type is",attribute)

        return attribute
    
    def add_labeled_column(self, col, file, attribute, verbose=False):
        '''
        given a labeled column (column id + file name + semantic label),
        check for conflicts with previous labels, and if there are no conflicts,
        add this labeled column to the "database" of labeled columns.
        '''
        
        # add the new attribute to the attributes dictionary:
        if attribute not in self.attributes_dict.keys():  
            attr_label = max(self.attributes_dict.values())+1
            self.attributes_dict[attribute] = attr_label
            self.labels_dict[attr_label] = attribute
            
        # Check whether [col, file] already exists in self.labels:
        duplicate_labels = self.labels.loc[(self.labels['col'] == col) & (self.labels['file'] == file)]
        if duplicate_labels.shape[0]>0:
            # print('The label for',col,', ',file,'already exists; skipping.')
            return (False, 'duplicate label')    
        else:
            # this is a new label, add it to self.labels
            new_label = pd.DataFrame([[col, file, attribute, 'NEW']], columns=["col", "file", "attribute", "status"])
            self.labels = pd.concat([self.labels, new_label], ignore_index=True)
            return (True, 'label added')
        
        
    def update_labeled_column(self, col, file, attribute, verbose=False):
        '''
        given a labeled column (column id + file name + semantic label),
        check for conflicts with previous labels, and if there are conflicts,
        update the label of the corresponding column in the "database" of labeled columns.
        '''
        
        # add the new attribute to the attributes dictionary:
        if attribute not in self.attributes_dict.keys():  
            attr_label = max(self.attributes_dict.values())+1
            self.attributes_dict[attribute] = attr_label
            self.labels_dict[attr_label] = attribute
        
        # Check whether [col, file] already exists in self.labels:
        updated_label_index = (self.labels['col'] == col) & (self.labels['file'] == file)
        if np.sum(updated_label_index)>0:
            self.labels.loc[updated_label_index] = [[col, file, attribute, 'UPDATED']]
            return (True, 'label updated')    
        else:
            # this is a new label, nothing to update in self.labels
            return (False, 'no label to update')
        
    
    def validate_labels(self):
        '''
        Validate the labels file, by checking whether the column data can be extracted
        '''
        for index,row in self.labels.iterrows():
            col = row['col']
            file = row['file']
            attribute = row['attribute']
            status = row['status']
            
            if status == 'NEW' or status == 'UPDATED':
                col_data = self.extract_column(col, file)
                if col_data is None:
                    self.labels.ix[index,'status'] = 'NO_DATA'
                    
    
    def update_labels_file(self):
        '''
        Save the updated self.labels to the "database"
        '''
        self.labels.to_csv(self.labels_path, index=False)
        
    def update_training_examples(self):
        '''
        Update the training set of labeled examples, by generating examples from the newly labeled columns, 
        and appending them to the existing training set of labeled examples.
        This updates the training set, in preparation for re-training the semantic classifier with train().
        '''
        
        self.updated_training_set = False
        # import the training examples:
        if os.path.exists(self.examples_path):
            examples = self.load_examples()
        else:
            examples = None
    
        for index,row in self.labels.iterrows():
            col = row['col']
            file = row['file']
            attribute = row['attribute']
            status = row['status']
            
            if status == 'NEW' or status == 'UPDATED':
                col_data = self.extract_column(col, file)
                if col_data is not None:
                    new_examples = self.generate_examples(col,col_data,attribute)
                else:
                    new_examples = None
            
            if status == 'UPDATED' and (new_examples is not None):   # update the examples previously generated from this column with new examples
                if examples is not None:
                    if examples[examples['col'] == col].shape[0]>0:  # if can find examples to update, update them by replacing with the new examples
                        i2=0
                        for i1, row in examples[examples['col'] == col].iterrows():
                            examples.iloc[i1] = new_examples.iloc[i2]   # replace the previous examples for this col with the new examples generated from this col
                            i2 += 1
                    else:  # otherwise, just add the examples
                        examples = examples.append(new_examples, ignore_index=True)   # add new_examples to examples
                else:
                    examples = new_examples
                self.labels.ix[index,'status'] = 'PROCESSED'
                self.updated_training_set = True
                
            elif status == 'NEW' and (new_examples is not None):
                if examples is not None:
                    # examples = pd.concat([examples,new_examples], ignore_index=True)   # add new_examples to examples
                    examples = examples.append(new_examples, ignore_index=True)   # add new_examples to examples
                else:
                    examples = new_examples        
                self.labels.ix[index,'status'] = 'PROCESSED'
                self.updated_training_set = True
                
            
        if (self.updated_training_set) and (examples is not None): 
            self.update_labels_file()
            self.save_examples(examples)
        
        return self.updated_training_set
    
    def train(self):
        '''
        Train the semantic classifier, if the training examples have been updated (self.updated_training_set is True)
        '''
        def shuffle_examples(df):
            return df.iloc[np.random.permutation(len(df))]

        self.updated_classifier = False
        if self.updated_training_set:   # if the training set has been updated, re-train the classifier
            # import the training examples:
            if os.path.exists(self.examples_path):
                examples_train = self.load_examples()
                examples_train = shuffle_examples(examples_train)
                
                train_x = examples_train.iloc[:,3:].as_matrix()   # the feature vector starts from the 3rd column, columns 0-2 have bucket_id, attribute, and label 
                train_y = examples_train.iloc[:,2].as_matrix()    # column 2 contains integer labels of the examples.
                train_y = np.array([y for y in train_y])       # convert into a regular numpy array
                
                # create the classifier:
                rf = RandomForestClassifier(n_estimators=500, random_state=self.seed, n_jobs=4)
                # train the rf:
                rf.fit(train_x, train_y)
                self.save_model(model=rf,hp=self.hp,labels_dict=self.labels_dict,file_name=os.path.join(self.path,'rf_model.pkl.gz'))
                self.classifier = rf
                
                self.updated_classifier = True
                self.updated_training_set = False
            else:
                print("train: training examples file",self.examples_path,"not found, nothing to train on.")
                self.updated_classifier = False

        else:
            print("train: no new examples found, keeping the previous classifier.")
            self.updated_classifier = False
            
        return self.updated_classifier

### Specify hyperparameters:

In [5]:
hp = {}
hp['examples_per_bucket'] = 10
hp['records_per_example'] = 100
hp['freq'] = True       # True for character frequencies, False for character counts
hp['lowercase'] = True  # Should the records be lower-cased when generating examples?
hp['entropy'] = True   # include text entropy in the feature vector, or not?

### Initialize a SemanticLabeler instance:

In [6]:
sl = SemanticLabeler('SL_test', hp=hp)

Creating /Users/ytyshetskiy/Projects/Data_integration/code/open-data/SL_test
The model file /Users/ytyshetskiy/Projects/Data_integration/code/open-data/SL_test/rf_model.pkl.gz is not found. Starting afresh.

Loaded semantic classifier:
 None 


Checking whether the labels file exists...NO. 
creating a new labels file...

Checking whether the training examples file exists...NO.


#### The labels database of sl is stored in sl.labels. When starting from scratch, it's empty:

In [7]:
sl.labels

Unnamed: 0,col,file,attribute,status


### Add some labeled columns to sl.labels: 
(note that the updated sl.labels is not saved to 'labels.csv' at this step)

In [8]:
sl.add_labeled_column(col="Amount",file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv'),attribute='expense_amount_paid')
sl.add_labeled_column(col="Payment Date",file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv'),attribute='expense_payment_date')
sl.add_labeled_column(col="Payment Date 1",file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv'),attribute='expense_payment_date')

(True, 'label added')

Validate the labels database:

In [9]:
sl.validate_labels()

ERROR: column Payment Date 1 is not found in file ../feature_engineering/repo/council_spending_data/data/schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv


#### The labels database is now:
Note that the 'NO_DATA' status indicates that the column is not found, and will not be used

In [10]:
sl.labels

Unnamed: 0,col,file,attribute,status
0,Amount,../feature_engineering/repo/council_spending_d...,expense_amount_paid,NEW
1,Payment Date,../feature_engineering/repo/council_spending_d...,expense_payment_date,NEW
2,Payment Date 1,../feature_engineering/repo/council_spending_d...,expense_payment_date,NO_DATA


### Update the training examples database:

In [11]:
sl.update_training_examples()

True

Inspect the updated examples database: (this is for inspection purpose only)

In [12]:
sl.load_examples()

Unnamed: 0,col,attribute,label,0,1,2,3,4,5,6,...,|,},~,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,_entropy_
0,Amount,expense_amount_paid,1,0.3153623,0.04,0.04869565,0.06666667,0.03594203,0.03072464,0.02782609,...,0,0,0,0.1733333,0,0.01391304,0,0,0,0.4957273
1,Amount,expense_amount_paid,1,0.05258386,0.0444243,0.04986401,0.05077063,0.04623753,0.0444243,0.04351768,...,0,0,0,0.444243,0,0.01359927,0,0,0,0.443287
2,Amount,expense_amount_paid,1,0.04970179,0.07455268,0.06262425,0.04572565,0.06163022,0.04771372,0.06262425,...,0,0,0,0.3697813,0,0.0139165,0,0,0,0.4739101
3,Amount,expense_amount_paid,1,0.05189621,0.06586826,0.06187625,0.04790419,0.04391218,0.03493014,0.0499002,...,0,0,0,0.3962076,0,0.01397206,0,0,0,0.4634969
4,Amount,expense_amount_paid,1,0.3107246,0.04347826,0.05217391,0.05681159,0.03594203,0.03014493,0.02318841,...,0,0,0,0.1733333,0,0.01391304,0,0,0,0.4992672
5,Amount,expense_amount_paid,1,0.0498008,0.06374502,0.06374502,0.05776892,0.05179283,0.06474104,0.04482072,...,0,0,0,0.3705179,0,0.01394422,0,0,0,0.4747476
6,Amount,expense_amount_paid,1,0.04422383,0.06407942,0.04873646,0.05144404,0.03429603,0.04602888,0.05234657,...,0,0,0,0.4359206,0,0.01353791,0,0,0,0.4462331
7,Amount,expense_amount_paid,1,0.04623753,0.06618314,0.07071623,0.04261106,0.04805077,0.0444243,0.03082502,...,0,0,0,0.4533092,0,0.01269266,0,0,0,0.4341895
8,Amount,expense_amount_paid,1,0.06893107,0.06793207,0.05194805,0.04895105,0.04495504,0.05894106,0.03896104,...,0,0,0,0.3756244,0,0.01398601,0,0,0,0.4715174
9,Amount,expense_amount_paid,1,0.3298551,0.03826087,0.05971014,0.05855072,0.03246377,0.02318841,0.02376812,...,0,0,0,0.1733333,0,0.01391304,0,0,0,0.4880688


### Train the classifier sl.classifier, using the updated examples as a training set:
This trains the classifier (RF classifier for now), and saves it, along with the parameters, to 'rf_model.pkl.gz' file.

In [13]:
sl.train()

True

### Predict the semantic attribute of a column, using the trained classifier:

In [14]:
col="Amount"
file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv')
col_data = sl.extract_column(col, file)
sl.predict(col,file, verbose=False)

'expense_amount_paid'

### Update a label of a column: 
(can be used to change the attribute of a column to any other attribute including 'UNKNOWN')

In [15]:
# sl.update_labeled_column(col="Amount",file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv'),attribute='expense_amount_paid_NEW')
sl.update_labeled_column(col="Payment Date",file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv'),attribute='UNKNOWN')

(True, 'label updated')

Inspect the labels database after a label has been updated:

In [16]:
sl.labels

Unnamed: 0,col,file,attribute,status
0,Amount,../feature_engineering/repo/council_spending_d...,expense_amount_paid,PROCESSED
1,Payment Date,../feature_engineering/repo/council_spending_d...,UNKNOWN,UPDATED
2,Payment Date 1,../feature_engineering/repo/council_spending_d...,expense_payment_date,NO_DATA


Update the training examples, and inspect the examples database:

In [17]:
sl.update_training_examples()
sl.load_examples()

Unnamed: 0,col,attribute,label,0,1,2,3,4,5,6,...,|,},~,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,_entropy_
0,Amount,expense_amount_paid,1,0.3153623,0.04,0.04869565,0.06666667,0.03594203,0.03072464,0.02782609,...,0,0,0,0.1733333,0,0.01391304,0,0,0,0.4957273
1,Amount,expense_amount_paid,1,0.05258386,0.0444243,0.04986401,0.05077063,0.04623753,0.0444243,0.04351768,...,0,0,0,0.444243,0,0.01359927,0,0,0,0.443287
2,Amount,expense_amount_paid,1,0.04970179,0.07455268,0.06262425,0.04572565,0.06163022,0.04771372,0.06262425,...,0,0,0,0.3697813,0,0.0139165,0,0,0,0.4739101
3,Amount,expense_amount_paid,1,0.05189621,0.06586826,0.06187625,0.04790419,0.04391218,0.03493014,0.0499002,...,0,0,0,0.3962076,0,0.01397206,0,0,0,0.4634969
4,Amount,expense_amount_paid,1,0.3107246,0.04347826,0.05217391,0.05681159,0.03594203,0.03014493,0.02318841,...,0,0,0,0.1733333,0,0.01391304,0,0,0,0.4992672
5,Amount,expense_amount_paid,1,0.0498008,0.06374502,0.06374502,0.05776892,0.05179283,0.06474104,0.04482072,...,0,0,0,0.3705179,0,0.01394422,0,0,0,0.4747476
6,Amount,expense_amount_paid,1,0.04422383,0.06407942,0.04873646,0.05144404,0.03429603,0.04602888,0.05234657,...,0,0,0,0.4359206,0,0.01353791,0,0,0,0.4462331
7,Amount,expense_amount_paid,1,0.04623753,0.06618314,0.07071623,0.04261106,0.04805077,0.0444243,0.03082502,...,0,0,0,0.4533092,0,0.01269266,0,0,0,0.4341895
8,Amount,expense_amount_paid,1,0.06893107,0.06793207,0.05194805,0.04895105,0.04495504,0.05894106,0.03896104,...,0,0,0,0.3756244,0,0.01398601,0,0,0,0.4715174
9,Amount,expense_amount_paid,1,0.3298551,0.03826087,0.05971014,0.05855072,0.03246377,0.02318841,0.02376812,...,0,0,0,0.1733333,0,0.01391304,0,0,0,0.4880688


### Re-train the classifier:

In [18]:
sl.train()

True

Note what happens when re-training again on the same training set:

In [19]:
sl.train()

train: no new examples found, keeping the previous classifier.


False

### Predict a column's semantic attribute, using the updated classifier:

In [20]:
# col="Amount"
col="Payment Date"
file=os.path.join(LOCATION,'schema_12__council__royal-borough-of-greenwich__august_2013_payments_over_and_pound500_csv.csv')
sl.predict(col,file, verbose=False)

'UNKNOWN'