# Automated Outliers Detection ML Model 
The following code was created for use by Lumere data analysts. They need to quickly clean clients' purchase order data for analysis by identifying and excluding outliers from the analysis dataset. This script utilizes a random forest model to classify purchase orders as an outlier. 

NOTE: any queries and table names have been blinded to protect Lumere's data. If you wish to run this script, you will have to import your own dataset. 

### Set up environment 
The analysts need to pull the relevant data from Lumere's database. This script utilizes the psycopg2 library to pull data from PostgreSQL. The command `getpass` allows the user to enter their database password without saving it to the actual file. 

In [None]:
import numpy as np 
import pandas as pd
import getpass
import psycopg2

print ("Enter your database Password")
PASSWORD = getpass.getpass()

### Run the model
#### Set Up: 
First, the user needs to update the parameters, including username, user ID, and the relevant data categories. Next, the script defines a function that takes the output from `fetchall` and inserts it into a pandas dataframe.

#### Data: 
The analysts will use client purchase order data. This data consists of individual purchase orders. Each purchase order has a price, units, unit of measure, eaches per unit of measure, total cost, catalog number, product, and type. The price is calculated by dividing total cost by (units x eaches per unit of measure). Think of a unit of measure as a box of markers and an eaches per unit of measure as the number of markers that comes in a box. Each catalog number can have several valid eaches per unit of measure values. Often, this particular field will be blank or incorrect. This script helps identify these lines and aims to correct the eaches per unit of measure field and recalculate the new price. 

From this data, the model computes the following features: 
1. The difference between the purchase order price and the average price for the purchase order's catalog number 
2. The difference between the purchase order price and the average price for the purchase order's type 
3. The count of unique purchase orders by catalog number and price 
4. The count of unique purchase orders by type and price 

The classifier for this model is `exclude_from_benchmarking`, which is a boolean. `exclude_from_benchmarking`=True when the purchase order is an outlier. 

#### Model:
The model loops over each category ID and does the following: 
1. The user will run a random forest model over the training and validation data (any data where `verified`=True). 
2. From this model, the script generates performance metrics. 
3. If recall > 0.8, then the user runs the model over the unverified data. 
4. For any purchase orders that were marked as an outlier by the model, the script then recalculates the price based on any additional eaches per unit of measure values for each purchase order and runs the model over that line again to see if a different eaches per unit of measure would make the purchase order not an outlier. 
5. Finally, the script inserts all predictions and eaches per unit of measure updates into the database for an analyst to verify. 

In [None]:
# Update parameters 
USERNAME = 'astaines'
USER_ID = 12345
_CATEGORY_ID_ = [238]

# Define a function that takes the output from executing the SQL query and puts it into a pandas dataframe 
def fetch(cur):
    df = pd.DataFrame(np.array(cur.fetchall()))
    colnames = [desc[0] for desc in cur.description]
    df.columns = colnames
    return df

# Connect to the database
con = psycopg2.connect(dbname = 'dbname',\
                host = 'host.com',\
                user = USERNAME,\
                password = PASSWORD_WAR,\
                port = 'port')
cur = con.cursor()
print('Connected to database')

# Loop over all categories in __CATEGORY_ID__ list 
for cat in range(len(_CATEGORY_ID_)):
    # Set category ID
    CATEGORY_ID = _CATEGORY_ID_[cat]

    # Get the most recent model_id + 1 from the database to use as model_id 
    cur.execute('''SELECT max(model_id)+1 as model_id FROM data table;''')
    model_id = fetch(cur)
    model_id = model_id.loc[0,'model_id']

    # Pull the category groups ID's 
    cur.execute('''SELECT id FROM data table 
    WHERE product_category_id IN ({0}) AND analytics_available = 2;'''.format(CATEGORY_ID))
    groups = fetch(cur)

    # Pull data from pg-warehouse 
    # The model uses the following features - price per each, the difference between price per each and the average price 
    # by catalog number and type, and the count of each unique price per each by catalog and type. 
    # The first section of this query pulls only data that has been verified, which will be used for the training and 
    # validation datasets. The second section pulls the rest of the data, which we will generate predictions for. 
    # This query also pulls in exclude_from_benchmarking, which the variable we will be predicting, and the line_id.
    print('Pulling data for category ' + str(CATEGORY_ID) + '...')
    cur.execute('''SELECT
      'training' AS data_set,
      type_id, 
      product_id, 
      catalog_number, 
      price_per_each,
      eaches_per_uom,
      units, 
      total_cost,
      NULL AS allowable_e_per_uom, 
      line_id,
      exclude_from_benchmarking,
      catalog_avg_price, 
      type_avg_price,
      coalesce((price_per_each - catalog_avg_price) / catalog_avg_price, -1) AS catalog_price_diff,
      coalesce((price_per_each - type_avg_price) / type_avg_price, -1) AS type_price_diff,
      coalesce(catalog_price_count, 0) AS catalog_price_count,
      coalesce(type_price_count, 0)AS type_price_count
    FROM data table
    WHERE category_id IN ({0})
        AND verified=True
    UNION
    SELECT
      'testing' AS data_set,
      type_id, 
      product_id, 
      catalog_number, 
      price_per_each,
      eaches_per_uom,
      units, 
      total_cost,
      NULL AS allowable_e_per_uom, 
      line_id,
      exclude_from_benchmarking,
      catalog_avg_price, 
      type_avg_price,
      coalesce((price_per_each - catalog_avg_price) / catalog_avg_price, -1) AS catalog_price_diff,
      coalesce((price_per_each - type_avg_price) / type_avg_price, -1) AS type_price_diff,
      coalesce(catalog_price_count, 0) AS catalog_price_count,
      coalesce(type_price_count, 0)AS type_price_count
    FROM data table
    WHERE category_id IN ({0})
        AND verified=False;'''.format(CATEGORY_ID))
    data = fetch(cur)
    print('Training and testing data pulled from database')

    # Set the index to the line_id to be able to trace predictions back to actual data 
    data = data.set_index('line_id')

    # Define training and validation data
    training_data = data.loc[data['data_set']=='training',]
    testing_data = data.loc[data['data_set']=='testing',]

    from sklearn.preprocessing import LabelEncoder
    class_le = LabelEncoder()

    y = class_le.fit_transform(training_data['exclude_from_benchmarking'].values)
    X = training_data[['price_per_each','catalog_price_diff', 'type_price_diff', 'catalog_price_count', 'type_price_count']]

    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

    y_predict = class_le.fit_transform(testing_data['exclude_from_benchmarking'].values)
    X_predict = testing_data[['price_per_each','catalog_price_diff', 'type_price_diff', 'catalog_price_count', 'type_price_count']]

    # Fit the model using a Random Forest Classifier
    from sklearn.ensemble import RandomForestClassifier

    forest = RandomForestClassifier(criterion = 'entropy', n_estimators=1000, class_weight = 'balanced')
    forest.fit(X_train, y_train)
    print('Model fitted to training data')

    # Generate performance metrics
    y_pred = forest.predict(X_test)

    from sklearn.metrics import confusion_matrix
    con_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)

    if len(con_mat) == 1: 
        precision = 1
        recall = 1
    else: 
        precision = con_mat[1,1]/(con_mat[0,1]+con_mat[1,1])
        recall = con_mat[1,1]/(con_mat[1,0]+con_mat[1,1])

    print('Recall: ' + str(recall))

    # Write output to database tables
    # Generate the outliers_groups insert statement
    insert_groups = ''
    if len(groups) == 1: 
        insert_groups += '({0}, {1}, {2})'.format(model_id, CATEGORY_ID, groups.loc[0,'id'])
    else: 
        for group in range(len(groups)-1):
            new_group = '({0}, {1}, {2}),'.format(model_id, CATEGORY_ID, groups.loc[group,'id'])
            insert_groups += new_group

        insert_groups += '({0}, {1}, {2})'.format(model_id, CATEGORY_ID, groups.loc[len(groups)-1,'id'])

    # Insert the model's performance metrics into outliers_model 
    cur.execute('''INSERT INTO data table (model_id, run_date, ran_by_id, precision, recall) VALUES
      ({0},current_timestamp, {1}, {2}, {3})'''.format(model_id, USER_ID, precision, recall))

    # Insert the model category and groups into outliers_groups 
    cur.execute('''INSERT INTO data table (model_id, category_id, group_id) VALUES '''+ str(insert_groups))

    con.commit()
    print('Model performance metrics and groups successfully inserted into database')

    if len(con_mat) == 1:
        # If the testing dataset did not have any outliers then the len(con_mat) will = 1
        print('No outliers in training or validation data - model cannot accurately predict outliers')
    elif recall >= 0.8: 
        # We will only insert predictions into outliers_predictions for models that have a recall > 0.8 
        # Generate predictions
        y_pred = forest.predict(X_predict)
        probabilities = forest.predict_proba(X_predict[0:])

        predictions = X_predict
        predictions['prediction'] = y_pred
        predictions['probability'] = probabilities[:,1]

        # Merge predictions with testing_data in order to get the line_id for each prediction
        predictions = pd.merge(testing_data[['group_type_id', 'product_id', 'catalog_stripped', 'exclude_from_benchmarking', 'eaches_per_uom', 'allowable_e_per_uom', 'units', 'extended_cost', 'catalog_avg_price', 'type_avg_price']], predictions, left_index=True, right_index=True)
        predictions = predictions.reset_index(drop=False)

        predictions.loc[predictions['prediction']==0, 'prediction'] = False
        predictions.loc[predictions['prediction']==1, 'prediction'] = True

        # Iterate over the predicted outliers to see if we can fix any eaches per uom problems 
        outliers = predictions.loc[predictions['prediction']==True].reset_index(drop=True)
        outliers.loc[outliers['allowable_e_per_uom'].isnull(),'allowable_e_per_uom'] = 1
        
        for i in range(len(outliers)):
            if outliers['allowable_e_per_uom'][i]==1: 
                outliers['allowable_e_per_uom'][i]=[1]
            for j in range(len(outliers['allowable_e_per_uom'][i])): 
                col_name = 'allowable_e_per_uom_' + str(j)
                outliers.loc[outliers.index==i,col_name] = outliers['allowable_e_per_uom'][i][j]

        allowable_e_per_uom_cols = outliers.filter(like='allowable_e_per_uom_').columns

        for i in range(len(allowable_e_per_uom_cols)):
            allowable_e_per_uom_col = 'allowable_e_per_uom_' + str(i)
            price_per_each_col = 'price_per_each_' + str(i)
            catalog_price_diff_col = 'catalog_price_diff_' + str(i)
            type_price_diff_col = 'type_price_diff_' + str(i)
            catalog_price_count_col = 'catalog_price_count_' + str(i)
            type_price_count_col = 'type_price_count_' + str(i)
            prediction_col = 'prediction_' + str(i)
            probability_col = 'probability_' + str(i)

            outliers[price_per_each_col] = outliers['extended_cost'].astype(float)/(outliers[allowable_e_per_uom_col].astype(float)*outliers['units'].astype(float))
            outliers.loc[outliers[price_per_each_col].isnull(), price_per_each_col] = 0 

            outliers[catalog_price_diff_col] = (outliers[price_per_each_col].astype(float) - outliers['catalog_avg_price'].astype(float))/outliers['catalog_avg_price'].astype(float)
            outliers[type_price_diff_col] = (outliers[price_per_each_col].astype(float) - outliers['type_avg_price'].astype(float))/outliers['type_avg_price'].astype(float)
            outliers.loc[outliers[catalog_price_diff_col].isnull(), catalog_price_diff_col] = -1
            outliers.loc[outliers[type_price_diff_col].isnull(), type_price_diff_col] = -1

            outliers = pd.merge(outliers, training_data[['product_id', 'catalog_stripped', 'price_per_each', 'catalog_price_count']].drop_duplicates(), how='left', left_on=['product_id', 'catalog_stripped', price_per_each_col], right_on=['product_id', 'catalog_stripped', 'price_per_each'])
            outliers = pd.merge(outliers, training_data[['group_type_id', 'price_per_each', 'type_price_count']].drop_duplicates(), how='left', left_on=['group_type_id', price_per_each_col], right_on=['group_type_id', 'price_per_each'])
            outliers = outliers.rename(columns = {'catalog_price_count_x': 'catalog_price_count'})
            outliers = outliers.rename(columns = {'type_price_count_x': 'type_price_count'})
            outliers = outliers.rename(columns = {'catalog_price_count_y': catalog_price_count_col})
            outliers = outliers.rename(columns = {'type_price_count_y': type_price_count_col})
            outliers = outliers.drop(['price_per_each', 'price_per_each_y'], axis=1)
            outliers = outliers.rename(columns = {'price_per_each_x': 'price_per_each'})

            outliers.loc[outliers[catalog_price_count_col].isnull(), catalog_price_count_col] = 0
            outliers.loc[outliers[type_price_count_col].isnull(), type_price_count_col] = 0

            X_predict_e = outliers[[price_per_each_col, catalog_price_diff_col, type_price_diff_col, catalog_price_count_col, type_price_count_col]]
            y_pred_e = forest.predict(X_predict_e)
            probabilities_e = forest.predict_proba(X_predict_e[0:])
            outliers[prediction_col] = y_pred_e
            outliers.loc[outliers[prediction_col]==0, prediction_col] = False
            outliers.loc[outliers[prediction_col]==1, prediction_col] = True
            outliers[probability_col] = probabilities_e[:,1]

            predictions = pd.merge(predictions, outliers.loc[outliers[prediction_col]==False,('line_id',allowable_e_per_uom_col, probability_col)], how='left', on='line_id')
            predictions.loc[~predictions[allowable_e_per_uom_col].isnull(), 'prediction'] = False
            predictions.loc[~predictions[allowable_e_per_uom_col].isnull(), 'update_e_per_uom'] = predictions[allowable_e_per_uom_col]
            predictions.loc[~predictions[probability_col].isnull(), 'probability'] = predictions[probability_col]
            predictions = predictions.drop([probability_col, allowable_e_per_uom_col], axis=1)

        predictions.loc[predictions['update_e_per_uom'].isnull(), 'update_e_per_uom'] = 'NULL'
        
        # Generate the insert statement for the predictions as a string 
        insert_predictions = ''
        for line_id in range(len(predictions)-1):
            new_prediction = '({0}, {1}, {2}, {3}, {4}, {5}),'.format(model_id, predictions.loc[line_id,'line_id'], predictions.loc[line_id,'exclude_from_benchmarking'],predictions.loc[line_id,'prediction'], predictions.loc[line_id,'probability'], predictions.loc[line_id,'update_e_per_uom'])
            insert_predictions += new_prediction

        insert_predictions += '({0}, {1}, {2}, {3}, {4}, {5})'.format(model_id, predictions.loc[len(predictions)-1,'line_id'], predictions.loc[len(predictions)-1,'exclude_from_benchmarking'],predictions.loc[len(predictions)-1,'prediction'], predictions.loc[len(predictions)-1,'probability'], predictions.loc[line_id,'update_e_per_uom'])

        # Insert predictions into outliers_predictions 
        cur.execute('''INSERT INTO data table (model_id, line_id, original_value, prediction, probability, update_e_per_uom) VALUES '''+ str(insert_predictions))

        con.commit()

        print('Model predictions successfully inserted into database')
    else: 
        print('Model did not meet requirements - recall < 0.8')

    print('Category ' + str(CATEGORY_ID) + ' run complete')