# Pattern mining tutorial

Welcome to the tutorial on pattern mining! 

This tutorial explains the most important features of the data-patterns package.

The data-pattern-package works with Pandas DataFrames.

In [None]:
import pandas as pd
import numpy as np
import data_patterns
for item in data_patterns.encodings_definitions:
    exec(data_patterns.encodings_definitions[item])
encodings = {}
for item in data_patterns.encodings_definitions.keys():
    encodings[item]= locals()[item]

Let's construct a simple dataframe to do some pattern mining.

In [None]:
col = ['Name', 'Type', 'Assets', 'TV-life', 'TV-nonlife', 'Own funds', 'Diversification','Excess']
insurers = [['Insurer  1', 'life insurer',     1000,  800,    0,  200,   12,  200], 
            ['Insurer  2', 'non-life insurer',   40,    0,   32,    8,    9,    8], 
            ['Insurer  3', 'non-life insurer',  800,    0,  700,  100,   -1,  100],
            ['Insurer  4', 'life insurer',       25,   18,    0,    7,    8,    7], 
            ['Insurer  5', 'non-life insurer', 2100,    0, 2200,  200,   12,  200], 
            ['Insurer  6', 'life insurer',      907,  887,    0,   20,    7,   20],
            ['Insurer  7', 'life insurer',     7123,    0, 6800,  323,    5,  323],
            ['Insurer  8', 'life insurer',     6100, 5920,    0,  180,   14,  180],
            ['Insurer  9', 'non-life insurer', 9011,    0, 8800,  211,   19,  211],
            ['Insurer 10', 'non-life insurer', 1034,    0,  901,  133,    1,  134]]
df = pd.DataFrame(columns = col, data = insurers)
df.set_index('Name', inplace = True)
df

Can we find the errors in this report?


Let's first define our miner

In [None]:
miner = data_patterns.PatternMiner(df)

### Patterns with equal values

There are two ways of finding a pattern with an expression or using a code structure. Let's start with the latter one:

In [None]:
parameters = {'min_confidence': 0.5,'min_support'   : 2, 'decimal': 8}
p2 = {'name'      : 'equal values', 
          'pattern'   : '=',
      'parameters': parameters}
miner.find(p2)

Now, let's find patterns with equal columns with an expression.

In [None]:
parameters = {'min_confidence': 0.5,'min_support'   : 2, 'decimal': 8}
p2 = {'name'      : 'equal values', 
      'expression'   : '{.*}={.*}',
      'parameters': parameters}
miner.find(p2)

When using the equal-pattern you can define the accuracy of the equal pattern. For this you can use the decimal-parameter.

In [None]:
parameters = {'min_confidence': 0.5, 'min_support': 2, 'decimal': -1}

If we now run the miner with the alternative 

In [None]:
p2_alt = {'name'      : 'equal values', 
          'expression'   : '{.*}={.*}',
          'parameters': parameters}
miner.find(p2_alt)

We can also do it with a pattern instead of an expression

In [None]:
p2_alt2 = {'name'      : 'equal values', 
          'pattern'   : '=',
          'parameters': parameters}
miner.find(p2_alt2)

### Patterns with value constant value

To find patterns you need to construct a PatternMiner-object and input a pattern definition. Then you can use the find-function. The result is a Pandas DataFrame with the patterns that were found.

First of all, let's find patterns for whether values are positive or negative.

In [None]:
p1 = {'name'      : 'positive values', 
        'expression'   : '{.*}>=0',
      'parameters': {'min_confidence': 0.5,
                     'min_support'   : 2}}
miner.find(p1)

So we have six patterns (for each column), with one exception, namely that the column 'diversification' contains one negative value.

We can also do it with a pattern instead of an expression

In [None]:
p1_alt = {'name'      : 'equal values', 
          'pattern'   : '>=',
          'value'     : 0,
          'parameters': parameters}
miner.find(p1_alt)

### Sum-patterns

To find sum-pattern you can use the function below. With the sum pattern, one can choose to ignore columns where the value is 0. One has to do that by setting the parameter 'nonzero' to True.

In [None]:
p3 = {'name'   : 'sum pattern',
      'pattern': 'sum',
      'parameters': {"min_confidence": 0.5,
                     "min_support"   : 1}}
miner.find(p3)

With an expression this would look like this:

In [None]:
parameters = {'min_confidence': 0.5,'min_support'   : 2}
p2 = {'name'      : 'equal values', 
      'expression'   : '{.*} + {.*}={.*}',
      'parameters': parameters}
miner.find(p2)

### Conditional patterns


With the conditional pattern you can find conditional statements between columns, such as IF TV-life = 0 THEN TV-nonlife > 0:

In [None]:
p2 = {'name'     : 'Condition',
     'pattern'  : '-->',
     'P_columns': ['TV-life'],
     'P_values' : [0],
     'Q_columns': ['TV-nonlife'],
     'Q_values' : [0],
     'parameters' : {"min_confidence" : 0.5, "min_support" : 1,'Q_operators': ['>']}}
miner.find(p2)

One can define the values, operators and logics. The values are normally set to none and will then try every possible option for the values. The operators are put in the parameters as shown above and are set to '=' when none are given. Logics are the operators between columns such as '&' and '|' (AND, OR). Logics are also put in the parameters as 'Q_logics' or 'P_logics'. These can only be used when we have more than one column in P or Q. This is set to '&' when none are given.

An easier approach is to use text for a conditional statement. See the Expression chapter for more information.

In [None]:
p2 = {'name'      : 'equal values',
                          'expression'   : 'IF {"TV-life"} = 0 THEN {"TV-nonlife"} > 0',
                          'parameters': {"min_confidence": 0.5,
                                         "min_support"   : 2}}
miner.find(p2)

### Percentile patterns

One can also find the percentiles of certain columns. It does so by adding the percentile value in parameters. The result is a lower and upper boundary of values that are included in the support elements.

In [None]:
parameters = {'min_confidence': 0.3,'min_support'   : 1, 'percentile' : 90}
p5 = {'name'      : 'type pattern',
        'pattern' : 'percentile',
        'columns' : [ 'TV-nonlife', 'Own funds'],
      'parameters':parameters}
miner.find(p5)

### Patterns in whether cells are reported or not

Suppose we expect a relation or association between Feature 1 and Feature 2. For this, we can now define a metapattern and initialize a PatternMiner-object with this metapattern.

In [None]:
p4 = {'name'     : 'type pattern',
    'pattern' : '-->',
      'P_columns': ['Type'],
      'Q_columns': ['Assets', 'TV-life', 'TV-nonlife', 'Own funds'],
      'encode'   : {'Assets'    : 'reported',
                    'TV-life'   : 'reported',
                    'TV-nonlife': 'reported',
                    'Own funds' : 'reported'}}
miner.find(p4)

With an expression this would be:

In [None]:
p4 = {'name'     : 'type pattern',
    'expression' : 'IF {.*Ty.*} = [@] THEN {.*As.*} = [@] & {.*TV-n.*} = [@] & {.*TV-l.*} = [@] & {.*O.*} = [@]',
      'P_columns': ['Type'],
      'Q_columns': ['Assets', 'TV-life', 'TV-nonlife', 'Own funds'],
      'encode'   : {'Assets'    : 'reported',
                    'TV-life'   : 'reported',
                    'TV-nonlife': 'reported',
                    'Own funds' : 'reported'}}
miner.find(p4)

### Instructions for an expression

Expressions can be written as followed:

1. Put it in a structure like above
2. Columns are given with '{}', example: '{Assests} > 0'
3. If you want to find matches with columns you can do '{.*}' (this will match all columns), example: '{.*TV.*} > 0' (will match TV-life and TV-nonlife)
4. Conditional statements go with IF, THEN together with & and | (and/or), example: 'IF ({.*TV-life.*} = 0) THEN ({.*TV-nonlife.*} = 8800) & {.*As.*} > 0)' Note: AND is only used when you want the reverse of this statement, such as 'IF ({.*TV-life.*} = 0) THEN ({.*TV-nonlife.*} = 8800) & {.*As.*} > 0) AND IF ({.*TV-life.*} = 0) THEN ~({.*TV-nonlife.*} = 8800) & {.*As.*} > 0)'
5. Use [@] if you do not have a specific value, example: 'IF ({.*Ty.*} = "@") THEN ({.*As.*} = "@")'

### Combining patterns 

You can run the miner with a list of pattern definitions.

In [None]:
df_patterns = miner.find([p1, p2, p3, p4])

In [None]:
df_patterns

### Getting different codings of patterns

Now that we have the patterns we can transform then to different codings. 

The Pandas code of the exceptions of the 7-th pattern is

In [None]:
pattern_text = df_patterns.loc[6, 'pandas']
print(pattern_text)

You can evaluate the Pandas code directly with the eval-function inside Python.

In [None]:
eval(pattern_text['X and Y'], globals(), {'df': df})

### Analyzing results

If you want to know the results of the patterns per insurer then you can use the analyze-function.

In [None]:
df_results = miner.analyze()

df_results is a proper Pandas DataFrame, so you can do the usual stuff with it. For example all exceptions to the patterns.

In [None]:
df_results[df_results['result_type']==False]

### Export to and import from Excel

You can export the DataFrame with the patterns with the to_excel-function. This produces an Excel file in a humanly readable format.

In [None]:
type(df_patterns)

In [None]:
df_patterns.to_excel("patterns.xlsx", sheet_name='Patterns')

And you can read the Excel with the patterns into the PatternMiner-object in the following way.

In [None]:
p = data_patterns.PatternMiner(df_patterns = pd.read_excel("patterns.xlsx"))

Subsequently, the statistics in df_patterns can be updated with statistics from the data by evaluating pandas expressions

In [None]:
p.update_statistics(df)
p.df_patterns

In [None]:
df_results = miner.analyze()
df_results

### Evaluating other metrics

On default the support, exceptions and confidence are evaluated. If desired you can also add other metrics in the output by using the parameters dictionary. 

In [None]:
miner = data_patterns.PatternMiner(df)
parameters = {'metrics': ['conviction', 'lift']}
p4 = {'name'     : 'test',
      'expression' : 'IF {.*Ty.*} = [@] THEN {.*As.*} = [@] & {.*TV-n.*} = [@] & {.*TV-l.*} = [@] & {.*O.*} = [@]',
      'encode'   : {'Assets'    : 'reported',
                    'TV-life'   : 'reported',
                    'TV-nonlife': 'reported',
                    'Own funds' : 'reported'},
      'parameters': parameters}
miner.find(p4)

The following metrics are currently available

* added value
* conviction
* casual confidence
* casual support
* lift
* relative support


## Background

Our approach to pattern mining is somewhat different from traditional association rules mining. Association rules work on a set of items (binary attributes). In the original definition, the items in the set are not linked to column names. However, often we want to find associations between the values of specific columns in a dataset. The pattern mining applied here finds patterns between the values of different columns in a dataset while using the basic measures of association rules mining like support and confidence.
