# Pattern mining tutorial

Welcome to the tutorial on pattern mining! 

This tutorial explains the most important features of the data-patterns package.

The data-pattern-package works with Pandas DataFrames.

In [1]:
import pandas as pd
import numpy as np
import data_patterns
for item in data_patterns.encodings_definitions:
    exec(data_patterns.encodings_definitions[item])
encodings = {}
for item in data_patterns.encodings_definitions.keys():
    encodings[item]= locals()[item]

Let's construct a simple dataframe to do some pattern mining.

In [2]:
col = ['Name', 'Type', 'Assets', 'TV-life', 'TV-nonlife', 'Own funds', 'Diversification','Excess']
insurers = [['Insurer  1', 'life insurer',     1000,  800,    0,  200,   12,  200], 
            ['Insurer  2', 'non-life insurer',   40,    0,   32,    8,    9,    8], 
            ['Insurer  3', 'non-life insurer',  800,    0,  700,  100,   -1,  100],
            ['Insurer  4', 'life insurer',       25,   18,    0,    7,    8,    7], 
            ['Insurer  5', 'non-life insurer', 2100,    0, 2200,  200,   12,  200], 
            ['Insurer  6', 'life insurer',      907,  887,    0,   20,    7,   20],
            ['Insurer  7', 'life insurer',     7123,    0, 6800,  323,    5,  323],
            ['Insurer  8', 'life insurer',     6100, 5920,    0,  180,   14,  180],
            ['Insurer  9', 'non-life insurer', 9011,    0, 8800,  211,   19,  211],
            ['Insurer 10', 'non-life insurer', 1034,    0,  901,  133,    1,  134]]
df = pd.DataFrame(columns = col, data = insurers)
df.set_index('Name', inplace = True)
df

Unnamed: 0_level_0,Type,Assets,TV-life,TV-nonlife,Own funds,Diversification,Excess
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Insurer 1,life insurer,1000,800,0,200,12,200
Insurer 2,non-life insurer,40,0,32,8,9,8
Insurer 3,non-life insurer,800,0,700,100,-1,100
Insurer 4,life insurer,25,18,0,7,8,7
Insurer 5,non-life insurer,2100,0,2200,200,12,200
Insurer 6,life insurer,907,887,0,20,7,20
Insurer 7,life insurer,7123,0,6800,323,5,323
Insurer 8,life insurer,6100,5920,0,180,14,180
Insurer 9,non-life insurer,9011,0,8800,211,19,211
Insurer 10,non-life insurer,1034,0,901,133,1,134


Can we find the errors in this report?


Let's first define our miner

In [3]:
miner = data_patterns.PatternMiner(df)


### Patterns with equal values

There are two ways of finding a pattern with an expression or using a code structure. Let's start with the latter one:

In [4]:
parameters = {'min_confidence': 0.5,'min_support'   : 2, 'decimal': 8}
p2 = {'name'      : 'equal values', 
          'pattern'   : '=',
      'parameters': parameters}
miner.find(p2)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"({""Own funds""} = {""Excess""})",9,1,0.9,not defined,{},"df[(abs((df[""Own funds""]-(df[""Excess""])))<1.5e...","df[(abs((df[""Own funds""]-(df[""Excess""])))>=1.5...",,,


Now, let's find patterns with equal columns with an expression.

In [5]:
parameters = {'min_confidence': 0.5,'min_support'   : 2, 'decimal': 8}
p2 = {'name'      : 'equal values', 
      'expression'   : '{.*}={.*}',
      'parameters': parameters}
miner.find(p2)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"({""Own funds""} = {""Excess""})",9,1,0.9,not defined,{},"df[(abs((df[""Own funds""]-(df[""Excess""])))<1.5e...","df[(abs((df[""Own funds""]-(df[""Excess""])))>=1.5...",,,


When using the equal-pattern you can define the accuracy of the equal pattern. For this you can use the decimal-parameter.

In [6]:
parameters = {'min_confidence': 0.5, 'min_support': 2, 'decimal': -1}

If we now run the miner with the alternative 

In [7]:
p2_alt = {'name'      : 'equal values', 
          'expression'   : '{.*}={.*}',
          'parameters': parameters}
miner.find(p2_alt)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"({""Own funds""} = {""Excess""})",10,0,1.0,not defined,{},"df[(abs((df[""Own funds""]-(df[""Excess""])))<1.5e...","df[(abs((df[""Own funds""]-(df[""Excess""])))>=1.5...",,,


We can also do it with a pattern instead of an expression

In [8]:
p2_alt2 = {'name'      : 'equal values', 
          'pattern'   : '=',
          'parameters': parameters}
miner.find(p2_alt2)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"({""Own funds""} = {""Excess""})",10,0,1.0,not defined,{},"df[(abs((df[""Own funds""]-(df[""Excess""])))<1.5e...","df[(abs((df[""Own funds""]-(df[""Excess""])))>=1.5...",,,


### Patterns with value constant value

To find patterns you need to construct a PatternMiner-object and input a pattern definition. Then you can use the find-function. The result is a Pandas DataFrame with the patterns that were found.

First of all, let's find patterns for whether values are positive or negative.

In [9]:
p1 = {'name'      : 'positive values', 
        'expression'   : '{.*}>=0',
      'parameters': {'min_confidence': 0.5,
                     'min_support'   : 2}}
miner.find(p1)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,positive values,0,"({""Assets""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""Assets""]>=0.0))]","df[~((df[""Assets""]>=0.0))]",,,
1,positive values,0,"({""TV-life""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""TV-life""]>=0.0))]","df[~((df[""TV-life""]>=0.0))]",,,
2,positive values,0,"({""TV-nonlife""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""TV-nonlife""]>=0.0))]","df[~((df[""TV-nonlife""]>=0.0))]",,,
3,positive values,0,"({""Own funds""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""Own funds""]>=0.0))]","df[~((df[""Own funds""]>=0.0))]",,,
4,positive values,0,"({""Diversification""} >= 0.0)",9,1,0.9,not defined,{},"df[((df[""Diversification""]>=0.0))]","df[~((df[""Diversification""]>=0.0))]",,,
5,positive values,0,"({""Excess""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""Excess""]>=0.0))]","df[~((df[""Excess""]>=0.0))]",,,


So we have six patterns (for each column), with one exception, namely that the column 'diversification' contains one negative value.

We can also do it with a pattern instead of an expression

In [10]:
p1_alt = {'name'      : 'equal values', 
          'pattern'   : '>=',
          'value'     : 0,
          'parameters': parameters}
miner.find(p1_alt)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"({""Assets""} >= 0)",10,0,1.0,not defined,{},"df[((df[""Assets""]>=0))&(df[""Assets""]!=0)]","df[~((df[""Assets""]>=0))&(df[""Assets""]!=0)]",,,
1,equal values,0,"({""TV-life""} >= 0)",10,0,1.0,not defined,{},"df[((df[""TV-life""]>=0))&(df[""TV-life""]!=0)]","df[~((df[""TV-life""]>=0))&(df[""TV-life""]!=0)]",,,
2,equal values,0,"({""TV-nonlife""} >= 0)",10,0,1.0,not defined,{},"df[((df[""TV-nonlife""]>=0))&(df[""TV-nonlife""]!=0)]","df[~((df[""TV-nonlife""]>=0))&(df[""TV-nonlife""]!...",,,
3,equal values,0,"({""Own funds""} >= 0)",10,0,1.0,not defined,{},"df[((df[""Own funds""]>=0))&(df[""Own funds""]!=0)]","df[~((df[""Own funds""]>=0))&(df[""Own funds""]!=0)]",,,
4,equal values,0,"({""Diversification""} >= 0)",9,1,0.9,not defined,{},"df[((df[""Diversification""]>=0))&(df[""Diversifi...","df[~((df[""Diversification""]>=0))&(df[""Diversif...",,,
5,equal values,0,"({""Excess""} >= 0)",10,0,1.0,not defined,{},"df[((df[""Excess""]>=0))&(df[""Excess""]!=0)]","df[~((df[""Excess""]>=0))&(df[""Excess""]!=0)]",,,


### Sum-patterns

To find sum-pattern you can use the function below. With the sum pattern, one can choose to ignore columns where the value is 0. One has to do that by setting the parameter 'nonzero' to True.

In [11]:
p3 = {'name'   : 'sum pattern',
      'pattern': 'sum',
      'parameters': {"min_confidence": 0.5,
                     "min_support"   : 1}}
miner.find(p3)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,sum pattern,0,"({""TV-life""} + {""Own funds""} = {""Assets""})",4,0,1.0,not defined,{},"df[(abs((df[""TV-life""] + df[""Own funds""]-(df[""...","df[(abs((df[""TV-life""] + df[""Own funds""]-(df[""...",,,
1,sum pattern,0,"({""TV-life""} + {""Excess""} = {""Assets""})",4,0,1.0,not defined,{},"df[(abs((df[""TV-life""] + df[""Excess""]-(df[""Ass...","df[(abs((df[""TV-life""] + df[""Excess""]-(df[""Ass...",,,
2,sum pattern,0,"({""TV-nonlife""} + {""Own funds""} = {""Assets""})",5,1,0.8333,not defined,{},"df[(abs((df[""TV-nonlife""] + df[""Own funds""]-(d...","df[(abs((df[""TV-nonlife""] + df[""Own funds""]-(d...",,,
3,sum pattern,0,"({""TV-nonlife""} + {""Excess""} = {""Assets""})",5,1,0.8333,not defined,{},"df[(abs((df[""TV-nonlife""] + df[""Excess""]-(df[""...","df[(abs((df[""TV-nonlife""] + df[""Excess""]-(df[""...",,,


With an expression this would look like this:

In [12]:
parameters = {'min_confidence': 0.5,'min_support'   : 2}
p2 = {'name'      : 'equal values', 
      'expression'   : '{.*} + {.*}={.*}',
      'parameters': parameters}
miner.find(p2)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"({""TV-life""} + {""Own funds""} = {""Assets""})",4,0,1.0,not defined,{},"df[(abs((df[""TV-life""] + df[""Own funds""]-(df[""...","df[(abs((df[""TV-life""] + df[""Own funds""]-(df[""...",,,
1,equal values,0,"({""TV-life""} + {""Excess""} = {""Assets""})",4,0,1.0,not defined,{},"df[(abs((df[""TV-life""] + df[""Excess""]-(df[""Ass...","df[(abs((df[""TV-life""] + df[""Excess""]-(df[""Ass...",,,
2,equal values,0,"({""TV-nonlife""} + {""Own funds""} = {""Assets""})",5,1,0.8333,not defined,{},"df[(abs((df[""TV-nonlife""] + df[""Own funds""]-(d...","df[(abs((df[""TV-nonlife""] + df[""Own funds""]-(d...",,,
3,equal values,0,"({""TV-nonlife""} + {""Excess""} = {""Assets""})",5,1,0.8333,not defined,{},"df[(abs((df[""TV-nonlife""] + df[""Excess""]-(df[""...","df[(abs((df[""TV-nonlife""] + df[""Excess""]-(df[""...",,,


### Conditional patterns


With the conditional pattern you can find conditional statements between columns, such as IF TV-life = 0 THEN TV-nonlife > 0:

In [13]:
p2 = {'name'     : 'Condition',
     'pattern'  : '-->',
     'P_columns': ['TV-life'],
     'P_values' : [0],
     'Q_columns': ['TV-nonlife'],
     'Q_values' : [0],
     'parameters' : {"min_confidence" : 0.5, "min_support" : 1,'Q_operators': ['>']}}
miner.find(p2)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,Condition,0,"IF ({""TV-life""} = 0) THEN ({""TV-nonlife""} > 0)",6,0,1.0,not defined,{},"df[((df[""TV-life""]==0)) & ((df[""TV-nonlife""]>0))]","df[((df[""TV-life""]==0)) & ~((df[""TV-nonlife""]>...",,,


One can define the values, operators and logics. The values are normally set to none and will then try every possible option for the values. The operators are put in the parameters as shown above and are set to '=' when none are given. Logics are the operators between columns such as '&' and '|' (AND, OR). Logics are also put in the parameters as 'Q_logics' or 'P_logics'. These can only be used when we have more than one column in P or Q. This is set to '&' when none are given.

An easier approach is to use text for a conditional statement. See the Expression chapter for more information.

In [14]:
p2 = {'name'      : 'equal values',
                          'expression'   : 'IF {"TV-life"} = 0 THEN {"TV-nonlife"} > 0',
                          'parameters': {"min_confidence": 0.5,
                                         "min_support"   : 2}}
miner.find(p2)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,equal values,0,"IF {""TV-life""} = 0 THEN {""TV-nonlife""} > 0",6,0,1.0,not defined,{},"df[(df[""TV-life""]==0) & (df[""TV-nonlife""]>0)]","df[(df[""TV-life""]==0) & ~(df[""TV-nonlife""]>0)]",,,


### Percentile patterns

One can also find the percentiles of certain columns. It does so by adding the percentile value in parameters. The result is a lower and upper boundary of values that are included in the support elements.

In [15]:
parameters = {'min_confidence': 0.3,'min_support'   : 1, 'percentile' : 90}
p5 = {'name'      : 'type pattern',
        'pattern' : 'percentile',
        'columns' : [ 'TV-nonlife', 'Own funds'],
      'parameters':parameters}
miner.find(p5)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,type pattern,0,"({""TV-nonlife""} >= 0.0) & ({""TV-nonlife""} <= 7...",9,1,0.9,not defined,{},"df[(((df[""TV-nonlife""]>=0.0)) & ((df[""TV-nonli...","df[~(((df[""TV-nonlife""]>=0.0)) & ((df[""TV-nonl...",,,
1,type pattern,0,"({""Own funds""} >= 7.45) & ({""Own funds""} <= 27...",8,2,0.8,not defined,{},"df[(((df[""Own funds""]>=7.45)) & ((df[""Own fund...","df[~(((df[""Own funds""]>=7.45)) & ((df[""Own fun...",,,


### Patterns in whether cells are reported or not

Suppose we expect a relation or association between Feature 1 and Feature 2. For this, we can now define a metapattern and initialize a PatternMiner-object with this metapattern.

In [16]:
p4 = {'name'     : 'type pattern',
    'pattern' : '-->',
      'P_columns': ['Type'],
      'Q_columns': ['Assets', 'TV-life', 'TV-nonlife', 'Own funds'],
      'encode'   : {'Assets'    : 'reported',
                    'TV-life'   : 'reported',
                    'TV-nonlife': 'reported',
                    'Own funds' : 'reported'}}
miner.find(p4)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,type pattern,0,"IF ({""Type""} = ""life insurer"") THEN ({""Assets""...",4,1,0.8,not defined,"{'Own funds': 'reported', 'TV-nonlife': 'repor...","df[((df[""Type""]==""life insurer"")) & ((((((repo...","df[((df[""Type""]==""life insurer"")) & ~((((((rep...",,,
1,type pattern,0,"IF ({""Type""} = ""non-life insurer"") THEN ({""Ass...",5,0,1.0,not defined,"{'Own funds': 'reported', 'TV-nonlife': 'repor...","df[((df[""Type""]==""non-life insurer"")) & ((((((...","df[((df[""Type""]==""non-life insurer"")) & ~(((((...",,,


With an expression this would be:

In [17]:
p4 = {'name'     : 'type pattern',
    'expression' : 'IF {.*Ty.*} = [@] THEN {.*As.*} = [@] & {.*TV-n.*} = [@] & {.*TV-l.*} = [@] & {.*O.*} = [@]',
      'P_columns': ['Type'],
      'Q_columns': ['Assets', 'TV-life', 'TV-nonlife', 'Own funds'],
      'encode'   : {'Assets'    : 'reported',
                    'TV-life'   : 'reported',
                    'TV-nonlife': 'reported',
                    'Own funds' : 'reported'}}
miner.find(p4)

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,type pattern,0,"IF {""Type""} = ""life insurer"" THEN {""Assets""} =...",4,1,0.8,not defined,"{'Own funds': 'reported', 'TV-nonlife': 'repor...","df[(df[""Type""]==""life insurer"") & (((((reporte...","df[(df[""Type""]==""life insurer"") & ~(((((report...",,,
1,type pattern,0,"IF {""Type""} = ""non-life insurer"" THEN {""Assets...",5,0,1.0,not defined,"{'Own funds': 'reported', 'TV-nonlife': 'repor...","df[(df[""Type""]==""non-life insurer"") & (((((rep...","df[(df[""Type""]==""non-life insurer"") & ~(((((re...",,,


### Instructions for an expression

Expressions can be written as followed:

1. Put it in a structure like above
2. Columns are given with '{}', example: '{Assests} > 0'
3. If you want to find matches with columns you can do '{.*}' (this will match all columns), example: '{.*TV.*} > 0' (will match TV-life and TV-nonlife)
4. Conditional statements go with IF, THEN together with & and | (and/or), example: 'IF ({.*TV-life.*} = 0) THEN ({.*TV-nonlife.*} = 8800) & {.*As.*} > 0)' Note: AND is only used when you want the reverse of this statement, such as 'IF ({.*TV-life.*} = 0) THEN ({.*TV-nonlife.*} = 8800) & {.*As.*} > 0) AND IF ({.*TV-life.*} = 0) THEN ~({.*TV-nonlife.*} = 8800) & {.*As.*} > 0)'
5. Use [@] if you do not have a specific value, example: 'IF ({.*Ty.*} = "@") THEN ({.*As.*} = "@")'

### Combining patterns 

You can run the miner with a list of pattern definitions.

In [18]:
df_patterns = miner.find([p1, p2, p3, p4])

In [19]:
df_patterns

Unnamed: 0_level_0,pattern_id,cluster,pattern_def,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex,Error message
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,positive values,0,"({""Assets""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""Assets""]>=0.0))]","df[~((df[""Assets""]>=0.0))]",,,
1,positive values,0,"({""TV-life""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""TV-life""]>=0.0))]","df[~((df[""TV-life""]>=0.0))]",,,
2,positive values,0,"({""TV-nonlife""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""TV-nonlife""]>=0.0))]","df[~((df[""TV-nonlife""]>=0.0))]",,,
3,positive values,0,"({""Own funds""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""Own funds""]>=0.0))]","df[~((df[""Own funds""]>=0.0))]",,,
4,positive values,0,"({""Diversification""} >= 0.0)",9,1,0.9,not defined,{},"df[((df[""Diversification""]>=0.0))]","df[~((df[""Diversification""]>=0.0))]",,,
5,positive values,0,"({""Excess""} >= 0.0)",10,0,1.0,not defined,{},"df[((df[""Excess""]>=0.0))]","df[~((df[""Excess""]>=0.0))]",,,
6,equal values,0,"IF {""TV-life""} = 0 THEN {""TV-nonlife""} > 0",6,0,1.0,not defined,{},"df[(df[""TV-life""]==0) & (df[""TV-nonlife""]>0)]","df[(df[""TV-life""]==0) & ~(df[""TV-nonlife""]>0)]",,,
7,sum pattern,0,"({""TV-life""} + {""Own funds""} = {""Assets""})",4,0,1.0,not defined,{},"df[(abs((df[""TV-life""] + df[""Own funds""]-(df[""...","df[(abs((df[""TV-life""] + df[""Own funds""]-(df[""...",,,
8,sum pattern,0,"({""TV-life""} + {""Excess""} = {""Assets""})",4,0,1.0,not defined,{},"df[(abs((df[""TV-life""] + df[""Excess""]-(df[""Ass...","df[(abs((df[""TV-life""] + df[""Excess""]-(df[""Ass...",,,
9,sum pattern,0,"({""TV-nonlife""} + {""Own funds""} = {""Assets""})",5,1,0.8333,not defined,{},"df[(abs((df[""TV-nonlife""] + df[""Own funds""]-(d...","df[(abs((df[""TV-nonlife""] + df[""Own funds""]-(d...",,,


### Getting different codings of patterns

Now that we have the patterns we can transform then to different codings. 

The Pandas code of the exceptions of the 7-th pattern is

In [20]:
pattern_text = df_patterns.loc[6, 'pandas ex']
print(pattern_text)

df[(df["TV-life"]==0) & ~(df["TV-nonlife"]>0)]


You can evaluate the Pandas code directly with the eval-function inside Python.

In [21]:
eval(pattern_text, globals(), {'df': df})

Unnamed: 0_level_0,Type,Assets,TV-life,TV-nonlife,Own funds,Diversification,Excess
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


The code for the XBRL-validation of the confirmation of this pattern is

In [22]:
df_patterns.loc[6, 'xbrl co']

''

### Analyzing results

If you want to know the results of the patterns per insurer then you can use the analyze-function.

In [23]:
df_results = miner.analyze()

df_results is a proper Pandas DataFrame, so you can do the usual stuff with it. For example all exceptions to the patterns.

In [24]:
df_results[df_results['result_type']==False]

Unnamed: 0_level_0,result_type,pattern_id,cluster,support,exceptions,confidence,pattern_def,P values,Q values
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Insurer 3,False,positive values,0,9,1,0.9,"({""Diversification""} >= 0.0)",-1,
Insurer 5,False,sum pattern,0,5,1,0.8333,"({""TV-nonlife""} + {""Own funds""} = {""Assets""})","[2200, 200]",2100
Insurer 5,False,sum pattern,0,5,1,0.8333,"({""TV-nonlife""} + {""Excess""} = {""Assets""})","[2200, 200]",2100
Insurer 7,False,type pattern,0,4,1,0.8,"IF {""Type""} = ""life insurer"" THEN {""Assets""} =...",life insurer,"[7123, 6800, 0, 323]"


### Export to and import from Excel

You can export the DataFrame with the patterns with the to_excel-function. This produces an Excel file in a humanly readable format.

In [25]:
type(df_patterns)

data_patterns.transform.PatternDataFrame

In [26]:
df_patterns.to_excel("patterns.xlsx", sheet_name='Patterns')

And you can read the Excel with the patterns into the PatternMiner-object in the following way.

In [27]:
p = data_patterns.PatternMiner(df_patterns = data_patterns.read_excel("patterns.xlsx"))

({"Assets"} >= 0.0)
({"TV-life"} >= 0.0)
({"TV-nonlife"} >= 0.0)
({"Own funds"} >= 0.0)
({"Diversification"} >= 0.0)
({"Excess"} >= 0.0)
IF {"TV-life"} = 0 THEN {"TV-nonlife"} > 0
({"TV-life"} + {"Own funds"} = {"Assets"})
({"TV-life"} + {"Excess"} = {"Assets"})
({"TV-nonlife"} + {"Own funds"} = {"Assets"})
({"TV-nonlife"} + {"Excess"} = {"Assets"})
IF {"Type"} = "life insurer" THEN {"Assets"} = "reported" & {"TV-nonlife"} = "not reported" & {"TV-life"} = "reported" & {"Own funds"} = "reported"
IF {"Type"} = "non-life insurer" THEN {"Assets"} = "reported" & {"TV-nonlife"} = "reported" & {"TV-life"} = "not reported" & {"Own funds"} = "reported"


ValueError: not enough values to unpack (expected 8, got 7)

Subsequently, the statistics in df_patterns can be updated with statistics from the data by evaluating pandas expressions

In [None]:
df_patterns = p.update_statistics(df)
df_patterns

## Background

Our approach to pattern mining is somewhat different from traditional association rules mining. Association rules work on a set of items (binary attributes). In the original definition, the items in the set are not linked to column names. However, often we want to find associations between the values of specific columns in a dataset. The pattern mining applied here finds patterns between the values of different columns in a dataset while using the basic measures of association rules mining like support and confidence.
