# Pattern mining tutorial

Welcome to the tutorial on pattern mining! 

This tutorial explains the most important features of the data-patterns package.

The data-pattern-package works with Pandas DataFrames.

In [1]:
import pandas as pd
import numpy as np
import data_patterns
for item in data_patterns.encodings_definitions:
    exec(data_patterns.encodings_definitions[item])
encodings = {}
for item in data_patterns.encodings_definitions.keys():
    encodings[item]= locals()[item]

Let's construct a simple dataframe to do some pattern mining.

In [2]:
col = ['Name', 'Type', 'Assets', 'TV-life', 'TV-nonlife', 'Own funds', 'Diversification','Excess']
insurers = [['Insurer  1', 'life insurer',     1000,  800,    0,  200,   12,  200], 
            ['Insurer  2', 'non-life insurer',   40,    0,   32,    8,    9,    8], 
            ['Insurer  3', 'non-life insurer',  800,    0,  700,  100,   -1,  100],
            ['Insurer  4', 'life insurer',       25,   18,    0,    7,    8,    7], 
            ['Insurer  5', 'non-life insurer', 2100,    0, 2200,  200,   12,  200], 
            ['Insurer  6', 'life insurer',      907,  887,    0,   20,    7,   20],
            ['Insurer  7', 'life insurer',     7123,    0, 6800,  323,    5,  323],
            ['Insurer  8', 'life insurer',     6100, 5920,    0,  180,   14,  180],
            ['Insurer  9', 'non-life insurer', 9011,    0, 8800,  211,   19,  211],
            ['Insurer 10', 'non-life insurer', 1034,    0,  901,  133,    1,  134]]
df = pd.DataFrame(columns = col, data = insurers)
df.set_index('Name', inplace = True)
df

Unnamed: 0_level_0,Type,Assets,TV-life,TV-nonlife,Own funds,Diversification,Excess
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Insurer 1,life insurer,1000,800,0,200,12,200
Insurer 2,non-life insurer,40,0,32,8,9,8
Insurer 3,non-life insurer,800,0,700,100,-1,100
Insurer 4,life insurer,25,18,0,7,8,7
Insurer 5,non-life insurer,2100,0,2200,200,12,200
Insurer 6,life insurer,907,887,0,20,7,20
Insurer 7,life insurer,7123,0,6800,323,5,323
Insurer 8,life insurer,6100,5920,0,180,14,180
Insurer 9,non-life insurer,9011,0,8800,211,19,211
Insurer 10,non-life insurer,1034,0,901,133,1,134


Can we find the errors in this report?


### Patterns with equal values

Now, let's find patterns with equal columns.

In [3]:
parameters = {'min_confidence': 0.5,'min_support'   : 2}
p2 = {'name'      : 'equal values', 
      'pattern'   : '=',
      'parameters': parameters}
miner = data_patterns.PatternMiner(p2)
miner.find(df)

Unnamed: 0_level_0,pattern_id,cluster,P columns,relation type,Q columns,P,relation,Q,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,equal values,0,[Own funds],=,[Excess],,,,9,1,0.9,not defined,{},"df[(df[""Own funds""]) - df[""Excess""] < 1.5e-8]","df[(df[""Own funds""]) - df[""Excess""] >= 1.5e-8]","""Own funds""= ""Excess""","""Own funds""= ""Excess"""
1,equal values,0,[Excess],=,[Own funds],,,,9,1,0.9,not defined,{},"df[(df[""Excess""]) - df[""Own funds""] < 1.5e-8]","df[(df[""Excess""]) - df[""Own funds""] >= 1.5e-8]","""Excess""= ""Own funds""","""Excess""= ""Own funds"""


When using the equal-pattern you can define the accuracy of the equal pattern. For this you can use the decimal-parameter.

In [6]:
parameters = {'min_confidence': 0.5, 'min_support': 2, 'decimal': -1}

If we now run the miner with the alternative 

In [7]:
p2_alt = {'name'      : 'equal values', 
          'pattern'   : '=',
          'parameters': parameters}
miner = data_patterns.PatternMiner(p2_alt)
miner.find(df)

Unnamed: 0_level_0,pattern_id,cluster,P columns,relation type,Q columns,P,relation,Q,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,equal values,0,[Own funds],=,[Excess],,,,10,0,1.0,not defined,{},"df[(df[""Own funds""]) - df[""Excess""] < 1.5e1]","df[(df[""Own funds""]) - df[""Excess""] >= 1.5e1]","""Own funds""= ""Excess""","""Own funds""= ""Excess"""
1,equal values,0,[Excess],=,[Own funds],,,,10,0,1.0,not defined,{},"df[(df[""Excess""]) - df[""Own funds""] < 1.5e1]","df[(df[""Excess""]) - df[""Own funds""] >= 1.5e1]","""Excess""= ""Own funds""","""Excess""= ""Own funds"""


### Patterns with value constant value

To find patterns you need to construct a PatternMiner-object and input a pattern definition. Then you can use the find-function. The result is a Pandas DataFrame with the patterns that were found.

First of all, let's find patterns for whether values are positive or negative.

In [8]:
p1 = {'name'      : 'positive values', 
      'pattern'   : '>=',
      'value'     : 0,
      'parameters': {'min_confidence': 0.5,
                     'min_support'   : 2}}
miner = data_patterns.PatternMiner(p1)
miner.find(df)

Unnamed: 0_level_0,pattern_id,cluster,P columns,relation type,Q columns,P,relation,Q,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,positive values,0,[Assets],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""Assets""])>= 0]","df[(df[""Assets""])< 0]","""Assets"">= ""0""","""Assets"">= ""0"""
1,positive values,0,[TV-life],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""TV-life""])>= 0]","df[(df[""TV-life""])< 0]","""TV-life"">= ""0""","""TV-life"">= ""0"""
2,positive values,0,[TV-nonlife],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""TV-nonlife""])>= 0]","df[(df[""TV-nonlife""])< 0]","""TV-nonlife"">= ""0""","""TV-nonlife"">= ""0"""
3,positive values,0,[Own funds],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""Own funds""])>= 0]","df[(df[""Own funds""])< 0]","""Own funds"">= ""0""","""Own funds"">= ""0"""
4,positive values,0,[Diversification],>=,0,,,,9,1,0.9,not defined,{},"df[(df[""Diversification""])>= 0]","df[(df[""Diversification""])< 0]","""Diversification"">= ""0""","""Diversification"">= ""0"""
5,positive values,0,[Excess],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""Excess""])>= 0]","df[(df[""Excess""])< 0]","""Excess"">= ""0""","""Excess"">= ""0"""


So we have six patterns (for each column), with one exception, namely that the column 'diversification' contains one negative value.

### Sum-patterns

To find sum-pattern you can use

In [9]:
p3 = {'name'   : 'sum pattern',
      'pattern': 'sum',
      'parameters': {"min_confidence": 0.5,
                     "min_support"   : 1}}
miner = data_patterns.PatternMiner(p3)
miner.find(df)

Unnamed: 0_level_0,pattern_id,cluster,P columns,relation type,Q columns,P,relation,Q,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,sum pattern,0,"[TV-life, Own funds]",sum,[Assets],,,,4,0,1.0,not defined,{},"df[(df[""TV-life""]+ df[""Own funds""]) - df[""Asse...","df[(df[""TV-life""]+ df[""Own funds""]) - df[""Asse...","""TV-life""sum ""Assets""","""TV-life""sum ""Assets"""
1,sum pattern,0,"[TV-life, Excess]",sum,[Assets],,,,4,0,1.0,not defined,{},"df[(df[""TV-life""]+ df[""Excess""]) - df[""Assets""...","df[(df[""TV-life""]+ df[""Excess""]) - df[""Assets""...","""TV-life""sum ""Assets""","""TV-life""sum ""Assets"""
2,sum pattern,0,"[TV-nonlife, Own funds]",sum,[Assets],,,,5,1,0.8333,not defined,{},"df[(df[""TV-nonlife""]+ df[""Own funds""]) - df[""A...","df[(df[""TV-nonlife""]+ df[""Own funds""]) - df[""A...","""TV-nonlife""sum ""Assets""","""TV-nonlife""sum ""Assets"""
3,sum pattern,0,"[TV-nonlife, Excess]",sum,[Assets],,,,5,1,0.8333,not defined,{},"df[(df[""TV-nonlife""]+ df[""Excess""]) - df[""Asse...","df[(df[""TV-nonlife""]+ df[""Excess""]) - df[""Asse...","""TV-nonlife""sum ""Assets""","""TV-nonlife""sum ""Assets"""


### Patterns in whether cells are reported or not

Suppose we expect a relation or association between Feature 1 and Feature 2. For this, we can now define a metapattern and initialize a PatternMiner-object with this metapattern.

In [10]:
p4 = {'name'     : 'type pattern',
      'P_columns': ['Type'],
      'Q_columns': ['Assets', 'TV-life', 'TV-nonlife', 'Own funds'],
      'encode'   : {'Assets'    : 'reported',
                    'TV-life'   : 'reported',
                    'TV-nonlife': 'reported',
                    'Own funds' : 'reported'}}
p = data_patterns.PatternMiner(p4)
p.find(df)

Unnamed: 0_level_0,pattern_id,cluster,P columns,relation type,Q columns,P,relation,Q,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,type pattern,0,[Type],-->,"[Assets, Own funds, TV-life, TV-nonlife]",[life insurer],-->,"[reported, reported, reported, not reported]",4,1,0.8,not defined,"{'Assets': 'reported', 'TV-life': 'reported', ...","df[(df[""Type""]==""life insurer"") & ((data_patte...","df[(df[""Type""]==""life insurer"") & ~((data_patt...","IF (({Type} = ""life insurer"")) THEN (""Assets"" ...","IF (({Type} = ""life insurer"")) THEN (""Assets"" ..."
1,type pattern,0,[Type],-->,"[Assets, Own funds, TV-life, TV-nonlife]",[non-life insurer],-->,"[reported, reported, not reported, reported]",5,0,1.0,not defined,"{'Assets': 'reported', 'TV-life': 'reported', ...","df[(df[""Type""]==""non-life insurer"") & ((data_p...","df[(df[""Type""]==""non-life insurer"") & ~((data_...","IF (({Type} = ""non-life insurer"")) THEN (""Asse...","IF (({Type} = ""non-life insurer"")) THEN (""Asse..."


### Combining patterns 

You can run the miner with a list of pattern definitions.

In [11]:
miner = data_patterns.PatternMiner([p1, p2, p3, p4])
df_patterns = miner.find(df)

In [12]:
df_patterns

Unnamed: 0_level_0,pattern_id,cluster,P columns,relation type,Q columns,P,relation,Q,support,exceptions,confidence,pattern status,encodings,pandas co,pandas ex,xbrl co,xbrl ex
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,positive values,0,[Assets],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""Assets""])>= 0]","df[(df[""Assets""])< 0]","""Assets"">= ""0""","""Assets"">= ""0"""
1,positive values,0,[TV-life],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""TV-life""])>= 0]","df[(df[""TV-life""])< 0]","""TV-life"">= ""0""","""TV-life"">= ""0"""
2,positive values,0,[TV-nonlife],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""TV-nonlife""])>= 0]","df[(df[""TV-nonlife""])< 0]","""TV-nonlife"">= ""0""","""TV-nonlife"">= ""0"""
3,positive values,0,[Own funds],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""Own funds""])>= 0]","df[(df[""Own funds""])< 0]","""Own funds"">= ""0""","""Own funds"">= ""0"""
4,positive values,0,[Diversification],>=,0,,,,9,1,0.9,not defined,{},"df[(df[""Diversification""])>= 0]","df[(df[""Diversification""])< 0]","""Diversification"">= ""0""","""Diversification"">= ""0"""
5,positive values,0,[Excess],>=,0,,,,10,0,1.0,not defined,{},"df[(df[""Excess""])>= 0]","df[(df[""Excess""])< 0]","""Excess"">= ""0""","""Excess"">= ""0"""
6,equal values,0,[Own funds],=,[Excess],,,,9,1,0.9,not defined,{},"df[(df[""Own funds""]) - df[""Excess""] < 1.5e-8]","df[(df[""Own funds""]) - df[""Excess""] >= 1.5e-8]","""Own funds""= ""Excess""","""Own funds""= ""Excess"""
7,equal values,0,[Excess],=,[Own funds],,,,9,1,0.9,not defined,{},"df[(df[""Excess""]) - df[""Own funds""] < 1.5e-8]","df[(df[""Excess""]) - df[""Own funds""] >= 1.5e-8]","""Excess""= ""Own funds""","""Excess""= ""Own funds"""
8,sum pattern,0,"[TV-life, Own funds]",sum,[Assets],,,,4,0,1.0,not defined,{},"df[(df[""TV-life""]+ df[""Own funds""]) - df[""Asse...","df[(df[""TV-life""]+ df[""Own funds""]) - df[""Asse...","""TV-life""sum ""Assets""","""TV-life""sum ""Assets"""
9,sum pattern,0,"[TV-life, Excess]",sum,[Assets],,,,4,0,1.0,not defined,{},"df[(df[""TV-life""]+ df[""Excess""]) - df[""Assets""...","df[(df[""TV-life""]+ df[""Excess""]) - df[""Assets""...","""TV-life""sum ""Assets""","""TV-life""sum ""Assets"""


### Getting different codings of patterns

Now that we have the patterns we can transform then to different codings. 

The Pandas code of the exceptions of 7-th pattern is

In [13]:
pattern_text = df_patterns.loc[12, 'pandas co']
print(pattern_text)

df[(df["Type"]=="life insurer") & ((data_patterns.reported(df["Assets"])=="reported") & (data_patterns.reported(df["Own funds"])=="reported") & (data_patterns.reported(df["TV-life"])=="reported") & (data_patterns.reported(df["TV-nonlife"])=="not reported"))]


You can evaluate the Pandas code directly with the eval-function inside Python.

In [14]:
eval(pattern_text, globals(), {'df': df})

Unnamed: 0_level_0,Type,Assets,TV-life,TV-nonlife,Own funds,Diversification,Excess
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Insurer 1,life insurer,1000,800,0,200,12,200
Insurer 4,life insurer,25,18,0,7,8,7
Insurer 6,life insurer,907,887,0,20,7,20
Insurer 8,life insurer,6100,5920,0,180,14,180


The code for the XBRL-validation of the confirmation of this pattern is

In [15]:
df_patterns.loc[12, 'xbrl co']

'IF (({Type} = "life insurer")) THEN ("Assets" = "reported") and ("Own funds" = "reported") and ("TV-life" = "reported") and ("TV-nonlife" = "not reported")'

### Analyzing results

If you want to know the results of the patterns per insurer then you can use the analyze-function.

In [16]:
df_results = miner.analyze(df)

df_results is a proper Pandas DataFrame, so you can do the usual stuff with it. For example all exceptions to the patterns.

In [17]:
df_results[df_results['result_type']==False]

Unnamed: 0_level_0,result_type,pattern_id,cluster,support,exceptions,confidence,P columns,relation type,Q columns,P,relation,Q,P values,Q values
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Insurer 3,False,positive values,0,9,1,0.9,[Diversification],>=,0,,,,[-1],0
Insurer 5,False,sum pattern,0,9,1,0.9,"[TV-nonlife, Own funds]",sum,[Assets],,,,"[2200, 200]",[2100]
Insurer 5,False,sum pattern,0,9,1,0.9,"[TV-nonlife, Excess]",sum,[Assets],,,,"[2200, 200]",[2100]
Insurer 7,False,type pattern,0,4,1,0.8,[Type],-->,"[Assets, Own funds, TV-life, TV-nonlife]",[life insurer],-->,"[reported, reported, reported, not reported]",[life insurer],"[7123, 323, 0, 6800]"
Insurer 10,False,equal values,0,9,1,0.9,[Excess],=,[Own funds],,,,[134],[133]


### Export to and import from Excel

You can export the DataFrame with the patterns with the to_excel-function. This produces an Excel file in a humanly readable format.

In [None]:
df_patterns.to_excel(filename = "patterns.xlsx")

And you can read the Excel with the patterns into the PatternMiner-object in the following way.

In [None]:
p = data_patterns.PatternMiner(df_patterns = data_patterns.read_excel(filename = "patterns.xlsx"))

In [None]:
df_patterns = p.update_statistics(df)
df_patterns

## Background

Our approach to pattern mining is somewhat different from traditional association rules mining. Association rules work on a set of items (binary attributes). In the original definition, the items in the set are not linked to column names. However, often we want to find associations between the values of specific columns in a dataset. The pattern mining applied here finds patterns between the values of different columns in a dataset while using the basic measures of association rules mining like support and confidence.
