In [None]:



Parkinsons Telemonitoring Data Set  

Abstract: Oxford Parkinson's Disease Telemonitoring Dataset

============================================================

Data Set Characteristics:  Multivariate
Attribute Characteristics:  Integer, Real
Associated Tasks:  Regression
Number of Instances:  5875
Number of Attributes:  26
Area:  Life
Date Donated:  2009-10-29

============================================================

SOURCE:

The dataset was created by Athanasios Tsanas (tsanasthanasis '@' gmail.com) 
and Max Little (littlem '@' physics.ox.ac.uk) of the University of Oxford, in 
collaboration with 10 medical centers in the US and Intel Corporation who 
developed the telemonitoring device to record the speech signals. The 
original study used a range of linear and nonlinear regression methods to 
predict the clinicians Parkinsons disease symptom score on the UPDRS scale.


============================================================

DATA SET INFORMATION:

This dataset is composed of a range of biomedical voice measurements from 42 
people with early-stage Parkinson's disease recruited to a six-month trial of 
a telemonitoring device for remote symptom progression monitoring. The 
recordings were automatically captured in the patient's homes.

Columns in the table contain subject number, subject age, subject gender, 
time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 
16 biomedical voice measures. Each row corresponds to one of 5,875 voice 
recording from these individuals. The main aim of the data is to predict the 
motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 
voice measures.

The data is in ASCII CSV format. The rows of the CSV file contain an instance 
corresponding to one voice recording. There are around 200 recordings per 
patient, the subject number of the patient is identified in the first column. 
For further information or to pass on comments, please contact Athanasios 
Tsanas (tsanasthanasis '@' gmail.com) or Max Little (littlem '@' 
physics.ox.ac.uk).

Further details are contained in the following reference -- if you use this 
dataset, please cite:
Athanasios Tsanas, Max A. Little, Patrick E. McSharry, Lorraine O. Ramig (2009),
'Accurate telemonitoring of Parkinson.s disease progression by non-invasive speech tests',
IEEE Transactions on Biomedical Engineering (to appear).

Further details about the biomedical voice measures can be found in:
Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009),
'Suitability of dysphonia measurements for telemonitoring of Parkinsons disease',
IEEE Transactions on Biomedical Engineering, 56(4):1015-1022 

 
===========================================================

ATTRIBUTE INFORMATION:

subject# - Integer that uniquely identifies each subject
age - Subject age
sex - Subject gender '0' - male, '1' - female
test_time - Time since recruitment into the trial. The integer part is the 
number of days since recruitment.
motor_UPDRS - Clinicians motor UPDRS score, linearly interpolated
total_UPDRS - Clinicians total UPDRS score, linearly interpolated
Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of 
variation in fundamental frequency
Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA - 
Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
RPDE - A nonlinear dynamical complexity measure
DFA - Signal fractal scaling exponent
PPE - A nonlinear measure of fundamental frequency variation 


===========================================================

RELEVANT PAPERS:

Little MA, McSharry PE, Hunter EJ, Ramig LO (2009),
'Suitability of dysphonia measurements for telemonitoring of Parkinsons disease',
IEEE Transactions on Biomedical Engineering, 56(4):1015-1022

Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM.
'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection',
BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) 

===========================================================

CITATION REQUEST:

If you use this dataset, please cite the following paper:
A Tsanas, MA Little, PE McSharry, LO Ramig (2009)
'Accurate telemonitoring of Parkinsons disease progression by non-invasive speech tests',
IEEE Transactions on Biomedical Engineering (to appear). 





In [135]:
import pandas as pd
import numpy as np
import scipy

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules


In [3]:



dataFileA=pd.read_csv("parkinsons_updrs.data") #found file
#dataFileB=pd.read_csv("parkinsons_updrs.names") 

print(dataFileA)





      subject#  age  sex  test_time  motor_UPDRS  total_UPDRS  Jitter(%)  \
0            1   72    0     5.6431       28.199       34.398    0.00662   
1            1   72    0    12.6660       28.447       34.894    0.00300   
2            1   72    0    19.6810       28.695       35.389    0.00481   
3            1   72    0    25.6470       28.905       35.810    0.00528   
4            1   72    0    33.6420       29.187       36.375    0.00335   
...        ...  ...  ...        ...          ...          ...        ...   
5870        42   61    0   142.7900       22.485       33.485    0.00406   
5871        42   61    0   149.8400       21.988       32.988    0.00297   
5872        42   61    0   156.8200       21.495       32.495    0.00349   
5873        42   61    0   163.7300       21.007       32.007    0.00281   
5874        42   61    0   170.7300       20.513       31.513    0.00282   

      Jitter(Abs)  Jitter:RAP  Jitter:PPQ5  ...  Shimmer(dB)  Shimmer:APQ3  \
0        

In [82]:
dataFileA_np = np.array(dataFileA)

#x_norm = (x-np.min(x))/(np.max(x)-np.min(x))

#print(dataFileA_np)

truth = dataFileA_np[:,4:5]

print(dataFileA_np[:,4:])

dataFileA_np = dataFileA_np[:,4:]

normalizedA = (dataFileA_np-np.min(dataFileA_np))/(np.max(dataFileA_np)-np.min(dataFileA_np))
print(normalizedA)

#note that normalizedA isnt meaningful for the first 3 columns
normalizedA_data = normalizedA[:,3:]
normalizedA_data = normalizedA[:,:normalizedA_data.shape[1]-3]

print(normalizedA_data)

[[2.8199e+01 3.4398e+01 6.6200e-03 ... 4.1888e-01 5.4842e-01 1.6006e-01]
 [2.8447e+01 3.4894e+01 3.0000e-03 ... 4.3493e-01 5.6477e-01 1.0810e-01]
 [2.8695e+01 3.5389e+01 4.8100e-03 ... 4.6222e-01 5.4405e-01 2.1014e-01]
 ...
 [2.1495e+01 3.2495e+01 3.4900e-03 ... 4.7792e-01 5.7888e-01 1.4157e-01]
 [2.1007e+01 3.2007e+01 2.8100e-03 ... 5.6865e-01 5.6327e-01 1.4204e-01]
 [2.0513e+01 3.1513e+01 2.8200e-03 ... 5.8608e-01 5.7077e-01 1.5336e-01]]
[[5.12783658e-01 6.25509150e-01 1.20340236e-04 ... 7.61706734e-03
  9.97268280e-03 2.91056438e-03]
 [5.17293405e-01 6.34528644e-01 5.45124768e-05 ... 7.90892799e-03
  1.02699988e-02 1.96569964e-03]
 [5.21803152e-01 6.43529953e-01 8.74263565e-05 ... 8.40518201e-03
  9.89321669e-03 3.82124234e-03]
 ...
 [3.90875011e-01 5.90904115e-01 6.34228641e-05 ... 8.69067809e-03
  1.05265816e-02 2.57433364e-03]
 [3.82000993e-01 5.82030096e-01 5.10574286e-05 ... 1.03405545e-02
  1.02427221e-02 2.58288034e-03]
 [3.73017868e-01 5.73046971e-01 5.12392733e-05 ... 1.065

In [12]:

print(normalizedA_data.shape)

origin = np.zeros(normalizedA_data.shape)
#print(squeeze(normalizedA_data[:,:1].ndim))
#print(origin[:,:1].ndim)
#print(normalizedA_data[:,:1])
#print(origin[:,:1])
#print(normalizedA_data[:,:1].shape)


oneRow = np.asarray(normalizedA_data[:,:1], order='c').squeeze()

print(oneRow.shape)


#print(origin)
print(origin.shape)

# scipy.spatial.distance.minkowski(u, v, p=2, w=None)[source]

left = np.asarray(normalizedA_data[:,:1], order='c').squeeze()
right = np.asarray(origin[:,:1], order='c').squeeze()

print(scipy.spatial.distance.minkowski(left,right,p=16))
#scipy.spatial.distance


(5875, 12)
(5875,)
(5875, 12)
0.9707154851110325


In [72]:
#clustering interested in seeing how people use clustering

#WAYS OF ENCODING THE DATA
#
# Equal size categories
# Low (<0.2) , Medium-Low (>=0.2, <0.4), Medium (>=0.4, <0.6), Medium-High (>=0.6, <0.8), High(>=0.8)
#
# Normalized distribution based
# Low (<0.024) , Medium-Low (>=0.024, <0.1583), Medium (>=0.1583, <0.8409), Medium-High (>=0.8409, <0.9806), High(>=0.9806)
#
# Extreme magnitudes
# Outlier (<0.02 or >0.98), Different(<0.1583 or >0.8417), Central(>=0.1583 and <=0.8416)

#34.13 ... 68.26

#med (31.73%)
#interims (95.45 - 68.27) / 2 = 13.59ea side
#outliers ( 99.73 - 95.45 ) / 2 = 2.24ea side

dataAsEqualSize = [''] * normalizedA_data.shape[0]
print(len(dataAsEqualSize))

#Equal size categories
#dataAsEqualSize = normalizedA_data
#dataAsEqualSize = np.chararray((normalizedA_data.shape[0], normalizedA_data.shape[1]))
#print(dataAsEqualSize)
# Low (<0.2) , Medium-Low (>=0.2, <0.4), Medium (>=0.4, <0.6), Medium-High (>=0.6, <0.8), High(>=0.8)

for iter in range(0, normalizedA_data.shape[0]): 
    dataAsEqualSize[iter] = [''] * normalizedA_data.shape[1]
    #print(dataAsEqualSize[iter])
    for innerIter in range(0, normalizedA_data.shape[1]):
        #print(str(iter) + " ....... " +  str(innerIter))
        
        if normalizedA_data[iter][innerIter] <= 0.2 :
            #print("HERE")
            #print(str(iter) + " ....... " +  str(innerIter))
            #print(len(dataAsEqualSize))

            dataAsEqualSize[iter][innerIter] = "LOW" + str(innerIter)
        elif normalizedA_data[iter][innerIter] > 0.2 and normalizedA_data[iter][innerIter] <= 0.4:
            dataAsEqualSize[iter][innerIter] = "MEDLOW" + str(innerIter)
        elif normalizedA_data[iter][innerIter] > 0.4 and normalizedA_data[iter][innerIter] <= 0.6:
            dataAsEqualSize[iter][innerIter] = "MED" + str(innerIter)
        elif normalizedA_data[iter][innerIter] > 0.6 and normalizedA_data[iter][innerIter] <= 0.8:
            dataAsEqualSize[iter][innerIter] = "MEDHIGH" + str(innerIter)
        elif normalizedA_data[iter][innerIter] >= 0.8 :
            dataAsEqualSize[iter][innerIter] = "HIGH" + str(innerIter)

#print(dataAsEqualSize)

5875


In [73]:
#Normalized distribution size categories
# Low (<0.024) , Medium-Low (>=0.024, <0.1583), Medium (>=0.1583, <0.8409), Medium-High (>=0.8409, <0.9806), High(>=0.9806)
normalDistribution = [''] * normalizedA_data.shape[0]

for iter in range(0, normalizedA_data.shape[0]): 
    normalDistribution[iter] = [''] * normalizedA_data.shape[1]
    for innerIter in range(0, normalizedA_data.shape[1]):
        if normalizedA_data[iter][innerIter] <= 0.024 :
            normalDistribution[iter][innerIter] = "LOW" + str(innerIter)
        elif normalizedA_data[iter][innerIter] > 0.024 and normalizedA_data[iter][innerIter] <= 0.1583:
            normalDistribution[iter][innerIter] = "MEDLOW" + str(innerIter)
        elif normalizedA_data[iter][innerIter] > 0.1583 and normalizedA_data[iter][innerIter] <= 0.8409:
            normalDistribution[iter][innerIter] = "MED" + str(innerIter)
        elif normalizedA_data[iter][innerIter] > 0.8409 and normalizedA_data[iter][innerIter] <= 0.9806:
            normalDistribution[iter][innerIter] = "MEDHIGH" + str(innerIter)
        elif normalizedA_data[iter][innerIter] >= 0.9806 :
            normalDistribution[iter][innerIter] = "HIGH" + str(innerIter)




#print(normalDistribution)


In [75]:
#Extreme magnitude categories
# Outlier (<0.02 or >0.98), Different(<0.1583 or >0.8417), Central(>=0.1583 and <=0.8416)
extremeMagnitude = [''] * normalizedA_data.shape[0]

for iter in range(0, normalizedA_data.shape[0]): 
    extremeMagnitude[iter] = [''] * normalizedA_data.shape[1]
    for innerIter in range(0, normalizedA_data.shape[1]):        
        if normalizedA_data[iter][innerIter] <= 0.024 or normalizedA_data[iter][innerIter] >= 0.976:
            extremeMagnitude[iter][innerIter] = "EXTR" + str(innerIter)
        elif normalizedA_data[iter][innerIter] <= 0.1583 or normalizedA_data[iter][innerIter] >= 0.8417:
            extremeMagnitude[iter][innerIter] = "DIFF" + str(innerIter)
        else:
            extremeMagnitude[iter][innerIter] = "MED" + str(innerIter)
            

#print(extremeMagnitude)

In [115]:
#print(dataAsEqualSize)


In [97]:
#print(normalDistribution)


In [95]:
#print(extremeMagnitude)


In [111]:
#NOW ENCODE THE GROUND TRUTH THE SAME WAY AS THE VARIOUS DATA ENCODINGS
#print(extremeMagnitude)
#print(normalDistribution)
#print(dataAsEqualSize)

#print("HERE")
#print(dataFileA)
#print("HERE")
#print(normalizedA)
#truth = normalizedA[:,1:5]
#print(truth)

#print("HERE")
#dataFileA_np = np.array(dataFileA)
#rawTruth = dataFileA_np[:,4:6]
#print(rawTruth)
#print("HERE")


#dataFileA_np = np.array(dataFileA)

#x_norm = (x-np.min(x))/(np.max(x)-np.min(x))

#print(dataFileA_np)

#truth = dataFileA_np[:,4:5]

#print(dataFileA_np[:,4:])

#dataFileA_np = dataFileA_np[:,4:]

#normalizedA = (dataFileA_np-np.min(dataFileA_np))/(np.max(dataFileA_np)-np.min(dataFileA_np))
#print(normalizedA)



#dataFileA_np = np.array(dataFileA)
#rawTruth = dataFileA_np[:,4:6]

#normalizedTruth = (rawTruth-np.min(rawTruth))/(np.max(rawTruth)-np.min(rawTruth))
#print(normalizedTruth)



[[0.46364978 0.5877432 ]
 [0.46861431 0.59767227]
 [0.47357885 0.60758133]
 ...
 [0.32944711 0.54964838]
 [0.31967819 0.53987945]
 [0.30978915 0.52999041]]


In [118]:
normalizedTruthAsEqualSize = [''] * normalizedTruth.shape[0]

for iter in range(0, normalizedTruth.shape[0]): 
    normalizedTruthAsEqualSize[iter] = [''] * normalizedTruth.shape[1]

    for innerIter in range(0, normalizedTruth.shape[1]):
        
        if normalizedTruth[iter][innerIter] <= 0.2 :

            normalizedTruthAsEqualSize[iter][innerIter] = "LOWT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] > 0.2 and normalizedTruth[iter][innerIter] <= 0.4:
            normalizedTruthAsEqualSize[iter][innerIter] = "MEDLOWT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] > 0.4 and normalizedTruth[iter][innerIter] <= 0.6:
            normalizedTruthAsEqualSize[iter][innerIter] = "MEDT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] > 0.6 and normalizedTruth[iter][innerIter] <= 0.8:
            normalizedTruthAsEqualSize[iter][innerIter] = "MEDHIGHT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] >= 0.8 :
            normalizedTruthAsEqualSize[iter][innerIter] = "HIGHT" + str(innerIter)

print(normalizedTruthAsEqualSize)

[['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDHIGHT0', 'HIGHT1'], ['MEDHIGHT0', 'HIGHT1'], ['MEDHIGHT0', 'HIGHT1'], ['MEDHIGHT0', 'HIGHT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['M

In [120]:
normalizedTruthAsNormalDistribution = [''] * normalizedTruth.shape[0]

for iter in range(0, normalizedTruth.shape[0]): 
    normalizedTruthAsNormalDistribution[iter] = [''] * normalizedTruth.shape[1]

    for innerIter in range(0, normalizedTruth.shape[1]):
        
        if normalizedTruth[iter][innerIter] <= 0.024 :

            normalizedTruthAsNormalDistribution[iter][innerIter] = "LOWT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] > 0.024 and normalizedTruth[iter][innerIter] <= 0.1583:
            normalizedTruthAsNormalDistribution[iter][innerIter] = "MEDLOWT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] > 0.1583 and normalizedTruth[iter][innerIter] <= 0.8409:
            normalizedTruthAsNormalDistribution[iter][innerIter] = "MEDT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] > 0.8409 and normalizedTruth[iter][innerIter] <= 0.9806:
            normalizedTruthAsNormalDistribution[iter][innerIter] = "MEDHIGHT" + str(innerIter)
        elif normalizedTruth[iter][innerIter] >= 0.9806 :
            normalizedTruthAsNormalDistribution[iter][innerIter] = "HIGHT" + str(innerIter)

print(normalizedTruthAsNormalDistribution)



[['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0', 'MEDHIGHT1'], ['MEDT0

In [125]:
#Extreme magnitude categories
# Outlier (<0.02 or >0.98), Different(<0.1583 or >0.8417), Central(>=0.1583 and <=0.8416)
normalizedTruthAsExtremeMagnitude = [''] * normalizedTruth.shape[0]

for iter in range(0, normalizedTruth.shape[0]): 
    normalizedTruthAsExtremeMagnitude[iter] = [''] * normalizedTruth.shape[1]
    for innerIter in range(0, normalizedTruth.shape[1]):        
        if normalizedA_data[iter][innerIter] <= 0.024 or normalizedTruth[iter][innerIter] >= 0.976:
            normalizedTruthAsExtremeMagnitude[iter][innerIter] = "EXTR" + str(innerIter)
        elif normalizedA_data[iter][innerIter] <= 0.1583 or normalizedTruth[iter][innerIter] >= 0.8417:
            normalizedTruthAsExtremeMagnitude[iter][innerIter] = "DIFF" + str(innerIter)
        else:
            normalizedTruthAsExtremeMagnitude[iter][innerIter] = "MED" + str(innerIter)

print(normalizedTruthAsExtremeMagnitude)


[['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'DIFF1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'DIFF1'], ['MED0', 'DIFF1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0', 'MED1'], ['MED0

In [159]:
# SO NOW WE HAVE TWO SETS OF DATA, ONE IS THE MEASURED BY MACHINES AND ONE IS MEASURED BY HUMANS
# EACH SET OF DATA IS ENCODED IN THREE WAYS:
#
# Equal size categories
# Low (<0.2) , Medium-Low (>=0.2, <0.4), Medium (>=0.4, <0.6), Medium-High (>=0.6, <0.8), High(>=0.8)
#
# Normalized distribution based
# Low (<0.024) , Medium-Low (>=0.024, <0.1583), Medium (>=0.1583, <0.8409), Medium-High (>=0.8409, <0.9806), High(>=0.9806)
#
# Extreme magnitudes
# Outlier (<0.02 or >0.98), Different(<0.1583 or >0.8417), Central(>=0.1583 and <=0.8416)
#

#TRUTH (desired future purchase in the transaction model):
#normalizedTruthAsEqualSize,    normalizedTruthAsNormalDistribution,   normalizedTruthAsExtremeMagnitude

#MACHINES OBSERVED (items purchased in the transaction model):
#dataAsEqualSize,    normalDistribution,      extremeMagnitude




In [160]:

#TRUTH (desired future purchase in the transaction model):
#normalizedTruthAsEqualSize,    normalizedTruthAsNormalDistribution,   normalizedTruthAsExtremeMagnitude

#MACHINES OBSERVED (items purchased in the transaction model):
#dataAsEqualSize,    normalDistribution,      extremeMagnitude

#ADDITIONAL DATA POINTS THAT CAN BE USED
#Different frequency calculations (fpgrowth, fpmax, apriori)
#min_support 


dataset = np.concatenate((normalizedTruthAsEqualSize,dataAsEqualSize), axis=1)
#print(dataset)

#EXAMPLES TAKEN FROM 
#https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = fpgrowth(df, min_support=0.6)
### alternatively:
#frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
#frequent_itemsets = fpmax(df, min_support=0.6, use_colnames=True)

print(frequent_itemsets)

print("HERE")

#association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

#rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print(rules)


      support                            itemsets
0         1.0                                (13)
1         1.0                                (12)
2         1.0                                (11)
3         1.0                                (10)
4         1.0                                 (9)
...       ...                                 ...
1018      1.0      (4, 5, 6, 7, 8, 9, 11, 12, 13)
1019      1.0      (4, 5, 6, 7, 8, 9, 10, 12, 13)
1020      1.0      (4, 5, 6, 7, 8, 9, 10, 11, 13)
1021      1.0      (4, 5, 6, 7, 8, 9, 10, 11, 12)
1022      1.0  (4, 5, 6, 7, 8, 9, 10, 11, 12, 13)

[1023 rows x 2 columns]
HERE
      antecedents                      consequents  antecedent support  \
0            (12)                             (13)                 1.0   
1            (13)                             (12)                 1.0   
2            (11)                             (13)                 1.0   
3            (13)                             (11)                 1.0   
