# Mining association rules

## Dataset

We‘ll use "201707-citibike-tripdata.csv.zip" (after preprocessed in HW0)

## Schema

- Every station’s information
    - id, name, lat, lng
- Every stations’ flow data
    - id, time, in-flow, out-flow

### Import packages

In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.plotly as py
import os
import time
from plotly.graph_objs import *
from sklearn.metrics.pairwise import euclidean_distances
%matplotlib inline

### Read csv to dataframe
use pandas to read data

In [75]:
# preprocessed dataset
df = pd.read_csv('./201707-citibike-tripdata-preprocessed.csv')
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,364,2017-07-01 00:00:00,2017-07-01 00:06:05,539,Metropolitan Ave & Bedford Ave,40.715348,-73.960241,3107,Bedford Ave & Nassau Ave,40.723117,-73.952123,14744,Subscriber,1986.0,1
1,2142,2017-07-01 00:00:03,2017-07-01 00:35:46,293,Lafayette St & E 8 St,40.730207,-73.991026,3425,2 Ave & E 104 St,40.78921,-73.943708,19587,Subscriber,1981.0,1
2,328,2017-07-01 00:00:08,2017-07-01 00:05:37,3242,Schermerhorn St & Court St,40.691029,-73.991834,3397,Court St & Nelson St,40.676395,-73.998699,27937,Subscriber,1984.0,2
3,2530,2017-07-01 00:00:11,2017-07-01 00:42:22,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,398,Atlantic Ave & Furman St,40.691652,-73.999979,26066,Subscriber,1985.0,1
4,2534,2017-07-01 00:00:15,2017-07-01 00:42:29,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,398,Atlantic Ave & Furman St,40.691652,-73.999979,29408,Subscriber,1982.0,2


In [76]:
# every station's information
station_info = pd.read_csv('./station_info.csv')
station_info.head()

Unnamed: 0,station id,station name,station latitude,station logitude
0,539,Metropolitan Ave & Bedford Ave,40.715348,-73.960241
1,293,Lafayette St & E 8 St,40.730207,-73.991026
2,3242,Schermerhorn St & Court St,40.691029,-73.991834
3,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198
4,361,Allen St & Hester St,40.716059,-73.991908


In [77]:
# every station's in-flow data
station_in_flow = pd.read_csv('./in_flow.csv')
station_in_flow.head()

Unnamed: 0,72,79,82,83,116,119,120,127,128,143,...,2003,2005,2006,2008,2009,2010,2012,2021,2022,2023
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,3.0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,...,1.0,0.0,0.0,2.0,0.0,1.0,0.0,2.0,0.0,1.0
2,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [78]:
# every station's out-flow data
station_out_flow = pd.read_csv('./out_flow.csv')
station_out_flow.head()

Unnamed: 0,72,79,82,83,116,119,120,127,128,143,...,2003,2005,2006,2008,2009,2010,2012,2021,2022,2023
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,3.0,0.0,...,0.0,0.0,1.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,1.0,0.0,2.0,0.0,0.0,2.0,0.0,2.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,0.0,2.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


## Algorithm

[Apriori](https://github.com/asaini/Apriori)

It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

The code attempts to implement the following paper

> Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/8eed75c18217fe2f9b15f266c40b369ce038164d)

[FP-Growth](https://github.com/enaeseth/python-fp-growth)

This module provides a pure Python implementation of the FP-growth algorithm for finding frequent itemsets. FP-growth exploits an (often-valid) assumption that many transactions will have items in common to build a prefix tree. If the assumption holds true, this tree produces a compact representation of the actual transactions and is used to generate itemsets much faster than Apriori can.

![](http://doi.ieeecomputersociety.org/cms/Computer.org/dl/trans/td/2011/09/figures/ttd2011091497x4.gif)


### Algorithm api

There are floating point inaccuracy in support and confidence, so the outcone of Apriori and FP Growth could be different

**Apriori and FP Growth both calculate frequency item sets and the rules are the same, so we output rules only in Apriori**

In [79]:
def Apriori(df, support, confidence):
    print "---------- Apriori ----------"
    df.to_csv('tx.csv', index = False)
    s = time.time()
    print os.popen('python {}/Apriori/apriori.py -f {} -s {} -c {}'.format(os.getcwd(), 'tx.csv', support, confidence)).read()
    t = time.time()
    print 'total run {} sec'.format(t - s)
    print '\n\n'
    
def FP_Growth(df, support):
    print "---------- FP Growth ----------"
    df.to_csv('tx.csv', index = False)
    support = int(len(df.index) * (support))
    s = time.time()
    print os.popen('python -m {}/python-fp-growth/fp_growth -s {} {}'.format(os.getcwd(), support, 'tx.csv')).read()
    t = time.time()
    print 'total run {} sec'.format(t - s)

## Transaction

### First transaction

- in-flow and out-flow for station_id = 519

Distanguish in_flow and out_flow by negtive station_out_flow because (1, 0) and (0, 1) would be the same when apply algorithm

#### Use "divided 10" as discretization method

In [80]:
tx = pd.DataFrame([station_in_flow['519'] / 10, -station_out_flow['519'] / 10]).astype(int)

tx = tx.T
tx.columns = ['in flow', 'out flow']
tx.head()

Unnamed: 0,in flow,out flow
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [81]:
tx.describe()

Unnamed: 0,in flow,out flow
count,1488.0,1488.0
mean,0.642473,-0.674731
std,1.307419,1.41957
min,0.0,-11.0
25%,0.0,-1.0
50%,0.0,0.0
75%,1.0,0.0
max,9.0,0.0


#### Test the support and confidence

I think the confidence should be higher 0.5 because it will show in half frequent item sets. It makes sense!

- support 0.1 is too high, so there is only one item in each frequency item sets

In [82]:
Apriori(tx, 0.1, 0.5)
FP_Growth(tx, 0.1)

---------- Apriori ----------
item: ('1',) , 0.156
item: ('-1',) , 0.156
item: ('0',) , 0.770

------------------------ RULES:

total run 0.0651249885559 sec



---------- FP Growth ----------
['-1'] 233
['0'] 2079
['1'] 232

total run 0.0578179359436 sec


- support 0.01 looks like more rules

And the rules tell me the station 519 is high transportation because the in and out flow are high and balance

In [83]:
Apriori(tx, 0.01, 0.5)
FP_Growth(tx, 0.01)

---------- Apriori ----------
item: ('3', '-2') , 0.010
item: ('6',) , 0.011
item: ('2', '-1') , 0.011
item: ('-6',) , 0.011
item: ('-4', '4') , 0.011
item: ('1', '-2') , 0.013
item: ('5',) , 0.014
item: ('4',) , 0.018
item: ('-4',) , 0.021
item: ('3', '-3') , 0.023
item: ('2', '-2') , 0.029
item: ('-3',) , 0.038
item: ('3',) , 0.041
item: ('-2',) , 0.054
item: ('2',) , 0.054
item: ('1', '0') , 0.068
item: ('0', '-1') , 0.071
item: ('1', '-1') , 0.074
item: ('1',) , 0.156
item: ('-1',) , 0.156
item: ('0',) , 0.770

------------------------ RULES:
Rule: ('-4',) ==> ('4',) , 0.531
Rule: ('2',) ==> ('-2',) , 0.537
Rule: ('-2',) ==> ('2',) , 0.537
Rule: ('3',) ==> ('-3',) , 0.557
Rule: ('-3',) ==> ('3',) , 0.607
Rule: ('4',) ==> ('-4',) , 0.630

total run 0.0633361339569 sec



---------- FP Growth ----------
['-1'] 233
['-1', '1'] 110
['-1', '2'] 16
['-2'] 80
['-2', '3'] 15
['-3'] 56
['-4'] 32
['-4', '4'] 17
['-6'] 17
['0'] 2079
['0', '-1'] 106
['0', '1'] 101
['1'] 232
['1', '-2'] 20
['2'

#### Use "divided 20" as discretization method

In [84]:
tx = pd.DataFrame([station_in_flow['519'] / 20, -station_out_flow['519'] / 20]).astype(int)

tx = tx.T
tx.columns = ['in flow', 'out flow']
tx.head()

Unnamed: 0,in flow,out flow
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [85]:
tx.describe()

Unnamed: 0,in flow,out flow
count,1488.0,1488.0
mean,0.212366,-0.230511
std,0.583852,0.64776
min,0.0,-5.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,4.0,0.0


#### Test the support and confidence

I think the confidence should be higher 0.5 because it will show in half frequent item sets. It makes sense!

- test support 0.02 and the rules are almost the same as the "divided 10" discretization method

In [86]:
Apriori(tx, 0.02, 0.5)
FP_Growth(tx, 0.02)

---------- Apriori ----------
item: ('-3',) , 0.020
item: ('2', '-2') , 0.021
item: ('-2',) , 0.030
item: ('2',) , 0.032
item: ('1', '-1') , 0.071
item: ('-1',) , 0.091
item: ('1',) , 0.095
item: ('0',) , 0.869

------------------------ RULES:
Rule: ('2',) ==> ('-2',) , 0.646
Rule: ('-2',) ==> ('2',) , 0.705
Rule: ('1',) ==> ('-1',) , 0.752
Rule: ('-1',) ==> ('1',) , 0.779

total run 0.0844600200653 sec



---------- FP Growth ----------
['-1'] 136
['-2'] 44
['-3'] 30
['0'] 2544
['1'] 141
['1', '-1'] 106
['2'] 48
['2', '-2'] 31

total run 0.0605289936066 sec


#### Top 3 rules

- ('2',) ==> ('-2',)
- ('3',) ==> ('-3',)
- ('4',) ==> ('-4',)

sort by the confidence

#### Observation

According to the result above, we adjust support to 0.01 and the rules generated. But the support is too low, it is meaningless
We can join more features so that the mining rules will be more meaningful and the about the algorithm runtime are a little bit weird. I think it's implement could be complex, so the overhead take too long time when dataset is not big enough

### Second transaction

- the station latitude, longitude and the sum of in_flow and out_flow

Distanguish the sum of in_flow and out_flow by negtive it because the rules would make confuse when apply algorithm

In [87]:
pos = station_info.sort_values(by = 'station id')
pos = pos[['station id', 'station latitude', 'station logitude']].reset_index(level = 0, drop = True)
station_flow = pd.DataFrame(station_in_flow.sum() + station_out_flow.sum())
station_flow.columns = ['flow']
pos['flow'] = -station_flow.values
pos.head()

Unnamed: 0,station id,station latitude,station logitude,flow
0,72,40.767272,-73.993929,-8040.0
1,79,40.719116,-74.006667,-5862.0
2,82,40.711174,-74.000165,-2656.0
3,83,40.683826,-73.976323,-3571.0
4,116,40.741776,-74.001497,-8090.0


In [88]:
pos.describe()

Unnamed: 0,station id,station latitude,station logitude,flow
count,634.0,634.0,634.0,634.0
mean,1938.20347,40.727318,-73.979194,-5472.955836
std,1432.441147,0.0363,0.021919,4576.306856
min,72.0,40.6554,-74.066921,-29318.0
25%,396.25,40.696093,-73.994128,-8030.5
50%,3059.5,40.724055,-73.981324,-4074.0
75%,3300.75,40.754918,-73.961923,-2041.25
max,3478.0,40.880921,-73.896602,-1.0


#### Discretization method

- Latitude 1 degree is about 110 km
- The average distance of stations is about 5.438 km by hw0

Discretize the latitude by unconditionally rounding to the first decimal place

- Longitude 1 degree is about 85 km in latitude 40 degree
- The average distance of stations is about 5.438 km by hw0

Discretize the latitude by unconditionally rounding to the first decimal place

Mean and std are about thousands, so I discretize flow by divided 1000

In [89]:
tx = pd.DataFrame(pos['flow'] / 1000).astype(int)
tx['station latitude'] = pos['station latitude'].round(1)
tx['station longitude'] = pos['station logitude'].round(1)
tx.head()

Unnamed: 0,flow,station latitude,station longitude
0,-8,40.8,-74.0
1,-5,40.7,-74.0
2,-2,40.7,-74.0
3,-3,40.7,-74.0
4,-8,40.7,-74.0


#### Test the support and confidence

I think the confidence should be higher 0.5 because it will show in half frequent item sets. It makes sense!

- When support is 0.1, the size of frequent item sets is two at most

let's try 0.09 and see what's happen

In [90]:
Apriori(tx, 0.1, 0.5)
FP_Growth(tx, 0.1)

---------- Apriori ----------
item: ('40.7', '-2') , 0.102
item: ('-1', '40.7') , 0.113
item: ('-3',) , 0.115
item: ('-73.9',) , 0.117
item: ('-74.0', '-1') , 0.121
item: ('-74.0', '-2') , 0.124
item: ('-2',) , 0.140
item: ('-1',) , 0.153
item: ('40.8', '-74.0') , 0.252
item: ('40.8',) , 0.288
item: ('-74.0', '40.7') , 0.627
item: ('40.7',) , 0.709
item: ('-74.0',) , 0.879

------------------------ RULES:
Rule: ('-74.0',) ==> ('40.7',) , 0.713
Rule: ('-2',) ==> ('40.7',) , 0.730
Rule: ('-1',) ==> ('40.7',) , 0.742
Rule: ('-1',) ==> ('-74.0',) , 0.794
Rule: ('40.8',) ==> ('-74.0',) , 0.874
Rule: ('40.7',) ==> ('-74.0',) , 0.884
Rule: ('-2',) ==> ('-74.0',) , 0.888

total run 0.0574560165405 sec



---------- FP Growth ----------
['-1'] 97
['-2'] 89
['-3'] 73
['-73.9'] 74
['-74.0'] 558
['-74.0', '-1'] 77
['-74.0', '-2'] 79
['-74.0', '40.7'] 398
['-74.0', '40.8'] 160
['40.7'] 450
['40.7', '-1'] 72
['40.7', '-2'] 65
['40.8'] 183

total run 0.064404964447 sec


- Some rules look like more meaningful and we can find the place seeing what special there

In [91]:
Apriori(tx, 0.09, 0.5)
FP_Growth(tx, 0.09)

---------- Apriori ----------
item: ('-74.0', '40.7', '-2') , 0.096
item: ('40.7', '-2') , 0.102
item: ('-1', '40.7') , 0.113
item: ('-3',) , 0.115
item: ('-73.9',) , 0.117
item: ('-74.0', '-1') , 0.121
item: ('-74.0', '-2') , 0.124
item: ('-2',) , 0.140
item: ('-1',) , 0.153
item: ('40.8', '-74.0') , 0.252
item: ('40.8',) , 0.288
item: ('-74.0', '40.7') , 0.627
item: ('40.7',) , 0.709
item: ('-74.0',) , 0.879

------------------------ RULES:
Rule: ('-2',) ==> ('-74.0', '40.7') , 0.685
Rule: ('-74.0',) ==> ('40.7',) , 0.713
Rule: ('-2',) ==> ('40.7',) , 0.730
Rule: ('-1',) ==> ('40.7',) , 0.742
Rule: ('-74.0', '-2') ==> ('40.7',) , 0.772
Rule: ('-1',) ==> ('-74.0',) , 0.794
Rule: ('40.8',) ==> ('-74.0',) , 0.874
Rule: ('40.7',) ==> ('-74.0',) , 0.884
Rule: ('-2',) ==> ('-74.0',) , 0.888
Rule: ('40.7', '-2') ==> ('-74.0',) , 0.938

total run 0.0606632232666 sec



---------- FP Growth ----------
['-1'] 97
['-2'] 89
['-3'] 73
['-4'] 57
['-73.9'] 74
['-74.0'] 558
['-74.0', '-1'] 77
['-74.

- Latitude 1 degree is about 110 km
- The average distance of stations is about 5.438 km by hw0

Discretize the latitude by unconditionally rounding to the first decimal place

- Longitude 1 degree is about 85 km in latitude 40 degree
- The average distance of stations is about 5.438 km by hw0

Discretize the latitude by unconditionally rounding to the first decimal place

This time I discretize flow by standardizing the value and rounding to the first decimal place, and compare the difference with above result

In [92]:
tx = pd.DataFrame(((pos['flow'] - pos['flow'].mean()) / pos['flow'].std()).round(1))
tx['station latitude'] = pos['station latitude'].round(1)
tx['station longitude'] = pos['station logitude'].round(1)
tx.head()

Unnamed: 0,flow,station latitude,station longitude
0,-0.6,40.8,-74.0
1,-0.1,40.7,-74.0
2,0.6,40.7,-74.0
3,0.4,40.7,-74.0
4,-0.6,40.7,-74.0


#### Test the support and confidence

I think the confidence should be higher 0.5 because it will show in half frequent item sets. It makes sense!

- When support is 0.1, the size of frequent item sets is two at most

let's try 0.09 and see what's happen

In [93]:
Apriori(tx, 0.1, 0.5)
FP_Growth(tx, 0.1)

---------- Apriori ----------
item: ('-73.9',) , 0.117
item: ('40.8', '-74.0') , 0.252
item: ('40.8',) , 0.288
item: ('-74.0', '40.7') , 0.627
item: ('40.7',) , 0.709
item: ('-74.0',) , 0.879

------------------------ RULES:
Rule: ('-74.0',) ==> ('40.7',) , 0.713
Rule: ('40.8',) ==> ('-74.0',) , 0.874
Rule: ('40.7',) ==> ('-74.0',) , 0.884

total run 0.0640709400177 sec



---------- FP Growth ----------
['-73.9'] 74
['-74.0'] 558
['-74.0', '40.7'] 398
['-74.0', '40.8'] 160
['40.7'] 450
['40.8'] 183

total run 0.0431699752808 sec


- Some rules look like telling me where the most stations are

In [94]:
Apriori(tx, 0.09, 0.5)
FP_Growth(tx, 0.09)

---------- Apriori ----------
item: ('-73.9',) , 0.117
item: ('40.8', '-74.0') , 0.252
item: ('40.8',) , 0.288
item: ('-74.0', '40.7') , 0.627
item: ('40.7',) , 0.709
item: ('-74.0',) , 0.879

------------------------ RULES:
Rule: ('-74.0',) ==> ('40.7',) , 0.713
Rule: ('40.8',) ==> ('-74.0',) , 0.874
Rule: ('40.7',) ==> ('-74.0',) , 0.884

total run 0.0615749359131 sec



---------- FP Growth ----------
['-73.9'] 74
['-74.0'] 558
['-74.0', '40.7'] 398
['-74.0', '40.8'] 160
['40.7'] 450
['40.8'] 183

total run 0.0456840991974 sec


#### Top 3 rules

- ('40.7',) ==> ('-74.0',)
- ('40.8',) ==> ('-74.0',)
- ('-2',) ==> ('-74.0', '40.7')

sort by the confidence

#### Observation

Top two rules are about the stations position, so I draw the stations in that area. And we can see there are very dense

![](https://i.imgur.com/VIe58Ce.png)

The following area is ('-74.0', '40.7') and the flow is about 2000. The usage of stations near by long bridge is few because people will drive instead of riding a bike. Therefore we can move the bikes there to somewhere needed a lot

![](https://i.imgur.com/FDR09bH.png)

### Third transaction

- user's birth year, gender for each stations

Distanguish the station id by negtive it because the rules would make confuse when apply algorithm

#### Use "divided 10" as discretization method

In [95]:
man = df[['gender']]
man['birth year'] = pd.DataFrame(df[['birth year']] / 10).astype(int)
man['station id'] = -pd.DataFrame(df['start station id']).values
man.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0,gender,birth year,station id
0,1,198,-539
1,1,198,-293
2,2,198,-3242
3,1,198,-2002
4,2,198,-2002


#### Test the support and confidence

I think the confidence should be higher 0.5 because it will show in half frequent item sets. It makes sense!

- we can find the men born in 1970 to 1990 are more often to ride citi bikes

In [96]:
Apriori(man, 0.1, 0.5)
FP_Growth(man, 0.1)

---------- Apriori ----------
item: ('196',) , 0.118
item: ('1', '199') , 0.122
item: ('0', '198') , 0.133
item: ('0',) , 0.135
item: ('1', '197') , 0.136
item: ('197',) , 0.176
item: ('199',) , 0.177
item: ('2',) , 0.231
item: ('1', '198') , 0.241
item: ('198',) , 0.468
item: ('1',) , 0.633

------------------------ RULES:
Rule: ('198',) ==> ('1',) , 0.515
Rule: ('199',) ==> ('1',) , 0.687
Rule: ('197',) ==> ('1',) , 0.773
Rule: ('0',) ==> ('198',) , 0.984

total run 102.790286064 sec



---------- FP Growth ----------
['0'] 235049
['1'] 1099013
['1', '197'] 235503
['1', '198'] 418124
['1', '199'] 211349
['196'] 205527
['197'] 304706
['198'] 811809
['198', '0'] 231266
['199'] 307691
['2'] 400956

total run 6.94738388062 sec


#### Use children, young and old people as discretization method

let x = birth year

- children
  - x > 2002
- young
  - 1952 <= x <= 2002
- old
  - x < 1952

In [97]:
# type 2 => children, 1 => young, 2 => old
ty = (df['birth year'] > 2002).astype(int)
ty += (df['birth year'] >= 1952).astype(int)
man = df[['gender']]
man['type'] = pd.DataFrame(man)
man['station id'] = -pd.DataFrame(df['start station id']).values
man.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0,gender,type,station id
0,1,1,-539
1,1,1,-293
2,2,2,-3242
3,1,1,-2002
4,2,2,-2002


#### Test the support and confidence

I think the confidence should be higher 0.5 because it will show in half frequent item sets. It makes sense!

- with this discretization method, there are no rules QQ

Because the gender and type are too dispersed, the support will be very low

In [98]:
Apriori(man, 0.1, 0.5)
FP_Growth(man, 0.1)

---------- Apriori ----------
item: ('0',) , 0.135
item: ('2',) , 0.231
item: ('1',) , 0.633

------------------------ RULES:

total run 87.5975542068 sec



---------- FP Growth ----------
['0'] 470098
['1'] 2198026
['2'] 801912

total run 6.19200801849 sec


#### Top 3 rules

- ('197',) ==> ('1',)
- ('199',) ==> ('1',)
- ('198',) ==> ('1',)

sort by the confidence

#### Observation

The men born in 1970 to 1990 are more often to ride citi bike

In this transaction, we can find the difference between Apriori and FP Growth in execution time