New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Is it possible to build a funnel and get all events between steps as well? #3413

Open
artfabrique opened this Issue Oct 18, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@artfabrique
Copy link

artfabrique commented Oct 18, 2018

I have a challenging task and i'm not sure that it's even possible with clickhouse: Find not only the funnel steps occurrences but all events between the funnel steps as well.
Here is the scheme I need to achieve

clickhouse algo - page 1

How the events are stored initially - does not matter (can be changed)
We have around 2.7kkk events/month
Now I have a query that runs around 3 minutes on a small sample of 75kk
Maybe someone can advise a direction to research? (I'll provide current query a bit later)

I've looked through this already: #2096

@artfabrique artfabrique changed the title [Question] Is it possible to buld funnel and get all events between steps? [Question] Is it possible to build a funnel and get all events between steps as well? Oct 18, 2018

@sundy-li

This comment has been minimized.

Copy link
Contributor

sundy-li commented Oct 18, 2018

This is a more complex funnel demand than windowFunnel , but I am looking forward to see your current query in solving this problem.

@blinkov blinkov added the question label Oct 18, 2018

@blinkov

This comment has been minimized.

Copy link
Contributor

blinkov commented Oct 18, 2018

@artfabrique can you show this 3-minutes-long query?

@blinkov blinkov added the need info label Oct 18, 2018

@panyushkins

This comment has been minimized.

Copy link

panyushkins commented Oct 18, 2018

@artfabrique can you show this 3-minutes-long query?

Hello!

SELECT

count() as trials,
uniq(uid) as users,  
n_step, 
step

FROM (
WITH

    groupArray(_interface) as event_names,
    groupArray(toInt32(date)) as event_dates,
    arrayJoin(arrayFilter((i, x) -> x = '""" + first_step + """', arrayEnumerate(event_names), event_names)) AS start_index_journey,
    arraySlice(event_names, start_index_journey+1) as from_start_step_1,
    indexOf(from_start_step_1, '""" + first_step + """') as end_index,
    
    arraySlice(from_start_step_1, 1, if(end_index != 0, end_index-1, toInt32(length(from_start_step_1)))) as journey,
    if(end_index != 0, arraySlice(event_dates, start_index_journey+1, end_index-1), arraySlice(event_dates, start_index_journey+1)) as dates_journey,
    
    indexOf(journey, '""" + second_step + """') as start_index_step_2,
    if(start_index_step_2 != 0, dates_journey[start_index_step_2]-event_dates[start_index_journey], -1) as step_1_2_time,
    if(step_1_2_time < """+ max_time + """, journey, emptyArrayString()) as step1_checked,
    if(start_index_step_2 != 0, arraySlice(step1_checked, 1, start_index_step_2-1), emptyArrayString()) as step_1,
    if(start_index_step_2 != 0, arraySlice(step1_checked, start_index_step_2+1), emptyArrayString()) as from_start_step_2,
    
    indexOf(from_start_step_2, '""" + third_step + """') as start_index_step_3,
    if(start_index_step_3 != 0, dates_journey[start_index_step_2 + start_index_step_3] - dates_journey[start_index_step_2], -1) as step_2_3_time,
    if(step_2_3_time < """+ max_time + """, from_start_step_2, emptyArrayString()) as step2_checked,
    if(start_index_step_3 != 0, arraySlice(step2_checked, 1, start_index_step_3-1), emptyArrayString()) as step_2
    
SELECT uid,
    
    
    [step_1, step_2] as steps,
    arrayJoin(arrayEnumerate(steps)) as n_step,
    steps[n_step] as step_array,
    arrayJoin(step_array) as step

          
  from (
      select uid, 
          _interface, 
          date 
      from test_events
      where date > '2018-10-10 23:30:00' and date < '2018-10-10 23:59:59'
      order by date)
  group by uid)
  
GROUP BY n_step, step

Result: https://service.crazypanda.ru/v/monosnap/2018-10-18-17-10-20-0kx4n.png
Select * from test_events: https://service.crazypanda.ru/v/monosnap/2018-10-18-17-11-34-emaii.png

@artfabrique

This comment has been minimized.

Copy link
Author

artfabrique commented Oct 19, 2018

@blinkov above is the query that we are using.

@blinkov

This comment has been minimized.

Copy link
Contributor

blinkov commented Oct 19, 2018

@artfabrique it looks roughly as I expected, but frankly speaking it's too complicated to debug via public comments without access to real ClickHouse cluster. Does it look CPU bound? If it does, what do you see in profiler? What kind of hardware do you use for this test?

@panyushkins

This comment has been minimized.

Copy link

panyushkins commented Oct 19, 2018

@artfabrique it looks roughly as I expected, but frankly speaking it's too complicated to debug via public comments without access to real ClickHouse cluster. Does it look CPU bound? If it does, what do you see in profiler? What kind of hardware do you use for this test?

I will answer at Monday, when I got this information from our developers.

Also, we tried another methodic: firstly we finding all indexes and after making only one slice per one hard step with soft steps (step between hard step). I use this query:

SELECT uid,
    
    groupArray(_interface) as event_names,
    groupArray(toInt32(date)) as event_dates,
    arrayJoin(arrayFilter((i, x) -> x = '""" + first_step + """', arrayEnumerate(event_names), event_names)) AS start_index_journey,
    arrayFilter((i, x) -> x = '""" + "game" + """' and i > start_index_journey, arrayEnumerate(event_names), event_names)[1] as i_step1,
    arrayFilter((i, x) -> x = '""" + "select_another" + """' and i > i_step1, arrayEnumerate(event_names), event_names)[1] as i_step_2,
    arraySlice(event_names, start_index_journey+1) as from_start_step_1,
    indexOf(from_start_step_1, '""" + first_step + """') as end_index,
    
    arraySlice(from_start_step_1, 1, if(end_index != 0, end_index-1, toInt32(length(from_start_step_1)))) as journey,
    if(end_index != 0, arraySlice(event_dates, start_index_journey+1, end_index-1), arraySlice(event_dates, start_index_journey+1)) as dates_journey
    
          
  from test_events
  group by uid

But! I was suprised, but this query is much more complicated and using much more RAM.

@artfabrique

This comment has been minimized.

Copy link
Author

artfabrique commented Oct 19, 2018

it's too complicated to debug via public comments without access to real ClickHouse cluster

Yeah, I know :( I thought maybe there are some fundamental mistakes it the query and maybe somebody will have an idea of some radically different approach. For example, pre-generate something when we collect data etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment