In [2]:
%run "../../common.ipynb"

# Click Fox Analysis 

### Details of data set

**session** is simply a single user’s interaction, perhaps on a web site. 
**path** displayed in the last column is a series of events, in the order visited by the user; i.e. path “V->V” is a user visiting event “V” twice in a row. These paths are only part of the user’s entire experience, and we may observe more than one path per session.
You’ll see that the paths are ranked, by volume, and that you also have the number of times the path was observed and the number of sessions in which it was observed.
 
For this exercise, we’re presenting a simple problem statement: If event “Q” represents an outcome of interest (perhaps a customer complaint or a repeat agent call), what can you tell us about the paths or events leading up to Q?
 
It’s a broad question with no one right answer. Get creative, have fun with it and be ready to talk about your strategies with the team. If you have any questions, feel free to email me or give me a ring on my cell (xxx-xxx-xxxx).
 
After you’ve had a chance to review and get your arms around the exercise, let’s schedule a time for you to come back by the office, meet the team, and present your results.  

## My Analysis

### Assumptions

I am assuming (or as I understand) "Sessions" column is count of total number of sessions that a "Path" (shown in last column) occured. Thus the path in first row "v->v" occured in 347 times. 

A session may have multiple paths; therefore a session may be of form as follows.

Session ID: XXX and paths in this session can be {V->V, A->j, U->X->E->K->C->H} that corresponds to paths in rows 1, 9, 14 respectively.


### Analysis

I could see many interesting patterns that can be extracted from this data. I will restrict my focus to an interesting event of choice. Lets arbitrarily select an event "Q" (as suggested) and since the choice or event arbitrary, the commentary that follows are applicable to any event.

Following questions are some metrics I would be seeking. Notice even though I am using "Q" in the following context, it could be any other event. Therefore, as previouly indicated, the questions are applicable to any event.

<pre>
1. How many total sessions does "Q" appear

2. What is the total percentage of sessions does "Q" appear.

3. Does the metric we see is significant 
  * assess the significance level - for example if "Q" were to be bug and does it appear in less than, say 3% of sessions that is worth pursuing. What is the significance criteria to priorotize this work.

4. In this data set it is not possible to count the number of paths per session; however knowing that the data is summarized, I would be interested in knowing how many paths exists when "Q" appears 

5. Is "Q" a start of path - count of times when "Q" is start of a path; 
6. Is "Q" always a end of a path - count of times when "Q" is end of a path; 
7. Is "Q" intermediate node in a path; (The concept can be extended depending on pattern - what is the rank of "Q" in  path)
8. What events leads to "Q"
9. What events follow "Q"
10*. Any correlation between "Q" and a particular type of users (or user sessions) - (Again this information is not apparent in the current summarized data set - but need to be extracted from the parent data set)

11. Is there an event that starts and leads to Q 
12. If we were to count the number of events in each path (regardless of the order) is there a correlation of count of events and "Q" appearing in that path - this is similar to Naive Bayes algorithm. In this case we shall have two states "Q" appears or not

13. Probablity a event leads to event "Q". In other words P(Q|event=x) = P(Q|E) 

         P(E|Q) P(E)
P(Q|E) = -----------
            P(Q)

If we are only interested in predicting Q, then only numerator is sufficient (Since denominator is same for all). If on the otherhand, if interested the exact probablity to see the strength of the relationship and denominator can be calculated as well.

14. I also see application of Markov chains with states "A"-"Z" and transition probabilities to compute the probability of "Q" occuring
15. I can see an application of HMM - similar to Markov chain, except having two hidden states to dynamically update the model.


</pre>


In [6]:
fileName="Alphabet_Paths.csv"

df  = LoadDataSet(fileName);
displayDFs([df], maxrows=23)

count,20.000,20.000,20.000,20.000,20.000,20.000,20
unique,-,-,-,-,-,-,20
top,-,-,-,-,-,-,U->X->E->K->C->H
freq,-,-,-,-,-,-,1
mean,10.500,51.900,1038.000,0.050,70.050,0.050,-
std,5.916,77.298,0.000,0.074,132.527,0.095,-
min,1.000,11.000,1038.000,0.011,11.000,0.008,-
25%,5.750,15.500,1038.000,0.015,16.500,0.012,-
50%,10.500,21.500,1038.000,0.021,30.500,0.022,-
75%,15.250,55.750,1038.000,0.054,67.250,0.048,-
max,20.000,347.000,1038.000,0.334,612.000,0.437,-
Unnamed: 0_level_11,Rank 	(int64),Sessions 	(int64),Total Sessions 	(int64),% of Total Sessions 	(float64),Paths 	(int64),% of Total Paths 	(float64),Path 	(object)
0,1,347,1038,0.334,612,0.437,V->V
1,2,148,1038,0.143,157,0.112,L->S->H->I->Y->G->R
2,3,79,1038,0.076,82,0.059,N->R->I->B->J->P
3,4,71,1038,0.068,81,0.058,Z->D->D->H->O->T->N->D->F
4,5,70,1038,0.067,71,0.051,Y->E
5,6,51,1038,0.049,66,0.047,L->X->A->P->R
6,7,46,1038,0.044,52,0.037,G->C->V->D->Q
7,8,25,1038,0.024,41,0.029,M->G->N->P->I
8,9,24,1038,0.023,37,0.026,T->W->Q->D->M->Q
9,10,22,1038,0.021,35,0.025,A->J

count,20.000,20.000,20.000,20.000,20.000,20.000,20
unique,-,-,-,-,-,-,20
top,-,-,-,-,-,-,U->X->E->K->C->H
freq,-,-,-,-,-,-,1
mean,10.500,51.900,1038.000,0.050,70.050,0.050,-
std,5.916,77.298,0.000,0.074,132.527,0.095,-
min,1.000,11.000,1038.000,0.011,11.000,0.008,-
25%,5.750,15.500,1038.000,0.015,16.500,0.012,-
50%,10.500,21.500,1038.000,0.021,30.500,0.022,-
75%,15.250,55.750,1038.000,0.054,67.250,0.048,-
max,20.000,347.000,1038.000,0.334,612.000,0.437,-
Unnamed: 0_level_11,Rank 	(int64),Sessions 	(int64),Total Sessions 	(int64),% of Total Sessions 	(float64),Paths 	(int64),% of Total Paths 	(float64),Path 	(object)
0,1,347,1038,0.334,612,0.437,V->V
1,2,148,1038,0.143,157,0.112,L->S->H->I->Y->G->R
2,3,79,1038,0.076,82,0.059,N->R->I->B->J->P
3,4,71,1038,0.068,81,0.058,Z->D->D->H->O->T->N->D->F
4,5,70,1038,0.067,71,0.051,Y->E
5,6,51,1038,0.049,66,0.047,L->X->A->P->R
6,7,46,1038,0.044,52,0.037,G->C->V->D->Q
7,8,25,1038,0.024,41,0.029,M->G->N->P->I
8,9,24,1038,0.023,37,0.026,T->W->Q->D->M->Q
9,10,22,1038,0.021,35,0.025,A->J


In [14]:
d='''
V->V
L->S->H->I->Y->G->R
N->R->I->B->J->P
Z->D->D->H->O->T->N->D->F
Y->E
L->X->A->P->R
G->C->V->D->Q
M->G->N->P->I
T->W->Q->D->M->Q
A->J
I->T->N->R->F->Q->I->Q
K->Q
J->G->V->L
Z->K->V->L->F->L
U->X->E->K->C->H
S->C->T->G->W->U
U->R->G->V->W->F
F->U->H->Q->C->Z->U->G->K->V
I->O->K->L->D->X->V->M
V->G->I->V
'''

syms = set(d.replace("\n","->").split("->"))