## Tests

In [2]:
# import related libraries
import pandas as pd
import numpy as np
import copy

#-----------------------------------
import warnings
warnings.filterwarnings("ignore")

### Frequent patterns

Let's prioritize events by frequence of every sequence of events, which includes "Subscription Premium Cancel"

In [31]:
data_2_copy = copy.deepcopy(data)

In [32]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori


# Convert the user_id and event_name columns into categorical variables
data_2_copy['userid'] = data_2_copy['userid'].astype('category')
data_2_copy['event_name'] = data_2_copy['event_name'].astype('category')


# Encode the categorical variables as integers
data_2_copy['userid_encoded'] = data_2_copy['userid'].cat.codes
mapping = dict(list(zip(data_2_copy['event_name'].cat.codes, data_2_copy['event_name'])))
data_2_copy['event_name_encoded'] = data_2_copy['event_name'].cat.codes

# Group the data by user_id and create a list of event_name_encoded for each user
grouped = data_2_copy.groupby('userid_encoded')['event_name_encoded'].apply(list).reset_index()

# Convert the grouped data into a one-hot encoded format
te = TransactionEncoder()
te_ary = te.fit(grouped['event_name_encoded']).transform(grouped['event_name_encoded'])
columns_mapping = te.columns_mapping_
# Convert the one-hot encoded data into a pandas DataFrame
data_2_copy_hot = pd.DataFrame(te_ary, columns=te.columns_)

# Find the frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(data_2_copy_hot, min_support=10e-9, use_colnames=True)

# decode values
frequent_itemsets['itemsets'] = frequent_itemsets['itemsets'].apply(lambda x: [columns_mapping[code] for code in x])
frequent_itemsets['itemsets'] = frequent_itemsets['itemsets'].apply(lambda x: [mapping[code] for code in x])

# Filter the frequent itemsets to only include those that contain the "Subscription Premium Cancel" event
subscription_cancel = 'Subscription Premium Cancel'
subscription_cancel_proritize = frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: subscription_cancel in x)]
# Sort the frequent itemsets by support in descending order
subscription_cancel_proritize = subscription_cancel_proritize.sort_values(by='support', ascending=False)

# Show the resulting frequent itemsets
subscription_cancel_proritize

Unnamed: 0,support,itemsets
19,0.070904,[Subscription Premium Cancel]
112,0.064407,"[Subscription Premium Cancel, Add Payment Meth..."
239,0.062994,"[Subscription Premium, Subscription Premium Ca..."
966,0.062994,"[Subscription Premium, Subscription Premium Ca..."
962,0.062147,"[Sign Up Success, Subscription Premium Cancel,..."
...,...,...
119615,0.000282,"[Add Payment Method Success, Add Vehicle Succe..."
119616,0.000282,"[Add Payment Method Success, Add Vehicle Succe..."
119619,0.000282,"[Add Payment Method Success, Add Vehicle Succe..."
119625,0.000282,"[Add Payment Method Success, Add Vehicle Succe..."


Let's see 10 sequences with highest frequency

In [33]:
list(subscription_cancel_proritize.itemsets[:10])

[['Subscription Premium Cancel'],
 ['Subscription Premium Cancel', 'Add Payment Method Success'],
 ['Subscription Premium', 'Subscription Premium Cancel'],
 ['Subscription Premium',
  'Subscription Premium Cancel',
  'Add Payment Method Success'],
 ['Sign Up Success',
  'Subscription Premium Cancel',
  'Add Payment Method Success'],
 ['Sign Up Success', 'Subscription Premium Cancel'],
 ['Chat Conversation Started', 'Subscription Premium Cancel'],
 ['Sign Up Success', 'Subscription Premium', 'Subscription Premium Cancel'],
 ['Sign Up Success',
  'Subscription Premium',
  'Subscription Premium Cancel',
  'Add Payment Method Success'],
 ['Subscription Premium Cancel', 'Add Vehicle Success']]

As was obvious, in almost every sequence we have Subscription Premium. Let's iterate by sequences and find ones with highest sum of supports

In [34]:
all_event_names = data.event_name.unique()
dict_of_support_by_events = {}
subs_dict = subscription_cancel_proritize.T.to_dict()
for event in all_event_names:
    dict_of_support_by_events[event] = 0
    for key in subs_dict:
        if event in subs_dict[key]['itemsets']:
            dict_of_support_by_events[event] += subs_dict[key]['support']

In [35]:
sorted(dict_of_support_by_events.items(), key=lambda x: x[1], reverse=True)

[('Subscription Premium Cancel', 92.2943502822235),
 ('Add Payment Method Success', 46.042937853121316),
 ('Subscription Premium', 45.84406779662421),
 ('Sign Up Success', 45.712994350296476),
 ('Add Vehicle Success', 45.40112994351698),
 ('Chat Conversation Started', 44.72740112995767),
 ('Chat Conversation Opened', 44.23785310735874),
 ('Account History Transaction Details', 41.76836158194793),
 ('Wallet Opened', 40.41751412432015),
 ('Calculator View', 34.67570621466897),
 ('Order', 29.776553672268488),
 ('Email Confirmation Success', 24.108474576227405),
 ('Add Vehicle Break', 20.822598870027107),
 ('Sign Out', 16.79999999999569),
 ('Transaction Refund', 15.304519774011851),
 ('Calculator Used', 13.776271186441251),
 ('Subscription Premium Renew', 12.294067796610651),
 ('Account Setup Profile Skip', 11.073446327684064),
 ('Add Payment Method Failed', 10.612429378531484),
 ('Account Setup Skip', 8.867796610169892),
 ('Sign Up Error', 1.1570621468926945),
 ('Add Vehicle Failed', 0.00

We need to remember that in this analyses we also we take into account frequency of every event, so we cannot judge correation, but we may notice, that in half cases, where we Subscription Premium Cancel, we also have Add Payment Method Success, Subscription Premium, Sign Up Success, Add Vehicle Succes, Chat Conversation Started, Chat Conversation Opened. But Subscription Premium Renew we can see only in 1 of 8 cases

### Bigram analyses

In [130]:
data.groupby('userid').agg('count').event_name.sort_values()

userid
8c9766660bf84148c2eab4fcb4d6f8ef      1
6d08ed5afa8f5e6917c009b58b9c3110      1
d5545663ff7f451293699ebb05333fc5      1
d538dc6b56a5319edade6d4a36c14717      1
d5348bfb9816c15891dd13981a05777b      1
                                   ... 
31c1e16c643a0b79de9e269ee421a3bb    178
9b19c0cc4a33ffdfe693957c48656c66    185
95e7c959d7d39be0a965e9e315906fed    215
dd3c7c5c898a4e6a4ed78d6e2c526bed    218
627f50253b42607513a1c93bb68201ad    320
Name: event_name, Length: 3540, dtype: int64

We can see that we have 3540 unique users. Amount of users' events is in interval from 1 to 320

In [131]:
subscription_cancel = 'Subscription Premium Cancel'
subscription_cancel_users = data[data['event_name'].apply(lambda x: subscription_cancel in x)]
subscription_cancel_users.groupby('userid').agg('count').event_name.sort_values()

userid
03e0c91e1163e8b80e74e586a3e666d6    1
a3a6f7a1f0a696ba44ab1d2017d30ba6    1
a4933d8ad8fe82f84a8edc1cd10b7863    1
a528cce04679dc6316925041fda7db8e    1
a529249c937ee94029cde7d9de5493b7    1
                                   ..
2567193ff2ad4910e8b44dd1fc853195    2
5393129fdf773191f7b3de51b35bb15d    2
c952231e6ea12e2bdd6f735453422008    2
5a90e9973c256c19c17b03bdda1c1fc4    2
375f680cd1242eb7bf9263dd5ba1fe6d    2
Name: event_name, Length: 251, dtype: int64

Also, we have 251 unique users, who cancelled premium subscription, 11 of them did it twice

In [132]:
bigram_data = copy.deepcopy(data)
bigram_data['event_name'] = bigram_data.apply(lambda x: str(x['event_name']).replace(" ", ""), axis=1)
bigram_data

Unnamed: 0,userid,user_state,event_name,event_attributes,event_created_date,event_platform,device_manufacture,device_model,month
0,c95c777785faec8dd910d019d7278ebe,CA,AddVehicleSuccess,"{""Make"":""Dodge"",""Model"":""Caravan"",""Color"":""Whi...",2022-01-16 17:03:04,android,samsung,SM-N9,1
1,c95c777785faec8dd910d019d7278ebe,CA,AddVehicleBreak,{},2022-01-16 17:07:47,android,samsung,SM-N9,1
2,f344be2d9a042b7444f3cc5279e38ef1,FL,CalculatorView,{},2022-01-16 17:16:25,android,samsung,SM-G9,1
3,c95c777785faec8dd910d019d7278ebe,CA,AddPaymentMethodSuccess,"{""Payment Method"":""Credit"",""Tokenized Pay"":""""}",2022-01-16 17:24:22,android,samsung,SM-N9,1
4,e331ed81422d8fba55520a43a872e701,IL,SignUpSuccess,"{""Method"":""Apple""}",2022-01-16 17:34:51,ios,Apple,iPhone12,1
...,...,...,...,...,...,...,...,...,...
23352,679eba26c4e75e0afb178360becfa21b,CA,AddPaymentMethodSuccess,"{""Payment Method"":""Credit"",""Tokenized Pay"":"""",...",2022-04-16 20:49:24,android,Google,Pixel 3,4
23353,679eba26c4e75e0afb178360becfa21b,CA,AccountSetupProfileSkip,"{""Screen"":""Address""}",2022-04-16 20:50:05,android,Google,Pixel 3,4
23354,679eba26c4e75e0afb178360becfa21b,CA,AccountSetupProfileSkip,"{""Screen"":""Phone Number""}",2022-04-16 20:50:10,android,Google,Pixel 3,4
23355,679eba26c4e75e0afb178360becfa21b,CA,ChatConversationOpened,"{""From"":""Dashboard"",""Transaction type"":""""}",2022-04-16 20:50:31,android,Google,Pixel 3,4


In [133]:
def values_agg(series):
    return str(series.values)

bigram_data = bigram_data.groupby(['userid']).agg({'event_name' :values_agg})
bigram_data

Unnamed: 0_level_0,event_name
userid,Unnamed: 1_level_1
0006869712ec9841dc36234bce245203,['AddPaymentMethodSuccess' 'SubscriptionPremiu...
000a59897372c5e3c147b15685fefc65,['SignUpSuccess']
001244c572f1a681553bc045a378cacf,['SignUpSuccess']
0032cb66b99f6baef57ec2aa04a9277f,['SignUpSuccess']
003f57fe2631ade57a86f6a2b96bb20c,['SignUpSuccess' 'AddVehicleSuccess' 'AccountS...
...,...
ff9fd3437958123842f3ab75d22fc13f,['SignUpSuccess']
ffa1aa12dd53aee84976cb6c525bb17b,['SignUpSuccess' 'EmailConfirmationSuccess' 'A...
ffbbc97af52745060a9dff4eb9917f75,['SignUpSuccess']
ffc566d97935423b6d7a3f9ba211a2b4,['SignUpSuccess']


In [134]:
from sklearn.feature_extraction.text import CountVectorizer
event_names = bigram_data['event_name'].tolist()
vectorizer = CountVectorizer(ngram_range=(2, 2), token_pattern=r'\b\w+\b', min_df=1)

# Transform the list of event names into bigrams
bigrams = vectorizer.fit_transform(event_names)

# Get the features and feature names
features = bigrams.toarray()
feature_names = vectorizer.vocabulary_

In [41]:
from collections import defaultdict

result = {k: v for k, v in feature_names.items() if 'subscriptionpremiumcancel' in k}
result = sorted(result.items(), key=lambda x: x[1], reverse=True)
pair_count = defaultdict(int)
for event_pair, count in result:
    pair_count[tuple(sorted(event_pair.split()))] += count
pair_count

defaultdict(int,
            {('subscriptionpremiumcancel', 'walletopened'): 557,
             ('subscriptionpremiumcancel', 'transactionrefund'): 541,
             ('subscriptionpremiumcancel', 'subscriptionpremiumrenew'): 271,
             ('subscriptionpremiumcancel', 'subscriptionpremiumcancel'): 261,
             ('signout', 'subscriptionpremiumcancel'): 474,
             ('order', 'subscriptionpremiumcancel'): 455,
             ('chatconversationstarted', 'subscriptionpremiumcancel'): 423,
             ('chatconversationopened', 'subscriptionpremiumcancel'): 406,
             ('calculatorview', 'subscriptionpremiumcancel'): 389,
             ('addvehiclesuccess', 'subscriptionpremiumcancel'): 255,
             ('addvehiclebreak', 'subscriptionpremiumcancel'): 254,
             ('addpaymentmethodsuccess', 'subscriptionpremiumcancel'): 322,
             ('addpaymentmethodfailed', 'subscriptionpremiumcancel'): 252,
             ('accounthistorytransactiondetails',
              'sub

As wee see, that the most frequent bigram with "Subscription Premium Cancel" is ("Wallet Opened",  "Subscription Premium Cancel) with frequency 271. Now we need to normalize this data

In [42]:
event_count = data.groupby('event_name').agg('count').month.T.to_dict()
event_count

{'Account History Transaction Details': 1607,
 'Account Setup Profile Skip': 498,
 'Account Setup Skip': 222,
 'Add Payment Method Failed': 334,
 'Add Payment Method Success': 1038,
 'Add Vehicle Break': 486,
 'Add Vehicle Failed': 21,
 'Add Vehicle Success': 1923,
 'Calculator Used': 120,
 'Calculator View': 620,
 'Chat Conversation Opened': 1485,
 'Chat Conversation Started': 1201,
 'Email Confirmation Success': 832,
 'Order': 4845,
 'Reset Password Set': 1,
 'Sign Out': 595,
 'Sign Up Error': 26,
 'Sign Up Success': 3329,
 'Subscription Premium': 711,
 'Subscription Premium Cancel': 262,
 'Subscription Premium Renew': 310,
 'Transaction Refund': 102,
 'Wallet Opened': 1471}

In [43]:
normalized_result = {}
for key in event_count:
    for event_pair, count in pair_count.items():
        if key.lower().replace(" ", "") in event_pair:
            normalized_result[key] = count/event_count[key]
normalized_result = sorted(normalized_result.items(), key=lambda x: x[1], reverse=True)
normalized_result

[('Transaction Refund', 5.303921568627451),
 ('Subscription Premium Cancel', 1.0),
 ('Subscription Premium Renew', 0.8741935483870967),
 ('Sign Out', 0.7966386554621848),
 ('Add Payment Method Failed', 0.7544910179640718),
 ('Calculator View', 0.6274193548387097),
 ('Add Vehicle Break', 0.522633744855967),
 ('Wallet Opened', 0.3786539768864718),
 ('Chat Conversation Started', 0.3522064945878435),
 ('Add Payment Method Success', 0.31021194605009633),
 ('Chat Conversation Opened', 0.2734006734006734),
 ('Account History Transaction Details', 0.16303671437461106),
 ('Add Vehicle Success', 0.13260530421216848),
 ('Order', 0.09391124871001032)]

As we can see the highest dependency Subscription Premium Cancellation has with Transaction Refund and Subscription Premium Renew. This has logic: at first, app automatically withdraws money from user, than transaction refunds, user cancels premium and sign out

### Markov chains

Let's implement class for Markov chain

In [61]:
class MarkovChain:
    
    def __init__(self, transition_matrix, states):
        self.transition_matrix = transition_matrix
        self.states = states
        self.state_to_index = {state: index for index, state in enumerate(states)}
    
    def get_next_state(self, current_state):
        current_index = self.state_to_index[current_state]
        next_index = np.random.choice(len(self.states), p=self.transition_matrix[current_index, :])
        return self.states[next_index]

In [59]:
events = data["event_name"].unique()
event_index = {event: index for index, event in enumerate(events)}
transition_matrix = np.zeros((len(events), len(events)))

for i in range(data.shape[0] - 1):
    current_event = data.iloc[i]["event_name"]
    next_event = data.iloc[i + 1]["event_name"]
    current_event_index = event_index[current_event]
    next_event_index = event_index[next_event]
    transition_matrix[current_event_index][next_event_index] += 1

transition_probs = transition_matrix / np.sum(transition_matrix, axis=1, keepdims=True)

In [62]:
markov_chain = MarkovChain(transition_probs, events)

#### !Notice that your test can be different!

#### 1 test

In [63]:
current_state = "Subscription Premium Cancel"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Subscription Premium Cancel
Next state: Transaction Refund


In [65]:
current_state = "Transaction Refund"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Transaction Refund
Next state: Chat Conversation Opened


In [66]:
current_state = "Chat Conversation Opened"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Chat Conversation Opened
Next state: Chat Conversation Started


#### 2 test

In [77]:
current_state = "Subscription Premium Cancel"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Subscription Premium Cancel
Next state: Sign Up Success


In [78]:
current_state = "Sign Up Success"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Sign Up Success
Next state: Wallet Opened


In [79]:
current_state = "Wallet Opened"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Wallet Opened
Next state: Chat Conversation Opened


In [80]:
current_state = "Chat Conversation Opened"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Chat Conversation Opened
Next state: Calculator View


#### 3 test

In [76]:
current_state = "Subscription Premium Cancel"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Subscription Premium Cancel
Next state: Chat Conversation Opened


In [81]:
current_state = "Chat Conversation Opened"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Chat Conversation Opened
Next state: Wallet Opened


In [82]:
current_state = "Wallet Opened"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Wallet Opened
Next state: Chat Conversation Opened


#### 4 test

In [83]:
current_state = "Subscription Premium Cancel"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Subscription Premium Cancel
Next state: Chat Conversation Started


In [85]:
current_state = "Chat Conversation Started"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Chat Conversation Started
Next state: Wallet Opened


In [86]:
current_state = "Wallet Opened"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Wallet Opened
Next state: Account History Transaction Details


#### 5 test

In [96]:
current_state = "Subscription Premium Cancel"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Subscription Premium Cancel
Next state: Transaction Refund


In [97]:
current_state = "Transaction Refund"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Transaction Refund
Next state: Sign Up Success


In [98]:
current_state = "Sign Up Success"
print("Current state:", current_state)
print("Next state:", markov_chain.get_next_state(current_state))

Current state: Sign Up Success
Next state: Add Vehicle Success


Like in previous tests, we got the highest corellation with Transaction Refund, Chat Conversations and Wallet openings

### Interim findings

Based on frequent, bigram and Markov sequences analyses we can see the common pattern of users which cancel Premium Subcription:
1. Subcription Premium Renew
2. Chat Conversation Opened
3. Chat conversation started
4. Wallet opened
5. Transaction Refund
6. Subscription Premium Cancel

So 2 business idea to reduce users which cancels premium subscription:
* In chat coversation with manager offer to user some previligues, so he would have motivation to stay
* Warn user about withdrawal of money few days before withdrawal with messages like "Glad you're with us" or oferring some promocodes once in few month

[See Notebook Parameters_analyses]