## Note that this demo only process two raw .log files, with one before cut-off date and the other after the cut-off date.

The raw data contains all the users' activities from March 01 to May 12, 2017. The data set is too large for a PC to read in all .logs at once. On the other hand, it's not necessary to get all the activity information to determine if a user is a churn. To be more specific, for churn and the users' activity analysis, only user_id, device_type (iphone or android) and date/time of each activity log needs to be saved.

### Cut-off date = April 21st.

In this demo, the first file of 03/01 is before the cut-off date, while the second file of 05/01 is after the cut-off. In reality there are three weeks before cut-off, and three weeks after. The cut off might need changing later, depending on the model's performance.

### Operations on the play log files:

1. Open *play.log files one by one

2. Read a line, and save the user id, device type and log date. 

3. Do this for all lines in all files. Write the saved info into a file called 'play_lite.log'

In [1]:
import glob


In [71]:
filepath = '/Users/Xiaoxi/Desktop/BitTiger/Capstone/data/play/*play.log'
files = glob.glob(filepath)
len(files)

2

In [72]:
files

['/Users/Xiaoxi/Desktop/BitTiger/Capstone/data/play/20170301_play.log',
 '/Users/Xiaoxi/Desktop/BitTiger/Capstone/data/play/20170501_1_play.log']

In [73]:
# read in log file line by line and save to log_lengths to check the total lengths of logs
log_lengths = []

for the_file in files:
    f = open(the_file, 'r')
    lines = f.readlines()
    log_lengths.append(len(lines))
    f.close()
    
log_lengths

[3422257, 992200]

In [104]:
with open(files[0],'r') as f:
    content = f.readlines()
len(content)

3422257

In [105]:
first_line = content[0]
first_line


'264715\n'

In [107]:
first_line_fields = content[0].strip('\n').split('\t')
first_line_fields

['264715']

In [89]:
reduced_fields = first_line_fields[:2]
reduced_fields

['264715']

In [90]:
filename = f.name.split('/')[-1]

reduced_fields.append(filename)

reduced_fields

['264715', '20170301_play.log']

In [91]:
new_first_line = '\t'.join(reduced_fields)+'\n'
new_first_line

'264715\t20170301_play.log\n'

### write the new reduced lines into a new file

    Using the f.write() method

In [114]:
output = open('/Users/Xiaoxi/Desktop/BitTiger/Capstone/data/output/all.log','w+')

In [115]:
import time

for the_file in files:
    current_time = time.clock()

    with open(the_file, 'r') as f:
        lines = f.readlines()
        print('processing file: %s' % f.name.split('/')[-1])
        for line in lines:
            fields_to_keep = line.strip('\n').split('\t')[:2]
            fields_to_keep.append(f.name.split('/')[-1])
            output.write('\t'.join(fields_to_keep)+'\n')
    print('...costs %.2f seconds' % (time.clock()-current_time))
    current_time = time.clock()

processing file: 20170301_play.log
...costs 16.43 seconds
processing file: 20170501_1_play.log
...costs 4.33 seconds


In [116]:
output.close()


In [117]:
with open('/Users/Xiaoxi/Desktop/BitTiger/Capstone/data/output/all.log','r') as output:
    lines = output.readlines()
len(lines)

4414457

In [118]:
sum(log_lengths)

4414457

In [119]:
lines[0]

'264715\t20170301_play.log\n'

### Save the user_ids into sets

  Overwrite the all.log just created, as the procedure below will create user_id sets and all.log file at the same time

In [120]:
import time
from sets import Set

list_of_sets = []
# for each day's data, set the active users' user_id into a set.


  """Entry point for launching an IPython kernel.


In [121]:
with open('/Users/Xiaoxi/Desktop/BitTiger/Capstone/data/output/all.log','w+') as output:
    for the_file in files:
        current_time = time.clock()
        
        with open(the_file, 'r') as f:
            print('processing file: %s' % f.name.split('/')[-1])
            lines = f.readlines()
            list_of_sets.append(Set([line.split('\t')[0] for line in lines]))
            
            for line in lines:
                fields = line.strip('\n').split('\t')
                fields.append(f.name.split('/')[-1][:8])
                output.write('\t'.join(fields)+'\n')
        print('...costs %.2f seconds' % (time.clock()-current_time))


processing file: 20170301_play.log
...costs 20.28 seconds
processing file: 20170501_1_play.log
...costs 5.16 seconds


In [123]:
[len(each_set) for each_set in list_of_sets]


[175191, 25489]

### Churn labeling and file saving

    Save the user_id of churns into a new file.

In [124]:
active_before, active_after = list_of_sets[0],list_of_sets[1]


In [125]:
# set method: s.intersection(t) returns the interset of s & t
loyal_users = active_before.intersection(active_after)
len(loyal_users)

0

In [126]:
# set method: s.difference(t) returns a new set with elements in s but not in t
churn = active_before.difference(active_after)
len(churn)

175191

In [127]:
new_users = active_after.difference(active_before)
len(new_users)

25489