> Beautiful is better than ugly.  
Explicit is better than implicit.  
Simple is better than complex.  
Complex is better than complicated.  
Flat is better than nested.  

In [1067]:
# Load the ruleset
rules = pd.read_csv(config.inputs['rules']['fullpath'])

In [1068]:

"""
For now, exclude rows that are - 
1. Missing text_match entry, and
2. Missing text_exclude
"""
match_rules = rules[(rules['text_match'].notna()) & (rules['text_exclude'].notna() == False)]

# Initialize the regex builder
builder = RulesRegexBuilder(GroupedRegexConcatenation('|'))

"""
Requirement
Add the requirements to the regex builder
1. S (start): if the pattern in ​text_match​ 
column is a prefix of the description string, 
a match is found for the corresponding service 
(both ID and name are included in the table)
2. A (anywhere): similar to above, except 
that the pattern doesn’t have to be in the 
beginning of the description
3. R (regular expression): use ​text_match​ 
as a regular regular expression
"""

builder.appendDecorator(RegexDecorator('M', '(', ')'))
builder.appendDecorator(RegexDecorator('S', '^(', ').*$'))
builder.appendDecorator(RegexDecorator('A', '^.*(', ').*$'))

"""
Finally, we build the dictionary of regexes:
It looks like {regex_string_to_match: service_id} for
easier mapping with the data later.
eg. {^.*(STARZ|Starz|STARZ|STARZ).*$': 8, ...}
"""

regexes_dict = builder.build(match_rules)
output(random.choice(list(regexes_dict)))

('^(Lim Commercials 3mo|CBS - Fake Cancel|CBS All Access|Commercial Free 1 '
 'Week|CBS - Fake New|CBS - Fake Cancel|Cancellation Confirmation CBS|CBS - '
 'Fake Cancel Passive Churn).*$')


In [1069]:
"""
Build the dictionary for mapping signals -
The mapping dictionay of signal keywords 
can be configured in ROOT_DIR/config.py
"""

signals = {'^.*(' + ('|'.join(v)) + ').*$' : k for k, v in config.inputs['signals'].items()}

output(signals)

{'^.*(cancelled|cancel).*$': 'cancellation',
 '^.*(coming|back|signup|signing|joining|welcome).*$': 'signup'}


In [1070]:
"""
Build the dictionary for mapping service 
names to service ids
"""

# services = rules[rules['service_id'].isin([1, 3, 8, 12, 39])][['service_id', 'service_name']]
# services.drop_duplicates(subset = ['service_id'], inplace = True)
# services.set_index('service_id', inplace=True)
# doutput(services)

'\nBuild the dictionary for mapping service \nnames to service ids\n'

In [1071]:
# Load the dataset
data = pd.read_csv(config.inputs['data']['fullpath'])

In [1072]:
"""
Requirement
There 3 types of statuses:
1. N (new): this transaction is new
2. U (update): this transaction is 
updated; discard the old one
3. D (delete): remove this transaction
"""

# Remove transactions with status 'D' as per requirements
data = data[data['status'] != 'D']

# Remove transactions that have an older entry
data = data.sort_values('last_updated').drop_duplicates('item_id',keep='last')

In [1073]:
# Copy the description column 
data['service_id'] = data[['description']]

# Replace the description in the new service_id
# column with the matching service_id from
# regexes_dict that we created above
data['service_id'] = data[['service_id']].replace({'service_id':regexes_dict}, regex=True)

In [1074]:
"""
Requirement
To simplify the project, you only need to 
consider the following services, although
the data may include many others.
● Netflix -> 3
● Hulu -> 1
● CBS All Access  -> 39
● Starz -> 8
● Showtime -> 12
"""

data.drop(data[data['service_id'].isin([1, 3, 8, 12, 39]) == False].index, inplace=True)

In [1075]:
# Add the service names corresponding 
# to the service id
data['service_name'] = data['service_id'].map(rules.set_index('service_id')['service_name'].to_dict())

In [1076]:
"""
Requirement
You still need to figure out what kind 
of action (signup versus cancellation) 
the remaining transactions are about.
If there are trial signup and cancellation, 
please make your own judgment as to how 
to treat them.
"""
data['signal_type'] = data[['description']]
data['signal_type'] = data[['signal_type']].replace({'signal_type': signals}, regex = True)

In [1077]:
# Save the processed dataset locally
save_file(config.outputs['local']['fullpath'], data.to_csv())

**⭣ This is the local link to the processed file: ⭣**

In [1078]:
# Output Local URL
output(config.outputs['local']['fullpath'])

'/home/jupyter/data/processed_data.csv'


In [1079]:
"""
Save a copy of the processed dataset 
to Google. Use IAM roles for authentication.

Requirement
Output data is accessible from a common 
cloud storage (i.e. AWS S3 or Google Storage)
"""
from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket(config.outputs['cloud']['bucket_name'])
blob = bucket.blob(config.outputs['local']['filename'])
blob.upload_from_filename(config.outputs['local']['fullpath'])


**⭣ This is the Cloud URL to the processed file: ⭣**

In [1080]:
# Output Cloud URL
output(blob.public_url)

'https://storage.googleapis.com/antenna-task/processed_data.csv'


In [1081]:
# print(data.to_string())