## Assignment ML/AI

Candidate: Karthik Kumar Billa <br>
EMail: guildbilla@gmail.com

**Data**

Below represents sample coupon data for a retailer. Each column (tokenized value) is extracted from OfferDetails
1. Product - Can be a single value or an array
2. FaceValue - The savings on the offer , generally a number
3. UOM - Unit of measure, represents the measuring unit for the product

In [1]:
import pandas as pd
import re
import nltk
import numpy as np
from textblob import TextBlob
# nltk.download('punkt')
# nltk.download('brown')
# nltk.download('averaged_perceptron_tagger')

In [2]:
df = pd.read_csv('coupons_ner.csv', names = ['OfferDetails'])
print(df.shape)
print('-'*50)
print(df.head(10))

(886, 1)
--------------------------------------------------
                                        OfferDetails
0  Save $2.00 ONE Downy Liquid Fabric Conditioner...
1  Save $2.00 ONE Tide PODS OR Tide Power PODS (e...
2  Save $2.00 ONE Tide Laundry Detergent (exclude...
3  SAVE $1.00 ON TWO when you buy TWO BOXES (8.9 ...
4  $3.00 OFF when you purchase any THREE (3) Pepp...
5  SAVE $1.11 when you buy any ONE (1) Familly Si...
6  SAVE $1.00 ON TWO when you buy TWO PACKAGES an...
7  Save $1.00 on any TWO (2) Sargento® Natural Ch...
8  $0.65 OFF On Any ONE (1) Oikos Greek Yogurt Cu...
9  $2.00 OFF ONE (1) SMALL bag of Eight O'Clock® ...


### Exercise 1

1. Write a function that takes OfferDetails as input, and returns FaceValue as output
2. Run the function against all rows in the attached data and determine function accuracy (Higher the better)

*Hint*: For Exercise 1, regex function should do (ensure that all edge cases are taken care of)

In [3]:
def FaceVal(x):
    l = []
    s = x.replace(',',' ').split()
    for i in s:
        if '$' in i or '¢' in i:
            if '¢' in i:
                n = i.split('¢')[0]
                if n!= '':
                    i = str(float(n)/100)
                else: i = '0'
            i = '.'.join(re.sub('[^A-Za-z0-9]+',' ',i).split())
            l.append(i)
    if len(l) == 0 or l==['']:
        return ''
    elif len(l)>=1:
        temp = 0
        for k in l:
            if k !='':
                temp+= float((k.split('$')[-1]))
        if len(str(temp))==3:
            return '$'+str(temp)+'0'
        else:
            return '$'+str(temp)

In [4]:
# Test cases
c = ['','s', 's50','$$', '¢¢', '50$', '¢¢50', '$2.0','$2 50¢','$2 50 ¢','$2 50 50¢)))', '$2 50 50¢))) $$$$05,.m 50 150¢']
for i in c:
    print(i,':',FaceVal(i))
    print('-'*50)

 : 
--------------------------------------------------
s : 
--------------------------------------------------
s50 : 
--------------------------------------------------
$$ : 
--------------------------------------------------
¢¢ : $0.00
--------------------------------------------------
50$ : $50.0
--------------------------------------------------
¢¢50 : $0.00
--------------------------------------------------
$2.0 : $2.00
--------------------------------------------------
$2 50¢ : $2.50
--------------------------------------------------
$2 50 ¢ : $2.00
--------------------------------------------------
$2 50 50¢))) : $2.50
--------------------------------------------------
$2 50 50¢))) $$$$05,.m 50 150¢ : $9.00
--------------------------------------------------


### Comment: These test cases indicate that the FaceVal function is good enough for extracting Face Value from each text.

In [5]:
df['FaceValue'] = df['OfferDetails'].map(FaceVal)
print(df.head())

                                        OfferDetails FaceValue
0  Save $2.00 ONE Downy Liquid Fabric Conditioner...     $2.00
1  Save $2.00 ONE Tide PODS OR Tide Power PODS (e...     $2.00
2  Save $2.00 ONE Tide Laundry Detergent (exclude...     $2.00
3  SAVE $1.00 ON TWO when you buy TWO BOXES (8.9 ...     $1.00
4  $3.00 OFF when you purchase any THREE (3) Pepp...     $3.00


### Exercise 2
1. Write a function that takes OfferDetails as input, and returns Product as output (Product can be single or array)
2. Run the function against all rows in the attached data and determine function accuracy (Higher the better)

*Hint*: For Exercise 2, build a corpus of products (built manually) and go from there. Mention the order of complexity (in any measures Big O, memory, cpu etc.). For Exercise 2, our expectation is that you implement NER model

In [6]:
# https://textblob.readthedocs.io/en/dev/
def prod(txt):
    '''
    Textblob takes: O(N) run time complexity and O(vocab) run time space complexity
    '''
    txt = txt.lower()
    txt = re.sub('packs','', txt)
    txt = re.sub('trial','', txt)
    txt = re.sub('travel','', txt)
    txt = re.sub('size','', txt)
    txt = re.sub('save','', txt)
    txt = re.sub('excludes','', txt)  
    blob = TextBlob(txt)
    return [i.capitalize() for i in blob.noun_phrases]

In [7]:
%%time
df['Products'] = df['OfferDetails'].map(prod)

Wall time: 9.9 s


In [8]:
for i in range(0,100,7):
    print('Sample Text:')
    print(df['OfferDetails'][i])
    print('\nProducts:')
    print(df['Products'][i])
    print('='*50)

Sample Text:
Save $2.00 ONE Downy Liquid Fabric Conditioner 72 ld or larger (includes Downy Odor Protect 48 oz or larger OR Downy Wrinkle Guard 40 oz or larger OR Downy Nature Blends 67 oz or larger) OR Bounce/Downy Sheets 130 ct or larger (includes Bounce/Downy Wrinkle Guard 80 ct or larger) OR In Wash Scent Boosters 8.6 oz or larger (includes Downy Unstopables, Fresh Protect, Odor Protect, and Infusions) (excludes Downy Libre Enjuague, Gain Fireworks, and trial/travel size).

Products:
['Downy liquid fabric conditioner', 'Downy odor', 'Downy wrinkle guard', 'Downy nature blends', 'Bounce/downy sheets', 'Bounce/downy wrinkle guard', 'Scent boosters', 'Downy unstopables', 'Downy libre enjuague', 'Gain fireworks']
Sample Text:
Save $1.00 on any TWO (2) Sargento® Natural Cheese Slices

Products:
['Natural cheese slices']
Sample Text:
$1.00 OFF on any THREE (3) noosa® yoghurts

Products:
['Noosa® yoghurts']
Sample Text:
SAVE $1.25 on any ONE (1) crunchy Crunchmaster® Crackers or Snacks

P

### Comment: These sample results indicate that the Prod function is good enough for extracting list of Products from each text.