## NOTES
- The file `coupons_brand_manuf_retailer.csv` is the original source file and is not included in the repo
- You can start execution from FASTTEXT block
- extract files from coupons_fast.tar.gz --> coupons_fast.txt

In [4]:
import pandas as pd
df = pd.read_csv("coupons_brand_manuf_retailer.csv")
df["SINGLE_LINE"] = df["labels"] + "  " + df["Offerdetails"]
df

Unnamed: 0,labels,Offerdetails,SINGLE_LINE
0,__label__1800 __label__ProximoSpirits __label_...,$3.00 OFF | 1800 and Gran Centenario $3.00 OFF...,__label__1800 __label__ProximoSpirits __label_...
1,__label__1800 __label__ProximoSpirits __label_...,$4.00 OFF | 1800 Ultimate Margarita® $4.00 OFF...,__label__1800 __label__ProximoSpirits __label_...
2,__label__1924 __label__DelicatoFamilyVineyards...,"$4.00 REBATE | 1924, Gnarly Head, Three Finger...",__label__1924 __label__DelicatoFamilyVineyards...
3,__label__7UP __label__DRPEPPERSNAPPLEGROUP __l...,Save$1.00 on THREE (3) 12-pack cans of any fla...,__label__7UP __label__DRPEPPERSNAPPLEGROUP __l...
4,__label__9Elements __label__TheProcter&GambleC...,Save $1.00 on 9 Elements Multi-Purpose or Bath...,__label__9Elements __label__TheProcter&GambleC...
...,...,...,...
220015,__label__Zyrtec __label__Johnson&JohnsonInc. _...,Save $10.00 on Adult ZYRTEC® product when you ...,__label__Zyrtec __label__Johnson&JohnsonInc. _...
220016,__label__ZzzQuil __label__TheProcter&GambleCom...,SAVE $0.50 ONE ZzzQuil OR PURE Zzzs Product (e...,__label__ZzzQuil __label__TheProcter&GambleCom...
220017,__label__ZzzQuil __label__TheProcter&GambleCom...,Save $0.50 | ZzzQuil ONE ZzzQuil OR PURE Zzzs ...,__label__ZzzQuil __label__TheProcter&GambleCom...
220018,__label__ZzzQuil __label__TheProcter&GambleCom...,Save $0.50 ONE ZzzQuil OR PURE Zzzs Product (e...,__label__ZzzQuil __label__TheProcter&GambleCom...


In [6]:
df["SINGLE_LINE"].to_csv("coupons_fast.txt",index=False)

## FASTTEXT
- `head -n 200000 coupons_fast.txt > coupons.train`
- `tail -n 20026 coupons_fast.txt > coupons.valid`

In [10]:
import fasttext
model = fasttext.train_supervised(input='coupons.train')
model.save_model("coupons.bin")

Read 3M words
Number of words:  32021
Number of labels: 9986
Progress: 100.0% words/sec/thread:    5456 lr:  0.000000 avg.loss:  5.377494 ETA:   0h 0m 0s 0.095494 avg.loss:  9.360373 ETA:   0h18m51s 20.0% words/sec/thread:    5499 lr:  0.080035 avg.loss:  7.355526 ETA:   0h15m56s 31.6% words/sec/thread:    5484 lr:  0.068381 avg.loss:  6.758183 ETA:   0h13m39s 34.3% words/sec/thread:    5482 lr:  0.065675 avg.loss:  6.663774 ETA:   0h13m 7s 35.0% words/sec/thread:    5479 lr:  0.065022 avg.loss:  6.639154 ETA:   0h13m 0s 79.1% words/sec/thread:    5466 lr:  0.020931 avg.loss:  5.628849 ETA:   0h 4m11s 91.3% words/sec/thread:    5463 lr:  0.008742 avg.loss:  5.456506 ETA:   0h 1m45s 91.5% words/sec/thread:    5462 lr:  0.008492 avg.loss:  5.448903 ETA:   0h 1m42s


## Performance
- If your feature offers the most user benefit for fewer false positives, consider optimizing
for precision
- If your feature offers the most user benefit for fewer false negatives, consider optimizing
for recall.

`Precision interpretation`: In the coupons case, if we found that a coupon belongs to a retailer and make a trip - it is time lost. So we need as few `false positives` as possible. 
`Recall interpretation`: Lets say K=5, and we get publix in 4th position (and in reality the coupon belonged to publix), we might ignore publix with k=5 - so this is a false negative.

As we see precision and recall generally are at tug-of-war. **In coupons case, we need optimization for precision MORE than recall**

In [16]:
import fasttext

if not 'model' in locals():
    model = fasttext.load_model("coupons.bin")
# Precision, Recall @k=1
model.test("coupons.valid")

(20026, 0.8766603415559773, 0.2937406930245788)

In [17]:
import fasttext

if not 'model' in locals():
    model = fasttext.load_model("coupons.bin")
# Precision, Recall @k=5
model.test("coupons.valid", k=5)

(20026, 0.4267552182163188, 0.7149597604028979)

In [3]:
import fasttext

if not 'model' in locals():
    model = fasttext.load_model("coupons.bin")# Precision, Recall @k=3
model.test("coupons.valid", k=3)



(20026, 0.6362562002729785, 0.639566985125571)

In [4]:
import fasttext

if not 'model' in locals():
    model = fasttext.load_model("coupons.bin")# Precision, Recall @k=2
model.test("coupons.valid", k=2)

(20026, 0.7399131129531609, 0.49584218716013856)

In [15]:
model.predict("Save $1.50 on any ONE (1) Rice Select® Product",k=5)

(('__label__Publix',
  '__label__StopAndShop',
  '__label__FamilyDollar',
  '__label__GlaxosmithklineConsumerHealthcare,Lp.',
  '__label__Biotene'),
 array([0.3831206 , 0.06342689, 0.02612786, 0.02133347, 0.01969311]))

## Using bigrams

In [2]:
import fasttext
model_bigram = fasttext.train_supervised(input="coupons.train", lr=1.0, epoch=25, wordNgrams=2, thread=3)
model_bigram.save_model("coupons_bigram.bin")

Read 3M words
Number of words:  32021
Number of labels: 9986
Progress: 100.0% words/sec/thread:    5431 lr:  0.000000 avg.loss:  2.842814 ETA:   0h 0m 0s  9.3% words/sec/thread:    5435 lr:  0.907435 avg.loss:  5.232283 ETA:   1h31m26s 13.9% words/sec/thread:    5433 lr:  0.861474 avg.loss:  4.949106 ETA:   1h26m51s lr:  0.811459 avg.loss:  4.729567 ETA:   1h22m 8s 28.0% words/sec/thread:    5417 lr:  0.719814 avg.loss:  4.298777 ETA:   1h12m47s 36.7% words/sec/thread:    5422 lr:  0.633194 avg.loss:  4.040242 ETA:   1h 3m58s 48.3% words/sec/thread:    5427 lr:  0.516554 avg.loss:  3.691016 ETA:   0h52m 8s 57.9% words/sec/thread:    5429 lr:  0.420891 avg.loss:  3.498765 ETA:   0h42m27s ETA:   0h30m26s 73.7% words/sec/thread:    5431 lr:  0.263052 avg.loss:  3.242199 ETA:   0h26m31s% words/sec/thread:    5431 lr:  0.237628 avg.loss:  3.202802 ETA:   0h23m57s 89.8% words/sec/thread:    5430 lr:  0.101544 avg.loss:  3.002315 ETA:   0h10m14ss


In [4]:
model_bigram.predict("Save $1.50 on any ONE (1) Rice Select® Product",k=5)

(('__label__Publix',
  '__label__FamilyDollar',
  '__label__StopAndShop',
  '__label__FoodLion',
  '__label__UNILEVER'),
 array([0.83732271, 0.06979061, 0.03113816, 0.00591142, 0.00197584]))