In [38]:
import pandas as pd
import transformers
import shap
import numpy as np
import gzip
import json
pd.set_option('display.max_colwidth', None)

#### Cargamos el dataset de reviews de productos de amazon "Arts_Crafts_and_Sewing_5"
##### El dataset fue obtenido en el sitio https://jmcauley.ucsd.edu/data/amazon_v2/
##### Los ejemplos de uso de SHAP fueron los de https://shap.readthedocs.io/

In [5]:
#parseo del json
data = []
with gzip.open('Arts_Crafts_and_Sewing_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))

In [7]:
#lo cargo como un dataframe de pandas y saco los duplicados
df = pd.DataFrame.from_dict(data)
df = df[['overall', 'reviewText']]
df = df.drop_duplicates()

In [39]:
df.sample(5)

Unnamed: 0,overall,reviewText
74515,5.0,These are so fun and cute. I love using them in card making and scrapbooking.
178163,4.0,I got this product damaged just the very tip of it was broken but it didn't effect my measurements. I absolutely love this ruler and also has very good grips so my fabric isn't sliding all over the place.
219006,5.0,Very sweet and looks great on my tree.
386637,2.0,"UPDATE:\nAfter a few days of use and watching videos online. This machine is more of a self contained computer free cutter. The touch screen lets you do a bit of very simple tasks. So for example, if you wanted to do vinyl or stickers, being its scan based...this basically means that you design and print out your projects separately and then scan the work in and it will cut things out. One of the issues with it this way is that if colors are light, the scanner may not pick them up and it won't cut that portion. You would need to put a dark outline for the scanner to pick it up.\n\nFor simple projects this would work but if you need more control and precision, I found the SNC more difficult to use.\n\nORIGINAL REVIEW:\n\nHorrible Setup, Terrible Web Based program and YOU CAN'T PRINT YOUR DESIGN to a printer!!!\n\nOn to making my first Scan N Cut... there's no USB cable supplied and if you want the machine to work wirelessly you have to shell out another $50 for a code to unlock this feature. After an hour I finally had to call tech support. Couldn't register the machine either. The C.S. person told me to use a USB thumb drive for everything! They don't want you to use a USB i guess hence the reason they don't give you one because it doesn't work that way.\n\nI spent an hour making nice fancy shapes with colors and words for those planner stickers. You can't print your design to a printer from the ""web based"" program. So how do you get your work to print out to cut? Couldn't find it in the instruction manual and clicked every button and option on the web application. So downloaded something from the internet, printed it and then set it to scan. Well, it only scanned maybe half of the shapes and the scan was so jagged and poor. I tried every type of scenario to get the scan to come out right and nothing worked.\n\nSo...i figured let me just see how it cuts with half the shapes made....well according to the blade settings for sticker paper i'm supposed to use a 4 setting but thought that was a bit much so I started with a 2....nothing cut. I then used the recommended 4 and what a mess!!! it cut the shapes out rather then just the top layer of the paper. I attached photos.\n\nI own the Silhouette Cameo and it lets you do everything on the program very easily. Its not a web based program and it has full features unlike the Brother. What a disaster, waste of an afternoon and the mess that followed. I mush prefer the Silhouette or Cricut machines!"
244358,4.0,"Nice looking, just a little too long for my craft project."


## Sentiment Analysis

In [17]:
# Acorto el texto hasta 500 caracteres de longitud para que se adecue al modelo a utilizar.
# Me quedo con solo los primeros 100 reviews
texto_acortado = [v[:500] for v in df["reviewText"][:100]]

In [14]:
#defino el clasificador
classifier = transformers.pipeline('sentiment-analysis', return_all_scores=True)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [36]:
#clasifico el primer review
classifier(texto_acortado[:2])

[[{'label': 'NEGATIVE', 'score': 0.000545727729331702},
  {'label': 'POSITIVE', 'score': 0.9994543194770813}],
 [{'label': 'NEGATIVE', 'score': 0.0003571246052160859},
  {'label': 'POSITIVE', 'score': 0.9996428489685059}]]

In [40]:
df[:2]

Unnamed: 0,overall,reviewText
0,4.0,Contains some interesting stitches.
1,5.0,"I'm a fairly experienced knitter of the one-color or color block intarsia vein, rather than a Fair Isle maestro, and what I loved best about this stitch guide is the multitude of reversible stitch patterns offered and shown reverse and obverse. If you knit and love to accumulate guides, stitch dictionaries, pattern books and design-your-own project books, this is a great resource. I find I'm always adapting knitting patterns slightly or significantly to swap out cables, add interesting borders so I can knit edges and body at the same time, what have you.\n\nThis gives you enough classic stitches to satisfy but its strength is in fresh twists on the usual or entirely new (at least to me) options in textured, lace, cables&cross stitches, slip st., and novelties. As others note, the stitches are arranged from simplest to most challenging in each section-- also a great help when deciding how much sweat and tears I'm wiling to expend.\n\nThis also does not frustrate me in the ways too many other books and guides do.\nHere, Leapman uses the symbols common in knitting magazines and most books on knitting you've seen. Yay!\nOne of my peeves with some designers and guidebooks is the use of their own symbology for charting various stitches. Alice Starmore leaps to mind as a woman living in her own private Idaho filled with her own Runic symbols. I have to translate her every chart's cable squiggles into the symbols I'm more familiar with just to get the thing going. Gorgeous end results, but geez.\n\nThis is concise but full of options, so I FIND something nifty quickly and easily.\nWhile I do love browsing Barbara Walker's stitch books, sometimes I'm just looking for a simple option and don't want to spend all day combing through umpteen volumes, each of which has its own section for lace, k/p, cables, slip st.s, panels, etc. I wish Walker would compile all her lace, all cables, all k/p, all color work, all panels, etc into huge sections in just ONE encyclopedia.\n\nUntil that happens, this is my new go-to for fast, interesting ideas to enhance my knitting.\n\nThis is not a great resource for traditional knitting stitch patterns, such as gansey (guernsey), it's perhaps best for 'modern' classic knitting."


## Explicación con SHAP

In [34]:
explainer = shap.Explainer(classifier)
shap_values = explainer(texto_acortado[:2])

  0%|          | 0/498 [00:00<?, ?it/s]

Partition explainer: 3it [00:38, 38.68s/it]                                     


In [35]:
shap.plots.text(shap_values[:,:,"POSITIVE"])

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
