# Amazon Q/A data

src: http://jmcauley.ucsd.edu/data/amazon/qa/

## Dataset contains the following features:

    asin - ID of the product, e.g. B000050B6Z
    questionType - type of question. Could be 'yes/no' or 'open-ended'
    answerType - type of answer. Could be 'Y', 'N', or '?' (if the polarity of the answer could not be predicted). Only present for yes/no questions.
    answerTime - raw answer timestamp
    unixTime - answer timestamp converted to unix time
    question - question text
    answer - answer text

# Load downloaded amazon data for analysis

I downloaded appliances, cellphones & accessories, electronics, office products, software, and tools & home improvement data. These categories seemed to me more generalizable in other domains.

In [1]:
import pandas as pd
import gzip
import os

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

dfs = []
for root, _, files in os.walk('./data'):
    for f in files:
        dfs.append(getDF(os.path.join(root, f)))

df_all = pd.concat(dfs)
df_all['question_len'] = df_all.question.apply(len)

# split into open ended and close ended
open_ended = df_all[df_all['questionType']=='open-ended'].reset_index(drop=True)
close_ended = df_all[df_all['questionType']=='yes/no'].reset_index(drop=True)

In [2]:
df_all

Unnamed: 0,questionType,asin,answerTime,unixTime,question,answerType,answer,question_len
0,yes/no,1466736038,"Mar 8, 2014",1.394266e+09,Is there a SIM card in it?,Y,Yes. The Galaxy SIII accommodates a micro SIM ...,26
1,open-ended,1466736038,"Aug 4, 2014",1.407136e+09,Why hasnt it upgraded to latest Android OS 4.4...,,"My S3 was able to upgrade to 4.4.2 last week, ...",99
2,yes/no,1466736038,"Jan 29, 2015",1.422518e+09,"Is this phone new, with 1 year manufacture war...",?,It is new but I was not able to get it activat...,52
3,yes/no,1466736038,"Nov 30, 2014",1.417334e+09,can in it be used abroad with a different carr...,Y,Yes,50
4,yes/no,1466736038,"Nov 24, 2014",1.416816e+09,Is this phone brand new and NOT a mini?,?,The phone we received was exactly as described...,39
...,...,...,...,...,...,...,...,...
314258,yes/no,BT008UKTMW,"Feb 20, 2015",1.424419e+09,Is the space from bottom of desktop to tray ad...,N,No,81
314259,yes/no,BT008UKTMW,"Oct 13, 2014",1.413184e+09,can the mouse extension be mounted on the LEFT...,Y,"yes, you can put it on which ever side you want",51
314260,yes/no,BT008UKTMW,"Feb 26, 2014",1.393402e+09,does it come with all the hardware,Y,"It's been a while since I bought this, but I'm...",34
314261,open-ended,BT008UKTMW,"Nov 8, 2013",1.383898e+09,how wide is it? I need a 19 inch length tray f...,,We just measured the tray and it is 21 inches ...,63


## Data quality check

In [3]:
# unixTime and answerType aren't necessary so it's fine for them to have NA.
df_all.isna().sum()

questionType         0
asin                 0
answerTime           0
unixTime         17577
question             0
answerType      272006
answer               0
question_len         0
dtype: int64

In [4]:
# class is well balanced
df_all.questionType.value_counts()

yes/no        292465
open-ended    272006
Name: questionType, dtype: int64

Checking question length. They're relatively short, but some seem a bit too short!

In [5]:
df_all.question.apply(len).describe()

count    564471.000000
mean         72.184810
std          42.718118
min           1.000000
25%          38.000000
50%          60.000000
75%         102.000000
max         293.000000
Name: question, dtype: float64

choosing 10 ~ 15 as the arbitrary filter for text length for comparison

In [6]:
close_ended[close_ended['question_len']<=10].question.unique()

array(['can I text', 'is it 4g', 'is it good', 'Is it 4G?', 'Is it GSM',
       'is it 3g?', 'is it a 4g', 'have wifi?', 'have flash',
       'is this 4g', 'is gsm', 'Is it 3g ?', 'Is GSM?', 'is it 4G?',
       'is it GSM?', 'is it real', 'have gps', 'is it 3g', 'is this 3g',
       'is android', 'is gsm?', 'is it free', 'is unloock', 'has wifi?',
       'Has wi fi?', 'has flash?', 'is it ios7', 'IS WHITE?', 'is cdma?',
       'IS IT NEW?', 'is GSM?', 'is FHD?', 'is it 3G?', 'is GSM??',
       'is it soft', 'is unlock?', 'is it slim', 'is it flat',
       'Is it Y1?', 'is it led', 'Is it m33?', 'Is it UL', 'Is it LED?',
       'Is it tang', 'Is it 7.5v', 'is it 110?', 'Is it 110V',
       'is it loud', 'Do I Dare?', 'Is it fun?', 'is it 44mm',
       'can i scan', 'can it fax', 'Can it fax', 'can u fax',
       'is duplex?', 'IS NTSC?', 'is it 5m', 'is it PCI?', 'Is it loud',
       'is pair?', 'can I edit', 'Is it 3D?', 'Is it fast', 'is this v4',
       'is it hd', 'Is It PTZ?', 'Is 

I don't like these short open ended questions, since they would hinder generalizablility of the model (ie- classifying 'iphone' as an open ended question! I'd need to consider cases when given sentence not a question). Thus I will filter them away during preprocessing..

On the other hand, close ended (our negative class) looks ok even when they're pretty short. Hence I'll keep it as is.

In [7]:
open_ended[open_ended['question_len']<=15].question.unique()[:20]

array(['iphone', 'motorola?', 'Size', 'how big is it?', 'parts', 'Music',
       'w260g', 'WAT color ?', 'htc one', 'Note 3', 'iphone 5 fit?',
       'real or fake?', 'music', 'HOW LONG IS IT', 'caller id', 'ipad',
       'a2dp', 'USA Version?', 'gps', 'How long is it?'], dtype=object)