## Project Proposal: Amazon Reviews

### Problem Statement

<b>People are increasingly using social media to disseminate their views on products they have purchased and companies do not have a structured way of extracting this data and analysing it for sentiment. Most companies still rely on reviews being written to them or posted to their page to understand the impact of their product. This tends to be a small sample size and there are now many labelled datasets which can be used to train an in-house sentiment analysis tool for unlabelled data. In this case, we will be creating a tool for a client selling electronic products using the Amazon electronics reviews dataset. The client will be able to use this tool to analyse the sentiment of unlabelled textual data about their products and hence, be able to make better decisions about their product.

### Data Extraction

<b>The project is based on the set of reviews provided by Amazon through their S3 service. More information on the dataset can be found at https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt. The projects aims to use this labelled dataset to develop a tool that can perform textual sentimental analysis for other unlabelled data e.g. on twitter and other social media.

In [1]:
import os.path
import boto3
import pandas as pd
import logging
from dotenv import load_dotenv
from botocore.exceptions import ClientError

<b>The dataset can be downloaded into a csv file using the [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) module. The function `download_s3_file` will do this for you. Here, I have downloaded a very small sample dataset as an example. 

In [2]:
load_dotenv()
SECRET_KEY = os.getenv("SECRET_KEY")
ACCESS_KEY = os.getenv("ACCESS_KEY")

In [3]:
def download_s3_file(access_key, secret_key, bucket, key, output_folder_name, output_file_name):
    
    s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)
    
    with open(output_folder_name + output_file_name , 'wb') as write_file:
        try:
            s3.download_fileobj(bucket, key, write_file)
            logging.info('File downloaded succesfully at {}'.format(folder_name + file_name))    
        except ClientError:
            logging.error("Invalid credentials", exc_info=True)

In [4]:
# define bucket and key name to identify location where file is stored
bucket = "amazon-reviews-pds"
key = "tsv/sample_us.tsv"
# define output destination for sample data file
folder_name = os.path.abspath('..') + '/data/external'
file_name = '/sample_data.csv'

In [5]:
download_s3_file(ACCESS_KEY, SECRET_KEY, bucket, key, folder_name, file_name)
sample_data_df = pd.read_csv(folder_name + file_name, sep='\t')

In [6]:
sample_data_df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,18778586,RDIJS7QYB6XNR,B00EDBY7X8,122952789,Monopoly Junior Board Game,Toys,5,0,0,N,Y,Five Stars,Excellent!!!,2015-08-31
1,US,24769659,R36ED1U38IELG8,B00D7JFOPC,952062646,56 Pieces of Wooden Train Track Compatible wit...,Toys,5,0,0,N,Y,Good quality track at excellent price,Great quality wooden track (better than some o...,2015-08-31
2,US,44331596,R1UE3RPRGCOLD,B002LHA74O,818126353,Super Jumbo Playing Cards by S&S Worldwide,Toys,2,1,1,N,Y,Two Stars,Cards are not as big as pictured.,2015-08-31
3,US,23310293,R298788GS6I901,B00ARPLCGY,261944918,Barbie Doll and Fashions Barbie Gift Set,Toys,5,0,0,N,Y,my daughter loved it and i liked the price and...,my daughter loved it and i liked the price and...,2015-08-31
4,US,38745832,RNX4EXOBBPN5,B00UZOPOFW,717410439,Emazing Lights eLite Flow Glow Sticks - Spinni...,Toys,1,1,1,N,Y,DONT BUY THESE!,Do not buy these! They break very fast I spun ...,2015-08-31


In [8]:
sample_data_df.columns

Index(['marketplace', 'customer_id', 'review_id', 'product_id',
       'product_parent', 'product_title', 'product_category', 'star_rating',
       'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
       'review_headline', 'review_body', 'review_date'],
      dtype='object')

<b> The most important columns will be review body and star rating. Our target will be the star ratings and we will train our machine learning algorithm on the review body. We will use some NLP techniques to tokenize the text data, followed by using different types of neural networks to train the data on. It is likely that we will also depend on existing models to build our base models. The deliverables will be the code posted on github, along with a report and presentation summarizing the process and outcome.