# Data Analysis on Amazon Movie & TV Reviews [3830ICT Assignment 1]

# 1. Data Preperation & Preprocessing

## 1.1. Read the Dataset

Import the relevant modules to be used later on:

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import gzip

The dataset is in .json.gz format (GZipped JavaScript Object Notation)

Read the dataset:

In [None]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Movies_and_TV_5.json.gz')

This dataset contains product reviews and metadata from Amazon, specific to the Movies & TV category.

## 1.2. View the Dataset (before cleaning)

The following methods allow us to describe and view the dataset before cleaning.

View the head of the dataset:

In [None]:
df.head(10)

View the tail of the dataset:

In [None]:
df.tail(10)

View the shape (number of columns, rows) of the dataset:

In [None]:
df.shape

View the column names (attributes) of the dataset:

In [None]:
df.coloumns

The .info() method provides important information about the dataset, such as the number of rows, number of columns, number of non-missing values in each column, and the data type stored in each column:

In [None]:
df.info()

The .describe() method calculates summary statistics of the data:

In [None]:
df.describe()

Visualise the number of good/bad ratings:

In [None]:
sns.distplot(df['overall'])

## 1.3. Cleaning 

### 1.3.1. Formatting

Convert data to a standard format for easy manipulation.

Convert the reviewTime column to datetime type:

In [None]:
df['reviewTime_convert'] = pd.to_datetime(df.reviewTime)

Check if it worked:

In [None]:
df[['reviewTime','reviewTime_convert']].head(5)

The date is in US format ie. YYYY-MM-DD.

Similarly unixReviewTime can also be converted to dateTime type:

However unixReviewTime will be droped later as it is redundant, so it will not be converted.

Change overall to integer type:

In [None]:
df['overall'] = df['overall'].astype(int)

Check if it worked:

In [None]:
df['overall'].head(5)

### 1.3.2. Missing Data

Check for missing data:

In [None]:
df.isnull().sum()

Fill the missing data with the string 'Unknown':

In [None]:
df.fillna('Unknown', inplace=True)

Check if it worked:

In [None]:
df.isnull().sum()

### 1.3.3. Invalid Data

Check for invalid data.

Check that all values in overall are between 0-5:

In [None]:
df.loc[(df['overall']<0) | (df['overall']>5)]

Check that all dates in reviewTime are between May 1996 - July 2014:

In [None]:
df.loc[(df['reviewTime_convert']<'1996-05-01') | (df['reviewTime_convert']>'2014-07-31')]

There is no invalid data.

### 1.3.4. Irrelevant Data 

Remove irrelevant data.

Remove unixReviewTime column as it has the same meaning as the reviewTime attribute in this dataset:

In [None]:
del df['unixReviewTime']

### 1.3.5. Inconsistent data

Two different reviewers should not have the same ID.

In [None]:
group = df.groupby('reviewID')['reviewName'].unique()
group[group.apply(lamba x: len(x)>1)].head(10)

## 1.4. Feature Reduction