# Amazon Reviews - Preparing and Cleaning the Dataset

In this notbook we prepare the "amazon_reviews" dataset that will be used to train models for sentiment analysis.

We peform the following tasks.

1. Remove non-text records because our models can only accept text as input.
2. Remove non-english records, the majority of the records are in English and we dont want to influence the training with other data. Other languages can be model seperately.
3. Remove any null values from the target feature.
4. Drop all columns that we won't use.

## Installing packages

In [1]:
#This package will be used to detect records that are in English.
!pip install pandas langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=d7ae370add6c9d459e42ebaae0244efae776a043c25f41a1ad9d577ed1d13e2a
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


## Loading the dataset

In [2]:
import pandas as pd

dataset = pd.read_csv('amazon_reviews.csv')

#Display an overview of the dataset before making any changes.
original_dataset_shape = dataset.shape[0]
print(f"The shape of the dataset is: {dataset.shape}")
print("Below are the first few records:")
dataset.head()


The shape of the dataset is: (55544, 8)
Below are the first few records:


Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,6acacf1c-e17b-457a-98b8-b52cb187b02d,Wealth Enterprise,Absolutely good 👍,5,0,28.13.6.100,2024-07-11 20:23:04,28.13.6.100
1,afe971fc-312c-42f6-938d-b4b05daddbf4,Arlo Lee,Fantastic,5,0,28.12.0.100,2024-07-11 20:09:37,28.12.0.100
2,073e29a9-4f66-4421-a43f-c565b076b0ac,Jessie Bridges,The latest update doesn't even show the overal...,1,1,28.13.6.100,2024-07-11 19:59:41,28.13.6.100
3,a5cd443b-bbcd-45f4-8979-cda6cac8a265,A Google user,"Monitors phone calls, emails, texts for purcha...",1,0,28.13.6.100,2024-07-11 19:49:36,28.13.6.100
4,c8d95f7e-818e-4e7b-b95a-75c0aa89b31e,Pepto Bismol,"Amazon used to be good but now it's buggy, it'...",2,0,28.13.6.100,2024-07-11 19:38:59,28.13.6.100


In [3]:
#Here we get an overview of the data types, we are interested in "content" and "score".
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55544 entries, 0 to 55543
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   reviewId              55544 non-null  object
 1   userName              55539 non-null  object
 2   content               55542 non-null  object
 3   score                 55544 non-null  int64 
 4   thumbsUpCount         55544 non-null  int64 
 5   reviewCreatedVersion  49995 non-null  object
 6   at                    55544 non-null  object
 7   appVersion            49995 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.4+ MB


## Remove non text records

In [4]:
dataset_strings = dataset[dataset['content'].apply(lambda x: isinstance(x, str))]

#Print the number of records dropped.
dataset_strings_shape = dataset_strings.shape[0]
print(f"{original_dataset_shape - dataset_strings_shape} records that were non text were dropped.")

2 records that were non text were dropped.


## Remove non english texts

In [5]:
from langdetect import detect, LangDetectException

# Function to detect English text
def is_english(text):
    try:
        return detect(text) == 'en'
    except LangDetectException:
        return False

# Apply the function to the text column (we only keep records that are in English)
dataset_english = dataset_strings[dataset_strings['content'].apply(is_english)]

#Print the number of records dropped.
dataset_english_shape = dataset_english.shape[0]
print(f"{dataset_strings_shape - dataset_english_shape} records that were non enlgish were dropped.")

1289 records that were non enlgish were dropped.


## Remove null ratings

In [6]:
dataset_english.dropna(subset=['score'])

#Print the number of records dropped.
dataset_dropna_shape = dataset_english.shape[0]
print(f"{dataset_english_shape - dataset_dropna_shape} records that were dropped because of null ratings.")

0 records that were dropped because of null ratings.


## Drop all features except "content" and "score"

In [10]:
#Here we drop all other columns that we won't use for training, we only need content and score
dataset_final = dataset_english[['content', 'score']]

print(f"The shape of the dataset is: {dataset_final.shape}")
print("Below are the first few records:")
dataset_final.head()

The shape of the dataset is: (54253, 2)
Below are the first few records:


Unnamed: 0,content,score
0,Absolutely good 👍,5
2,The latest update doesn't even show the overal...,1
3,"Monitors phone calls, emails, texts for purcha...",1
4,"Amazon used to be good but now it's buggy, it'...",2
5,Communication to what went wrong with my packa...,1


In [8]:
#Save the dataset
dataset_final.to_csv('amazon_reviews_cleaned_dataset.csv', index=False)