# Sampling from Amazon Reviews'23 Dataset

The downloaded Amazon Reviews'23 dataset is a huge dataset with 571 million reviews inside from 34 different product categories. <br>
To make more rapid experiments with lower resource requirements (GPU, memory and disk space etc.) and cost, decided to take below actions:<br>
* Make a wise down sampling from the huge data to ensure there are large enough samples to have a rating and category agnostic sentiment analysis,<br>
* Made some smart feature engineering actions (rather than storing images, just extracted has_image feature and dropped images and categories added by using file names of the reviews with no processing requirement of product meta files) <br> 
* Build the customized model to benefit from transfer learning by fine-tuning of fundamental LLM models trained with super big data,<br>
* Utilize a cloud environment to benefit from flexible and free resources as much as possible

## Steps Followed for Sampling and Relevant Tech Stack
There were multiple steps to follow after downloading the dataset. Rather than processing all files together which requires huge resources, the processing is executed per file separately and saved the outputs as csv files. Below steps are followed:<br>
1. Using Pyspark to open and index each review file per product category as Spark Data Frame by benefiting Spark's parallel processing capabilities<br>
2. Generating two new features ('product_category', 'has_image') and dropping one feture column ('images' which contains image urls) <br>
3. Taking required number of samples from each rating of each product category and saving as temporary csv files.<br>
4. Opening each csv folder with Pandas to get single partition file which is in csv format and removing the rows with more than expected columns (For example record is expected to have 11 columns but there are more or less columns) and the rows with unexpected value types in columns (for example there are boolean type columns with number inside for some rows). <br>
5. Merging all cleaned data for each category into a single clean data file and saving it as parquet file for future steps. <br>

In [68]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import col, isnull, size
from itertools import count
from functools import reduce
from datetime import datetime

import os
import pandas as pd

## Settings of Sampling Process
There are multiple setting to use for above sampling steps
* **'target_sample_count':** The total number of samples to get from the large raw dataset **(Step 3)**<br>
* **'input_data_directory_path':** The location of downloaded large raw dataset to read files with Spark Session **(Step 1)** <br>
* **'input_data_file_extension':** The file format of downloaded large raw dataset files **(Step 1)** <br>
* **'output_sampled_data_directory_path':** The location to save sampled temporary data files by using Pyspark **(Step 3)** <br>
* **'output_sampled_data_file_extension':** The file format of the temporary sample files **(Step 3)** <br>
* **'clean_sampled_data_directory_path':** The location to save the final unified clean data sample file for each product category and rating **(Step 5)**<br>
* **'clean_sampled_data_file_extension':** The file format of the final unified clean data sample file **(Step 5)** <br>

In [69]:
target_sample_count = 1000000

In [71]:
input_data_directory_path = "D:\\Datasets\\amazon-product-review-2023\\"
input_data_file_extension = ".jsonl.gz"

output_sampled_data_directory_path = "C:\\amazon-sampled-dataset\\"
output_sampled_data_file_extension = ".csv"

clean_sampled_data_directory_path = "C:\\amazon-sampled-dataset\\cleaned-data\\"
clean_sampled_data_file_extension = ".parquet"

In [None]:
# Create a Spark Session to read files into Spart Data Frame
spark = SparkSession.builder.appName("Amazon Product Review Sampling App")\
        .config("spark.memory.offHeap.enabled", "true")\
        .config("spark.memory.offHeap.size", "10g")\
        .config("spark.dynamicAllocation.enabled", "true")\
        .config("spark.shuffle.service.enabled", "true")\
        .config("spark.executor.memory", "3g")\
        .getOrCreate()

In [73]:
def find_files_with_extension(directory_path, extension):
    """Find all raw data files in specified format and return as a list

    Args:
        directory_path (string): The directory path to search for raw data files
        extension (string): The file extension to search for (e.g., ".txt", ".csv")

    Returns:
        list: The list of file items
    """
    try:
        all_items = os.listdir(directory_path)
        files = [item for item in all_items if os.path.isfile(os.path.join(directory_path, item)) and item.endswith(extension)]
        return files
    except FileNotFoundError:
        print(f"Error: The directory '{directory_path}' does not exist.")
        return []

# To save from product meta files processing, the file names are used as product category information to enrich review data with product categories

files = find_files_with_extension(input_data_directory_path, input_data_file_extension)
product_categories = [file_name.split(".")[0] for file_name in files]
print(product_categories)


Error: The directory 'D:\Datasets\amazon-product-review-2023\' does not exist.
[]


In [10]:
# The toal number of samples for each rating from each product category is calculated

product_category_count = len(product_categories)
rating_groups_count = 5

# Note: Below // 1 is for flooring the division to integer - a shortcut python operation same with math.floor()
samples_per_group = (target_sample_count / (product_category_count * rating_groups_count)) // 1
print(samples_per_group)

5882.0


In [84]:
"""For each product category do below steps
    1. Read the reviews json file for that category
    2. Add a new column 'category' with the category name (extracted from file name)
    3. Add a new column 'has_image' which is true if the review has images
    4. For each rating (1 to 5) do
        a. Calculate the fraction of samples for that rating
        b. If the fraction is greater than 1, then set it to 1.0
        c. Sample the reviews for that rating with replacement=false and the calculated fraction
    5. Reduce the list of sampled dataframes for each rating into a single dataframe
    6. Write the sampled dataframe back to file system in csv format with header and overwrite mode
"""
for category_name in product_categories:
    print(f"Category: {category_name} started to be processed..")
    df_category_reviews = spark.read.json(input_data_directory_path + category_name + input_data_file_extension)
    df_category_reviews = df_category_reviews.withColumn("category", lit(category_name))
    df_category_reviews = df_category_reviews.withColumn("has_image", size(col("images")) > 0).drop("images")

    sampled_df_list_by_rating = []

    for rating in range(1, 6):
        print(f"Rating: {rating}")
        fraction = samples_per_group / df_category_reviews.filter(col("rating") == rating).count()

        # For some product categories we do not have enough number of samples for certain ratings, so we take all of them
        if fraction > 1:
            fraction = 1.0

        sampled_df_by_rating = df_category_reviews.filter(col("rating") == rating).sample(withReplacement=False, fraction=fraction, seed=61)
        sampled_df_list_by_rating.append(sampled_df_by_rating)    

    # below reduce function in functools works in map/reduce approach and traverse all list items until the end 
    df_sampled_category_reviews  = reduce(lambda df1, df2: df1.union(df2), sampled_df_list_by_rating)

    # write the sampled list back to file system as csv format
    # coalesce + 1 means generate a single partition when writing, not multiple partitions
    df_sampled_category_reviews.coalesce(1).write.option("header", "true").mode("overwrite").csv(output_sampled_data_directory_path + category_name + output_sampled_data_file_extension)



Category: All_Beauty started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Amazon_Fashion started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Appliances started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Arts_Crafts_and_Sewing started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Automotive started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Baby_Products started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Beauty_and_Personal_Care started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Books started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: CDs_and_Vinyl started to be processed..
Rating: 1
Rating: 2
Rating: 3
Rating: 4
Rating: 5
Category: Cell_Phones_and_Accessories started to be processed..
Rating: 1
Rating: 2
Rating: 3
Ra

In [85]:
# Sinc we generated smaller sampled files in file system, there is no need to work with Spark anymore, so terminating the spark session
spark.stop()

In [74]:
def find_folders_endwith_extension(directory_path, extension):
    """
    Find all folders in the given directory that end with the specified extension.

    Args:
        directory_path (str): The path to the directory.
        extension (str): The extension to search for (e.g., ".txt").

    Returns:
        list: A list of folder names that end with the specified extension.
    """
    try:
        all_items = os.listdir(directory_path)
        folders = [item for item in all_items if os.path.isdir(os.path.join(directory_path, item)) and item.endswith(extension)]
        return folders
    except FileNotFoundError:
        print(f"Error: The directory '{directory_path}' does not exist.")
        return []

folders = find_folders_endwith_extension(output_sampled_data_directory_path, output_sampled_data_file_extension)
product_categories = [folder_name.split(".")[0] for folder_name in folders]
print(product_categories)

['All_Beauty', 'Amazon_Fashion', 'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Baby_Products', 'Beauty_and_Personal_Care', 'Books', 'CDs_and_Vinyl', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Digital_Music', 'Electronics', 'Gift_Cards', 'Grocery_and_Gourmet_Food', 'Handmade_Products', 'Health_and_Household', 'Health_and_Personal_Care', 'Home_and_Kitchen', 'Industrial_and_Scientific', 'Kindle_Store', 'Magazine_Subscriptions', 'Movies_and_TV', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Software', 'Sports_and_Outdoors', 'Subscription_Boxes', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Unknown', 'Video_Games']


In [75]:
def read_sampled_data_of_category(directory_path, extension, category_name):
    """Inside the directory of product category finds the spark export files in specified format and return the file

    Args:
        directory_path (string): Path to the directory containing the product category folders
        extension (string): The extension of the files to be read (e.g., '.csv', '.json', '.parquet')
        category_name (string): The name of the product category folder

    Returns:
        pandas.DataFrame: The data frame that contains the data in temporary spark export file
    """
    try:
        files = find_files_with_extension(directory_path + category_name + extension, extension)
        if len(files) == 0:
            return None
        
        print(directory_path + category_name + extension + "\\" + files[0])
        
        # on_bad_lines parameter eliminates the broken rows (such as expected 8 columns but provided 12 columns)
        return pd.read_csv(directory_path + category_name + extension + "\\" + files[0], sep=',', dtype=str, on_bad_lines='skip')
    except FileNotFoundError:
        print(f"Error: The directory '{directory_path + category_name + extension + files[0]}' does not exist.")
        return None

def write_sampled_data_to_file(df, file_path, extension):
    """Writes the data frame into file system

    Args:
        df (pandas.DataFrame): The data frame to write to file
        file_path (string): The path of the data file to write
        extension (string): The extension of the data file
    """
    if extension == '.csv':
        df.to_csv(file_path, index=False)
    elif extension == '.json':
        df.to_json(file_path, orient='records')
    elif extension == '.parquet':
        df.to_parquet(file_path)
    else:
        print("Invalid file extension. Please provide a valid extension (.csv, .json, or .parquet).")

In [76]:
def clean_data(df, column_types):
    """
    Cleans the input DataFrame based on the provided column types.

    Args:
        df (pandas.DataFrame): The input DataFrame to be cleaned.
        column_types (dict): A dictionary mapping column names to their expected data types.

    Returns:
        pandas.DataFrame: The cleaned DataFrame with correct data types.
    """
    import pandas as pd
    
    # Function to check if a value matches the expected type
    def is_correct_type(value, expected_type):
        if expected_type == bool:
            return value.lower() in ['true', 'false', '0', '1']
        elif expected_type == int:
            return value.isdigit()
        elif expected_type == float:
            try:
                float(value)
                return True
            except ValueError:
                return False
        elif expected_type == str:
            return True  # Assuming all values can be strings
        else:
            return False

    # Identify rows with type mismatches
    mask = pd.Series(True, index=df.index)
    for column, expected_type in column_types.items():
        if column in df.columns:
            mask &= df[column].apply(lambda x: is_correct_type(str(x), expected_type))

    # Keep only the rows that match all type expectations
    df_clean = df[mask]

    # Convert columns to their proper types
    for column, expected_type in column_types.items():
        if column in df_clean.columns:
            if expected_type == bool:
                df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': True})
            elif expected_type in [int, float]:
                df_clean[column] = df_clean[column].astype(expected_type)

    return df_clean


column_types = {
    'asin': str,
    'helpful_vote': int,
    'parent_asin': str,
    'rating': float,
    'text': str,
    'timestamp': str,
    'title': str,
    'user_id': str,
    'verified_purchase': bool,
    'category': str,
    'has_image': bool
}


In [77]:

""" The below does below operations
1. Reads the sampled data of each category
2. Cleans the data by removing unwanted columns and rows
3. Prints the info about the removed rows
4. Appends the cleaned data to a list
5. Concatenates all the cleaned data into a single dataframe
6. Writes the concatenated data to a file
7. Prints the time taken for the entire process
"""
cleaned_df_list = []
for product_category in product_categories:
    df_sample_data = read_sampled_data_of_category(output_sampled_data_directory_path, output_sampled_data_file_extension, product_category)
    cleaned_df = clean_data(df_sample_data, column_types)

    # Print info about removed rows
    print(f"{product_category} Original row count: {df_sample_data.shape[0]}")
    print(f"{product_category} Cleaned row count: {cleaned_df.shape[0]}")
    print(f"{product_category} Rows removed: {df_sample_data.shape[0] - cleaned_df.shape[0]}")

    cleaned_df_list.append(cleaned_df)

cleaned_all_data = pd.concat(cleaned_df_list, ignore_index=True)

current_datetime = datetime.now().strftime("%Y%m%d_%H%M%S")
write_sampled_data_to_file(cleaned_all_data, clean_sampled_data_directory_path + "sample_amazon_product_review_data_" + current_datetime + clean_sampled_data_file_extension, clean_sampled_data_file_extension)

C:\amazon-sampled-dataset\All_Beauty.csv\part-00000-dd5552d7-8773-4759-bb70-89bb389750fa-c000.csv
All_Beauty Original row count: 29703
All_Beauty Cleaned row count: 29703
All_Beauty Rows removed: 0
C:\amazon-sampled-dataset\Amazon_Fashion.csv\part-00000-c3922f1a-043f-4aa0-85a4-5f3125e4dbd4-c000.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': T

Amazon_Fashion Original row count: 29180
Amazon_Fashion Cleaned row count: 29179
Amazon_Fashion Rows removed: 1
C:\amazon-sampled-dataset\Appliances.csv\part-00000-6500a0d7-bfac-4dee-97c8-ef8c23a80658-c000.csv
Appliances Original row count: 29070
Appliances Cleaned row count: 29070
Appliances Rows removed: 0
C:\amazon-sampled-dataset\Arts_Crafts_and_Sewing.csv\part-00000-6a01b8b4-7417-4e8a-b247-5f8d2c8c9676-c000.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': T

Arts_Crafts_and_Sewing Original row count: 29244
Arts_Crafts_and_Sewing Cleaned row count: 29243
Arts_Crafts_and_Sewing Rows removed: 1
C:\amazon-sampled-dataset\Automotive.csv\part-00000-5d254ddc-cf40-4922-a915-30bcdbe9d6e3-c000.csv
Automotive Original row count: 29277
Automotive Cleaned row count: 29277
Automotive Rows removed: 0
C:\amazon-sampled-dataset\Baby_Products.csv\part-00000-a262cc79-3fc1-4301-b092-9667c828b488-c000.csv
Baby_Products Original row count: 29187
Baby_Products Cleaned row count: 29187
Baby_Products Rows removed: 0
C:\amazon-sampled-dataset\Beauty_and_Personal_Care.csv\part-00000-9ca9abf2-4c2f-4381-9374-7b4dec75d6f2-c000.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': T

Beauty_and_Personal_Care Original row count: 29468
Beauty_and_Personal_Care Cleaned row count: 29467
Beauty_and_Personal_Care Rows removed: 1
C:\amazon-sampled-dataset\Books.csv\part-00000-31f5ded9-5095-4d28-973c-f7cbabcf3923-c000.csv
Books Original row count: 27139
Books Cleaned row count: 27139
Books Rows removed: 0
C:\amazon-sampled-dataset\CDs_and_Vinyl.csv\part-00000-2870c220-e0d7-4892-9122-9058a7d24b1e-c000.csv
CDs_and_Vinyl Original row count: 25254
CDs_and_Vinyl Cleaned row count: 25254
CDs_and_Vinyl Rows removed: 0
C:\amazon-sampled-dataset\Cell_Phones_and_Accessories.csv\part-00000-db6e3bd2-386c-492f-83b2-671fc894854a-c000.csv
Cell_Phones_and_Accessories Original row count: 29452
Cell_Phones_and_Accessories Cleaned row count: 29452
Cell_Phones_and_Accessories Rows removed: 0
C:\amazon-sampled-dataset\Clothing_Shoes_and_Jewelry.csv\part-00000-863d6c3f-f7ec-4f56-912a-12af08ac496b-c000.csv
Clothing_Shoes_and_Jewelry Original row count: 28875
Clothing_Shoes_and_Jewelry Cleaned ro

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': T

Musical_Instruments Original row count: 28805
Musical_Instruments Cleaned row count: 28804
Musical_Instruments Rows removed: 1
C:\amazon-sampled-dataset\Office_Products.csv\part-00000-d5e87346-cf34-4f2f-b195-cd102e81c7cd-c000.csv
Office_Products Original row count: 28982
Office_Products Cleaned row count: 28982
Office_Products Rows removed: 0
C:\amazon-sampled-dataset\Patio_Lawn_and_Garden.csv\part-00000-07e693b9-b6ab-4942-a01f-d84a59355fed-c000.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': T

Patio_Lawn_and_Garden Original row count: 29286
Patio_Lawn_and_Garden Cleaned row count: 29285
Patio_Lawn_and_Garden Rows removed: 1
C:\amazon-sampled-dataset\Pet_Supplies.csv\part-00000-c1a4d709-db54-44db-8361-df3f194ced5f-c000.csv
Pet_Supplies Original row count: 29474
Pet_Supplies Cleaned row count: 29474
Pet_Supplies Rows removed: 0
C:\amazon-sampled-dataset\Software.csv\part-00000-8ae68f3e-8eca-4ae2-83b5-e30c1e6b01a7-c000.csv
Software Original row count: 29192
Software Cleaned row count: 29192
Software Rows removed: 0
C:\amazon-sampled-dataset\Sports_and_Outdoors.csv\part-00000-b6f4d328-e0e2-4591-9384-76b18707574d-c000.csv
Sports_and_Outdoors Original row count: 29146
Sports_and_Outdoors Cleaned row count: 29146
Sports_and_Outdoors Rows removed: 0
C:\amazon-sampled-dataset\Subscription_Boxes.csv\part-00000-388df76a-c9f7-4aa3-9a7d-36896b051bcc-c000.csv
Subscription_Boxes Original row count: 12930
Subscription_Boxes Cleaned row count: 12930
Subscription_Boxes Rows removed: 0
C:\amaz

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].astype(expected_type)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[column] = df_clean[column].map({'true': True, 'false': False, '0': False, '1': T

Tools_and_Home_Improvement Original row count: 29148
Tools_and_Home_Improvement Cleaned row count: 29147
Tools_and_Home_Improvement Rows removed: 1
C:\amazon-sampled-dataset\Toys_and_Games.csv\part-00000-8f98471a-b38c-48e6-83ce-e4b49abae4bc-c000.csv
Toys_and_Games Original row count: 29315
Toys_and_Games Cleaned row count: 29315
Toys_and_Games Rows removed: 0
C:\amazon-sampled-dataset\Unknown.csv\part-00000-f76284ac-b082-40c2-bf14-9b9fb55d3350-c000.csv
Unknown Original row count: 28801
Unknown Cleaned row count: 28801
Unknown Rows removed: 0
C:\amazon-sampled-dataset\Video_Games.csv\part-00000-eeb0a89c-bba0-4aa2-903c-002c2c3678a3-c000.csv
Video_Games Original row count: 28431
Video_Games Cleaned row count: 28431
Video_Games Rows removed: 0
