## A. Identification of Candidate Datasets

1. Wordbank-An open database of children's vocabulary development
- Course: Text mining, embeddings
- Beyond course: Natural Language
- Dataset size and structure: There are over 7 million entries. The row include information about the child as well as the word that they are trying to learn.
- Data types: Mainly text. There are a few attributes such as child age that are numeric.
- Target variables: Child attributes when a word is learned.
- Licensing and usage constraints: CC BY 4.0. If used in a derivative work, credit must be given. Otherwise, the dataset is free to use.
2. YouTube Video Social Graph
- Course: Graph Mining, PageRank, Community Detection
- Beyond course: Link Prediction
- Dataset size and structure: About 60 zip files of different crawls. Each file typically has 60,000-100,000 videos. Each entry has information about the current video and 20 videoIds that show up in the current video's recommened videos.
- Data types: Mainly text. There are a few attributes such as views that are numeric.
- Target variables: The videoIds in the a given video's recommended videos section.
- Licensing and usage constraints: None specified.
3. Amazon Sales Dataset
- Course: Frequent Itemsets, Association Rules, Text mining
- Beyond course: Sentiment-Augmented Rules
- Dataset size and structure: There are about 1500 entries. Each entry gives product information, the buyer, and the review that the buyer wrote.
- Data types: A split between numeric and text.
- Target variables: Product rating given a userId and a productId.
- Licensing and usage constraints: CC BY-NC-SA 4.0. Users must credit the creator and indicate any changes when redistibuting the dataset.

## Resources.

On my honor, I declare the following resources:
1. Collaborators:
None

2. Web Sources:
- https://wordbank.stanford.edu/data/
- https://netsg.cs.sfu.ca/youtubedata/
- https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset/data

3. AI Tools:
None

## B. Comparative Analysis of Datasets

1. Wordbank-An open database of children's vocabulary development
- Supported data mining tasks: Text mining (course), embeddings (course), natural language processing (external)
- Data quality issues: Some words are sounds such as "baa baa" which could lead to variations considered different words. However, the dataset seems well maintained so data quality is likely not an issue.
- Algorithmic feasibility: Because of the size of the dataset, some algorithms may not work on the full dataset. A sample will likely need to be taken. This may be fine if we want to sample a certain word or certain child for our analysis.
- Bias considerations: Socioeconomic factors about the children are not included in the database. We do not know if we are getting a good representation of all demographics and economic statuses. Children who's parents are more involved in their learning process are more likely to be included in this dataset because parents who are too busy or do not care will not try to seek out research like this.
- Ethical considerations: If we predict baselines for how old children should be before learning certain words and our data is based on children with more opportunities do to their economic or social status, then we will falsely claim that most children are behind on their educational development and could cause children to believe they are less capable of learning with respect to their peers.

2. Youtube Video Social Graph
- Supported data mining tasks: Graph Mining (course), PageRank (course), Community Detection (course), Link Prediction (external)
- Data quality issues: Youtubers can change their channel or video names during the collection process since it took place over such a large period of time. Videos and channels also could have been deleted during that time. Comments and ratings are very commonly fabricated on Youtube as well. The Youtuber can choose the video category themselves and sometimes they may choose one that representents their channel as a whole rather than the video itself.
- Algorithmic feasibility: It may be difficult to create a graph of millions of nodes. Because there are so many files, the graph may not fully connect. Choosing a few crawls and analyzing them seperately may be the best option.
- Ethical considerations: The recommended age or potential objectionable content is not a feature of this dataset so harmful content may be shown to minors.

3. Amazon Sales Dataset
- Supported data mining tasks: Frequent Itemsets (course), Association Rules (course), Text mining (course), Sentiment-Augmented Rules (external)
- Data quality issues: There are null values in some of the columns. Many of the columns are unorganized text. 
- Algorithmic feasibility: Trying text mining on such large paragraph may be difficult. However, the dataset is smaller than the other two candidates so there will likely not be any large issues.
- Bias considerations: I do not know where they got this data from. It likely does not resemble amazon shoppers as a whole. Only a small number of amazon users actually leave reviews so the behavior of reviewers likely do not resemble the true population.
- Ethical considerations: If a manufacturer inflates their product with fake reviews, then we may recommend faulty products to our users.

## Resources.

On my honor, I declare the following resources:
1. Collaborators:
None

2. Web Sources:
- https://wordbank.stanford.edu/data/
- https://netsg.cs.sfu.ca/youtubedata/
- https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset/data

3. AI Tools:
None

## C. Dataset Selection

I have selected the Amazon Sales Dataset for my project.

Reasons:
- Supports both bucket-item analysis as well as text mining.
- Many possibilities for analysis.

Trade-offs:
- Need to handle Null values.
- Text blocks may be difficult to handle.

## Resources.

On my honor, I declare the following resources:
1. Collaborators:
None

2. Web Sources:
- https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset/data

3. AI Tools:
None

## D. Exploratory Data Analysis

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("karkavelrajaj/amazon-sales-dataset")

print("Path to dataset files:", path)

Downloading to C:\Users\zacha\.cache\kagglehub\datasets\karkavelrajaj\amazon-sales-dataset\1.archive...


100%|██████████| 1.95M/1.95M [00:00<00:00, 4.66MB/s]

Extracting files...





Path to dataset files: C:\Users\zacha\.cache\kagglehub\datasets\karkavelrajaj\amazon-sales-dataset\versions\1


In [6]:
import os
import glob
import pandas as pd

# KaggleHub gives a *directory* path; find the CSV inside it
csv_files = glob.glob(os.path.join(path, "*.csv"))
if not csv_files:
    raise FileNotFoundError(f"No CSV files found in directory: {path}")

csv_path = csv_files[0]
print("Using CSV file:", csv_path)

# Load dataset
df = pd.read_csv(csv_path)

# Clean and convert numeric columns
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

def _clean_price(series):
    return (series.astype(str)
                  .str.replace('₹', '', regex=False)
                  .str.replace(',', '', regex=False)
                  .str.strip())

df['discounted_price'] = pd.to_numeric(_clean_price(df['discounted_price']), errors='coerce').astype('Float64')
df['actual_price'] = pd.to_numeric(_clean_price(df['actual_price']), errors='coerce').astype('Float64')

df['discount_percentage'] = pd.to_numeric(
    df['discount_percentage'].astype(str).str.replace('%', '', regex=False).str.strip(),
    errors='coerce',
).astype('Int64')

df['rating_count'] = pd.to_numeric(
    df['rating_count'].astype(str).str.replace(',', '', regex=False).str.strip(),
    errors='coerce',
).astype('Int64')

# Basic info
df.info()
display(df.head())  # first five rows

# Unique counts
n_users = df['user_id'].nunique()
n_products = df['product_id'].nunique()
print(f"\nUnique user_ids: {n_users}")
print(f"Unique product_ids: {n_products}")

# Average number of user_ids per product_id
avg_users_per_product = (
    df.groupby('product_id')['user_id'].nunique().mean()
)

# Average number of product_ids per user_id
avg_products_per_user = (
    df.groupby('user_id')['product_id'].nunique().mean()
)

print(f"\nAverage number of user_ids per product_id: {avg_users_per_product:.3f}")
print(f"Average number of product_ids per user_id: {avg_products_per_user:.3f}")

# Average rating per category
avg_rating_per_category = df.groupby('category')['rating'].mean().sort_values(ascending=False)

print("\nAverage rating per category:")
display(avg_rating_per_category)

Using CSV file: C:\Users\zacha\.cache\kagglehub\datasets\karkavelrajaj\amazon-sales-dataset\versions\1\amazon.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   product_id           1465 non-null   object 
 1   product_name         1465 non-null   object 
 2   category             1465 non-null   object 
 3   discounted_price     1465 non-null   Float64
 4   actual_price         1465 non-null   Float64
 5   discount_percentage  1465 non-null   Int64  
 6   rating               1464 non-null   float64
 7   rating_count         1463 non-null   Int64  
 8   about_product        1465 non-null   object 
 9   user_id              1465 non-null   object 
 10  user_name            1465 non-null   object 
 11  review_id            1465 non-null   object 
 12  review_title         1465 non-null   object 
 13  review_content       146

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,399.0,1099.0,64,4.2,24269,High Compatibility : Compatible With iPhone 12...,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jasp...","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,199.0,349.0,43,4.0,43994,"Compatible with all Type C enabled devices, be...","AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar,Sagar Viswanathan,Asp,Plac...","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Ambrane-Unbreakable-Char...
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,199.0,1899.0,90,3.9,7928,【 Fast Charger& Data Sync】-With built-in safet...,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu,viswanath,sai niharka,saqib mal...","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Sounce-iPhone-Charging-C...
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,Computers&Accessories|Accessories&Peripherals|...,329.0,699.0,53,4.2,94363,The boAt Deuce USB 300 2 in 1 cable is compati...,"AEWAZDZZJLQUYVOVGBEUKSLXHQ5A,AG5HTSFRRE6NL3M5S...","Omkar dhale,JD,HEMALATHA,Ajwadh a.,amar singh ...","R3EEUZKKK9J36I,R3HJVYCLYOY554,REDECAZ7AMPQC,R1...","Good product,Good one,Nice,Really nice product...","Good product,long wire,Charges good,Nice,I bou...",https://m.media-amazon.com/images/I/41V5FtEWPk...,https://www.amazon.in/Deuce-300-Resistant-Tang...
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,Computers&Accessories|Accessories&Peripherals|...,154.0,399.0,61,4.2,16905,[CHARGE & SYNC FUNCTION]- This cable comes wit...,"AE3Q6KSUK5P75D5HFYHCRAOLODSA,AFUGIFH5ZAFXRDSZH...","rahuls6099,Swasat Borah,Ajay Wadke,Pranali,RVK...","R1BP4L2HH9TFUP,R16PVJEXKV6QZS,R2UPDB81N66T4P,R...","As good as original,Decent,Good one for second...","Bought this instead of original apple, does th...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Portronics-Konnect-POR-1...



Unique user_ids: 1194
Unique product_ids: 1351

Average number of user_ids per product_id: 1.007
Average number of product_ids per user_id: 1.140

Average rating per category:


category
Computers&Accessories|Tablets                                                                                    4.6
Computers&Accessories|NetworkingDevices|NetworkAdapters|PowerLANAdapters                                         4.5
Electronics|Cameras&Photography|Accessories|Film                                                                 4.5
Computers&Accessories|Components|Memory                                                                          4.5
Electronics|HomeAudio|MediaStreamingDevices|StreamingClients                                                     4.5
                                                                                                                ... 
Computers&Accessories|Accessories&Peripherals|Audio&VideoAccessories|PCMicrophones                               3.6
Electronics|HomeTheater,TV&Video|Accessories|3DGlasses                                                           3.5
Computers&Accessories|Accessories&Peripherals|Audio&Vid

My EDA consists of a table of the columns and datatypes and a small example of what entries in the dataset look like. I determined the amount of users and products to see if I could get an idea of what kind of bucket-item analysis could be done. I printed the ratio of users to products and vice versa and got close to one on both. Then I printed the average rating per category. The ratings per category is closer to 7 and could be increased because there are multiple categories within each string.

## Resources.

On my honor, I declare the following resources:
1. Collaborators:
None

2. Web Sources:
https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset/data

3. AI Tools:
Github Copilot Agent (GPT-5.1): I prompted for code on the type of functions that I wanted performed on the dataset. I also prompted for specific type conversions so that the functions could be performed successfully.

## E. Initial Insights and Direction

Observation: The dataset rarely repeats customers and products.

Hypothesis: The dataset is meant more for analysis of reviews.

Observation: The category attribute often includes multiple categories within the single string.

Hypothesis: The category attribute must be split during analysis for proper interpretation of the category analysis.

Potential RQs:
- Do similar reviews correlate with similar ratings?
- Can we predict the ranking of items within a category based on words used in the reviews?

## Resources.

On my honor, I declare the following resources:
1. Collaborators:
None

2. Web Sources:
- https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset/data

3. AI Tools:
None