# Data Preprocessing Pipeline
## Anime Recommendation System - Kaggle Dataset

---

## Step 1: Setup and Import Libraries

**Objective:** Import all necessary Python libraries for data manipulation, analysis, and visualization.

**Libraries Used:**
- pandas - Data manipulation and CSV handling
- numpy - Numerical operations
- matplotlib & seaborn - Data visualization
- pathlib - File path management
- warnings - Suppress warning messages

**Expected Output:** No errors, all libraries loaded successfully

In [1]:
# Import required libraries
import pandas as pd              # Data manipulation
import numpy as np               # Numerical operations
import matplotlib.pyplot as plt  # Plotting
import seaborn as sns           # Advanced visualizations
from pathlib import Path        # File path handling
import warnings                 # Warning control

# Configure settings
warnings.filterwarnings('ignore')  # Suppress warnings for cleaner output
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows

# Verify imports
print("✓ All libraries imported successfully!")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")

✓ All libraries imported successfully!
✓ Pandas version: 2.3.3
✓ NumPy version: 2.3.5


---

## Step 2: Load Raw Data

**Objective:** Load the raw anime and rating datasets from CSV files.

**Input Files:**
- ../data/raw/anime.csv - Contains anime information (7 columns)
- ../data/raw/rating.csv - Contains user ratings (3 columns)

**Expected Output:** 
- Two dataframes loaded successfully
- Display of dataset shapes (rows x columns)
- No loading errors

**Note:** Using encoding='utf-8' for anime.csv to handle special characters in anime names (Japanese characters, symbols, etc.)

In [2]:
# Load raw datasets
print("=" * 60)
print("LOADING RAW DATA")
print("=" * 60)

# Load anime dataset with UTF-8 encoding for special characters
anime_raw = pd.read_csv('../data/raw/anime.csv', encoding='utf-8')

# Load rating dataset
rating_raw = pd.read_csv('../data/raw/rating.csv')

# Display shapes
print(f"\n✓ Anime dataset loaded: {anime_raw.shape[0]:,} rows × {anime_raw.shape[1]} columns")
print(f"✓ Rating dataset loaded: {rating_raw.shape[0]:,} rows × {rating_raw.shape[1]} columns")

# Quick preview of data
print("\n--- Anime Data Preview ---")
print(anime_raw.head(3))

print("\n--- Rating Data Preview ---")
print(rating_raw.head(3))

LOADING RAW DATA

✓ Anime dataset loaded: 12,294 rows × 7 columns
✓ Rating dataset loaded: 7,813,737 rows × 3 columns

--- Anime Data Preview ---
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   

                                               genre   type episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie        1    9.37   
1  Action, Adventure, Drama, Fantasy, Magic, Mili...     TV       64    9.26   
2  Action, Comedy, Historical, Parody, Samurai, S...     TV       51    9.25   

   members  
0   200630  
1   793665  
2   114262  

--- Rating Data Preview ---
   user_id  anime_id  rating
0        1        20      -1
1        1        24      -1
2        1        79      -1


---

## Step 3: Initial Data Exploration

**Objective:** Understand the structure, data types, and quality of raw data before cleaning.

**Analysis Performed:**
- Column names and data types
- Missing values count per column
- Basic statistical summary (min, max, mean, etc.)
- Special case: Count of -1 ratings (watched but not rated)

**Purpose:** Identify data quality issues that need to be addressed in cleaning phase.

**Expected Findings:**
- Missing values in anime data (genre, type, episodes, rating)
- -1 ratings in rating data (need to be removed)
- Data type validation

In [3]:
# Explore raw data structure and quality
print("\n" + "=" * 60)
print("INITIAL DATA EXPLORATION")
print("=" * 60)

# --- ANIME DATASET EXPLORATION ---
print("\n--- ANIME DATASET ---")
print(f"\nColumns: {anime_raw.columns.tolist()}")
print(f"\nData Types:\n{anime_raw.dtypes}")
print(f"\nMissing Values:\n{anime_raw.isnull().sum()}")
print(f"\nBasic Statistics:\n{anime_raw.describe()}")

# --- RATING DATASET EXPLORATION ---
print("\n--- RATING DATASET ---")
print(f"\nColumns: {rating_raw.columns.tolist()}")
print(f"\nData Types:\n{rating_raw.dtypes}")
print(f"\nMissing Values:\n{rating_raw.isnull().sum()}")
print(f"\nBasic Statistics:\n{rating_raw.describe()}")

# Check for -1 ratings (special case: watched but not rated)
unrated_count = (rating_raw['rating'] == -1).sum()
unrated_percent = (unrated_count / len(rating_raw)) * 100
print(f"\n⚠ Ratings with -1 (watched but not rated): {unrated_count:,} ({unrated_percent:.2f}%)")

# Summary
print(f"\n--- SUMMARY ---")
print(f"Anime with missing ratings: {anime_raw['rating'].isnull().sum():,}")
print(f"Anime with missing genres: {anime_raw['genre'].isnull().sum():,}")
print(f"Total missing values in rating data: {rating_raw.isnull().sum().sum():,}")


INITIAL DATA EXPLORATION

--- ANIME DATASET ---

Columns: ['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

Data Types:
anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

Missing Values:
anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Basic Statistics:
           anime_id        rating       members
count  12294.000000  12064.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.026746  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.880000  2.250000e+02
50%    10260.500000      6.570000  1.550000e+03
75%    24794.500000      7.180000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06

--- RATING DATASET ---

Columns: ['user_id', 'anime_id', 'rating']

Data Types:
user_id     int64
anime_id   

---

## Step 4: Clean Anime Dataset (Complete)

**Objective:** Perform all anime dataset cleaning operations in one comprehensive step.

**Cleaning Operations:**
1. Remove duplicate rows
2. Remove rows with missing names
3. Fill missing genres with 'Unknown'
4. Fill missing types with 'Unknown'
5. Handle missing/unknown episodes (convert to 0)
6. Remove rows with missing ratings (critical field)
7. Remove invalid ratings (outside 0-10 range)
8. Handle missing members and convert to integer
9. Remove anime with <10 members (too obscure)
10. Clean whitespace from text columns
11. Reset index

**Rationale:** All cleaning operations combined for efficiency and clarity.

**Expected Result:** Fully cleaned anime dataset ready for use

In [4]:
# Complete anime dataset cleaning
print("\n" + "=" * 60)
print("CLEANING ANIME DATASET (COMPLETE)")
print("=" * 60)

# Create a copy to preserve original data
anime_cleaned = anime_raw.copy()
print(f"\nStarting with: {len(anime_cleaned):,} rows")

# 1. Remove duplicates
duplicates_count = anime_cleaned.duplicated().sum()
anime_cleaned = anime_cleaned.drop_duplicates()
print(f"\n1. Duplicates removed: {duplicates_count:,}")

# 2. Remove rows with missing names
missing_names = anime_cleaned['name'].isnull().sum()
if missing_names > 0:
    anime_cleaned = anime_cleaned.dropna(subset=['name'])
    print(f"2. Rows with missing names removed: {missing_names:,}")
else:
    print(f"2. No missing names found ✓")

# 3. Fill missing genres
missing_genre = anime_cleaned['genre'].isnull().sum()
anime_cleaned['genre'] = anime_cleaned['genre'].fillna('Unknown')
print(f"3. Missing genres filled with 'Unknown': {missing_genre:,}")

# 4. Fill missing types
missing_type = anime_cleaned['type'].isnull().sum()
anime_cleaned['type'] = anime_cleaned['type'].fillna('Unknown')
print(f"4. Missing types filled with 'Unknown': {missing_type:,}")

# 5. Handle missing/unknown episodes
missing_episodes = anime_cleaned['episodes'].isnull().sum()
anime_cleaned['episodes'] = pd.to_numeric(anime_cleaned['episodes'], errors='coerce')
anime_cleaned['episodes'] = anime_cleaned['episodes'].fillna(0)
anime_cleaned['episodes'] = anime_cleaned['episodes'].astype(int)
print(f"5. Missing/Unknown episodes filled with 0: {missing_episodes:,}")

# 6. Remove rows with missing ratings (CRITICAL)
missing_ratings = anime_cleaned['rating'].isnull().sum()
anime_cleaned = anime_cleaned.dropna(subset=['rating'])
print(f"6. Rows with missing ratings removed: {missing_ratings:,}")

# 7. Remove invalid ratings
invalid_low = (anime_cleaned['rating'] < 0).sum()
invalid_high = (anime_cleaned['rating'] > 10).sum()
anime_cleaned = anime_cleaned[(anime_cleaned['rating'] >= 0) & (anime_cleaned['rating'] <= 10)]
print(f"7. Invalid ratings removed: {invalid_low + invalid_high:,}")

# 8. Handle missing members
missing_members = anime_cleaned['members'].isnull().sum()
anime_cleaned['members'] = anime_cleaned['members'].fillna(0)
anime_cleaned['members'] = anime_cleaned['members'].astype(int)
print(f"8. Missing members filled with 0: {missing_members:,}")

# 9. Remove anime with very few members
min_members_threshold = 10
low_members = (anime_cleaned['members'] < min_members_threshold).sum()
anime_cleaned = anime_cleaned[anime_cleaned['members'] >= min_members_threshold]
print(f"9. Anime with <{min_members_threshold} members removed: {low_members:,}")

# 10. Clean whitespace from text columns
anime_cleaned['name'] = anime_cleaned['name'].str.strip()
anime_cleaned['genre'] = anime_cleaned['genre'].str.strip()
anime_cleaned['type'] = anime_cleaned['type'].str.strip()
print(f"10. Whitespace cleaned from text columns ✓")

# 11. Reset index
anime_cleaned = anime_cleaned.reset_index(drop=True)
print(f"11. Index reset ✓")

# Final summary
rows_removed = len(anime_raw) - len(anime_cleaned)
print(f"\n--- ANIME DATASET CLEANING COMPLETE ---")
print(f"Original: {len(anime_raw):,} rows")
print(f"Cleaned: {len(anime_cleaned):,} rows")
print(f"Removed: {rows_removed:,} rows ({(rows_removed/len(anime_raw)*100):.2f}%)")
print(f"\nMissing values: {anime_cleaned.isnull().sum().sum()}")
print(f"Duplicates: {anime_cleaned.duplicated().sum()}")

# Display sample
print(f"\n--- Sample of Cleaned Data ---")
print(anime_cleaned.head(3))


CLEANING ANIME DATASET (COMPLETE)

Starting with: 12,294 rows

1. Duplicates removed: 0
2. No missing names found ✓
3. Missing genres filled with 'Unknown': 62
4. Missing types filled with 'Unknown': 25
5. Missing/Unknown episodes filled with 0: 0
6. Rows with missing ratings removed: 230
7. Invalid ratings removed: 0
8. Missing members filled with 0: 0
9. Anime with <10 members removed: 0
10. Whitespace cleaned from text columns ✓
11. Index reset ✓

--- ANIME DATASET CLEANING COMPLETE ---
Original: 12,294 rows
Cleaned: 12,064 rows
Removed: 230 rows (1.87%)

Missing values: 0
Duplicates: 0

--- Sample of Cleaned Data ---
   anime_id                              name  \
0     32281                    Kimi no Na wa.   
1      5114  Fullmetal Alchemist: Brotherhood   
2     28977                          Gintama°   

                                               genre   type  episodes  rating  \
0               Drama, Romance, School, Supernatural  Movie         1    9.37   
1  Action, 

---

## Step 5: Clean Rating Dataset (Complete)

**Objective:** Perform all rating dataset cleaning operations in one comprehensive step.

**Cleaning Operations:**
1. Remove duplicate ratings
2. Remove rows with any missing values
3. Remove -1 ratings (watched but not rated - unusable for training)
4. Remove invalid ratings (outside 1-10 range)
5. Remove ratings for anime that don't exist in cleaned anime dataset
6. Remove users with very few ratings (<5 ratings)
7. Ensure correct data types for all columns
8. Reset index for clean structure

**Rationale:** All cleaning operations combined for efficiency and clarity. Ensures data consistency between anime and rating datasets.

**Expected Result:** Fully cleaned rating dataset aligned with anime dataset and ready for use

In [6]:
# Complete rating dataset cleaning
print("\n" + "=" * 60)
print("CLEANING RATING DATASET (COMPLETE)")
print("=" * 60)

# Create a copy to preserve original
rating_cleaned = rating_raw.copy()
print(f"\nStarting with: {len(rating_cleaned):,} rows")

# 1. Remove duplicate ratings
duplicates_count = rating_cleaned.duplicated().sum()
rating_cleaned = rating_cleaned.drop_duplicates()
print(f"\n1. Duplicates removed: {duplicates_count:,}")
print(f"   Remaining rows: {len(rating_cleaned):,}")

# 2. Remove rows with missing values
missing_before = rating_cleaned.isnull().sum().sum()
rating_cleaned = rating_cleaned.dropna()
print(f"\n2. Rows with missing values removed: {missing_before:,}")
print(f"   Remaining rows: {len(rating_cleaned):,}")

# 3. Remove -1 ratings (watched but didn't rate)
unrated = (rating_cleaned['rating'] == -1).sum()
rating_cleaned = rating_cleaned[rating_cleaned['rating'] != -1]
print(f"\n3. Unrated entries (-1) removed: {unrated:,}")
print(f"   Remaining rows: {len(rating_cleaned):,}")

# 4. Remove invalid ratings (outside 1-10 range)
invalid_low = (rating_cleaned['rating'] < 1).sum()
invalid_high = (rating_cleaned['rating'] > 10).sum()
rating_cleaned = rating_cleaned[(rating_cleaned['rating'] >= 1) & (rating_cleaned['rating'] <= 10)]
print(f"\n4. Invalid ratings removed: {invalid_low + invalid_high:,}")
print(f"   Remaining rows: {len(rating_cleaned):,}")

# 5. Remove ratings for anime not in cleaned anime dataset
valid_anime_ids = set(anime_cleaned['anime_id'])
before_filter = len(rating_cleaned)
rating_cleaned = rating_cleaned[rating_cleaned['anime_id'].isin(valid_anime_ids)]
removed = before_filter - len(rating_cleaned)
print(f"\n5. Ratings for non-existent anime removed: {removed:,}")
print(f"   Remaining rows: {len(rating_cleaned):,}")

# 6. Remove users with very few ratings (reduce noise)
min_ratings_per_user = 5
user_rating_counts = rating_cleaned['user_id'].value_counts()
valid_users = user_rating_counts[user_rating_counts >= min_ratings_per_user].index
before_user_filter = len(rating_cleaned)
rating_cleaned = rating_cleaned[rating_cleaned['user_id'].isin(valid_users)]
removed_users = before_user_filter - len(rating_cleaned)
print(f"\n6. Ratings from users with <{min_ratings_per_user} ratings removed: {removed_users:,}")
print(f"   Remaining rows: {len(rating_cleaned):,}")

# 7. Ensure correct data types
rating_cleaned['user_id'] = rating_cleaned['user_id'].astype(int)
rating_cleaned['anime_id'] = rating_cleaned['anime_id'].astype(int)
rating_cleaned['rating'] = rating_cleaned['rating'].astype(int)
print(f"\n7. Data types corrected ✓")

# 8. Reset index
rating_cleaned = rating_cleaned.reset_index(drop=True)
print(f"8. Index reset ✓")

# Final summary
rows_removed = len(rating_raw) - len(rating_cleaned)
print(f"\n--- RATING DATASET CLEANING COMPLETE ---")
print(f"Original: {len(rating_raw):,} rows")
print(f"Cleaned: {len(rating_cleaned):,} rows")
print(f"Removed: {rows_removed:,} rows ({(rows_removed/len(rating_raw)*100):.2f}%)")
print(f"\nMissing values: {rating_cleaned.isnull().sum().sum()}")
print(f"Duplicates: {rating_cleaned.duplicated().sum()}")
print(f"Unique users: {rating_cleaned['user_id'].nunique():,}")
print(f"Unique anime: {rating_cleaned['anime_id'].nunique():,}")

# Display sample
print(f"\n--- Sample of Cleaned Data ---")
print(rating_cleaned.head(3))


CLEANING RATING DATASET (COMPLETE)

Starting with: 7,813,737 rows

1. Duplicates removed: 1
   Remaining rows: 7,813,736

2. Rows with missing values removed: 0
   Remaining rows: 7,813,736

3. Unrated entries (-1) removed: 1,476,496
   Remaining rows: 6,337,240

4. Invalid ratings removed: 0
   Remaining rows: 6,337,240

5. Ratings for non-existent anime removed: 7
   Remaining rows: 6,337,233

6. Ratings from users with <5 ratings removed: 18,810
   Remaining rows: 6,318,423

7. Data types corrected ✓
8. Index reset ✓

--- RATING DATASET CLEANING COMPLETE ---
Original: 7,813,737 rows
Cleaned: 6,318,423 rows
Removed: 1,495,314 rows (19.14%)

Missing values: 0
Duplicates: 0
Unique users: 60,970
Unique anime: 9,924

--- Sample of Cleaned Data ---
   user_id  anime_id  rating
0        3        20       8
1        3       154       6
2        3       170       9


---

## Step 6: Data Quality Validation and Statistics

**Objective:** Verify data quality and generate comprehensive statistics for both datasets.

**Validation Checks:**
1. Missing values count (should be 0)
2. Duplicate count (should be 0)
3. Data ranges and distributions
4. Dataset relationships (user counts, anime counts)
5. Data sparsity calculation

**Purpose:** Ensure data is ready for algorithm implementation and understand dataset characteristics.

**Expected Output:** Complete data quality report with statistics

In [7]:
# Comprehensive data quality validation
print("\n" + "=" * 60)
print("DATA QUALITY VALIDATION & STATISTICS")
print("=" * 60)

# --- ANIME DATASET QUALITY ---
print("\n--- ANIME DATASET QUALITY ---")
print(f"Total anime: {len(anime_cleaned):,}")
print(f"Missing values: {anime_cleaned.isnull().sum().sum()}")
print(f"Duplicates: {anime_cleaned.duplicated().sum()}")
print(f"Unique anime IDs: {anime_cleaned['anime_id'].nunique():,}")
print(f"Rating range: {anime_cleaned['rating'].min():.2f} - {anime_cleaned['rating'].max():.2f}")
print(f"Average rating: {anime_cleaned['rating'].mean():.2f}")
print(f"Members range: {anime_cleaned['members'].min():,} - {anime_cleaned['members'].max():,}")

print("\nAnime types distribution:")
print(anime_cleaned['type'].value_counts())

# --- RATING DATASET QUALITY ---
print("\n--- RATING DATASET QUALITY ---")
print(f"Total ratings: {len(rating_cleaned):,}")
print(f"Missing values: {rating_cleaned.isnull().sum().sum()}")
print(f"Duplicates: {rating_cleaned.duplicated().sum()}")
print(f"Unique users: {rating_cleaned['user_id'].nunique():,}")
print(f"Unique anime rated: {rating_cleaned['anime_id'].nunique():,}")
print(f"Rating range: {rating_cleaned['rating'].min()} - {rating_cleaned['rating'].max()}")
print(f"Average user rating: {rating_cleaned['rating'].mean():.2f}")
print(f"Median user rating: {rating_cleaned['rating'].median():.1f}")

# --- RELATIONSHIP STATISTICS ---
print("\n--- DATASET RELATIONSHIPS ---")
ratings_per_user = len(rating_cleaned) / rating_cleaned['user_id'].nunique()
ratings_per_anime = len(rating_cleaned) / rating_cleaned['anime_id'].nunique()
print(f"Average ratings per user: {ratings_per_user:.2f}")
print(f"Average ratings per anime: {ratings_per_anime:.2f}")

# Data sparsity calculation (important for recommendation systems)
total_possible_ratings = rating_cleaned['user_id'].nunique() * rating_cleaned['anime_id'].nunique()
sparsity = 1 - (len(rating_cleaned) / total_possible_ratings)
print(f"\nData sparsity: {sparsity * 100:.4f}%")
print(f"(Higher = more sparse; typical for recommendation systems)")

# Rating distribution
print("\n--- RATING DISTRIBUTION ---")
print(rating_cleaned['rating'].value_counts().sort_index())

print("\n" + "=" * 60)
print("✓ DATA QUALITY VALIDATION COMPLETE")
print("=" * 60)


DATA QUALITY VALIDATION & STATISTICS

--- ANIME DATASET QUALITY ---
Total anime: 12,064
Missing values: 0
Duplicates: 0
Unique anime IDs: 12,064
Rating range: 1.67 - 10.00
Average rating: 6.47
Members range: 12 - 1,013,917

Anime types distribution:
type
TV         3671
OVA        3285
Movie      2297
Special    1671
ONA         652
Music       488
Name: count, dtype: int64

--- RATING DATASET QUALITY ---
Total ratings: 6,318,423
Missing values: 0
Duplicates: 0
Unique users: 60,970
Unique anime rated: 9,924
Rating range: 1 - 10
Average user rating: 7.81
Median user rating: 8.0

--- DATASET RELATIONSHIPS ---
Average ratings per user: 103.63
Average ratings per anime: 636.68

Data sparsity: 98.9557%
(Higher = more sparse; typical for recommendation systems)

--- RATING DISTRIBUTION ---
rating
1       16575
2       23104
3       41407
4      104210
5      282553
6      637174
7     1373584
8     1642673
9     1249433
10     947710
Name: count, dtype: int64

✓ DATA QUALITY VALIDATION COMP

---

## Step 7: Save Cleaned CSV Files

**Objective:** Save the cleaned datasets to CSV files.

**Output Files:**
- ../data/processed/anime_cleaned.csv - Cleaned anime dataset
- ../data/processed/rating_cleaned.csv - Cleaned rating dataset
- ../data/processed/combined_data.csv - Merged anime + rating dataset

**Purpose:** Preserve cleaned data for future use and create combined dataset for algorithms

In [8]:
# Save cleaned CSV files
print("\n" + "=" * 60)
print("SAVING CLEANED CSV FILES")
print("=" * 60)

# Create processed directory if it doesn't exist
Path('../data/processed').mkdir(parents=True, exist_ok=True)
print("\n✓ Ensured 'processed' folder exists")

# Save cleaned anime dataset
anime_cleaned.to_csv('../data/processed/anime_cleaned.csv', index=False, encoding='utf-8')
print(f"✓ Anime dataset saved: ../data/processed/anime_cleaned.csv")
print(f"  Size: {len(anime_cleaned):,} rows × {anime_cleaned.shape[1]} columns")

# Save cleaned rating dataset
rating_cleaned.to_csv('../data/processed/rating_cleaned.csv', index=False)
print(f"✓ Rating dataset saved: ../data/processed/rating_cleaned.csv")
print(f"  Size: {len(rating_cleaned):,} rows × {rating_cleaned.shape[1]} columns")

# Create and save combined dataset
print(f"\n--- Creating Combined Dataset ---")
combined_data = rating_cleaned.merge(
    anime_cleaned, 
    on='anime_id', 
    how='inner'
)
print(f"Combined dataset shape: {combined_data.shape}")

combined_data.to_csv('../data/processed/combined_data.csv', index=False, encoding='utf-8')
print(f"✓ Combined dataset saved: ../data/processed/combined_data.csv")
print(f"  Size: {len(combined_data):,} rows × {combined_data.shape[1]} columns")

# Calculate file sizes
import os
anime_file_size = os.path.getsize('../data/processed/anime_cleaned.csv') / (1024 * 1024)
rating_file_size = os.path.getsize('../data/processed/rating_cleaned.csv') / (1024 * 1024)
combined_file_size = os.path.getsize('../data/processed/combined_data.csv') / (1024 * 1024)

print(f"\n--- FILE INFORMATION ---")
print(f"Anime CSV size: {anime_file_size:.2f} MB")
print(f"Rating CSV size: {rating_file_size:.2f} MB")
print(f"Combined CSV size: {combined_file_size:.2f} MB")
print(f"Total size: {anime_file_size + rating_file_size + combined_file_size:.2f} MB")

print("\n✓ All CSV files saved successfully!")


SAVING CLEANED CSV FILES

✓ Ensured 'processed' folder exists
✓ Anime dataset saved: ../data/processed/anime_cleaned.csv
  Size: 12,064 rows × 7 columns
✓ Rating dataset saved: ../data/processed/rating_cleaned.csv
  Size: 6,318,423 rows × 3 columns

--- Creating Combined Dataset ---
Combined dataset shape: (6318423, 9)
✓ Combined dataset saved: ../data/processed/combined_data.csv
  Size: 6,318,423 rows × 9 columns

--- FILE INFORMATION ---
Anime CSV size: 0.87 MB
Rating CSV size: 84.93 MB
Combined CSV size: 588.85 MB
Total size: 674.65 MB

✓ All CSV files saved successfully!


---

## Step 8: Generate ARFF Files for WEKA (Classification & Association)

**Objective:** Convert cleaned datasets to ARFF format optimized for WEKA classification and association algorithms.

**Output Files:**

**For Classification (with class attribute):**
1. anime_classification.arff - Anime data with rating as class
2. rating_classification.arff - Rating data with rating as class
3. combined_classification.arff - Combined data with rating as class ⭐

**For Association Rules (discretized attributes):**
4. anime_association.arff - Anime data in nominal format
5. rating_association.arff - Rating data in nominal format
6. combined_association.arff - Combined data in nominal format ⭐

**Key Features:**
- Classification files: numeric/nominal attributes with 'rating' as class
- Association files: all attributes converted to nominal (required for Apriori/FPGrowth)
- Proper discretization for continuous values
- WEKA-compatible format

**Purpose:** Enable classification (J48, Bayes, OneR) and association (Apriori, FPGrowth) in WEKA

In [9]:
# Generate ARFF files for WEKA (Classification & Association)
print("\n" + "=" * 60)
print("GENERATING ARFF FILES FOR WEKA")
print("=" * 60)

def create_classification_arff(dataframe, filename, relation_name, class_attribute='rating'):
    """
    Create ARFF file for WEKA classification algorithms
    The class_attribute will be placed last and marked as class
    """
    with open(filename, 'w', encoding='utf-8') as f:
        # Write relation name
        f.write(f"@RELATION {relation_name}\n\n")
        
        # Reorder columns to put class attribute last
        cols = [col for col in dataframe.columns if col != class_attribute]
        if class_attribute in dataframe.columns:
            cols.append(class_attribute)
        df_reordered = dataframe[cols]
        
        # Write attributes
        for col in df_reordered.columns:
            dtype = df_reordered[col].dtype
            
            if col == class_attribute:
                # Class attribute - use nominal for rating (1-10)
                unique_vals = sorted(df_reordered[col].unique())
                vals = ','.join([str(int(v)) for v in unique_vals if pd.notna(v)])
                f.write(f"@ATTRIBUTE {col} {{{vals}}}\n")
            elif dtype == 'object':
                unique_vals = df_reordered[col].unique()
                if len(unique_vals) > 100:
                    f.write(f"@ATTRIBUTE {col} string\n")
                else:
                    vals = ','.join([f"'{str(v)}'" for v in unique_vals if pd.notna(v)])
                    f.write(f"@ATTRIBUTE {col} {{{vals}}}\n")
            elif dtype in ['int64', 'int32', 'float64', 'float32']:
                f.write(f"@ATTRIBUTE {col} numeric\n")
            else:
                f.write(f"@ATTRIBUTE {col} string\n")
        
        # Write data
        f.write("\n@DATA\n")
        for idx, row in df_reordered.iterrows():
            row_data = []
            for col in df_reordered.columns:
                val = row[col]
                if pd.isna(val):
                    row_data.append('?')
                elif df_reordered[col].dtype == 'object':
                    val_str = str(val).replace("'", "\\'").replace('"', '\\"')
                    if ',' in val_str or ' ' in val_str:
                        row_data.append(f"'{val_str}'")
                    else:
                        row_data.append(f"'{val_str}'")
                else:
                    row_data.append(str(val))
            f.write(','.join(row_data) + '\n')
    
    print(f"✓ Created: {filename}")

def create_association_arff(dataframe, filename, relation_name):
    """
    Create ARFF file for WEKA association algorithms (Apriori, FPGrowth)
    All attributes must be nominal (categorical)
    """
    with open(filename, 'w', encoding='utf-8') as f:
        # Write relation name
        f.write(f"@RELATION {relation_name}\n\n")
        
        # Discretize numeric columns
        df_nominal = dataframe.copy()
        
        for col in df_nominal.columns:
            dtype = df_nominal[col].dtype
            
            if dtype in ['int64', 'int32', 'float64', 'float32']:
                # Discretize numeric attributes
                if col == 'rating':
                    # Rating 1-10: group into Low(1-4), Medium(5-7), High(8-10)
                    df_nominal[col] = pd.cut(df_nominal[col], 
                                            bins=[0, 4, 7, 10], 
                                            labels=['Low', 'Medium', 'High'],
                                            include_lowest=True)
                elif col == 'episodes':
                    # Episodes: Short(1-12), Medium(13-26), Long(27+)
                    df_nominal[col] = pd.cut(df_nominal[col], 
                                            bins=[-1, 12, 26, float('inf')], 
                                            labels=['Short', 'Medium', 'Long'])
                elif col == 'members':
                    # Members: Low(<1000), Medium(1000-10000), High(>10000)
                    df_nominal[col] = pd.cut(df_nominal[col], 
                                            bins=[-1, 1000, 10000, float('inf')], 
                                            labels=['Low', 'Medium', 'High'])
                elif col == 'anime_id' or col == 'user_id':
                    # Keep IDs as strings
                    df_nominal[col] = df_nominal[col].astype(str)
                else:
                    # Generic discretization for other numeric columns
                    df_nominal[col] = pd.cut(df_nominal[col], 
                                            bins=5, 
                                            labels=['VeryLow', 'Low', 'Medium', 'High', 'VeryHigh'])
        
        # Write attributes (all nominal)
        for col in df_nominal.columns:
            unique_vals = df_nominal[col].unique()
            # Remove NaN from unique values
            unique_vals = [v for v in unique_vals if pd.notna(v)]
            
            if len(unique_vals) > 0:
                if df_nominal[col].dtype == 'object' or df_nominal[col].dtype.name == 'category':
                    vals = ','.join([f"'{str(v)}'" for v in unique_vals])
                    f.write(f"@ATTRIBUTE {col} {{{vals}}}\n")
                else:
                    vals = ','.join([str(v) for v in unique_vals])
                    f.write(f"@ATTRIBUTE {col} {{{vals}}}\n")
        
        # Write data
        f.write("\n@DATA\n")
        for idx, row in df_nominal.iterrows():
            row_data = []
            for col in df_nominal.columns:
                val = row[col]
                if pd.isna(val):
                    row_data.append('?')
                else:
                    val_str = str(val).replace("'", "\\'").replace('"', '\\"')
                    if ',' in val_str or ' ' in val_str:
                        row_data.append(f"'{val_str}'")
                    else:
                        row_data.append(f"'{val_str}'")
            f.write(','.join(row_data) + '\n')
    
    print(f"✓ Created: {filename}")

# ====================================================
# PART 1: CLASSIFICATION ARFF FILES
# ====================================================
print("\n" + "=" * 60)
print("PART 1: CREATING CLASSIFICATION ARFF FILES")
print("=" * 60)

# 1. Anime Classification ARFF
print("\n--- Anime Classification ARFF ---")
anime_class_path = '../data/processed/anime_classification.arff'
create_classification_arff(anime_cleaned, anime_class_path, 'anime_classification', 'rating')

# 2. Rating Classification ARFF (sample if too large)
print("\n--- Rating Classification ARFF ---")
rating_sample = rating_cleaned
if len(rating_cleaned) > 100000:
    print(f"⚠ Rating dataset is large ({len(rating_cleaned):,} rows)")
    print(f"  Sampling 100,000 rows for WEKA performance")
    rating_sample = rating_cleaned.sample(n=100000, random_state=42)

rating_class_path = '../data/processed/rating_classification.arff'
create_classification_arff(rating_sample, rating_class_path, 'rating_classification', 'rating')

# 3. Combined Classification ARFF (MOST IMPORTANT)
print("\n--- Combined Classification ARFF ---")
combined_sample = combined_data
if len(combined_data) > 100000:
    print(f"⚠ Combined dataset is large ({len(combined_data):,} rows)")
    print(f"  Sampling 100,000 rows for WEKA performance")
    combined_sample = combined_data.sample(n=100000, random_state=42)

# For combined data, use user rating as class (rename to user_rating for clarity)
combined_for_class = combined_sample.copy()
if 'rating_x' in combined_for_class.columns:
    combined_for_class = combined_for_class.rename(columns={'rating_x': 'user_rating', 'rating_y': 'anime_rating'})
    class_attr = 'user_rating'
else:
    class_attr = 'rating'

combined_class_path = '../data/processed/combined_classification.arff'
create_classification_arff(combined_for_class, combined_class_path, 'combined_classification', class_attr)

# ====================================================
# PART 2: ASSOCIATION ARFF FILES
# ====================================================
print("\n" + "=" * 60)
print("PART 2: CREATING ASSOCIATION ARFF FILES")
print("=" * 60)

# 4. Anime Association ARFF
print("\n--- Anime Association ARFF ---")
anime_assoc_path = '../data/processed/anime_association.arff'
create_association_arff(anime_cleaned, anime_assoc_path, 'anime_association')

# 5. Rating Association ARFF
print("\n--- Rating Association ARFF ---")
rating_assoc_path = '../data/processed/rating_association.arff'
create_association_arff(rating_sample, rating_assoc_path, 'rating_association')

# 6. Combined Association ARFF (MOST IMPORTANT)
print("\n--- Combined Association ARFF ---")
combined_assoc_path = '../data/processed/combined_association.arff'
create_association_arff(combined_sample, combined_assoc_path, 'combined_association')

# ====================================================
# SUMMARY
# ====================================================
print("\n" + "=" * 60)
print("ARFF FILE GENERATION COMPLETE")
print("=" * 60)

print("\n--- CLASSIFICATION ARFF FILES (For J48, Bayes, OneR, etc.) ---")
print(f"1. {anime_class_path}")
print(f"   Rows: {len(anime_cleaned):,} | Class: rating")
print(f"\n2. {rating_class_path}")
print(f"   Rows: {len(rating_sample):,} | Class: rating")
print(f"\n3. {combined_class_path} ⭐ (RECOMMENDED)")
print(f"   Rows: {len(combined_sample):,} | Class: {class_attr}")

print("\n--- ASSOCIATION ARFF FILES (For Apriori, FPGrowth) ---")
print(f"4. {anime_assoc_path}")
print(f"   Rows: {len(anime_cleaned):,} | All attributes: nominal")
print(f"\n5. {rating_assoc_path}")
print(f"   Rows: {len(rating_sample):,} | All attributes: nominal")
print(f"\n6. {combined_assoc_path} ⭐ (RECOMMENDED)")
print(f"   Rows: {len(combined_sample):,} | All attributes: nominal")

print("\n--- DISCRETIZATION RULES (for Association files) ---")
print("• Rating: Low(1-4), Medium(5-7), High(8-10)")
print("• Episodes: Short(1-12), Medium(13-26), Long(27+)")
print("• Members: Low(<1000), Medium(1000-10000), High(>10000)")

print("\n--- HOW TO USE IN WEKA ---")
print("\nFOR CLASSIFICATION:")
print("  1. Open WEKA Explorer")
print("  2. Load: combined_classification.arff")
print("  3. Go to 'Classify' tab")
print("  4. Choose classifier: J48, NaiveBayes, OneR, etc.")
print("  5. Test options: Cross-validation (10 folds)")
print("  6. Click 'Start' to run")

print("\nFOR ASSOCIATION RULES:")
print("  1. Open WEKA Explorer")
print("  2. Load: combined_association.arff")
print("  3. Go to 'Associate' tab")
print("  4. Choose: Apriori or FPGrowth")
print("  5. Set parameters (minSupport, minConfidence)")
print("  6. Click 'Start' to mine rules")

print("\n✓ All ARFF files ready for WEKA!")
print("✓ 6 files generated (3 classification + 3 association)")


GENERATING ARFF FILES FOR WEKA

PART 1: CREATING CLASSIFICATION ARFF FILES

--- Anime Classification ARFF ---
✓ Created: ../data/processed/anime_classification.arff

--- Rating Classification ARFF ---
⚠ Rating dataset is large (6,318,423 rows)
  Sampling 100,000 rows for WEKA performance
✓ Created: ../data/processed/rating_classification.arff

--- Combined Classification ARFF ---
⚠ Combined dataset is large (6,318,423 rows)
  Sampling 100,000 rows for WEKA performance
✓ Created: ../data/processed/combined_classification.arff

PART 2: CREATING ASSOCIATION ARFF FILES

--- Anime Association ARFF ---
✓ Created: ../data/processed/anime_association.arff

--- Rating Association ARFF ---
✓ Created: ../data/processed/rating_association.arff

--- Combined Association ARFF ---
✓ Created: ../data/processed/combined_association.arff

ARFF FILE GENERATION COMPLETE

--- CLASSIFICATION ARFF FILES (For J48, Bayes, OneR, etc.) ---
1. ../data/processed/anime_classification.arff
   Rows: 12,064 | Class: 