# Product Recommendation Challenge - Data Structure Analysis

## Overview
This notebook analyzes the structure of the data files used in the Product Recommendation Challenge and identifies potential features that can be extracted for building recommendation systems.

## Dataset Files Summary
- **train.csv**: 2,543,147 user-item interactions
- **test.csv**: 412,462 users requiring predictions  
- **item_metadata.csv**: 3,819,722 product records with detailed metadata
- **sample_submission.csv**: Expected output format (412,462 predictions)
- **id_mappings.json**: Mapping between original and encoded IDs


## 1. Training Data (train.csv)

**Structure:**
- **user_id**: Encoded user identifier (integer)
- **item_id**: Encoded item identifier (integer) 
- **rating**: User's rating for the item (float, e.g., 5.0, 4.0)
- **timestamp**: Unix timestamp of the interaction (milliseconds)

**Sample:**
```
user_id,item_id,rating,timestamp
0,7314,5.0,1353612262000
0,15493,5.0,1370653034000  
0,18817,4.0,1373668644000
```

**Potential Features:**
- **User behavior patterns**: Rating distributions, average ratings per user
- **Temporal features**: Time of day, day of week, seasonality patterns
- **User engagement**: Number of ratings per user, rating frequency
- **Item popularity**: Number of ratings per item, average item ratings
- **Rating patterns**: Explicit feedback (ratings 1-5), rating variance
- **Sequential patterns**: User's rating timeline, rating evolution
- **Recency**: Time since last interaction, recent activity patterns


## 2. Test Data (test.csv)

**Structure:**
- **user_id**: Encoded user identifier (integer)
- **predictions**: Placeholder column (contains 0)

**Sample:**
```
user_id,predictions
0,0
1,0
3,0
```

**Purpose:**
- Contains 412,462 users for whom we need to generate top-10 item recommendations
- These are the users we need to predict for in the final submission
- Notice some user IDs are missing (e.g., user 2 is not in the sample), indicating sparse user set

**Potential Analysis:**
- **User coverage**: Which users from training data appear in test set
- **Cold start problem**: New users not seen in training data
- **User activity levels**: Compare test users' historical activity in training data


## 3. Item Metadata (item_metadata.csv)

**Structure:** (15 columns total)
1. **parent_asin**: Amazon Standard Identification Number (string)
2. **main_category**: Primary product category (e.g., "All Beauty")
3. **title**: Product title/name (string)
4. **average_rating**: Average rating of the product (float)
5. **rating_number**: Number of ratings received (float)
6. **price**: Product price (can be None)
7. **store**: Store/brand name
8. **features**: Product features (list/array format)
9. **description**: Product description (list/array format)
10. **images**: Image data structure
11. **categories**: Detailed category hierarchy (list)
12. **image_count**: Number of product images (integer)
13. **has_images**: Boolean indicator for image availability
14. **image_urls**: URLs to product images
15. **category**: Simplified category (may duplicate main_category)

**Sample:**
```
parent_asin,main_category,title,average_rating,rating_number,price,store,features,description,...
B01CUPMQZE,All Beauty,"Howard LC0008 Leather Conditioner, 8-Ounce (4-Pack)",4.8,10.0,None,Howard Products,[],...
```


**Potential Features from Item Metadata:**

### Content-Based Features:
- **Category features**: One-hot encoding of main_category, hierarchical category analysis
- **Price features**: Price bins, price relative to category average, price availability
- **Rating features**: Average rating, rating count, popularity metrics
- **Text features**: TF-IDF from titles and descriptions, text embeddings
- **Brand features**: Store/brand popularity, brand-category associations
- **Visual features**: Image availability, image count as proxy for marketing investment

### Advanced Features:
- **Content similarity**: Item-to-item similarity based on text/categories
- **Price positioning**: Expensive/cheap relative to similar items
- **Category popularity**: Category trends and seasonal patterns
- **Quality indicators**: Rating count as quality signal, rating-price correlation
- **Feature density**: Number of filled metadata fields as quality indicator
- **Text complexity**: Title/description length, feature list completeness


## 4. Sample Submission (sample_submission.csv)

**Structure:**
- **user_id**: Encoded user identifier (integer)
- **predictions**: Space-separated list of 10 recommended item IDs

**Sample:**
```
user_id,predictions
0,50727 25161 70745 64522 3476 5270 67819 59047 9548 20616
1,72042 48322 70607 51973 13888 45212 47281 16753 65859 33584
3,9089 12890 75010 54531 32877 16323 61681 47577 72231 49359
```

**Format Requirements:**
- Exactly 10 item recommendations per user
- Items must be space-separated in a single string
- Order matters (first item is the top recommendation)
- All item IDs must exist in the dataset

**Evaluation Considerations:**
- Likely evaluated using ranking metrics (MAP@10, NDCG@10, Precision@10)
- Order of recommendations is important
- Need to ensure recommended items are valid and available


## 5. ID Mappings (id_mappings.json)

**Structure:**
- **user_mapping**: Dictionary mapping original user IDs to encoded integers
- **item_mapping**: Dictionary mapping original item ASINs to encoded integers (likely)

**Sample:**
```json
{
  "user_mapping": {
    "AE22236AFRRSMQIKGG7TPTB75QEA": 0,
    "AE2224FSUK5AV5R2USYXINUNTW7Q": 1,
    "AE2226PENZTTCDKFGRTUCUX2NU2Q": 2,
    ...
  },
  "item_mapping": {
    "B01CUPMQZE": 0,
    "B01EXAMPLE": 1,
    ...
  }
}
```

**Purpose:**
- Links encoded IDs used in train/test data to original Amazon identifiers
- Essential for mapping back to item metadata using parent_asin
- Enables understanding of real-world product identifiers
- Required to validate that recommended items exist in metadata


## 6. Feature Engineering Strategy

### Collaborative Filtering Features:
1. **User-Item Matrix**: Sparse matrix of ratings for matrix factorization
2. **User Similarities**: Cosine similarity, Pearson correlation between users
3. **Item Similarities**: Item-to-item collaborative filtering based on user ratings
4. **Matrix Factorization**: SVD, NMF, ALS to discover latent factors

### Content-Based Features:
1. **Item Profiles**: TF-IDF vectors from titles, descriptions, categories
2. **Category Embeddings**: Learned representations of product categories  
3. **Price Features**: Normalized prices, price bins, price-category relationships
4. **Brand Features**: Brand popularity, brand-user affinity scores
5. **Quality Indicators**: Average ratings, review counts, image availability

### Hybrid Features:
1. **User Profiles**: Aggregate user preferences from historical interactions
2. **Temporal Patterns**: Seasonal trends, day-of-week effects, recency weights
3. **Cross-Features**: User-category preferences, user-brand affinities
4. **Cold Start Solutions**: Content-based recommendations for new users/items


## 7. Data Quality Considerations

### Potential Issues:
1. **Sparsity**: With 2.5M interactions across millions of users and items, the user-item matrix will be very sparse
2. **Cold Start**: New users and items with no historical data
3. **Missing Values**: Price field contains "None" values, varying metadata completeness
4. **Imbalanced Data**: Some users/items may have many more interactions than others
5. **Temporal Drift**: User preferences may change over time

### Data Integration Challenges:
1. **ID Mapping**: Must correctly map encoded IDs to original ASINs for metadata lookup
2. **Consistency**: Ensure all recommended items exist in both training data and metadata
3. **Scale**: Large datasets require efficient processing and storage strategies
4. **Feature Alignment**: Matching user-item interactions with item metadata

### Recommended Preprocessing:
1. **Data Validation**: Verify ID mappings and data consistency
2. **Missing Value Handling**: Strategies for missing prices and metadata
3. **Outlier Detection**: Identify and handle unusual rating patterns
4. **Data Splitting**: Proper train/validation splits for model evaluation
5. **Feature Scaling**: Normalize numerical features for consistent model training


## 8. Modeling Approach Recommendations

### Phase 1: Baseline Models
1. **Popularity-Based**: Recommend most popular items globally or by category
2. **User-Based CF**: Simple user-user collaborative filtering
3. **Item-Based CF**: Item-item collaborative filtering using rating similarities

### Phase 2: Advanced Collaborative Filtering
1. **Matrix Factorization**: SVD, NMF with regularization
2. **Deep Learning**: Neural collaborative filtering, autoencoders
3. **Factorization Machines**: Capture feature interactions efficiently

### Phase 3: Hybrid Systems
1. **Content-Collaborative Hybrid**: Combine CF with item metadata
2. **Ensemble Methods**: Weight multiple model predictions
3. **Deep Hybrid Models**: Neural networks incorporating both CF and content features

### Evaluation Strategy:
1. **Offline Evaluation**: Split training data chronologically
2. **Metrics**: MAP@10, NDCG@10, Precision@10, Recall@10
3. **Cross-Validation**: User-based or time-based splits
4. **A/B Testing Framework**: For production deployment

## Next Steps
1. Load and explore the data in detail (`eda.ipynb`)
2. Implement data preprocessing pipeline
3. Build baseline recommendation models
4. Evaluate and iterate on advanced approaches
