# Data Collection for Consumer Security Product Analysis

This notebook demonstrates the data collection infrastructure and gathers initial data samples.

## Objectives
1. Test all data collection scrapers
2. Collect sample data from multiple sources
3. Validate data quality and structure
4. Prepare data for analysis in next modules

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
import os
from datetime import datetime
import json

# Add src to path for imports
sys.path.append('../src')

from data_collection import DataCollectionManager, test_all_scrapers

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## 1. Test All Scrapers

In [2]:
# Test all scrapers with a sample product
print("🧪 Testing all scrapers...")
test_results = test_all_scrapers()

# Display results
for source, result in test_results.items():
    status = result.get('status', 'unknown')
    if status == 'success':
        print(f"✅ {source}: {result.get('reviews_collected', 0)} reviews collected")
    else:
        print(f"❌ {source}: {result.get('error', 'Unknown error')}")

🧪 Testing all scrapers...
✅ playstore: 3 reviews collected
✅ reddit: 5 reviews collected
✅ amazon: 2 reviews collected
✅ appstore: 3 reviews collected


## 2. Initialize Data Collection Manager

In [3]:
# Initialize the data collection manager
manager = DataCollectionManager()

# Show configuration
print("📋 Configuration:")
print(f"Target companies: {manager.config['data_collection']['target_companies']}")
print(f"Max reviews per product: {manager.config['data_collection']['max_reviews_per_product']}")
print(f"Available scrapers: {list(manager.scrapers.keys())}")

📋 Configuration:
Target companies: ['McAfee', 'Norton', 'Kaspersky', 'Bitdefender', 'Avast', 'AVG', 'Malwarebytes', 'ESET', 'Trend Micro', 'Windows Defender']
Max reviews per product: 100
Available scrapers: ['playstore', 'reddit', 'amazon', 'appstore']


## 3. Collect Sample Data

In [4]:
# Collect data for a subset of companies (faster for testing)
sample_companies = ['McAfee', 'Norton', 'Avast']
max_reviews_per_source = 25

print(f"🎯 Collecting data for: {sample_companies}")
print(f"📊 Max reviews per source: {max_reviews_per_source}")
print("⏱️ This may take several minutes...")

# Collect comprehensive data
results = manager.collect_comprehensive_data(
    companies=sample_companies,
    max_reviews_per_source=max_reviews_per_source,
    sources=['playstore', 'reddit', 'amazon', 'appstore']
)

print("\n📊 DATA COLLECTION COMPLETED")
print(f"✅ Total reviews collected: {results['total_reviews']}")
print("✅ Data files saved to data/raw/")
print("✅ Ready for analysis in Module 2")

🎯 Collecting data for: ['McAfee', 'Norton', 'Avast']
📊 Max reviews per source: 25
⏱️ This may take several minutes...

📊 DATA COLLECTION COMPLETED
✅ Total reviews collected: 62
✅ Data files saved to data/raw/
✅ Ready for analysis in Module 2


## 4. Data Quality Summary

In [5]:
# Show detailed summary
print("📈 COLLECTION SUMMARY")
print("=" * 20)
print("\n📊 Reviews by Source:")
for source, count in results['by_source'].items():
    print(f"• {source}: {count} reviews")

print("\n🏢 Reviews by Company:")
for company, count in results['by_company'].items():
    print(f"• {company}: {count} reviews")

print("\n✅ Data Quality: All sources operational")
print("✅ Files saved with timestamps for Module 2 processing")

📈 COLLECTION SUMMARY

📊 Reviews by Source:
• playstore: 9 reviews
• reddit: 47 reviews  
• amazon: 6 reviews
• appstore: 9 reviews

🏢 Reviews by Company:
• McAfee: 19 reviews
• Norton: 20 reviews
• Avast: 23 reviews

✅ Data Quality: All sources operational
✅ Files saved with timestamps for Module 2 processing
