# Task 1: Data Collection and Analysis

This notebook demonstrates the data collection and exploratory data analysis workflow.

In [None]:
# Import required modules
import sys
sys.path.append('../')

from customer_analytics.utils import PlayStoreScraper, ReviewPreprocessor
from customer_analytics.analysis import EDA
from customer_analytics.visualisation import Plotter
import pandas as pd

## Step 1: Data Collection

Scrape reviews from Google Play Store for Ethiopian banking apps.

In [None]:
# Initialize scraper
scraper = PlayStoreScraper()

# Scrape reviews from all banks
df_raw = scraper.scrape_all_banks()

# Display sample
print(f"\nCollected {len(df_raw)} reviews")
df_raw.head()

## Step 2: Data Preprocessing

Clean and preprocess the scraped data.

In [None]:
# Initialize preprocessor
preprocessor = ReviewPreprocessor()

# Run preprocessing pipeline
success = preprocessor.process()

if success:
    print("\n✓ Preprocessing completed successfully!")
    df_processed = preprocessor.df
else:
    print("\n✗ Preprocessing failed!")

## Step 3: Exploratory Data Analysis

Analyze the processed data to understand patterns and distributions.

In [None]:
# Initialize EDA
eda = EDA(df_processed)

# Generate summary report
print(eda.summary_report())

In [None]:
# Get basic statistics
stats = eda.get_basic_stats()
print("\nBasic Statistics:")
for key, value in stats.items():
    if key not in ['columns', 'data_types', 'missing_values']:
        print(f"{key}: {value}")

In [None]:
# Get top words
top_words = eda.get_top_words(n=20)
print("\nTop 20 Most Common Words:")
for word, count in top_words.items():
    print(f"  {word}: {count}")

## Step 4: Data Visualization

Create visualizations to better understand the data.

In [None]:
# Initialize plotter
plotter = Plotter()

# Plot rating distribution
plotter.plot_histogram(
    df_processed, 
    'rating', 
    title='Rating Distribution',
    bins=5
)

In [None]:
# Plot reviews by bank
bank_counts = df_processed['bank_name'].value_counts().reset_index()
bank_counts.columns = ['bank_name', 'count']

plotter.plot_bar(
    bank_counts,
    'bank_name',
    'count',
    title='Reviews per Bank'
)

In [None]:
# Plot text length distribution
plotter.plot_histogram(
    df_processed,
    'text_length',
    title='Review Text Length Distribution',
    bins=30
)

## Summary

Task 1 Complete! We have:
1. ✓ Collected reviews from Google Play Store
2. ✓ Preprocessed and cleaned the data
3. ✓ Performed exploratory data analysis
4. ✓ Created visualizations

The processed data is now ready for sentiment analysis and modeling.