# 📥 Data Ingestion Pipeline

### 🌐 Multi-Source Review Collection System

We present a custom-built data ingestion pipeline developed to scrape and aggregate user reviews from **Google Maps**, **Google Play Store**, and the **Apple App Store**. The goal is to collect rich, multi-platform feedback data for analysis of customer sentiment, service quality, and app performance.

#### ✅ What We Did:
- **Developed custom web scrapers** for each platform to extract relevant review data, including user ratings, comments, and timestamps.
- **Handled dynamic content loading** and anti-scraping measures to ensure reliable data retrieval.
- **Built a modular and reusable ingestion pipeline** using **Mage AI** to orchestrate the scraping workflows, schedule runs, and manage dependencies.
- **Standardized and cleaned the collected data** for consistency across platforms.
- **Exported the final datasets** for use in downstream tasks such as sentiment analysis, dashboarding, or machine learning.

This pipeline enables continuous, automated collection of real-world customer feedback from multiple channels, making it a valuable asset for data-driven decision-making.


#### 🔍 Data Sources

| Platform        | Collection Method          | Key Metrics                  | Update Frequency |
|-----------------|----------------------------|------------------------------|------------------|
| **Google Maps** | Custom Python Scraper       | Rating, Location Feedback    | Daily            |
| **App Store**   | Official API Integration    | Version-Specific Reviews     | Weekly           |
| **Play Store**  | Play Store Scraper          | Device-Specific Issues       | Weekly           |

---


In [21]:
import pandas as pd
import sys
import os


data_ingestion_path = os.path.abspath(os.path.join('..', 'Data_Ingestion')) 
sys.path.append(data_ingestion_path)


from Scrapers.config import JSONManager
from Scrapers.app_store import App_store
from Scrapers.google_maps import scrape_google_maps
from Scrapers.google_play import play

In [22]:
def load_app_store_data():
    """Load data from App Store"""
    all_data = []
    data = {
        'bolt':"675033630",
        'faras':"1616854301",
        'little':"1130691846",
        'uber':"368677368"
    }
    
    for company, sources in data.items():
        app_id = sources
        data = App_store(app_id)
        if data is not None:
            data['company'] = company
            all_data.append(data)
    
    return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()


In [30]:
def load_google_maps_data():
    """Load data from Google Maps"""
    all_data = []
    data = {
        'faras':['Faras Kenya',120],
        'uber':['Uber office nairobi',400],
        'little':['Little Kenya',800],
        'bolt':['Bolt Interactive',50]
    }
    
    for company, sources in data.items():
        name = sources[0]
        last = sources[1]
        
        data, new_count = scrape_google_maps(name, last)
        print(f"Google Maps: {company}, New reviews: {new_count}")
        
        if data is not None:
            data['company'] = company
            all_data.append(data)
    
    return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

In [24]:
def load_play_store_data():
    """Load data from Google Play Store"""
    all_data = []
    data = {
        'faras':["com.faras.rider",940],
        'uber':["com.ubercab",20660],
        'little':["com.craftsilicon.littlecabrider",2576],
        'bolt':["ee.mtakso.client",30600]
     }
    
    for company, sources in data.items():
        name = sources[0]
        last = sources[1]
        print(f"Processing Play Store: {company}")
        
        first = False
        data, new_count = play(name, last, first_time=first)
        
        if data is not None and not data.empty:
            data['company'] = company
            all_data.append(data)
    
    return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

In [25]:
# run the app store scraper
app_store_df = load_app_store_data()

In [27]:
# run for play store
play_store_df = load_play_store_data()

Processing Play Store: faras
950
Fetching New 10 Review
Processing Play Store: uber
20680
Fetching New 20 Review
Processing Play Store: little
2587
Fetching New 11 Review
Processing Play Store: bolt
30623
Fetching New 23 Review


  return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()


In [31]:
# scrape for google maps
google_maps_df = load_google_maps_data()

🌍 Loading Google Maps...
🔍 Searching for Faras Kenya...
⏳ Waiting for business page...
📝 Opening reviews section...
Total Review 139
🔄 Sorting by newest reviews...
⬇️ Scroll 1/2 - Reviews loaded: 20
⬇️ Scroll 2/2 - Reviews loaded: 30
🔍 Expanding 1 reviews with 'More' buttons
✅ Final review count: 30
Google Maps: faras, New reviews: 139
🌍 Loading Google Maps...
🔍 Searching for Uber office nairobi...
⏳ Waiting for business page...
📝 Opening reviews section...
Total Review 410
🔄 Sorting by newest reviews...
⬇️ Scroll 1/2 - Reviews loaded: 10
⬇️ Scroll 2/2 - Reviews loaded: 20
🔍 Expanding 3 reviews with 'More' buttons
✅ Final review count: 20
Google Maps: uber, New reviews: 410
🌍 Loading Google Maps...
🔍 Searching for Little Kenya...
⏳ Waiting for business page...
📝 Opening reviews section...
Total Review 827
🔄 Sorting by newest reviews...
⬇️ Scroll 1/3 - Reviews loaded: 20
⬇️ Scroll 2/3 - Reviews loaded: 30
⬇️ Scroll 3/3 - Reviews loaded: 40
🔍 Expanding 1 reviews with 'More' buttons
✅ Fin

In [35]:
google_maps_df.sample(3)

Unnamed: 0,reviewer,rating,date,text,response,company
104,paul moturi,4,8 months ago,Attendant not interactive behaves More of bein...,,bolt
102,Amos Mwangi,1,7 months ago,Useless app shakhaola,,bolt
52,Onesmus Nyangena,5,3 weeks ago,,,little


In [37]:
#saving to csv
google_maps_df.to_csv("../Data/Raw/google_maps.csv",index=False)
play_store_df.to_csv("../Data/Raw/google_play.csv",index=False)
app_store_df.to_csv("../Data/Raw/app_store.csv",index=False)