# 🏦 Ethiopian Banking App Reviews: Scraping and Preprocessing

This notebook handles:
- Scraping Google Play Store reviews for three major Ethiopian banks.
- Preprocessing the data: cleaning, formatting, and exporting.

---

## 1. 📌 Introduction

In this notebook, we collect user reviews for:
- Commercial Bank of Ethiopia (CBE)
- Bank of Abyssinia (BOA)
- Dashen Bank

These reviews will be used to analyze customer sentiment, uncover themes, and identify user satisfaction drivers and pain points.

---

## 2. 🕸️ Scraping Google Play Store Reviews

We use the `google-play-scraper` Python library to gather reviews from the Google Play Store. Each review includes:
- Text content
- Rating (1–5 stars)
- Review date
- Bank name
- Source

We target 400+ reviews per app.

In [2]:
from google_play_scraper import reviews_all
import pandas as pd

apps = {
    "CBE": "com.combanketh.mobilebanking",
    "BOA": "com.boa.boaMobileBanking",
    "Dashen": "com.dashen.dashensuperapp"
}

all_reviews = []

for bank, package in apps.items():
    print(f"Scraping reviews for {bank}...")
    reviews = reviews_all(
        package,
        lang='en',
        country='us',
        sleep_milliseconds=0,
    )
    for r in reviews:
        all_reviews.append({
            "review": r['content'],
            "rating": r['score'],
            "date": r['at'].strftime('%Y-%m-%d'),
            "bank": bank,
            "source": "Google Play"
        })

df = pd.DataFrame(all_reviews)
df.to_csv("../data/bank_reviews_raw.csv", index=False)
print("Raw review data saved to ../data/bank_reviews_raw.csv")

Scraping reviews for CBE...
Scraping reviews for BOA...
Scraping reviews for Dashen...
Raw review data saved to ../data/bank_reviews_raw.csv


# 3. 🧹 Preprocessing: Cleaning the Data
This step includes:

- Removing duplicates

- Dropping missing values

- Normalizing date formats

- Retaining only relevant columns

In [8]:
df = pd.read_csv("../data/bank_reviews_raw.csv")

# Drop duplicates
df.drop_duplicates(subset=["review", "date", "bank"], inplace=True)

# Drop missing
df.dropna(subset=["review", "rating", "date", "bank"], inplace=True)

# Normalize date format
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')

# Select and reorder columns
df = df[["review", "rating", "date", "bank", "source"]]

df.to_csv("../data/bank_reviews_clean.csv", index=False)
print("Cleaned data saved to ../data/bank_reviews_clean.csv")


Cleaned data saved to ../data/bank_reviews_clean.csv


# 4. 📦 Export and Save Cleaned Data
The cleaned dataset is saved to:

```bash
../data/bank_reviews_clean.csv
```
This dataset will be used in the next phase: sentiment and theme analysis.

# 5. 🔍 Data Preview & Quick Summary
Preview the cleaned dataset to confirm structure and completeness.

In [None]:
df.head()


Unnamed: 0,review,rating,date,bank,source
0,20 years,5,2025-06-08,CBE,Google Play
1,A great app. It's like carrying a bank in your...,4,2025-06-07,CBE,Google Play
2,More than garrantty bank EBC.,4,2025-06-07,CBE,Google Play
3,really am happy to this app it is Siple to use...,5,2025-06-07,CBE,Google Play
4,I liked this app. But the User interface is ve...,2,2025-06-07,CBE,Google Play


In [None]:
print("Total Reviews:", df.shape[0])
print("Missing values per column:")
print(df.isnull().sum())


Total Reviews: 8843
Missing values per column:
review    0
rating    0
date      0
bank      0
source    0
dtype: int64
