Build a POC data app that start from data source to final application. Process :
data collect
->data process
->db modeling
->data storage
->ETL
->data analysis / ML
->data visualization
. This project will focus on : 1) database modeling / schema design (per business understanding, use cases) 2) data process 3) analysis that extract business insights 4) framework design logic (why this database, why this schema, why this BI tool..)
- Tech : Python 3, Pyspark, Mysql/AWS RDS S3, Redash, Alembic, Docker
- Data process : transform_all_json_2_csv.sh
- DB modeling : Alembic ddl
- Data storage : all_csv_2_mysql.sh
- ETL : spark etl
- Analysis/data visualization : redash dashboard
- Presentation : YelpReview_DS_demo
├── README.md
├── alembic.ini : configuration for alembic DB version control
├── config : configuration for database, RDS, s3...
├── data : file saved Yelp dataset
├── db : sql for redash dashboard, and analysis
├── ddl : alembic database migration (ddl/versions)
├── doc : file for documentation
├── etl : Main ETL scripts
├── redash : Dockerfile redash env (BI tool)
├── requirements.txt : Needed python libraries
├── script : Scripts run data preprocess
├── spark : Dockerfile build spark env
└── superset : Dockerfile superset env (BI tool)
Prerequisites
- Fork the repo :
git clone https://github.com/yennanliu/YelpReviews.git
- Download Kaggle dataset and and save at data file
- Download/launch mysql server local, and create a database
yelp
(for development) - Set up AWS RDS mysql database (for prodution,
optional
) - Modify mysql db config with yours
- Modify RDS mysql db config with yours (
optional
) - Modify DB connection (e.g.
sqlalchemy.url = <your_mysql_url>
)in alembic.ini with yours
Quick start
# STEP 0) install libraries
$ cd ~ && cd YelpReviews && git install -r requirements.txt
# STEP 1) db migration
$ alembic init --template generic ddl && alembic upgrade head # downgrade : $ alembic downgrade -1
# STEP 2) data preprocess
$ bash script/transform_all_json_2_csv.sh # json to csv
# csv -> mysql
$ bash script/all_csv_2_mysql.sh
# STEP 3) spark etl
$ docker build spark/. -t spark_env
$ bash etl/run_etl_digest_business.sh
$ bash etl/etl_user_friend_count.sh
Development
dev
TODO
- Add tests
- Dockerize all end to end applications (can run all functionalities offline)
- Tune spark code raise IO efficiency
Ref
- Yelp dataset
- Superset connect to s3 transformed athena
- alembic mysql migration
- Redash docker
- ML : Yelp review star prediction
- Yelp dataset db model design