This tool reads raw NASA Kennedy Space Center HTTP web server log files, parses and transforms them into structured records, executes three analytical queries using your choice of four execution engines, stores aggregated results in MySQL, and displays a formatted report with execution metadata.
Four interchangeable execution pipelines:
| # | Pipeline | Technology Used |
|---|---|---|
| 1 | MapReduce | Hadoop Streaming with Python mapper/reducer |
| 2 | Pig | Apache Pig Latin scripts with Python UDF |
| 3 | MongoDB | pymongo with MongoDB aggregation pipelines |
| 4 | Hive | HiveQL via beeline with RegEx SerDe |
All pipelines process the same dataset with identical parsing rules, cleaning rules, and query logic — guaranteeing a fair comparison.
| Software | Minimum Version | Notes |
|---|---|---|
| Python | 3.10+ | System Python or virtual env |
| Hadoop | 3.x | HDFS + YARN must be running |
| Apache Pig | 0.17+ | pig command must be in PATH |
| Apache Hive | 3.x | HiveServer2 + beeline in PATH |
| MongoDB | 6.x | mongod must be running |
| MySQL | 8.x | mysqld must be running |
# ── System packages ──────────────────────────────────────────────────────────
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk wget tar ssh rsync mysql-server
# ── Start MySQL ──────────────────────────────────────────────────────────────
sudo service mysql start
# Set root password (e.g. 'root') and create the database:
sudo mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root'; FLUSH PRIVILEGES;"
# ── Hadoop 3.3.6 ─────────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xzf hadoop-3.3.6.tar.gz && sudo rm hadoop-3.3.6.tar.gz
# ── Apache Pig 0.17 ──────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/pig/pig-0.17.0/pig-0.17.0.tar.gz
sudo tar -xzf pig-0.17.0.tar.gz && sudo rm pig-0.17.0.tar.gz
# ── Apache Hive 3.1.3 ────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
sudo tar -xzf apache-hive-3.1.3-bin.tar.gz && sudo rm apache-hive-3.1.3-bin.tar.gz
# Fix Guava version mismatch:
sudo rm /opt/apache-hive-3.1.3-bin/lib/guava-19.0.jar
sudo cp /opt/hadoop-3.3.6/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/apache-hive-3.1.3-bin/lib/
# ── MongoDB ──────────────────────────────────────────────────────────────────
# Follow official MongoDB Community installation for Ubuntu:
# https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update && sudo apt-get install -y mongodb-org
sudo systemctl start mongod || sudo mongod --fork --logpath /var/log/mongod.log --dbpath /var/lib/mongodb
# ── Python packages ──────────────────────────────────────────────────────────
pip install mysql-connector-python pymongo tabulateAdd these to your ~/.bashrc (or ~/.zshrc on macOS):
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop-3.3.6
export PIG_HOME=/opt/pig-0.17.0
export HIVE_HOME=/opt/apache-hive-3.1.3-bin
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH=$PATH:$PIG_HOME/bin
export PATH=$PATH:$HIVE_HOME/bin
export MYSQL_PASSWORD=root # or your MySQL root passwordThen: source ~/.bashrc
# Enable passwordless SSH for Hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa 2>/dev/null || true
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
# Configure Hadoop for pseudo-distributed mode
cat > $HADOOP_HOME/etc/hadoop/core-site.xml << 'EOF'
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
EOF
cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml << 'EOF'
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
EOF
echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# Format HDFS and start services
hdfs namenode -format -force
$HADOOP_HOME/sbin/start-dfs.sh
# Initialize Hive metastore
cd /tmp && schematool -dbType derby -initSchemabrew install hadoop pig hive
brew tap mongodb/brew && brew install mongodb-community
brew install mysql
brew services start mongodb-community
brew services start mysql
pip install mysql-connector-python pymongo tabulate# Create the data directory inside the project
mkdir -p nosql_etl_project/data
cd nosql_etl_project/data
# Download the NASA log files
curl -O https://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
curl -O https://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz
# Decompress (the ONLY allowed preprocessing step)
gunzip NASA_access_log_Jul95.gz
gunzip NASA_access_log_Aug95.gz
# Verify
ls -lh NASA_access_log_Jul95 NASA_access_log_Aug95
⚠️ Do not edit, filter, convert, or modify the raw log files. Decompression is the only permitted manual step.
Edit config.py to set your credentials:
MYSQL_PASSWORD = "yourpassword" # ← set your MySQL root passwordAll other defaults work for a standard local installation.
cd nosql_etl_project
mysql -u root -p < db/schema.sqlThis creates the nasa_etl database and all 5 required tables.
The tool also calls init_schema() automatically at startup.
cd nosql_etl_project
python main.pyYou will see:
=============================================
NASA Log ETL & Analytics Tool
=============================================
Select execution pipeline:
1. MapReduce (Hadoop Streaming)
2. Apache Pig
3. MongoDB
4. Apache Hive
Enter choice (1-4): 3
Enter batch size [default 10000]: 10000
# MapReduce pipeline
python main.py --pipeline mapreduce --batch-size 10000
# Apache Pig pipeline
python main.py --pipeline pig --batch-size 10000
# MongoDB pipeline (with automatic report after run)
python main.py --pipeline mongodb --batch-size 10000 --report
# Apache Hive pipeline
python main.py --pipeline hive --batch-size 10000# Show report for the most recent run
python main.py --report
# Show report for a specific run
python main.py --report --run-id run_20250426_143200_mongodb
# List all past runs
python main.py --list-runs
# Compare two pipeline runs side-by-side
python main.py --compare run_20250426_143200_mapreduce run_20250426_150000_pignosql_etl_project/
├── main.py ← Unified CLI entry point
├── config.py ← All configuration (credentials, paths, batch size)
├── data/ ← Place NASA log files here
├── parser/log_parser.py ← Shared parsing logic (single source of truth)
├── pipelines/
│ ├── mapreduce_pipeline.py
│ ├── pig_pipeline.py
│ ├── mongodb_pipeline.py
│ └── hive_pipeline.py
├── mapreduce/ ← Hadoop Streaming mapper/reducer scripts
├── pig/ ← Pig Latin scripts + Python UDF
├── hive/ ← HiveQL DDL and query scripts
├── mongodb/ ← MongoDB aggregation modules
├── db/schema.sql ← MySQL schema
├── db/loader.py ← MySQL bulk loader
├── reporting/report.py ← Report display module
├── utils/ ← Timer, BatchManager, run_id_generator
└── logs/ ← Malformed record logs per run
| Query | Description | Key Output Columns |
|---|---|---|
| Q1 | Daily Traffic Summary | log_date, status_code, request_count, total_bytes |
| Q2 | Top 20 Requested Resources | resource_path, request_count, total_bytes, distinct_host_count |
| Q3 | Hourly Error Analysis (400–599) | log_date, log_hour, error_request_count, total_request_count, error_rate, distinct_error_hosts |
Set your HADOOP_HOME environment variable or:
export HADOOP_STREAMING_JAR=/path/to/hadoop-streaming-*.jarCheck that MySQL is running and your password in config.py is correct:
brew services start mysql
mysql -u root -pbrew services start mongodb-community
mongosh --eval "db.adminCommand('ping')"Ensure $HIVE_HOME/bin is in your PATH:
export PATH=$PATH:/opt/homebrew/opt/hive/bin
source ~/.zshrcexport PATH=$PATH:/opt/homebrew/opt/pig/bin
source ~/.zshrc$HADOOP_HOME/sbin/start-dfs.sh
hdfs dfsadmin -report- Batch unit: number of input log records (lines), not bytes.
- Batch IDs: start at 1, increment sequentially.
- Final batch: always counted as valid even if shorter than batch_size.
- avg_batch_size:
total_records / num_batches - Both log files (July + August) are concatenated into one continuous stream.
- Same batch_size must be used across all pipelines for fair comparison.
Hive processes entire HDFS files in a single MapReduce/Tez job. The BatchManager is used in the Hive pipeline only for pre-scanning (counting records and malformed lines). All results are written with a single batch_id=1. This design decision is documented in the compact report.
| Included in runtime | Excluded from runtime |
|---|---|
| All parsing time | Dataset download time |
| All batch processing | Software installation time |
| All query execution | Report rendering time |
| All MySQL write time | MySQL schema creation |
| Initial HDFS file setup |
════════════════════════════════════════════════════════════════════════════════
NASA Log ETL & Analytics Tool — Run Report
════════════════════════════════════════════════════════════════════════════════
Pipeline : MONGODB
Run ID : run_20250426_143200_mongodb
Started At : 2025-04-26 14:32:00
Runtime : 45.230 seconds
Batch Size : 10,000
Total Records : 3,461,612
Malformed : 1,243
Num Batches : 347
Avg Batch Size : 9,973.2
────────────────────────────────────────────────────────────────────────────────
QUERY 1: Daily Traffic Summary
────────────────────────────────────────────────────────────────────────────────
╭────────────┬─────────────┬───────────────┬─────────────╮
│ log_date │ status_code │ request_count │ total_bytes │
├────────────┼─────────────┼───────────────┼─────────────┤
│ 1995-07-01 │ 200 │ 152,341 │ 982,043,210 │
│ ... │ ... │ ... │ ... │
╰────────────┴─────────────┴───────────────┴─────────────╯
cd nosql_etl_project
git init
git add .
git commit -m "Initial: complete multi-pipeline ETL tool"DAS 839 — NoSQL Systems | End Semester Project | NASA HTTP Log Analytics