NASA Log ETL & Analytics Tool

DAS 839 — NoSQL Systems End Semester Project

Multi-Pipeline ETL and Reporting Framework for Web Server Log Analytics

Overview

This tool reads raw NASA Kennedy Space Center HTTP web server log files, parses and transforms them into structured records, executes three analytical queries using your choice of four execution engines, stores aggregated results in MySQL, and displays a formatted report with execution metadata.

Four interchangeable execution pipelines:

#	Pipeline	Technology Used
1	MapReduce	Hadoop Streaming with Python mapper/reducer
2	Pig	Apache Pig Latin scripts with Python UDF
3	MongoDB	pymongo with MongoDB aggregation pipelines
4	Hive	HiveQL via beeline with RegEx SerDe

All pipelines process the same dataset with identical parsing rules, cleaning rules, and query logic — guaranteeing a fair comparison.

Prerequisites

Software	Minimum Version	Notes
Python	3.10+	System Python or virtual env
Hadoop	3.x	HDFS + YARN must be running
Apache Pig	0.17+	`pig` command must be in PATH
Apache Hive	3.x	HiveServer2 + beeline in PATH
MongoDB	6.x	`mongod` must be running
MySQL	8.x	`mysqld` must be running

Linux / Ubuntu / WSL Installation

# ── System packages ──────────────────────────────────────────────────────────
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk wget tar ssh rsync mysql-server

# ── Start MySQL ──────────────────────────────────────────────────────────────
sudo service mysql start
# Set root password (e.g. 'root') and create the database:
sudo mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root'; FLUSH PRIVILEGES;"

# ── Hadoop 3.3.6 ─────────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xzf hadoop-3.3.6.tar.gz && sudo rm hadoop-3.3.6.tar.gz

# ── Apache Pig 0.17 ──────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/pig/pig-0.17.0/pig-0.17.0.tar.gz
sudo tar -xzf pig-0.17.0.tar.gz && sudo rm pig-0.17.0.tar.gz

# ── Apache Hive 3.1.3 ────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
sudo tar -xzf apache-hive-3.1.3-bin.tar.gz && sudo rm apache-hive-3.1.3-bin.tar.gz
# Fix Guava version mismatch:
sudo rm /opt/apache-hive-3.1.3-bin/lib/guava-19.0.jar
sudo cp /opt/hadoop-3.3.6/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/apache-hive-3.1.3-bin/lib/

# ── MongoDB ──────────────────────────────────────────────────────────────────
# Follow official MongoDB Community installation for Ubuntu:
# https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update && sudo apt-get install -y mongodb-org
sudo systemctl start mongod || sudo mongod --fork --logpath /var/log/mongod.log --dbpath /var/lib/mongodb

# ── Python packages ──────────────────────────────────────────────────────────
pip install mysql-connector-python pymongo tabulate

Environment Variables

Add these to your ~/.bashrc (or ~/.zshrc on macOS):

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop-3.3.6
export PIG_HOME=/opt/pig-0.17.0
export HIVE_HOME=/opt/apache-hive-3.1.3-bin
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH=$PATH:$PIG_HOME/bin
export PATH=$PATH:$HIVE_HOME/bin
export MYSQL_PASSWORD=root    # or your MySQL root password

Then: source ~/.bashrc

Hadoop Configuration (Pseudo-Distributed Mode)

# Enable passwordless SSH for Hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa 2>/dev/null || true
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

# Configure Hadoop for pseudo-distributed mode
cat > $HADOOP_HOME/etc/hadoop/core-site.xml << 'EOF'
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
EOF

cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml << 'EOF'
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
EOF

echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh

# Format HDFS and start services
hdfs namenode -format -force
$HADOOP_HOME/sbin/start-dfs.sh

# Initialize Hive metastore
cd /tmp && schematool -dbType derby -initSchema

macOS Alternative (Homebrew)

brew install hadoop pig hive
brew tap mongodb/brew && brew install mongodb-community
brew install mysql
brew services start mongodb-community
brew services start mysql
pip install mysql-connector-python pymongo tabulate

Dataset Download & Preparation

# Create the data directory inside the project
mkdir -p nosql_etl_project/data
cd nosql_etl_project/data

# Download the NASA log files
curl -O https://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
curl -O https://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz

# Decompress (the ONLY allowed preprocessing step)
gunzip NASA_access_log_Jul95.gz
gunzip NASA_access_log_Aug95.gz

# Verify
ls -lh NASA_access_log_Jul95 NASA_access_log_Aug95

⚠️ Do not edit, filter, convert, or modify the raw log files. Decompression is the only permitted manual step.

Configuration

Edit config.py to set your credentials:

MYSQL_PASSWORD = "yourpassword"   # ← set your MySQL root password

All other defaults work for a standard local installation.

MySQL Schema Setup

cd nosql_etl_project
mysql -u root -p < db/schema.sql

This creates the nasa_etl database and all 5 required tables. The tool also calls init_schema() automatically at startup.

Running the Tool

Interactive Mode (recommended for demo)

cd nosql_etl_project
python main.py

You will see:

=============================================
    NASA Log ETL & Analytics Tool
=============================================

Select execution pipeline:
  1. MapReduce  (Hadoop Streaming)
  2. Apache Pig
  3. MongoDB
  4. Apache Hive

Enter choice (1-4): 3
Enter batch size [default 10000]: 10000

Non-Interactive Mode

# MapReduce pipeline
python main.py --pipeline mapreduce --batch-size 10000

# Apache Pig pipeline
python main.py --pipeline pig --batch-size 10000

# MongoDB pipeline (with automatic report after run)
python main.py --pipeline mongodb --batch-size 10000 --report

# Apache Hive pipeline
python main.py --pipeline hive --batch-size 10000

Viewing Reports

# Show report for the most recent run
python main.py --report

# Show report for a specific run
python main.py --report --run-id run_20250426_143200_mongodb

# List all past runs
python main.py --list-runs

# Compare two pipeline runs side-by-side
python main.py --compare run_20250426_143200_mapreduce run_20250426_150000_pig

Directory Structure

nosql_etl_project/
├── main.py                     ← Unified CLI entry point
├── config.py                   ← All configuration (credentials, paths, batch size)
├── data/                       ← Place NASA log files here
├── parser/log_parser.py        ← Shared parsing logic (single source of truth)
├── pipelines/
│   ├── mapreduce_pipeline.py
│   ├── pig_pipeline.py
│   ├── mongodb_pipeline.py
│   └── hive_pipeline.py
├── mapreduce/                  ← Hadoop Streaming mapper/reducer scripts
├── pig/                        ← Pig Latin scripts + Python UDF
├── hive/                       ← HiveQL DDL and query scripts
├── mongodb/                    ← MongoDB aggregation modules
├── db/schema.sql               ← MySQL schema
├── db/loader.py                ← MySQL bulk loader
├── reporting/report.py         ← Report display module
├── utils/                      ← Timer, BatchManager, run_id_generator
└── logs/                       ← Malformed record logs per run

Three Analytical Queries

Query	Description	Key Output Columns
Q1	Daily Traffic Summary	log_date, status_code, request_count, total_bytes
Q2	Top 20 Requested Resources	resource_path, request_count, total_bytes, distinct_host_count
Q3	Hourly Error Analysis (400–599)	log_date, log_hour, error_request_count, total_request_count, error_rate, distinct_error_hosts

Common Errors & Solutions

`FileNotFoundError: Cannot locate hadoop-streaming JAR`

Set your HADOOP_HOME environment variable or:

export HADOOP_STREAMING_JAR=/path/to/hadoop-streaming-*.jar

`MySQL connection failed`

Check that MySQL is running and your password in config.py is correct:

brew services start mysql
mysql -u root -p

`MongoDB connection failed`

brew services start mongodb-community
mongosh --eval "db.adminCommand('ping')"

`beeline: command not found`

Ensure $HIVE_HOME/bin is in your PATH:

export PATH=$PATH:/opt/homebrew/opt/hive/bin
source ~/.zshrc

`pig: command not found`

export PATH=$PATH:/opt/homebrew/opt/pig/bin
source ~/.zshrc

`HDFS not running`

$HADOOP_HOME/sbin/start-dfs.sh
hdfs dfsadmin -report

Batching Behaviour

Batch unit: number of input log records (lines), not bytes.
Batch IDs: start at 1, increment sequentially.
Final batch: always counted as valid even if shorter than batch_size.
avg_batch_size: total_records / num_batches
Both log files (July + August) are concatenated into one continuous stream.
Same batch_size must be used across all pipelines for fair comparison.

Hive Batching Note

Hive processes entire HDFS files in a single MapReduce/Tez job. The BatchManager is used in the Hive pipeline only for pre-scanning (counting records and malformed lines). All results are written with a single batch_id=1. This design decision is documented in the compact report.

Runtime Measurement

Included in runtime	Excluded from runtime
All parsing time	Dataset download time
All batch processing	Software installation time
All query execution	Report rendering time
All MySQL write time	MySQL schema creation
	Initial HDFS file setup

Sample Output

════════════════════════════════════════════════════════════════════════════════
  NASA Log ETL & Analytics Tool — Run Report
════════════════════════════════════════════════════════════════════════════════
  Pipeline       : MONGODB
  Run ID         : run_20250426_143200_mongodb
  Started At     : 2025-04-26 14:32:00
  Runtime        : 45.230 seconds
  Batch Size     : 10,000
  Total Records  : 3,461,612
  Malformed      : 1,243
  Num Batches    : 347
  Avg Batch Size : 9,973.2
────────────────────────────────────────────────────────────────────────────────
  QUERY 1: Daily Traffic Summary
────────────────────────────────────────────────────────────────────────────────
╭────────────┬─────────────┬───────────────┬─────────────╮
│ log_date   │ status_code │ request_count │ total_bytes │
├────────────┼─────────────┼───────────────┼─────────────┤
│ 1995-07-01 │         200 │       152,341 │ 982,043,210 │
│ ...        │         ... │           ... │         ... │
╰────────────┴─────────────┴───────────────┴─────────────╯

Git Repository

cd nosql_etl_project
git init
git add .
git commit -m "Initial: complete multi-pipeline ETL tool"

DAS 839 — NoSQL Systems | End Semester Project | NASA HTTP Log Analytics

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
db		db
hive		hive
logs		logs
mapreduce		mapreduce
mongodb		mongodb
parser		parser
pig		pig
pipelines		pipelines
reporting		reporting
utils		utils
.gitignore		.gitignore
Dockerfile.hive		Dockerfile.hive
README.md		README.md
config.py		config.py
experiment.py		experiment.py
main.py		main.py
start_hive_docker.sh		start_hive_docker.sh

Folders and files

Latest commit

History

Repository files navigation

NASA Log ETL & Analytics Tool

DAS 839 — NoSQL Systems End Semester Project

Multi-Pipeline ETL and Reporting Framework for Web Server Log Analytics

Overview

Prerequisites

Linux / Ubuntu / WSL Installation

Environment Variables

Hadoop Configuration (Pseudo-Distributed Mode)

macOS Alternative (Homebrew)

Dataset Download & Preparation

Configuration

MySQL Schema Setup

Running the Tool

Interactive Mode (recommended for demo)

Non-Interactive Mode

Viewing Reports

Directory Structure

Three Analytical Queries

Common Errors & Solutions

FileNotFoundError: Cannot locate hadoop-streaming JAR

MySQL connection failed

MongoDB connection failed

beeline: command not found

pig: command not found

HDFS not running

Batching Behaviour

Hive Batching Note

Runtime Measurement

Sample Output

Git Repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`FileNotFoundError: Cannot locate hadoop-streaming JAR`

`MySQL connection failed`

`MongoDB connection failed`

`beeline: command not found`

`pig: command not found`

`HDFS not running`

Packages