Skip to content

sidGoswami725/NoSQL-Course-Project

Repository files navigation

NASA Log ETL & Analytics Tool

DAS 839 — NoSQL Systems End Semester Project

Multi-Pipeline ETL and Reporting Framework for Web Server Log Analytics


Overview

This tool reads raw NASA Kennedy Space Center HTTP web server log files, parses and transforms them into structured records, executes three analytical queries using your choice of four execution engines, stores aggregated results in MySQL, and displays a formatted report with execution metadata.

Four interchangeable execution pipelines:

# Pipeline Technology Used
1 MapReduce Hadoop Streaming with Python mapper/reducer
2 Pig Apache Pig Latin scripts with Python UDF
3 MongoDB pymongo with MongoDB aggregation pipelines
4 Hive HiveQL via beeline with RegEx SerDe

All pipelines process the same dataset with identical parsing rules, cleaning rules, and query logic — guaranteeing a fair comparison.


Prerequisites

Software Minimum Version Notes
Python 3.10+ System Python or virtual env
Hadoop 3.x HDFS + YARN must be running
Apache Pig 0.17+ pig command must be in PATH
Apache Hive 3.x HiveServer2 + beeline in PATH
MongoDB 6.x mongod must be running
MySQL 8.x mysqld must be running

Linux / Ubuntu / WSL Installation

# ── System packages ──────────────────────────────────────────────────────────
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk wget tar ssh rsync mysql-server

# ── Start MySQL ──────────────────────────────────────────────────────────────
sudo service mysql start
# Set root password (e.g. 'root') and create the database:
sudo mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root'; FLUSH PRIVILEGES;"

# ── Hadoop 3.3.6 ─────────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xzf hadoop-3.3.6.tar.gz && sudo rm hadoop-3.3.6.tar.gz

# ── Apache Pig 0.17 ──────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/pig/pig-0.17.0/pig-0.17.0.tar.gz
sudo tar -xzf pig-0.17.0.tar.gz && sudo rm pig-0.17.0.tar.gz

# ── Apache Hive 3.1.3 ────────────────────────────────────────────────────────
cd /opt
sudo wget -q https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
sudo tar -xzf apache-hive-3.1.3-bin.tar.gz && sudo rm apache-hive-3.1.3-bin.tar.gz
# Fix Guava version mismatch:
sudo rm /opt/apache-hive-3.1.3-bin/lib/guava-19.0.jar
sudo cp /opt/hadoop-3.3.6/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/apache-hive-3.1.3-bin/lib/

# ── MongoDB ──────────────────────────────────────────────────────────────────
# Follow official MongoDB Community installation for Ubuntu:
# https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update && sudo apt-get install -y mongodb-org
sudo systemctl start mongod || sudo mongod --fork --logpath /var/log/mongod.log --dbpath /var/lib/mongodb

# ── Python packages ──────────────────────────────────────────────────────────
pip install mysql-connector-python pymongo tabulate

Environment Variables

Add these to your ~/.bashrc (or ~/.zshrc on macOS):

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop-3.3.6
export PIG_HOME=/opt/pig-0.17.0
export HIVE_HOME=/opt/apache-hive-3.1.3-bin
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH=$PATH:$PIG_HOME/bin
export PATH=$PATH:$HIVE_HOME/bin
export MYSQL_PASSWORD=root    # or your MySQL root password

Then: source ~/.bashrc

Hadoop Configuration (Pseudo-Distributed Mode)

# Enable passwordless SSH for Hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa 2>/dev/null || true
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

# Configure Hadoop for pseudo-distributed mode
cat > $HADOOP_HOME/etc/hadoop/core-site.xml << 'EOF'
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
EOF

cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml << 'EOF'
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
EOF

echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh

# Format HDFS and start services
hdfs namenode -format -force
$HADOOP_HOME/sbin/start-dfs.sh

# Initialize Hive metastore
cd /tmp && schematool -dbType derby -initSchema

macOS Alternative (Homebrew)

brew install hadoop pig hive
brew tap mongodb/brew && brew install mongodb-community
brew install mysql
brew services start mongodb-community
brew services start mysql
pip install mysql-connector-python pymongo tabulate

Dataset Download & Preparation

# Create the data directory inside the project
mkdir -p nosql_etl_project/data
cd nosql_etl_project/data

# Download the NASA log files
curl -O https://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
curl -O https://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz

# Decompress (the ONLY allowed preprocessing step)
gunzip NASA_access_log_Jul95.gz
gunzip NASA_access_log_Aug95.gz

# Verify
ls -lh NASA_access_log_Jul95 NASA_access_log_Aug95

⚠️ Do not edit, filter, convert, or modify the raw log files. Decompression is the only permitted manual step.


Configuration

Edit config.py to set your credentials:

MYSQL_PASSWORD = "yourpassword"   # ← set your MySQL root password

All other defaults work for a standard local installation.


MySQL Schema Setup

cd nosql_etl_project
mysql -u root -p < db/schema.sql

This creates the nasa_etl database and all 5 required tables. The tool also calls init_schema() automatically at startup.


Running the Tool

Interactive Mode (recommended for demo)

cd nosql_etl_project
python main.py

You will see:

=============================================
    NASA Log ETL & Analytics Tool
=============================================

Select execution pipeline:
  1. MapReduce  (Hadoop Streaming)
  2. Apache Pig
  3. MongoDB
  4. Apache Hive

Enter choice (1-4): 3
Enter batch size [default 10000]: 10000

Non-Interactive Mode

# MapReduce pipeline
python main.py --pipeline mapreduce --batch-size 10000

# Apache Pig pipeline
python main.py --pipeline pig --batch-size 10000

# MongoDB pipeline (with automatic report after run)
python main.py --pipeline mongodb --batch-size 10000 --report

# Apache Hive pipeline
python main.py --pipeline hive --batch-size 10000

Viewing Reports

# Show report for the most recent run
python main.py --report

# Show report for a specific run
python main.py --report --run-id run_20250426_143200_mongodb

# List all past runs
python main.py --list-runs

# Compare two pipeline runs side-by-side
python main.py --compare run_20250426_143200_mapreduce run_20250426_150000_pig

Directory Structure

nosql_etl_project/
├── main.py                     ← Unified CLI entry point
├── config.py                   ← All configuration (credentials, paths, batch size)
├── data/                       ← Place NASA log files here
├── parser/log_parser.py        ← Shared parsing logic (single source of truth)
├── pipelines/
│   ├── mapreduce_pipeline.py
│   ├── pig_pipeline.py
│   ├── mongodb_pipeline.py
│   └── hive_pipeline.py
├── mapreduce/                  ← Hadoop Streaming mapper/reducer scripts
├── pig/                        ← Pig Latin scripts + Python UDF
├── hive/                       ← HiveQL DDL and query scripts
├── mongodb/                    ← MongoDB aggregation modules
├── db/schema.sql               ← MySQL schema
├── db/loader.py                ← MySQL bulk loader
├── reporting/report.py         ← Report display module
├── utils/                      ← Timer, BatchManager, run_id_generator
└── logs/                       ← Malformed record logs per run

Three Analytical Queries

Query Description Key Output Columns
Q1 Daily Traffic Summary log_date, status_code, request_count, total_bytes
Q2 Top 20 Requested Resources resource_path, request_count, total_bytes, distinct_host_count
Q3 Hourly Error Analysis (400–599) log_date, log_hour, error_request_count, total_request_count, error_rate, distinct_error_hosts

Common Errors & Solutions

FileNotFoundError: Cannot locate hadoop-streaming JAR

Set your HADOOP_HOME environment variable or:

export HADOOP_STREAMING_JAR=/path/to/hadoop-streaming-*.jar

MySQL connection failed

Check that MySQL is running and your password in config.py is correct:

brew services start mysql
mysql -u root -p

MongoDB connection failed

brew services start mongodb-community
mongosh --eval "db.adminCommand('ping')"

beeline: command not found

Ensure $HIVE_HOME/bin is in your PATH:

export PATH=$PATH:/opt/homebrew/opt/hive/bin
source ~/.zshrc

pig: command not found

export PATH=$PATH:/opt/homebrew/opt/pig/bin
source ~/.zshrc

HDFS not running

$HADOOP_HOME/sbin/start-dfs.sh
hdfs dfsadmin -report

Batching Behaviour

  • Batch unit: number of input log records (lines), not bytes.
  • Batch IDs: start at 1, increment sequentially.
  • Final batch: always counted as valid even if shorter than batch_size.
  • avg_batch_size: total_records / num_batches
  • Both log files (July + August) are concatenated into one continuous stream.
  • Same batch_size must be used across all pipelines for fair comparison.

Hive Batching Note

Hive processes entire HDFS files in a single MapReduce/Tez job. The BatchManager is used in the Hive pipeline only for pre-scanning (counting records and malformed lines). All results are written with a single batch_id=1. This design decision is documented in the compact report.


Runtime Measurement

Included in runtime Excluded from runtime
All parsing time Dataset download time
All batch processing Software installation time
All query execution Report rendering time
All MySQL write time MySQL schema creation
Initial HDFS file setup

Sample Output

════════════════════════════════════════════════════════════════════════════════
  NASA Log ETL & Analytics Tool — Run Report
════════════════════════════════════════════════════════════════════════════════
  Pipeline       : MONGODB
  Run ID         : run_20250426_143200_mongodb
  Started At     : 2025-04-26 14:32:00
  Runtime        : 45.230 seconds
  Batch Size     : 10,000
  Total Records  : 3,461,612
  Malformed      : 1,243
  Num Batches    : 347
  Avg Batch Size : 9,973.2
────────────────────────────────────────────────────────────────────────────────
  QUERY 1: Daily Traffic Summary
────────────────────────────────────────────────────────────────────────────────
╭────────────┬─────────────┬───────────────┬─────────────╮
│ log_date   │ status_code │ request_count │ total_bytes │
├────────────┼─────────────┼───────────────┼─────────────┤
│ 1995-07-01 │         200 │       152,341 │ 982,043,210 │
│ ...        │         ... │           ... │         ... │
╰────────────┴─────────────┴───────────────┴─────────────╯

Git Repository

cd nosql_etl_project
git init
git add .
git commit -m "Initial: complete multi-pipeline ETL tool"

DAS 839 — NoSQL Systems | End Semester Project | NASA HTTP Log Analytics

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors