# 🛠️ Apache Hadoop 3.3.6 Installation Guide (Ubuntu 20.04)

> ✅ Single-node (pseudo-distributed) Hadoop setup with **Java 8**, covering installation, configuration, and **real-world troubleshooting** (like Java 17 errors).

---

## 📋 System Requirements

| Component          | Requirement                                     |
|--------------------|-------------------------------------------------|
| OS                 | Ubuntu 20.04 (tested)                           |
| RAM                | 8 GB minimum                                    |
| Disk Space         | At least 10–20 GB (HDFS, logs, temp)            |
| Java Version       | ✅ Java 8 (**recommended**)                      |
| Hadoop Version     | Apache Hadoop 3.3.6                             |
| SSH                | Required for Hadoop daemon coordination         |

---

## 📦 Step 1: Install Dependencies

```bash
sudo apt update && sudo apt upgrade -y
sudo apt install -y ssh rsync curl wget openjdk-8-jdk
```

✅ **Why?**
- Java is needed for Hadoop to run
- SSH is required for Hadoop daemons to talk to each other — even on localhost

---

## 📁 Step 2: Download and Set Up Hadoop

```bash
HADOOP_VERSION=3.3.6
cd ~
wget https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
tar -xzf hadoop-$HADOOP_VERSION.tar.gz
mv hadoop-$HADOOP_VERSION hadoop
```

---

## ⚙️ Step 3: Set Environment Variables

Add this to `~/.bashrc`:

```bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=$HOME/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
```

Then reload:

```bash
source ~/.bashrc
```

---

## 🧠 Step 4: Configure Hadoop

### Edit `hadoop-env.sh`

```bash
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
```

Set:

```bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```

### Edit `core-site.xml`

```xml
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/neosoft/hadoop_data/tmp</value>
  </property>
</configuration>
```

### Edit `hdfs-site.xml`

```xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/neosoft/hadoop_data/dfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/neosoft/hadoop_data/dfs/datanode</value>
  </property>
</configuration>
```

### Edit `mapred-site.xml`

```bash
cp mapred-site.xml.template mapred-site.xml
```

```xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
```

### Edit `yarn-site.xml`

```xml
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>
```

---

## 📂 Step 5: Create Local Hadoop Directories

```bash
mkdir -p ~/hadoop_data/tmp
mkdir -p ~/hadoop_data/dfs/namenode
mkdir -p ~/hadoop_data/dfs/datanode
```

---

## 🔐 Step 6: Set Up Passwordless SSH

```bash
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh localhost
```

---

## 🧹 Step 7: Format the NameNode

```bash
hdfs namenode -format
```

---

## 🚀 Step 8: Start Hadoop Daemons

```bash
start-dfs.sh
start-yarn.sh
```

---

## ✅ Step 9: Verify Services

```bash
jps
```

Should show:
```
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager
```

UI Access:
- NameNode: [http://localhost:9870](http://localhost:9870)
- YARN: [http://localhost:8088](http://localhost:8088)

---

## 📁 Step 10: Upload & Read Files from HDFS

```bash
echo "Hello Hadoop" > ~/test.txt
hdfs dfs -mkdir /input
hdfs dfs -put ~/test.txt /input/
hdfs dfs -cat /input/test.txt
```

---

## 🛠️ Common Issues & Fixes

### ❌ **Problem:** ResourceManager Fails with Java 17

```text
Caused by: java.lang.reflect.InaccessibleObjectException:
module java.base does not "opens java.lang" to unnamed module
```

#### 🧠 Root Cause:
- Java 17+ enforces strong module boundaries.
- Hadoop uses reflection (Google Guice, Jetty), which is blocked without `--add-opens`.

#### ✅ Fix:
Use **Java 8 instead**:

```bash
sudo apt install openjdk-8-jdk -y

# Update hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# Restart services
stop-yarn.sh
stop-dfs.sh
start-dfs.sh
start-yarn.sh
```

---

### ⚠️ Optional (Advanced): Using Java 17 with `--add-opens` (not stable)

In `hadoop-env.sh`:

```bash
export HADOOP_OPTS="--add-opens java.base/java.lang=ALL-UNNAMED"
```

> Not guaranteed — still may crash in other modules (Guice, Jetty, Jackson, etc.)

---

### ❌ Other Issues

| Symptom                                         | Likely Cause                       | Fix                                      |
|--------------------------------------------------|------------------------------------|-------------------------------------------|
| `ssh: connect to host localhost port 22: Connection refused` | SSH not installed or running      | `sudo apt install openssh-server`        |
| ResourceManager UI not opening                  | Java version conflict              | Switch to Java 8                         |
| File not found in HDFS                          | Incorrect path                     | `hdfs dfs -ls /` to verify paths         |
| `Permission denied` error in file operations    | Wrong ownership or directory perms | Use proper user or set correct chmod     |

---

## ✅ You're All Set!

You now have a full Hadoop 3.3.6 environment running on Ubuntu 20.04 with:

- Java 8 compatibility
- HDFS + YARN operational
- UI dashboards available
- CLI file operations working

🎉 Happy Hadooping!

