# SE446 ‚Äì Week 2B: HDFS Basics for Beginners

## üéì Your First Steps with HDFS

---

### üéØ What You'll Learn

In this notebook, you'll learn:

1. ‚úÖ Basic Python commands to connect to a server
2. ‚úÖ How to list files in HDFS
3. ‚úÖ How to create directories
4. ‚úÖ How to upload files
5. ‚úÖ How to read files from HDFS
6. ‚úÖ Simple commands one step at a time!

---

### üìö What is HDFS?

**HDFS** = **H**adoop **D**istributed **F**ile **S**ystem

Think of it like Google Drive or Dropbox, but for **Big Data**!

- üìÇ Stores very large files
- üíæ Splits files across multiple computers
- üîÑ Keeps backup copies automatically
- ‚ö° Very fast for big datasets

---

### üñ•Ô∏è Our Cluster Information

You'll connect to a real HDFS cluster:

- **Server Address**: 134.209.172.50
- **Number of Computers**: 3 (1 Master + 2 Workers)
- **Total Storage**: 95 GB
- **Web Interface**: https://hdfs.aniskoubaa.org

---

**‚ö†Ô∏è Important**: Execute cells **one by one** from top to bottom!

---

## Step 1: Install Tools üîß

First, we need to install a Python library called **`paramiko`**.

**What is paramiko?**
- It helps Python connect to remote servers using SSH
- SSH = Secure SHell (like a secure remote control for computers)

**The `!` symbol**: Runs terminal commands from Python

In [2]:
# Install paramiko library (only need to run once)
!pip install paramiko -q

**What happened?**
- `-q` means "quiet" (less output)
- Python downloaded and installed the paramiko library
- Now we can use it in our code!

---

## Step 2: Import the Library üìö

**What is import?**
- Like opening a toolbox before using tools
- We tell Python: "We want to use paramiko"

In [3]:
# Import the paramiko library
import paramiko

**What happened?**
- No output = Success! ‚úÖ
- Python loaded the paramiko library into memory
- Now we can use functions like `paramiko.SSHClient()`

---

## Step 3: Set Connection Details üîê

**Variables**: Like labeled boxes that store information

We'll store:
- Server address (where to connect)
- Username (who you are)
- Password (your secret key)

In [None]:
# Store connection information in variables
server = "134.209.172.50"
username = "root"
password = "Zj:7a^9HEh&+a@c"  # Ask instructor for password

**What happened?**
- Created 3 variables: `server`, `username`, `password`
- Each stores a piece of text (called a "string")
- Strings are wrapped in quotes `" "`

---

## Step 4: Print the Variables üñ®Ô∏è

**print()**: Displays information on screen

Let's check what we stored:

In [5]:
# Display the server address
print(server)

134.209.172.50


In [6]:
# Display the username
print(username)

root


**What happened?**
- `print()` function displays the value inside the variable
- You should see the IP address and username

---

## Step 5: Create SSH Connection Object üîå

**Object**: Think of it as a "connection device"

**SSHClient()**: Creates a new SSH connection tool

In [7]:
# Create a new SSH client object
ssh = paramiko.SSHClient()

**What happened?**
- Created a new SSH client and stored it in variable `ssh`
- It's like creating a "phone" to call the server
- Not connected yet - just created the tool!

---

## Step 6: Trust the Server ü§ù

**Security Policy**: Rules for trusting servers

We tell Python: "It's okay to connect to this server"

In [8]:
# Set policy to automatically trust the server
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

**What happened?**
- Set a security policy for our SSH connection
- `AutoAddPolicy()` = automatically trust new servers
- Like saying "Yes, I trust this website" in your browser

---

## Step 7: Connect to the Server! üöÄ

**connect()**: Actually make the connection

This is like dialing the phone number!

In [27]:
# Connect to the HDFS server
ssh.connect(hostname=server, username=username, password=password)
print("‚úÖ Connected successfully!")

print(f"üì° Connected to: {server}")
print(f"üë§ Logged in as: {username}")

‚úÖ Connected successfully!
üì° Connected to: 134.209.172.50
üë§ Logged in as: root


**What happened?**
- `ssh.connect()` made the connection
- Used our variables: server, username, password
- If you see "‚úÖ Connected successfully!", it worked!
- If error appears, check your password

---

## Step 8: Run Your First HDFS Command! üéâ

**exec_command()**: Runs a command on the remote server

We'll list files in HDFS root directory:
- `hdfs dfs -ls /` = "list files in HDFS"

In [32]:
# Run HDFS list command
stdin, stdout, stderr = ssh.exec_command("sudo -u hadoop /opt/hadoop/bin/hdfs dfs -ls /")
print("‚è≥ Command sent to server... waiting for response...")

‚è≥ Command sent to server... waiting for response...


**What happened?**
- Sent command to the remote server
- Got back 3 things:
  - `stdin` = input (we don't use this)
  - `stdout` = output/results
  - `stderr` = errors (if any)

---

## Step 9: Read the Output üìñ

**read()**: Gets the data from stdout

**decode()**: Converts bytes to readable text

In [33]:
# Get the output from the command
output = stdout.read().decode('utf-8')
error = stderr.read().decode('utf-8')
print(f"üìä Output length: {len(output)} characters")
if error:
    print(f"‚ö†Ô∏è Error: {error}")

üìä Output length: 81 characters


**What happened?**
- `stdout.read()` = get the raw data
- `.decode('utf-8')` = convert to readable text
- Stored in variable called `output`

---

## Step 10: Display the Results! üéä

Let's see what files are in HDFS!

In [34]:
# Print the output
print("üìÇ HDFS Root Directory Contents:")
print("=" * 50)
if output:
    print(output)
else:
    print("(No output - directory might be empty)")
if error:
    print("\n‚ö†Ô∏è Errors:")
    print(error)

üìÇ HDFS Root Directory Contents:
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2026-01-27 11:22 /test



**What you see:**
```
drwxr-xr-x   - hadoop supergroup    0 2024-01-15 /students
drwxr-xr-x   - hadoop supergroup    0 2024-01-15 /tmp
```

**Explanation:**
- `d` = directory (folder)
- `rwxr-xr-x` = permissions
- `hadoop` = owner
- `/students` = folder name

---

## üéØ Practice: Make a Helper Function

**Function**: A reusable piece of code

Instead of writing the same code over and over, we create a function!

**def**: Defines a new function

In [35]:
# Define a function to run HDFS commands easily
def run_hdfs_command(command):
    stdin, stdout, stderr = ssh.exec_command("sudo -u hadoop /opt/hadoop/bin/hdfs dfs " + command)
    result = stdout.read().decode('utf-8')
    error = stderr.read().decode('utf-8')
    if error and not result:
        print(f"‚ö†Ô∏è Error: {error}")
    return result

**What happened?**
- Created a function called `run_hdfs_command`
- Takes one input: `command`
- Runs the command on HDFS
- Returns the result
- Now we can use it easily!

---

## Step 11: Use Your Function! ‚ú®

Let's list files again, but easier:

In [36]:
# List HDFS root directory using our function
result = run_hdfs_command("-ls /")
print("üìÇ HDFS Root Directory:")
print("=" * 50)
if result:
    print(result)
else:
    print("(No files found or empty directory)")

üìÇ HDFS Root Directory:
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2026-01-27 11:22 /test



**What happened?**
- Called our function with `"-ls /"`
- Much simpler than before!
- Got the same result

---

## Step 12: Create Your Personal Directory üìÅ

**Variable for your name:**

In [None]:
# Set your name (NO SPACES!)
my_name = "student_demo"  # ‚Üê CHANGE THIS TO YOUR NAME!

**Create directory command:**
- `-mkdir` = make directory
- `-p` = create parent folders if needed

In [None]:
# Create your directory
result = run_hdfs_command("-mkdir -p /students/" + my_name)
print("‚úÖ Directory created!")

**What happened?**
- `"-mkdir -p /students/" + my_name` joins strings
- Creates folder `/students/student_demo`
- `-p` means "create parent folders too"

---

## Step 13: Verify Your Directory Exists ‚úîÔ∏è

List all students' directories:

In [None]:
# List students directory
result = run_hdfs_command("-ls /students")
print(result)

**What to look for:**
- You should see your folder name in the list!
- Example: `/students/student_demo`

---

## Step 14: Create a Text File Locally üìù

Before uploading to HDFS, we create a file on the server.

**String with \n**: `\n` = new line (Enter key)

In [None]:
# Content for our file
file_content = "Hello HDFS!\nThis is my first file.\nLearning is fun!"

**Create file on remote server:**

In [None]:
# Create file command
create_cmd = 'echo "' + file_content + '" > /tmp/my_first_file.txt'
stdin, stdout, stderr = ssh.exec_command(create_cmd)
print("‚úÖ File created at /tmp/my_first_file.txt")

**What happened?**
- `echo` = print text
- `>` = save to file
- Created file at `/tmp/my_first_file.txt` on server

---

## Step 15: Upload File to HDFS! üöÄ

**-put command**: Uploads a file to HDFS

Syntax: `hdfs dfs -put <local_file> <hdfs_path>`

In [None]:
# Upload file to HDFS
upload_path = "/students/" + my_name + "/my_first_file.txt"
result = run_hdfs_command("-put /tmp/my_first_file.txt " + upload_path)
print("‚úÖ File uploaded to HDFS!")

**What happened?**
- Took file from `/tmp/my_first_file.txt`
- Uploaded to HDFS at `/students/your_name/my_first_file.txt`
- File is now in distributed storage!

---

## Step 16: List Your Files in HDFS üìã

In [None]:
# List your directory in HDFS
result = run_hdfs_command("-ls /students/" + my_name)
print(result)

**What you see:**
- Your file with size, date, and name
- Example: `-rw-r--r--   2 hadoop supergroup  45 2024-01-27 /students/student_demo/my_first_file.txt`

---

## Step 17: Read File from HDFS! üìñ

**-cat command**: Displays file content

Like `cat` in Linux terminal

In [None]:
# Read and display file content
result = run_hdfs_command("-cat /students/" + my_name + "/my_first_file.txt")
print(result)

**What you see:**
```
Hello HDFS!
This is my first file.
Learning is fun!
```

---

## Step 18: Check File Size üìè

**-du command**: Shows disk usage

**-h flag**: Human-readable (shows KB, MB instead of bytes)

In [None]:
# Check file size
result = run_hdfs_command("-du -h /students/" + my_name + "/my_first_file.txt")
print(result)

**What you see:**
- File size in human-readable format
- Example: `45  45  /students/student_demo/my_first_file.txt`
- First number = size
- Second number = disk space used (with replication)

---

## Step 19: Check Replication Factor üîÑ

**Replication**: How many copies HDFS keeps

**-stat command**: Shows file statistics

**%r**: Replication factor

In [None]:
# Check replication factor
result = run_hdfs_command("-stat %r /students/" + my_name + "/my_first_file.txt")
print("Replication factor:", result)

**What you see:**
- A number (probably `2`)
- This means HDFS keeps **2 copies** of your file
- Why? For backup and fault tolerance!
- If one computer fails, you still have the file

---

## Step 20: Create a CSV File üìä

**CSV**: Comma-Separated Values (like Excel)

Let's create data with multiple lines:

In [None]:
# Create CSV content
csv_data = "name,age,grade\nAlice,20,A\nBob,21,B\nCharlie,19,A"

In [None]:
# Print to see what it looks like
print(csv_data)

**What you see:**
```
name,age,grade
Alice,20,A
Bob,21,B
Charlie,19,A
```

---

## Step 21: Save CSV to Server

In [None]:
# Create CSV file on server
create_csv = 'echo "' + csv_data + '" > /tmp/students.csv'
stdin, stdout, stderr = ssh.exec_command(create_csv)
print("‚úÖ CSV file created")

---

## Step 22: Upload CSV to HDFS

In [None]:
# Upload CSV to HDFS
result = run_hdfs_command("-put /tmp/students.csv /students/" + my_name + "/")
print("‚úÖ CSV uploaded to HDFS!")

---

## Step 23: Read CSV from HDFS

In [None]:
# Display CSV content from HDFS
result = run_hdfs_command("-cat /students/" + my_name + "/students.csv")
print(result)

---

## Step 24: Count Your Files üî¢

**-count command**: Shows summary of directory

Returns: # of directories, # of files, total size

In [None]:
# Count files in your directory
result = run_hdfs_command("-count /students/" + my_name)
print(result)

**What you see:**
```
  1    2    100  /students/student_demo
  ‚îÇ    ‚îÇ     ‚îÇ
  ‚îÇ    ‚îÇ     ‚îî‚îÄ Total bytes
  ‚îÇ    ‚îî‚îÄ Number of files
  ‚îî‚îÄ Number of directories
```

---

## Step 25: Download File from HDFS üì•

**-get command**: Downloads from HDFS to local

In [None]:
# Download file from HDFS
result = run_hdfs_command("-get /students/" + my_name + "/students.csv /tmp/downloaded.csv")
print("‚úÖ File downloaded to /tmp/downloaded.csv")

---

## Step 26: Verify Download

In [None]:
# Check if download worked
stdin, stdout, stderr = ssh.exec_command("cat /tmp/downloaded.csv")
result = stdout.read().decode('utf-8')
print(result)

**What happened?**
- Downloaded file from HDFS
- Saved to `/tmp/downloaded.csv` on server
- Displayed the content

---

## Step 27: Check Cluster Status üñ•Ô∏è

**dfsadmin**: HDFS administration commands

**-report**: Shows cluster health

In [None]:
# Get cluster report
stdin, stdout, stderr = ssh.exec_command("sudo -u hadoop /opt/hadoop/bin/hdfs dfsadmin -report")
result = stdout.read().decode('utf-8')
print(result)

**What you see:**
- Configured Capacity: Total storage
- DFS Used: How much is used
- Live DataNodes: Number of working servers
- Each DataNode's information

---

## Step 28: Disconnect from Server üëã

**close()**: Ends the SSH connection

Always clean up when done!

In [None]:
# Close the SSH connection
ssh.close()
print("‚úÖ Disconnected from server")

**What happened?**
- Closed the connection to the server
- Like hanging up the phone
- Good practice to always close connections

---

## üéâ Congratulations! You Did It!

### üìö What You Learned:

**Python Skills:**
- ‚úÖ Variables (storing information)
- ‚úÖ Strings (text)
- ‚úÖ Functions (reusable code)
- ‚úÖ Print statements
- ‚úÖ String concatenation (`+`)

**HDFS Commands:**
- ‚úÖ `-ls` (list files)
- ‚úÖ `-mkdir` (create directory)
- ‚úÖ `-put` (upload file)
- ‚úÖ `-get` (download file)
- ‚úÖ `-cat` (read file)
- ‚úÖ `-du` (check size)
- ‚úÖ `-stat` (file statistics)
- ‚úÖ `-count` (count files)

**Concepts:**
- ‚úÖ Distributed file systems
- ‚úÖ Replication for backup
- ‚úÖ SSH connections
- ‚úÖ Remote command execution

---

## üéØ Practice Exercises

Try these on your own:

1. Create a new directory called `practice`
2. Create a file with your favorite quote
3. Upload it to HDFS
4. Check the file size
5. Download it back

---

## üìñ HDFS Command Cheat Sheet

```bash
hdfs dfs -ls /path              # List files
hdfs dfs -mkdir /path           # Create directory
hdfs dfs -put local remote      # Upload
hdfs dfs -get remote local      # Download
hdfs dfs -cat /path/file        # Read file
hdfs dfs -rm /path/file         # Delete file
hdfs dfs -rm -r /path/dir       # Delete directory
hdfs dfs -du -h /path           # Check size
hdfs dfs -count /path           # Count files
```

---

## üåü Next Steps

1. **Practice more HDFS commands**
2. **Learn about MapReduce** (next lecture)
3. **Explore the Web UI**: https://hdfs.aniskoubaa.org

---

**Great job! You're now ready for more advanced HDFS topics! üöÄ**