. Write a Python program to read a Hadoop configuration file and display the core components of Hadoop.
2. Implement a Python function that calculates the total file size in a Hadoop Distributed File System (HDFS) directory.
3. Create a Python program that extracts and displays the top N most frequent words from a large text file using the MapReduce approach.
4. Write a Python script that checks the health status of the NameNode and DataNodes in a Hadoop cluster using Hadoop's REST API.
5. Develop a Python program that lists all the files and directories in a specific HDFS path.
6. Implement a Python program that analyzes the storage utilization of DataNodes in a Hadoop cluster and identifies the nodes with the highest and lowest storage capacities.
7. Create a Python script that interacts with YARN's ResourceManager API to submit a Hadoop job, monitor its progress, and retrieve the final output.
8. Create a Python script that interacts with YARN's ResourceManager API to submit a Hadoop job, set resource requirements, and track resource usage during job execution.
9. Write a Python program that compares the performance of a MapReduce job with different input split sizes, showcasing the impact on overall job execution time.

Python program to read a Hadoop configuration file and display the core components of Hadoop:


def read_hadoop_config(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    
    core_components = []
    
    for line in lines:
        if line.startswith('dfs.namenode') or line.startswith('dfs.datanode'):
            component = line.split('=')[0].strip()
            core_components.append(component)
    
    return core_components

# Example usage
config_file_path = 'hadoop.conf'
components = read_hadoop_config(config_file_path)
print("Core components of Hadoop:")
for component in components:
    print(component)


Python function to calculate the total file size in a Hadoop Distributed File System (HDFS) directory:

import subprocess

def get_directory_size(directory_path):
    command = f'hdfs dfs -du -s {directory_path}'
    output = subprocess.check_output(command, shell=True).decode('utf-8')
    size = int(output.split()[0])
    return size

# Example usage
directory_path = '/user/myuser/data'
total_size = get_directory_size(directory_path)
print(f"Total size of {directory_path}: {total_size} bytes")


Python program to extract and display the top N most frequent words from a large text file using the MapReduce approach:

from collections import Counter
import subprocess

def get_top_words(file_path, n):
    command = f'hadoop jar wordcount.jar WordCount {file_path} output'
    subprocess.call(command, shell=True)

    top_words = []
    with open('output/part-r-00000', 'r') as file:
        lines = file.readlines()
        word_counts = [line.strip().split('\t') for line in lines]
        word_counts.sort(key=lambda x: int(x[1]), reverse=True)
        top_words = [word for word, count in word_counts[:n]]
    
    return top_words

# Example usage
file_path = '/user/myuser/large_file.txt'
top_n = 10
top_words = get_top_words(file_path, top_n)
print(f"Top {top_n} frequent words:")
for word in top_words:
    print(word)


Python script to check the health status of the NameNode and DataNodes in a Hadoop cluster using Hadoop's REST API:

import requests

def check_cluster_health():
    namenode_url = 'http://namenode:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo'
    datanode_url = 'http://datanode:50075/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo'
    
    namenode_health = requests.get(namenode_url).json()['beans'][0]['State']
    datanode_health = requests.get(datanode_url).json()['beans'][0]['State']
    
    return namenode_health, datanode_health

# Example usage
nn_health, dn_health = check_cluster_health()
print(f"NameNode health: {nn_health}")
print(f"DataNode health: {dn_health}")


Python program to list all the files and directories in a specific HDFS path:

import subprocess

def list_hdfs_path(path):
    command = f'hdfs dfs -ls {path}'
    output = subprocess.check_output(command, shell=True).decode('utf-8')
    lines = output.strip().split('\n')[1:]
    
    files = []
    directories = []
    
    for line in lines:
        line_parts = line.split()
        file_type = line_parts[0][0]
        file_name = line_parts[-1]
        
        if file_type == 'd':
            directories.append(file_name)
        else:
            files.append(file_name)
    
    return files, directories

# Example usage
hdfs_path = '/user/myuser/data'
files, directories = list_hdfs_path(hdfs_path)

print("Files:")
for file in files:
    print(file)

print("\nDirectories:")
for directory in directories:
    print(directory)


Python program to analyze the storage utilization of DataNodes in a Hadoop cluster and identify the nodes with the highest and lowest storage capacities:

import requests

def analyze_datanode_storage():
    datanode_url = 'http://datanode:50075/jmx?qry=Hadoop:service=DataNode,name=FSDatasetState'
    
    response = requests.get(datanode_url).json()
    datanodes = response['beans'][0]['VolumeInfo']
    
    storage_utilization = {}
    
    for datanode in datanodes:
        node_name = datanode['key'].split('=')[1]
        used = datanode['usedSpace']
        capacity = datanode['capacity']
        utilization = (used / capacity) * 100
        storage_utilization[node_name] = utilization
    
    max_utilization_node = max(storage_utilization, key=storage_utilization.get)
    min_utilization_node = min(storage_utilization, key=storage_utilization.get)
    
    return max_utilization_node, min_utilization_node

# Example usage
max_node, min_node = analyze_datanode_storage()
print(f"DataNode with highest storage utilization: {max_node}")
print(f"DataNode with lowest storage utilization: {min_node}")


Python script to interact with YARN's ResourceManager API to submit a Hadoop job, monitor its progress, and retrieve the final output: