## Preparations

### Devices
Raspberry Pi 3 Model B `*` 1  
8GB Micro SD card `*` 1

### Softwares
Latest version of [RASPBIAN STRETCH LITE](https://www.raspberrypi.org/downloads/raspbian/)   
SD burning tool [Etcher](https://etcher.io/)  
Apache Hadoop [hadoop-2.8.1](http://www-eu.apache.org/dist/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz)  
Apache Hive [hive-2.3.1](http://www-eu.apache.org/dist/hive/hive-2.3.1/apache-hive-2.3.1-bin.tar.gz)  
Apache Spark [spark-2.2.0](http://www-eu.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz)  

## Bootstrapping the Pis

1. Burning the image of RASPBIAN STRETCH LITE into the Micro SD card via Etcher.  
2. Writing ssh file and WIFI configure file into the boot directory.  
>```shell
touch /Volumes/boot/ssh
```
```shell
vi /Volumes/boot/wpa_supplicant.conf  
```
```shell
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
        ssid="<network name>"
        psk="<password>"
}```

3. Plugging the power cable into the Raspberry Pi, then finding Raspberry Pi's IP in the Route ARP table, Raspberry Pi's MAC address starts with "B8-27-EB", which is owned by the Raspberry Pi Foundation.
>```
B8-27-EB-5B-C7-9C	192.168.1.108
```
4. SSH to Raspberry PI. User is pi and password is raspberry by default.  
>```shell
ssh pi@192.168.1.108
```
5. Adding public key to the Raspberry Pi.
>```shell
#local client
ssh-keygen
cat /Users/user/.ssh/id_rsa.pub
#on Raspberry Pi, add the public key
vi .ssh/authorized_keys
```

### Note
By default the username to SSH in is pi and the password is raspberry. When you SSH in you'll be given the following message:
>```
SSH is enabled and the default password for the 'pi' user has not been changed.
This is a security risk - please login as the 'pi' user and type 'passwd' to set a new password.
```

This will cause issues with rsync where you'll get errors like the following:  
>```
protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(178) [sender=3.1.2]
```

So if you don't change the default password you'll need to remove the warning script to stop that message disturbing rsync.  
>```shell
sudo rm /etc/profile.d/sshpwd.sh
```

## Installing HDFS, Hive & Spark

1. Scping and installing ExpressVpn: 
>```shell
#on local client
scp expressvpn_1.2.0_armhf.deb pi@192.168.1.108:./
```
```shell
#on Raspberry Pi
sudo dpkg -i ./expressvpn_1.2.0_armhf.deb
#activate device
expressvpn activate
#connect
expressvpn connect hk1
```
2. Installing prerequisite packages:
This is the list of prerequisite packages I installed. iotop and nethogs are for telemetry and are optional. mysql-server is only used on master server and you'll save yourself some memory and CPU cycles by not installing it on slave servers. If the storage devices via the USB ports are formatted with the exfat file system which isn't supported out of the box with Raspbian so exfat-fuse and exfat-utils are needed in order to interact with them.
>```shell
sudo apt-get update
sudo apt-get install \
    exfat-fuse \
    exfat-utils \
    iotop \
    mysql-server \
    nethogs \
    oracle-java8-jdk
```
3. Creating a user for Hive in MySQL / MariaDB.
>```shell
sudo su
mysql -uroot
```
```SQL
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive';
GRANT ALL PRIVILEGES ON *.* TO 'hive'@'localhost';
FLUSH PRIVILEGES;
exit
```
4. Hadoop needs to refer to other nodes in the cluster by hostname so I'll add them to the hosts file on all devices.
>```shell
sudo vi /etc/hosts
```
```
192.168.1.108 r1
```
5. To ease memory pressure I'll expand the 100 MB SWAP file to 2,000 MB by changing the CONF_SWAPSIZE setting in /etc/dphys-swapfile on all devices as well.
>```shell
sudo vi /etc/dphys-swapfile
```
```
CONF_SWAPSIZE=2000
```
restarting each of the devices so they'll pick up that SWAP file change
>```shell
sudo reboot
```
6. By default Hadoop uses the root account to SSH onto each of the nodes in the cluster. I'll create SSH keys to make sure this is a password-less process. On r1 I'll generate a new key pair and add it to the authorized keys of r1's root account.
>```shell
sudo su
ssh-keygen
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
```
**I don't have an SSH password for the root account on slave servers so I'll copy it to the authorized_keys file for the pi user on those devices.**
>```shell
ssh-copy-id pi@r2
ssh-copy-id pi@r3
exit
```
**Then on slave servers I'll bootstrap the .ssh folder for the root accounts on those machines and copy the authorized_keys file from the pi user's .ssh folder so the root user can accept it as well.**
>```shell
sudo su
ssh-keygen
cp /home/pi/.ssh/authorized_keys \
   /root/.ssh/authorized_keys
exit
```
7. There are settings that will be used by HDFS, Hive and Spark and by both the root and the pi user accounts. To centralise these settings I've stored them in /etc/profile and created a symbolic link from /root/.bashrc to this file as well. That way all users will have the same settings and they can be centrally managed on each device.
>```shell
sudo vi /etc/profile
```
```shell
#add below:
export HADOOP_HOME=/opt/hadoop
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:/opt/hive/bin:/opt/spark/bin:/opt/pig/bin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_MASTER_HOST=r1
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre
export ZOO_HOME=/opt/zookeeper
export KAFKA_HOME=/opt/kafka
export PIG_HOME=/opt/pig
export HBASE_HOME=/opt/hbase
export PIG_CLASSPATH=/opt/hadoop/etc/hadoop
```
```shell
sudo ln -sf /etc/profile /root/.bashrc
source /etc/profile
```
On r1 I'll create the folders used by the various Hadoop tools used in this benchmark.
>```shell
sudo mkdir -p /opt/{hadoop,hdfs/{datanode,namenode},hive,spark,kafka}
```
**On all servers, the USB connecting storage is represented by /dev/sda1. I'll mount it to /mnt/usb.**
>```shell
sudo mkdir -p /mnt/usb
sudo mount /dev/sda1 /mnt/usb
```
**Creating the application folders and the two data node folders HDFS will use for heterogeneous storage.**
```shell
>sudo mkdir -p /opt/{hadoop,hdfs/datanode,spark,kafka},/mnt/usb/hdfs/datanode
```
8. SCPing the sorftware packages from client to Raspberry Pi master server
>```shell
#on the client
scp spark-2.2.0-bin-hadoop2.7.tgz pi@192.168.1.108:./
scp hadoop-2.8.1.tar.gz pi@192.168.1.108:./
scp apache-hive-2.3.1-bin.tar.gz pi@192.168.1.108:./
```
or directly downloading to the Raspberry Pi
>```shell
DIST=http://www-eu.apache.org/dist
wget -c -O hadoop.tar.gz $DIST/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
wget -c -O hive.tar.gz   $DIST/hive/hive-2.3.0/apache-hive-2.3.0-bin.tar.gz
wget -c -O spark.tgz     DIST/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
```
Hadoop is 405 MB in size when compressed, Hive is 221 MB and Spark is 194 MB. Hadoop expands to 2.1 GB but 1.9 GB of that is documentation so I'll exclude the docs from the extraction.
>```shell
sudo tar xvf hadoop-2.8.1.tar.gz \
    --directory=/opt/hadoop \
    --exclude=hadoop-2.8.1/share/doc \
    --strip 1
```
Hive is 172 MB decompressed but 102 MB of that is unit tests so I'll exclude those from extraction.
>```shell
sudo tar xvf apache-hive-2.3.1-bin.tar.gz \
    --directory=/opt/hive \
    --exclude=apache-hive-2.3.0-bin/ql/src/test \
    --strip 1
```
The following will extract Spark to it's installation folder.
>```shell
sudo tar xzvf spark-2.2.0-bin-hadoop2.7.tgz \
  --directory=/opt/spark \
  --strip 1
```
**I'll specify the master and slaves for the HDFS cluster. r1 will serve as both a master and a slave so that all the Raspberry Pis will be busy when processing workloads.**
>```shell
sudo vi /opt/hadoop/etc/hadoop/master
```
```
r1
```
```shell
sudo vi /opt/hadoop/etc/hadoop/slaves
```
```
r1
r2
r3
```
**I'll then create two files with configuration overrides needed for this HDFS cluster. I'll be setting a default replication factor of 3 for all the files stored on HDFS so that they're copied onto each machine in full. There are multiple storage folders used on r2 and r3 and to avoid filling the Micro SD card used by both HDFS and the OS I've set a limit of 3 GB that must be available before HDFS writes any blocks to a partition.**
>```shell
sudo vi /opt/hadoop/etc/hadoop/core-site.xml
```
```xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://r1:9000/</value>
    </property>
    <property>
        <name>fs.default.FS</name>
        <value>hdfs://r1:9000/</value>
    </property>
</configuration>
```
```shell
sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml
```
```xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hdfs/datanode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hdfs/namenode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>r1:50070</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.datanode.du.reserved</name>
        <value>3221225472</value>
    </property>
</configuration>
```
9. **Syncing Hadoop's binaries and configuration onto the slave servers.**
>```shell
for SERVER in r2 r3
do
    sudo rsync --archive \
               --one-file-system \
               --partial \
               --progress \
               --compress \
               /opt/hadoop/ SERVER:/opt/hadoop/
done
```
On r2 and r3 I'll adjust the HDFS configuration to include both storage folders.
>```shell
sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml
```
```xml
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/mnt/usb/hdfs/datanode,/opt/hdfs/datanode</value>
    <final>true</final>
</property>
```
10. At this point I'll need to load an interactive root shell in order to run three commands.
>```shell
sudo su
```
The first command will format the HDFS name node.
>```shell
hdfs namenode -format
```
The next will launch HDFS across the whole cluster. This command will SSH as the root user into each device.
>```shell
start-dfs.sh
#stop-dfs.sh
```
The third command sets permissive access for the pi user on HDFS.
>```shell
hdfs dfs -chown pi /
```
11. Once that's all done I can check the capacity available across the cluster. The first line of output is the aggregate of each of the devices. The remaining lines are the amount of capacity on each respective device.
>```shell
hdfs dfsadmin -report | grep 'Configured Capacity'
```
```
Configured Capacity: 314337058816 (292.75 GB)
Configured Capacity: 125850886144 (117.21 GB)
Configured Capacity: 94243086336 (87.77 GB)
Configured Capacity: 94243086336 (87.77 GB)
```
```shell
exit
```
12. The following will configure Hive to use MySQL / MariaDB to store it's metadata. This only needs to happen on master server.
>```shell
sudo vi /opt/hive/conf/hive-site.xml
```
```xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
    </property>
    <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.fixedDatastore</name>
        <value>true</value>
    </property>
    <property>
        <name>datanucleus.autoCreateTables</name>
        <value>True</value>
    </property>
</configuration>
```
Downloading the MySQL / MariaDB connector for Hive to use.
>```shell
sudo wget -c http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar \
    -P /opt/hive/lib/
```
Initialising the schema and launch the Hive Metastore.
>```shell
sudo su
schematool -dbType mysql -initSchema
```
```shell
hive --service metastore &
```
Spark will need to know of Hive's configuration settings so I'll link the configuration file into Spark's configuration folder.
>```shell
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark/conf/hive-site.xml
```
Spark too will also need to use the same MySQL / MariaDB connector.
>```shell
sudo ln -s /opt/hive/lib/mysql-connector-java-5.1.28.jar \
             /opt/spark/jars/mysql-connector-java-5.1.28.jar
```
13. When you launch pyspark, spark-submit or spark-sql the Spark libraries from the master node are copied onto HDFS and shared amongst the worker nodes. Reading 200 MB off of the Micro SD card every time one of these applications launches adds a lot of delay so I'll package up these libraries, upload them to HDFS and in the Spark configuration I'll make sure the cached jar of libraries is used instead.
>```shell
jar cv0f ~/spark-libs.jar -C /opt/spark/jars/ .
hdfs dfs -mkdir /spark-libs
hdfs dfs -put ~/spark-libs.jar /spark-libs/
sudo vi /opt/spark/conf/spark-defaults.conf
```
```
spark.master spark://r1:7077
spark.yarn.preserve.staging.files true
spark.yarn.archive hdfs:///spark-libs/spark-libs.jar
```
I found a 650 MB memory limit on the various Spark components allowed everything to work without complaining.
>```shell
sudo vi /opt/spark/conf/spark-env.sh
```
```
SPARK_EXECUTOR_MEMORY=650m
SPARK_DRIVER_MEMORY=650m
SPARK_WORKER_MEMORY=650m
SPARK_DAEMON_MEMORY=650m
```
**Spark jobs will run on all Raspberry Pis.**
>```shell
sudo vi /opt/spark/conf/slaves
```
```
r1
r2
r3
```
**With that done I'll distribute Spark and its configuration to the other nodes.**
>```shell
for SERVER in r2 r3
do
    sudo rsync --archive \
               --one-file-system \
               --partial \
               --progress \
               --compress \
               --exclude /opt/spark/logs \
               /opt/spark/ SERVER:/opt/spark/
done
```
14. To save memory I didn't launch Spark until after I have populated all the data onto HDFS but it makes sense to mention the launch commands here. They are as follows:
>```shell
sudo /opt/spark/sbin/start-master.sh
#sudo /opt/spark/sbin/stop-master.sh
sudo /opt/spark/sbin/start-slaves.sh
#sudo /opt/spark/sbin/stop-slaves.sh
```

### Issue
Failing to spark-sql
>```shell
spark-sql \
    --master spark://r1:7077 \
    --num-executors 1
```
*Failed to connect to master r1:7077*
check the port status:
```shell
ssh -v -p 7077 r1
```
*OpenSSH_7.4p1 Raspbian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017  
debug1: Reading configuration data /etc/ssh/ssh_config  
debug1: /etc/ssh/ssh_config line 19: Applying options for **  
debug1: Connecting to r1 [192.168.1.108] port 7077.  
debug1: connect to address 192.168.1.108 port 7077: Connection refused  
ssh: connect to host r1 port 7077: Connection refused*  
check port status:
```shell
netstat -pln
```
127.0.0.1:7077 is open, but 192.168.1.108:7077 is not open. 
redo this step, I just place r1 before, suppose the mechanism is: if there is only master server as slave server, Spark will only listen on 127.0.0.1:7077 for local scope. If there is other remote slave server in this configure file, than Spark will listern on 192.168.1.108:7077 for cluster scope.
```shell
sudo vi /opt/spark/conf/slaves
r1
r2
r3
```

### There is some issue with Hive metastore when restart this platform for the first.
*Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Hive Schema version 2.3.0 does not match metastore's schema version 1.2.0 Metastore is not upgraded or corrupt*

doing the following steps to fix this issue:
```shell
cd /opt/hive/scripts/metastore/upgrade/mysql

mysql --verbose

use metastore;
source upgrade-1.2.0-to-2.0.0.mysql.sql
source upgrade-2.0.0-to-2.1.0.mysql.sql
source upgrade-2.1.0-to-2.2.0.mysql.sql
source upgrade-2.2.0-to-2.3.0.mysql.sql
exit
#schematool -dbType mysql -upgradeSchemaFrom 1.2.0
```
redo this step:
>hive --service metastore &

## Installing Zookeeper
>```shell
#on client:  
scp zookeeper-3.4.11.tar.gz pi@192.168.1.108:./  
#on master server:
sudo mkdir -p /opt/zookeeper
sudo tar xvf zookeeper-3.4.11.tar.gz \
--directory=/opt/zookeeper \
--strip 1
```

>configure zookeeper:
```shell
sudo vi $ZOO_HOME/conf/zoo.cfg
```
```
# The number of milliseconds of each tick
tickTime=5000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
dataDir=/home/pi/zookeeper
# the port at which the clients will connect
clientPort=2181
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
initLimit=5
syncLimit=2
server.1=r1:2888:3888
#server.2=r2:2888:3888
#server.3=r3:2888:3888
#This config is for three RaspberryPi
```

>Start zookeeper by: 
```shell
sudo /opt/zookeeper/bin/zkServer.sh start
```
```shell
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
```
#### issue
>check the status:
```shell
sudo /opt/zookeeper/bin/zkServer.sh status
```
```
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
```
there is also error log:
less /home/pi/zookeeper.out
>>WARN  [main:QuorumPeer$QuorumServer@190] - Failed to resolve address: r3  
>>java.net.UnknownHostException: r3: unknown error

>annotate the r2,r3 in the conf/zoo.cfg file, and restart, this time started successfully.
```shell
netstat -nlp | grep 2181
```
```
tcp6       0      0 :::2181                 :::*                    LISTEN      -  
```
```shell
sudo /opt/zookeeper/bin/zkServer.sh status
```
```ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Mode: standalone
```

## Installing Kafka
>```shell
#on client:  
scp kafka_2.12-1.0.0.tgz pi@192.168.1.108:./  
#on master server:
sudo mkdir -p /opt/kafka
sudo tar xzvf kafka_2.12-1.0.0.tgz \
--directory=/opt/kafka \
--strip 1
```
>Config config/server.properties
```shell
sudo vi /opt/kafka/config/server.properties
#The id of the broker. This must be set to a unique integer for each broker.
broker.id=0 # increment this for each node (broker.id=1 on 2nd node etc.)
#...
#A comma seperated list of directories under which to store log files
log.dirs=/opt/kafka/log # change to wherever you want the log to be
#...
#The minimum age of a log file to be eligible for deletion
log.retention.hours=1 # change this based on your need
#...
#Zookeeper connection string (see zookeeper docs for details).
#This is a comma separated host:port pairs, each corresponding to a zk
#server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
#You can also append an optional chroot string to the urls to specify the
#root directory for all kafka znodes.
zookeeper.connect=r1:2181
#zookeeper.connect=r1:2181,r2:2181,r3:2181
```
>Config bin/kafka-server-start.sh
```shell
sudo vi /opt/kafka/bin/kafka-server-start.sh
```
Add the following after all comments:
```
export JMX_PORT=${JMX_PORT:-9999}
```

>For Raspberry Pi, if the memory on the Pi is small, the default settings (usually 1G) will have trouble starting JVM. Add the following:
```shell
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M" 
```

>Config bin/kafka-run-class.sh
For Raspberry Pi OS: Java is running as -client instead of -server. However, default Kafka setting runs Java as -server. To change that:
```shell
sudo vi /opt/kafka/bin/kafka-run-class.sh
```
Find KAFKA_JVM_PERFORMANCE_OPTS and change to:
```
KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true"
###if 2.8 kafka:
#KAFKA_JVM_PERFORMANCE_OPTS="-client -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava.awt.headless=true"
```

>Start Kafka on all nodes
```shell
sudo /opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties &
```

### Create a topic
>Let's create a topic named "test" with a single partition and only one replica:
```shell
$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper r1:2181 --replication-factor 1 --partitions 1 --topic test
#We can now see that topic if we run the list topic command
$KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper r1:2181
test
```

### Issue: 
>Error while executing topic command : Replication factor: 1 larger than available brokers: 0.
>check broke id in zoo_keeper client:
```shell
$ZOO_HOME/bin/zkCli.sh
ls /brokers/ids
[]
```
>the broker is not connected to zoo_keeper  
>stop Kafka server and restart, this failure resolved. Maybe some mistake keyboard operation last run.

### Send some messages
>Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.
Run the producer and then type a few messages into the console to send to the server.
```shell
$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>message2
>message3
>message4
```

### Start a consumer
>Kafka also has a command line consumer that will dump out messages to standard output.
```shell
$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
message2
message3
message4
```

## Installing Hbase
>```shell
#on client:
scp hbase-1.3.1-bin.tar.gz pi@192.168.1.108:./
#on master server:
sudo mkdir -p /opt/hbase
sudo tar xvf hbase-1.3.1-bin.tar.gz \
--directory=/opt/hbase \
--strip 1
```
### Installing HBase in Pseudo-Distributed Mode
Let us now check how HBase is installed in pseudo-distributed mode.

>CONFIGURING HBASE
Before proceeding with HBase, configure Hadoop and HDFS on your local system or on a remote system and make sure they are running. Stop HBase if it is running.
>Edit hbase-site.xml file to add the following properties.
```shell
sudo vi /opt/hbase/conf/hbase-site.xml
```
```xml
<property>
   <name>hbase.cluster.distributed</name>
   <value>true</value>
</property>
```
>It will mention in which mode HBase should be run. In the same file from the local file system, change the hbase.rootdir, your HDFS instance address, using the hdfs://// URI syntax. 
```xml
<property>
   <name>hbase.rootdir</name>
   <value>hdfs://r1:9000/hbase</value>
</property>
```
Starting HBase
After configuration is over, browse to HBase home folder and start HBase using the following command.
```shell
sudo su
/opt/hbase/bin/start-hbase.sh
```
Note: Before starting HBase, make sure Hadoop is running.

>Checking the HBase Directory in HDFS
HBase creates its directory in HDFS. To see the created directory, browse to Hadoop bin and type the following command.
```shell
hdfs dfs -ls /hbase
```
```
Found 7 items
drwxr-xr-x   - root supergroup          0 2017-11-12 05:23 /hbase/.tmp
drwxr-xr-x   - root supergroup          0 2017-11-12 05:24 /hbase/MasterProcWALs
drwxr-xr-x   - root supergroup          0 2017-11-12 05:23 /hbase/WALs
drwxr-xr-x   - root supergroup          0 2017-11-12 05:23 /hbase/data
-rw-r--r--   3 root supergroup         42 2017-11-12 05:23 /hbase/hbase.id
-rw-r--r--   3 root supergroup          7 2017-11-12 05:23 /hbase/hbase.version
drwxr-xr-x   - root supergroup          0 2017-11-12 05:23 /hbase/oldWALs
```
##### Starting HBaseShell
>After Installing HBase successfully, you can start HBase Shell. Below given are the sequence of steps that are to be followed to start the HBase shell. Open the terminal, and login as super user.

>Start Hadoop File System
Browse through Hadoop home sbin folder and start Hadoop file system as shown below.
```shell
HADOOP_HOME/sbin/start-all.sh
```
Start HBase
Browse through the HBase root directory bin folder and start HBase.
```shell
/usr/local/HBase/bin/start-hbase.sh
```
Connect to your running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. In this example, some usage and version information that is printed when you start HBase Shell has been omitted. The HBase Shell prompt ends with a > character.
```shell
HBASE_HOME/bin/hbase shell
```
Create a table and populate it with data.

>You can use the HBase Shell to create a table, populate it with data, scan and get values from it, using the same procedure as in [shell exercises](https://hbase.apache.org/book.html#shell_exercises).
##### Start HBase Master Server
>The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster servers, which makes 10 total HMasters, counting the primary. To start a backup HMaster, use the local-master-backup.sh. For each backup master you want to start, add a parameter representing the port offset for that master. Each HMaster uses three ports (16010, 16020, and 16030 by default). The port offset is added to these ports, so using an offset of 2, the backup HMaster would use ports 16012, 16022, and 16032. The following command starts 3 backup servers using ports 16012/16022/16032, 16013/16023/16033, and 16015/16025/16035.
```shell
#HBASE_HOME/bin/local-master-backup.sh start 2 3 5
HBASE_HOME/bin/local-master-backup.sh start 2
```
To kill a backup master without killing the entire cluster, you need to find its process ID (PID). The PID is stored in a file with a name like /tmp/hbase-USER-X-master.pid. The only contents of the file is the PID. You can use the kill -9 command to kill that PID. The following command will kill the master with port offset 1, but leave the cluster running:
```shell
cat /tmp/hbase-root-2-master.pid |xargs kill -9
```
##### Start and stop additional RegionServers

>The HRegionServer manages the data in its StoreFiles as directed by the HMaster. Generally, one HRegionServer runs per node in the cluster. Running multiple HRegionServers on the same system can be useful for testing in pseudo-distributed mode. The local-regionservers.sh command allows you to run multiple RegionServers. It works in a similar way to the local-master-backup.sh command, in that each parameter you provide represents the port offset for an instance. Each RegionServer requires two ports, and the default ports are 16020 and 16030. However, the base ports for additional RegionServers are not the default ports since the default ports are used by the HMaster, which is also a RegionServer since HBase version 1.0.0. The base ports are 16200 and 16300 instead. You can run 99 additional RegionServers that are not a HMaster or backup HMaster, on a server. The following command starts four additional RegionServers, running on sequential ports starting at 16202/16302 (base ports 16200/16300 plus 2).
```shell
#HBASE_HOME/bin/local-regionservers.sh start 2 3 4 5
HBASE_HOME/bin/local-regionservers.sh start 2
```
To stop a RegionServer manually, use the local-regionservers.sh command with the stop parameter and the offset of the server to stop.
```shell
HBASE_HOME/bin/local-regionservers.sh stop 2
```
##### HBase Web Interface
>To access the web interface of HBase, type the following url in the browser.
[http://192.168.1.108:60010](http://192.168.1.108:60010)

## Installing Pig
>```shell
#on client:
scp pig-0.17.0.tar.gz pi@192.168.1.108:./
#on master server:
sudo mkdir -p /opt/pig
sudo tar xvf pig-0.17.0.tar.gz \
--directory=/opt/pig \
--strip 1
```
>In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various parameters as given below.
```shell
pig -h properties
```
**Mapreduce Mode** - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).  
**Spark Mode** - To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos cluster and HDFS installation. Specify Spark mode using the -x flag (-x spark). In Spark execution mode, it is necessary to set env::SPARK_MASTER to an appropriate value (local - local mode, yarn-client - yarn-client mode, mesos://host:port - spark on mesos or spark://host:port - spark cluster. For more information refer to spark documentation on Master URLs, yarn-cluster mode is currently not supported). Pig scripts run on Spark can take advantage of the dynamic allocation feature. The feature can be enabled by simply enabling spark.dynamicAllocation.enabled. Refer to spark configuration for additional configuration details. In general all properties in the pig script prefixed with spark. are copied to the Spark Application Configuration. Please note that Yarn auxillary service need to be enabled on Spark for this to work. See Spark documentation for additional details.

### restart steps
>```shell
sudo su
start-dfs.sh
hive --service metastore &
/opt/spark/sbin/start-master.sh
/opt/spark/sbin/start-slaves.sh
/opt/zookeeper/bin/zkServer.sh start
/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties &
```

### stop steps
>```shell
sudo su
$HBASE_HOME/bin/stop-hbase.sh
$HDFS_HOME/bin/stop-dfs.sh
```