This repository contains the commands and links to code that support a YouTube Series on How to Stream Social Media to Hadoop. Each section corresponds to a video in the series. Assumes you have working knowledge of the technical components.
http://www.youtube.com/datadansandler
This is part 2 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Here we deploy the AWS EC2 RHEL 6.3 64-bit M1.small instance.
- Red Hat 6.3, M1 Large recommended, for this exercise will Make it small
a. Setup security
b. AWS Security Group - allow ports HTTP (80 and 8080), HTTPS (443), add 7180 (Cloudera Mgr), 11000 (Oozie) and 50030 (MR Job tracker) and 51400
c. Get the Access tokens
i. If you have an EC2 account, an admin should have sent these to you
1) Do not Lose, Do not Share
ii. Add environment variables as you normally wouldl
iii. AWS_ACCESS_KEY
iv. AWS_SECRET_KEY
C. Setup a Key Pair (aka .PEM file)
i. Do not lose, do not share
A. Get the external IP and external
B. SSH to environment
A. ssh root@ec2-<YOUR_IP_ADDR>.compute-1.amazonaws.com -i <your_PEM_FILE>.pem
B. Firewall config
i. system-config-firewall-tui
ii. Enable SSH (22), HTTP (80 and 8080), HTTPS (443), add 7180 (Cloudera Mgr), 11000 (Oozie) and 50030 (MR Job tracker) and 500 70
a. Enable SE Linux
i. When attempting to install, "SELinux is enabled. It must be disabled to install and use this product."
ii. vi /etc/selinux/config
1) SELINUX=disabled
2) reboot
3) sestatus
4) getenforce
b. Only if there are port issues
i. iptables -L -n
ii. service iptables restart
This is part 3 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. In this part, we install Cloudera in the Amazon Cloud.
A. mkdir /tmp/cdh_download; cd /tmp/cdh_download
B. download using: curl -o cloudera-manager-installer.bin http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
C. chmod u+rx on cloudera-manager-installer.bin
D. ./cloudera-manager-installer.bin
E. Test services running
a. Login to Cloudera Manager http://aws-host:7180
a) Check services
b) Also check 50030 and 50070
b. Check Job Tracker
F. Install Maven
-
wget http://apache.mirrors.lucidnetworks.net/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz
-
tar xvf apache-maven-3.0.4-bin.tar.gz
This is part 4 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Here we register an app on Twitter.
No special instructions other than those posted in the video.
This is part 5 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Here we configure Flume to stream Twitter to HDFS.
-
Download Twitter Project from Git Hub
a) mkdir /tmp/twitter-install
b) wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
c) unzip master
-
Cd /tmp/twitter-install/cdh-twitter-example-master/flume-sources
-
wget http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar
-
mv flume-sources-1.0-SNAPSHOT.jar /usr/lib.hadoop
-
sudo cp /etc/flume-ng/conf/flume-env.sh.template /etc/flume-ng/conf/flume-env.sh
-
Vi flume-env.sh, add jar above to FLUME_CLASSPATH
-
create /etc/default/flume-ng-agent
a) See https://github.com/DataDanSandler/sentiment_analysis/tree/master/src/flume
-
Create /etc/flume-ng/conf/flume.conf
a) See https://github.com/DataDanSandler/sentiment_analysis/tree/master/src/flume
This is part 6 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Here we install MySQL and configure Hive.
-
cd /usr/lib/hadoop
-
wget http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar
-
Create the Hive directory hierarchy
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
sudo -u hdfs hadoop fs -chown -R hive:hive /user/hive
sudo -u hdfs hadoop fs -chmod 777 /user/hive
sudo -u hdfs hadoop fs -chmod 777 /user/hive/warehouse
- Permissions
sudo usermod -a -G hive flume
sudo usermod -a -G hive hdfs
sudo usermod -a -G hive oozie
-
install the MySQL JDBC driver in /usr/lib/hive/lib.
curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.22.tar.gz/from/http://mysql.he.net/' | tar xz
sudo cp mysql-connector-java-5.1.22/mysql-connector-java-5.1.22-bin.jar /usr/lib/hive/lib/
-
Setup Hive in MySQL https://ccp.cloudera.com/display/CDH4DOC/Hive+Installation#HiveInstallation-ConfiguringtheHiveMetastore
mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.9.0.mysql.sql;
mysql> CREATE USER 'hive'@'myhost' IDENTIFIED BY 'mypassword';
...
mysql> GRANT ALL on metastore.* TO hive@localhost IDENTIFIED BY '<your_password>';
MAKE SURE THIS IS DONE at ROOT MYSQL LEVEL, not JUST METSATORE
mysql> FLUSH PRIVILEGES;
mysql> quit;
Pasted from https://ccp.cloudera.com/display/CDH4DOC/Hive+Installation#HiveInstallation-ConfiguringtheHiveMetastore
-
Test connectivity
sudo -u hdfs hive -e 'show tables;'
-
Create Tweets table
CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, user:STRUCT>, entities STRUCT< urls:ARRAY>, user_mentions:ARRAY>, hashtags:ARRAY>>, text STRING, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) PARTITIONED BY (datehour INT) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets';
This is part 7 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Here we will setup and configure the Oozie workflow.
-
cp /usr/lib/hive/lib/mysql*.jar /usr/lib/oozie/libext
-
Configure oozie to use MySQL https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-ConfiguringOozietoUseMySQL
a) Cofigure MySQL to Ooize using http://www.the-sandlers.com:7180/cmf/services/12/config
-
mkdir /home/oozie_lib
-
mkdir home/oozie_lib/oozie-workflows/lib
-
cp hive-serdes/target/hive-serdes-1.0-SNAPSHOT.jar oozie-workflows/lib
-
cp /usr/lib/hive/lib/mysql*.jar oozie-workflows/lib
-
sudo cp /etc/hive/conf/hive-site.xml oozie-workflows
-
sudo chown hdfs:hdfs oozie-workflows/hive-site.xml
-
Install Oozie ShareLib
-
Grab the following files, put them to oozie_workflows
a) wget https://raw.github.com/cloudera/cdh-twitter-example/master/oozie-workflows/add_partition.q
b) wget https://raw.github.com/cloudera/cdh-twitter-example/master/oozie-workflows/coord-app.xml
c) Wget https://raw.github.com/cloudera/cdh-twitter-example/master/oozie-workflows/hive-action.xml
d) wget https://raw.github.com/cloudera/cdh-twitter-example/master/oozie-workflows/job.properties
i) Edit Job properties (see c:\ec2\vm_install\job.properties)
-
sudo -u hdfs hadoop fs -mkdir oozie-workflows /user/hdfs
-
sudo -u hdfs hadoop fs -put oozie-workflows /user/hdfs/oozie-workflows
-
sudo -u hdfs hadoop fs -chown oozie:oozie /user/oozie
-
Get HTTPSource JARs
a) cd /usr/lib/flume-ng/lib
b) Wget http://repo1.maven.org/maven2/org/apache/flume/flume-ng-core/1.3.0/flume-ng-core-1.3.0.jar
c) wget http://repo1.maven.org/maven2/org/apache/flume/flume-ng-sdk/1.3.0/flume-ng-sdk-1.3.0.jar
-
Start flume agent
-
sudo -u hdfs hadoop fs -mkdir /user/flume/tweets
-
sudo -u hdfs hadoop fs -mkdir /user/flume/facebook
sudo -u hdfs hadoop fs -chown -R flume:flume /user/flume
sudo -u hdfs hadoop fs -chmod -R 777 /
a) /etc/init.d/flume-ng-agent restart
i) Make sure started cleanly
ii) Look at /var/log/flume-ng
One. If you see ConnectionRefused, check flume.conf and oozie-site.xml and hive-xml and coord-app.xml and job.properties, make sure host name (ip-<IP_ADDR>.ec2.internal is used, not localhost or hadoop1
-
-
Start Oozie
sudo /sbin/service oozie start
-
Start oozie workflow
a) Alias start_oozie_wkf
-
Check logs to make sure it started OK
a) /var/log/hdfs
-
Check HDFS to see if tweets are streaming
-
Query Hive to get counts of Tweets
-
Notice how we are an hour behind, Why the wait?
This is part 8 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Now we remove the wait condition from the Oozie workflow and avoid contention with the Flume temp files.
See my blog for the steps
http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html
This is part 9 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. Here we setup ODBC connections and run basic queries.
Please use the video to setup the connection.
The queries can be found in the cloudera site.
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
This is part 10 of a series that provides a guide for setting up CDH4 Hadoop in the Amazon cloud, and streaming social media to the environment. In the final installment, we show how to stream Facebook to Flume, how to perform sentiment analysis, and some final thoughts
See my GitHub repository for the Facebook code ans intructions.