Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Latest commit

 

History

History
141 lines (90 loc) · 6.67 KB

hdinsight-storm-twitter-trending.md

File metadata and controls

141 lines (90 loc) · 6.67 KB

#Determine Twitter trending topics with Apache Storm on HDInsight

Learn how to use Trident to create a Storm topology that determines trending topics (hash tags) on Twitter.

Trident is a high-level abstraction that provides tools such as joins, aggregations, grouping, functions, and filters. Additionally, Trident adds primitives for doing stateful, incremental processing. This example demonstrates how you can build a topology using a custom spout, function, and several built-in functions provided by Trident.

[AZURE.NOTE] This example is heavily based on the Trident Storm example by Juan Alonso.

##Requirements

##Download the project

Use the following code to clone the project locally.

git clone https://github.com/Blackmist/TwitterTrending

##Topology

The topology for this example is as follows:

topology

[AZURE.NOTE] This is a simplified view of the topology. Multiple instances of the components will be distributed across the nodes in the cluster.

The Trident code that implements the topology is as follows:

topology.newStream("spout", spout)
    .each(new Fields("tweet"), new HashtagExtractor(), new Fields("hashtag"))
    .groupBy(new Fields("hashtag"))
    .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
    .newValuesStream()
    .applyAssembly(new FirstN(10, "count"))
	.each(new Fields("hashtag", "count"), new Debug());

This code does the following:

  1. Creates a new stream from the spout. The spout retrieves tweets from Twitter, and filters them for specific keywords (love, music, and coffee in this example).

  2. HashtagExtractor, a custom function, is used to extract hash tags from each tweet. These are emitted to the stream.

  3. The stream is grouped by hash tag, and passed to an aggregator. This aggregator creates a count of how many times each hash tag has occurred. This data is persisted in memory. Finally, a new stream is emitted that contains the hash tag and the count.

  4. Because we are only interested in the most popular hash tags for a given batch of tweets, the FirstN assembly is applied to return only the top 10 values, based on the count field.

[AZURE.NOTE] Other than the spout and HashtagExtractor, we are using built-in Trident functionality.

For information about built-in operations, see Package storm.trident.operation.builtin.

For Trident-state implementations other than MemoryMapState, see the following:

###The spout

The spout, TwitterSpout, uses Twitter4j to retrieve tweets from Twitter. A filter is created (love, music, and coffee in this example), and the incoming tweets (status) that match the filter are stored in a linked blocking queue. (For more information, see Class LinkedBlockingQueue.) Finally, items are pulled off the queue and emitted to the topology.

###The HashtagExtractor

To extract hash tags, getHashtagEntities is used to retrieve all hash tags that are contained in the tweet. These are then emitted to the stream.

##Enable Twitter

Use the following steps to register a new Twitter application and obtain the consumer and access token information needed to read from Twitter:

  1. Go to Twitter Apps and click the Create new app button. When filling in the form, leave the Callback URL field empty.

  2. When the app is created, click the Keys and Access Tokens tab.

  3. Copy the Consumer Key and Consumer Secret information.

  4. At the bottom of the page, select Create my access token if no tokens exist. When the tokens have been created, copy the Access Token and Access Token Secret information.

  5. In the TwitterSpoutTopology project you previously cloned, open the resources/twitter4j.properties file, add the information you gathered in the previous steps, and then save the file.

##Build the topology

Use the following code to build the project:

	cd [directoryname]
	mvn compile

##Test the topology

Use the following command to test the topology locally:

mvn compile exec:java -Dstorm.topology=com.microsoft.example.TwitterTrendingTopology

After the topology starts, you should see debug information that contains the hash tags and counts emitted by the topology. The output should appear similar to the following:

DEBUG: [Quicktellervalentine, 7]
DEBUG: [GRAMMYs, 7]
DEBUG: [AskSam, 7]
DEBUG: [poppunk, 1]
DEBUG: [rock, 1]
DEBUG: [punkrock, 1]
DEBUG: [band, 1]
DEBUG: [punk, 1]
DEBUG: [indonesiapunkrock, 1]

##Next steps

Now that you have tested the topology locally, discover how to deploy the topology: Deploy and manage Apache Storm topologies on HDInsight.

You may also be interested in the following Storm topics:

For more Storm examples for HDinsight: