# Publishing Messages from contents of the URL to Kafka

[Apache Kafka](https://kafka.apache.org/) is an open-source distributed event streaming platform used for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

This notebook is a walkthrough in publishing contents of a URL to Kafka.

**Step 1:** Read the contents of the URL into a json file using the following command. 

Curl is a command for getting or sending data using URL syntax, using any of the supported protocols. Some of the supported protocol are HTTP, HTTPS, FTP, IMAP, POP3, SCP, SFTP, SMTP, TFTP, TELNET, LDAP or FILE.

We add the options: 

* -L (valid for HTTP and HTTPS) to be able make curl redo the request on the new place if the server reports that the requested page  has  moved  to  a  different  location (indicated  with  a Location: header and a 3XX response code). If used together with -i, --include or  -I,  --head,  headers  from  all requested pages will be shown. When authentication is used, curl only sends its credentials to the initial host. If a redirect takes curl to a different host, it won't  be  able  to  intercept  the user+password.  You can limit the amount of redirects to follow by using the --max-redirs option.

* -o assessment-attempts-nested.json to write  output  to this file instead of stdout

Then we provided https://goo.gl/ME6hjp, the URL we want to receive data from. 

In [None]:
curl -L -o assessment-attempts-nested.json https://goo.gl/ME6hjp

In [None]:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 9096k  100 9096k    0     0  14.6M      0 --:--:-- --:--:-- --:--:-- 14.6M

**Step 2:** Use jq to modify the way we read in the json file we populated with the output of curl in step 1 and test that with cat.

The cat command concatenates files and prints on the standard output.

We provide the file(s) name(s) we want to concatenate, or standard input, to standard output. With no FILE, or when FILE is -, it reads standard input.

The | (pipe) allows the standard output of command 1 (the command before the |) to be the standard input for command 2 (the command after the |). So our concatenate result acts as an input to jq '.'

jq is a lightweight and flexible command-line JSON processor. It lets you slice and filter and map and transform structured data with ease. ‘.[]’ unrolls the array and pulls out index in the array as a line/row and the -c preserves the color from jq formatting. So the  jq '.[]' -c allows us to separate out each index in the array of the json into a new line and preserve the color of the formatting provided by jq. 

In [None]:
cat assessment-attempts-nested.json | jq '.[]' -c

**Step 3:** To see how many lines (which will be how many messages we publish to kafka) result from our command in step 2. 

Adding | wc -l lets us take our standard output from the command in Step 2 which is the jq formatted and extracted lines from array json to the next command as input. The next command is wc -l. wc prints the newline count because the -l provided as an option that specifies newline.  

In [None]:
cat assessment-attempts-nested.json  | jq '.[]' -c | wc -l

In [None]:
3280

**Step 4:** Here we take what we got in Step 2 and publish that into the kafka topic 'assessment-attempts'.

Here we will use docker-compose assuming the kafka service is launched using Docker. 

docker-compose exec runs a command in the container whose name is provided. 

Here, we use it to run a command on the mids container (defined in the docker-compose file - the description of which is provided in Section 1 of the report).

The command we run is bash -c "cat assessment-attempts-nested.json | jq '.[]' -c | kafkacat -P -b kafka:29092 -t assessment-attempts && echo 'Produced 3280 messages.'" 

* bash is to launch a shell in the container
* -c is an option to be able to read commands from the following string
* The string following first concatenates the contents of the file assessment-attempts-nested.json into standard output.
* It then passes the standard output from that as standard input into the next command: jq '.[]' -c which gets all the contents of the output (formatted like json) and extracts out each index of the array into a new line.
* The standard output of that is then passed as standard input to the next command: kafkacat -P -b kafka:29092 -t assessment-attempts && echo 'Produced 3280 messages.'"
    * kafkacat -P mentions to launch the utility in producer mode. In this, kafkacat reads messages from standard input (stdin). 
    * -b kafka:29092 is used to specify the kafka broker, the name of which is just kafka with the host - both of which are configured in the docker-compose.yml
    * -t assessment-attempts is used to specify the topic name that we want to publish to
    * && is used to list a command that we want to execute after the one before it has successfully completed execution
    * echo 'Produced 3280 messages.' is a message we want to display if the previous command of publishing to kafka has been successfully executed. We knew 3280 from Step 3.

In [None]:
docker-compose exec mids bash -c "cat assessment-attempts-nested.json | jq '.[]' -c | kafkacat -P -b kafka:29092 -t assessment-attempts && echo 'Produced 3280 messages.'"

In [None]:
Produced 3280 messages.