# Ingesting realtime tweets using Apache Kafka, Tweepy and Python

### Purpose:
- main data source for the lambda architecture pipeline
- uses twitter streaming API to simulate new events coming in every minute
- Kafka Producer sends the tweets as records to the Kafka Broker

### Contents: 
- [Twitter setup](#1)
- [Defining the Kafka producer](#2)
- [Producing and sending records to the Kafka Broker](#3)
- [Deployment](#4)

### Required libraries

In [2]:
import time
from kafka import KafkaProducer

A helper function to normalize the time a tweet was created with the time of our system

In [2]:
from datetime import datetime, timedelta

def normalize_timestamp(time):
    mytime = datetime.strptime(time, "%Y-%m-%d %H:%M:%S")
    mytime += timedelta(hours=1)   # the tweets are timestamped in GMT timezone, while I am in +1 timezone
    return (mytime.strftime("%Y-%m-%d %H:%M:%S")) 

<a id="2"></a>
### Defining the Kafka producer
- specify the Kafka Broker
- specify the topic name
- optional: specify partitioning strategy

In [3]:
producer = KafkaProducer(bootstrap_servers='localhost:9092')
topic_name = 'test-amazon-crawler'

<a id="3"></a>
### Producing and sending records to the Kafka Broker
- querying the Twitter API Object
- extracting relevant information from the response
- formatting and sending the data to proper topic on the Kafka Broker
- resulting tweets have following attributes:
    - id 
    - created_at
    - followers_count
    - location
    - favorite_count
    - retweet_count

In [8]:
def get_data():
    # dir = "amazon-scraper/search_results_output.jsonl"
    #read data from file
    with open("amazon-scraper/search_results_output.jsonl", "r") as file:
        data = file.readlines()
    for line in data:
        producer.send(topic_name, str.encode(line))
    producer.flush()

In [7]:
line="test1"
producer.send(topic_name, str.encode(line))
producer.flush()

In [9]:
get_data()

In [10]:
line="test2"
producer.send(topic_name, str.encode(line))
producer.flush()

<a id="4"></a>
### Deployment 
- perform the task every couple of minutes and wait in between

In [None]:
def periodic_work(interval):
    while True:
        get_data()
        #interval should be an integer, the number of seconds to wait
        time.sleep(interval)


In [None]:
periodic_work(60 * 0.1)  # get data every couple of minutes