# PySpark Structured Streaming

## Word frequency count example
In this example, we will use PySpark Structured Streaming to count the frequency of words in a stream of text data. We will use Netcat to generate a stream of text data, which will be read by PySpark and processed to count the frequency of each word.

**Step 1**
To run Netcat on Windows, we will use the Windows Subsystem for Linux (WSL) with an Ubuntu distribution installed. If you are using MacOs or Linux, you can run Netcat directly on your system without using WSL.

To lunch Netcat on Windows, you can open the Ubuntu application from the Start menu and run the following command in the terminal:

```bash
nc -lk 9999
```

This command will start Netcat on port 9999 and listen for incoming connections. You can now type text in the terminal and press Enter to send it to the PySpark application.

**Step 2**
We need to run the PySpark streaming applications. To do this, we will use the `spark-submit` command to run the `stream_wc.py` script. This script reads the stream of text data from Netcat, processes it to count the frequency of each word, and prints the results to the console. Open command prompt, navigate to the directory containing the `stream_wc.py` script, and run the following command:

```bash
spark-submit stream_wc.py
```

## GDELT streaming example
GDELT (Global Database of Events, Language, and Tone) is a global event database that extracts events from news articles around the world. The dataset is updated every 15 minutes and is available in CSV format. It contains detailed information such as event dates, actors involved, locations, and more. The latest dataset can be accessed via the URL:  
http://data.gdeltproject.org/gdeltv2/lastupdate.txt

There are two scripts in this example:

- **gdelt-streaming.py:**  
  This PySpark application continuously ingests streaming CSV files from the GDELT dataset. It enriches the data by joining it with a country mapping file, computes aggregations like the average Goldstein scale and event counts by country, and outputs real-time insights—specifically, the top 10 most positive and negative countries—directly to the console.

- **gdelt_update.py:**  
  This Python script periodically checks the GDELT last update URL to retrieve the latest CSV export (packaged as a ZIP file). It downloads and extracts the file into the designated `input_files` directory if it is not already present, ensuring that the most current data is available for processing. The script repeats this check every five minutes.

**How to Run on Windows:**

Open two Command Prompt windows to run the scripts concurrently. First run the `gdelt_update.py` script to download the latest GDELT data, then run the `gdelt-streaming.py` script to process the streaming data.
