<a href="https://colab.research.google.com/github/UdayLab/geoanalytics/blob/main/tests/patternMining/frequentPatternMiningTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this exercise, we learn the process to find useful patterns hidden in the raster data.

# Step 1: Install the Necessary Libraries

In [None]:
# Installing necessary modules
!apt update
!apt install -y nco cdo gdal-bin
!which ncrename
!which cdo
!which gdal_translate

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Connected to cloud.r-project.org (65.9.86.28)] [Conne[0m                                                                               Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
                                                                               Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
[33m0% [Waiting for headers] [Connected to cloud.r-project.org (65.9.86.28)] [Conne[0m                                                                               Get:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
[33m0% [4 InRelease 24.3 kB/127 kB 19%] [Connected to cloud.r-project.org (65.9.86.[0m[33m0% [Waiting for headers] [Connecting to r2u.stat.illinois.edu (192.17.190.167)][0m                                                                               Get:5 htt

In [None]:
!pip install -U geoanalytics

# Step 2: Python code creating a synthethic tiff file containing 25 pixels (5 x 5) and 10 bands.

## Step 2.1: Generating the synthetic tiff file

In [None]:
import numpy as np
import rasterio
from rasterio.enums import Compression
from rasterio.transform import from_origin
from rasterio.crs import CRS
import csv

# Image parameters
width, height = 5, 5
bands = 10
pixel_size = 1.0  # 1 degree per pixel
top_left_lon = 0.0
top_left_lat = 5.0  # Top-left starting coordinate

# Define transform and CRS
transform = from_origin(top_left_lon, top_left_lat, pixel_size, pixel_size)
crs = CRS.from_epsg(4326)

# Generate random image data (bands, height, width)
data = np.random.randint(1, 101, size=(bands, height, width), dtype=np.uint8)

# Save as GeoTIFF
output_tif = 'random_10band_5x5_wgs84.tif'
with rasterio.open(
    output_tif,
    'w',
    driver='GTiff',
    width=width,
    height=height,
    count=bands,
    dtype='uint8',
    crs=crs,
    transform=transform,
    compress=Compression.none
) as dst:
    for i in range(bands):
        dst.write(data[i], i + 1)

# Prepare CSV data and print output
csv_file = 'pixel_data.csv'
headers = ['lon', 'lat'] + [f'band{b+1}' for b in range(bands)]

rows = []
rows.append(headers)

# Print and store rows: lon, lat, band1...band10
print("\n" + "\t".join(headers))
for y in range(height):
    for x in range(width):
        lon, lat = transform * (x + 0.5, y + 0.5)  # pixel center
        values = [data[b, y, x] for b in range(bands)]
        row = [round(lon, 2), round(lat, 2)] + values
        rows.append(row)
        print("\t".join(map(str, row)))

# Save to CSV
with open(csv_file, mode='w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(rows)

print(f"\n✅ TIFF saved to: {output_tif}")
print(f"✅ CSV saved to: {csv_file}")


## Step 2.2: Viewing the tiff file

In [None]:
from geoanalytics.visualization import TiffViewer

viewer = TiffViewer.TiffViewer(inputFile='random_10band_5x5_wgs84.tif')

viewer.run(cmap='jet', title='TIFF Image')

# Step 3: Hypothetical Assumption about this Dataset


* Let us assume the raster data represents an imagenary surface of a Celestial body (or object).
* Band represents the wavelength or time.

**For brevity, let us consider the above raster data represents the sea surface temperatures on the planet earth.**  

# Step 4: Problem Statement

Useful patterns of various types are hidden in this raster data. These patterns can empower the users to gain competitive advantage and achieve socioeconomic development.  In this exercise, we try to extract this information in form of frequently occurring patterns.

**Since we are consider our data as sea surface temperatures, we try to identify the areas where high temperatures are frequently observed collectively**

# Step 5: Identifying the locations where high SSTs were observed **simultaneously** (Frequent Pattern Discovery)

## Step 5.1: Creation of Binary Matrix (or transactional database)

* Let us consider any value greater than or equal to 50 represents high SST. Anything else, low SST, and we are not interested in those recording.

* Let us binarize the matrix such that
   * band value = 1 means corresponding pixel recorded high temperature at that respective band
   * band value = 0 means corresponding pixel recorded low temperature at that respective band

In [None]:
import pandas as pd

#Reading the CSV data
df = pd.read_csv('pixel_data.csv', sep=',')
df

In [None]:
from geoanalytics.conversion import RasterDF2DB

# Pass the dataframe as input dataframe
obj = RasterDF2DB.RasterDF2DB(dataframe=df)

#Preparing the binary transactional database
obj.prepareTransactionalDataframe()

#Creation of a transactional database
obj.convertToTransactionalDB(DBname='transactionalDB.csv', condition='>=', thresholdValue=50)

## Step 5.2: Print the transactional database and check out its contents

In [None]:
!cat transactionalDB.csv

Meaning of the created dataset:
* Each line represents a band. Our raster file contains 10 bands. So we will have 10 lines (or transactions).
* Each line denotes the pixels that have recorded value that satisfied the user-specified condition.

**In this tutorial, each line represents the pixels that have recorded high SSTs at their respective bands.**

In [None]:
# Just for verifying purposes we are printing the first two (data) lines of the files: pixel_data.csv and transactionalDB.csv
!head -3 pixel_data.csv # contains header row
!head -2 transactionalDB.csv

## Step 5.3: To specify appropriate hyper-parameter (minimum support) for the frequent pattern model we need to study the distribution (or statistical details) of the constructed transactional database.

In [None]:
from geoanalytics.patternMining import FrequentPatternMining as pm

alg = pm.FrequentPatternMining(inputFile='transactionalDB.csv')
alg.showDBstats()

## Step 5.4: Mining Frequent Patterns

Let minSup be 4. It means we are finding the sets of pixels where high SSTs were simultaneously observed at least 4 times (out of 10 times/bands) in our dataset.

In other words, we are finding the sets of pixels that have simultaneously observed high SSTs at least 40% of the time.

In [None]:
alg.run(minSupport=4)

We have identified 93 sets of locations that have simultaneously recorded high levels of SSTs.

## Step 5.5: Saving the generated frequent patterns

In [None]:
alg.save(outputFile='FrequentPatterns.txt')

## Step 5.6: Printing the generated frequent patterns

In [None]:
!tail -10 FrequentPatterns.txt

**Format - pattern:frequency**

* The first frequent pattern provides the information that the pixels or points (3.5,1.5) and (0.5,4.5) have simultaneously observed high SSTs 60% of the time in the dataset.
* Similar statements can be made for remaining frequent patterns.
* Long patterns are often interesting to the users. So you can identify long patterns and study them seperately.
* Frequent patterns containing single pixel/point may be ignored.

## Step 5.7: Visualizing Long Patterns

In [None]:
from PAMI.extras.graph import visualizePatterns as fig

obj = fig.visualizePatterns('FrequentPatterns.txt',10)
obj.visualize(width=1000,height=900)