<a href="https://colab.research.google.com/github/soonieboi/sparkstuff/blob/main/Colab_MinIO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 MinIO + Colab Integration Demo

This project shows how to:
	1.	Run a local MinIO object storage server on your machine.
	2.	Expose it to the internet using Cloudflare Tunnel.
	3.	Generate presigned URLs with the MinIO client (mc).
	4.	Access and analyze the data in Google Colab using Pandas and PySpark.


📦 Setup MinIO Locally

Install MinIO server
Run MinIO (store data under ~/minio-data)

🌐 Expose MinIO with Cloudflare Tunnel
Run Tunnel
Get public URL

📂 Upload Files to MinIO

🔑 Generate Presigned URL
Using the tunnel alias:
- e.g.
mc alias set tunnel https://opinion-joe-res-pump.trycloudflare.com minioadmin minioadmin
mc share download tunnel/test/BEAD_Rebu_TripData.csv

📊 Load Data in Colab
See code below


✅ Notes
	•	mc share download creates a presigned URL valid for 7 days (default).
	•	If you see encoding errors (UnicodeDecodeError), try encoding="latin1" or encoding="ISO-8859-1".
	•	For bulk file access, you can mc cp --recursive to upload/download entire buckets.




In [11]:
!pip install pyspark==3.5.0




In [1]:
from pyspark.sql import SparkSession

ENDPOINT = "https://opinion-joe-res-pump.trycloudflare.com"
ACCESS_KEY = "minioadmin"
SECRET_KEY = "minioadmin"
BUCKET = "test"
KEY = "BEAD_Rebu_TripData.csv"

spark = (
    SparkSession.builder
    .appName("Colab-MinIO")
    .config(
        "spark.jars.packages",
        "org.apache.hadoop:hadoop-aws:3.3.6,com.amazonaws:aws-java-sdk-bundle:1.12.262"
    )
    .getOrCreate()
)

hconf = spark._jsc.hadoopConfiguration()
hconf.set("fs.s3a.endpoint", ENDPOINT)
hconf.set("fs.s3a.connection.ssl.enabled", "true")
hconf.set("fs.s3a.path.style.access", "true")
hconf.set("fs.s3a.aws.credentials.provider",
          "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
hconf.set("fs.s3a.access.key", ACCESS_KEY)
hconf.set("fs.s3a.secret.key", SECRET_KEY)

print("Spark configured for MinIO at:", ENDPOINT)

Spark configured for MinIO at: https://opinion-joe-res-pump.trycloudflare.com


In [4]:
import pandas as pd
from pyspark.sql import SparkSession

In [8]:
import pandas as pd

url = "https://opinion-joe-res-pump.trycloudflare.com/test/BEAD_Rebu_TripData.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20250823%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250823T062555Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=7810a963aeecc4e939c4a9aaccbc0e492d14861673829ced074674ba03c0dcda"

# Try with a more permissive encoding
df_pd = pd.read_csv(url, encoding="latin1")   # or encoding="ISO-8859-1"
print("✅ File loaded into Pandas with latin1 encoding")
print(df_pd.head())

✅ File loaded into Pandas with latin1 encoding
   Sno      Date  Day  Hour of Day Trip Start Time HHMM Pickup District  \
0    1  1-Jan-24  Mon            0                00:01       Kew Drive   
1    2  1-Jan-24  Mon            0                00:01          Marina   
2    3  1-Jan-24  Mon            0                00:02          Bishan   
3    4  1-Jan-24  Mon            0                00:05     Suntec City   
4    5  1-Jan-24  Mon            0                00:06       Chinatown   

   DropOff District  Distance Travelled  Trip Duration in Seconds  \
0          Clementi                25.1                      3001   
1     Clementi Park                12.8                      1671   
2       Hume Avenue                19.8                      3332   
3  Upper East Coast                16.8                      2904   
4           Geylang                17.6                      2413   

  Trip End Time Taxi Number Taxi Type  Taxi Capacity  Number Of Passengers  \
0        

In [9]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Colab-MinIO").getOrCreate()
df_spark = spark.createDataFrame(df_pd)
df_spark.printSchema()
df_spark.show(5, truncate=False)

root
 |-- Sno: long (nullable = true)
 |-- Date: string (nullable = true)
 |-- Day: string (nullable = true)
 |-- Hour of Day: long (nullable = true)
 |-- Trip Start Time HHMM: string (nullable = true)
 |-- Pickup District: string (nullable = true)
 |--  DropOff District: string (nullable = true)
 |-- Distance Travelled: double (nullable = true)
 |-- Trip Duration in Seconds: long (nullable = true)
 |-- Trip End Time: string (nullable = true)
 |-- Taxi Number: string (nullable = true)
 |-- Taxi Type: string (nullable = true)
 |-- Taxi Capacity: long (nullable = true)
 |-- Number Of Passengers: long (nullable = true)
 |-- Trip Fare: double (nullable = true)
 |-- Passenger ID: long (nullable = true)
 |-- Passenger Name: string (nullable = true)

+---+--------+---+-----------+--------------------+---------------+-----------------+------------------+------------------------+-------------+-----------+---------+-------------+--------------------+---------+------------+--------------+
|Sno|Da

In [10]:
df_spark.describe()

DataFrame[summary: string, Sno: string, Date: string, Day: string, Hour of Day: string, Trip Start Time HHMM: string, Pickup District: string,  DropOff District: string, Distance Travelled: string, Trip Duration in Seconds: string, Trip End Time: string, Taxi Number: string, Taxi Type: string, Taxi Capacity: string, Number Of Passengers: string, Trip Fare: string, Passenger ID: string, Passenger Name: string]