<a href="https://colab.research.google.com/github/ravipatil33/llama-stack/blob/example-notebook/DEMO_Data_Types_in_AI_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Specialized Data Formats for AI/ML**

* Parquet
* HDF5
* TFRecord
* Image Formats (JPG, PNG)

Parquet: Columnar storage format optimized for big data analytics.

HDF5: Hierarchical format for large datasets with support for complex structures.

TFRecord: TensorFlow’s format for large-scale datasets, particularly in deep learning.

Image Formats: Optimized storage for image data (e.g., JPG, PNG).

Demo on PyTorch Native Format - To store images and print any image.



[1] **Parquet Format**

Usage:
- Efficient for big data applications, supports columnar storage, used in distributed systems like Apache Spark.

Advantages:
- High compression, faster column-wise operations, good for numeric data.

In [None]:
import pandas as pd
import pyarrow.parquet as pq

# Create a DataFrame
df = pd.DataFrame({
    'Name': ['Alpha', 'Beta', 'Charlie', 'Delta'],
    'Age': [25, 30, 35, 32],
    'Skill Level': [85.5, 90.2, 88.7, 92]
})

# Save DataFrame to Parquet
df.to_parquet('data.parquet')

# Read the Parquet file
df_parquet = pd.read_parquet('data.parquet')
print(df_parquet)


This is single example.

However it can be used to store different kinds of data.

- Weather
- List of travellers in Titanic
- Flight Data


In [None]:
# Connect Colab to Google Drive
import os
from google.colab import drive

drive.mount('/content/drive')


In [None]:
# Import file to Drive

import os
from google.colab import files

uploaded = files.upload()

# List Files
os.listdir()


In [None]:
# Read content of the files : Parquet

import pandas as pd
import pyarrow.parquet as pq


# Read the Parquet file
#df_parquet = pd.read_parquet('MT cars.parquet')
df_parquet = pd.read_parquet('Flights 1m.parquet')
#df_parquet = pd.read_parquet('Weather.parquet')
#df_parquet = pd.read_parquet('Titanic.parquet')

# Print entire content of data file.
#print(df_parquet)

# Print specific row in parquet data file
print(df_parquet.iloc[0])

# Get details of parquet file
#pfile = pq.read_table('Titanic.parquet')

# Print Schema of the data file
#print("Column names: {}".format(pfile.column_names))
#print("Schema: {}".format(pfile.schema))




[2] **HDF5 Format**

Usage:
- Efficiently handles large datasets and complex hierarchies. - Used extensively in scientific computing and deep learning.

Advantages:

- Supports storage of large multi-dimensional arrays and hierarchical data.


In [None]:
import h5py
import numpy as np

# Create an HDF5 file
with h5py.File('data.h5', 'w') as f:
    # Create a dataset
    data = f.create_dataset('dataset_1', (100,), dtype='i')
    data[...] = np.arange(100)  # Fill with data

# Read the HDF5 file
with h5py.File('data.h5', 'r') as f:
    print(f['dataset_1'][:])



In [None]:
# Create arrays using HDF5 Format

import h5py
import numpy as np

arr1 = np.random.randn(10000)
arr2 = np.random.randn(10000)

with h5py.File('complex_read.hdf5', 'w') as f:
    f.create_dataset('array_1', data=arr1)
    f.create_dataset('array_2', data=arr2)

with h5py.File('complex_read.hdf5', 'r') as f:
    d1 = f['array_1']
    d2 = f['array_2']

    data = []

    for i in range(len(d1)):
        if d1[i] > 0:
            data.append(d2[i])


print('The length of data with a for loop: {}'.format(len(data)))

# Print entire array
#print(data)

# Print specific entry
print(data[5])


[3] **TFRecord Format**

Usage:
- TensorFlow's preferred format for efficient storage of large datasets.

Advantages:
- Supports sequential reading of data for large-scale deep learning tasks.


In [None]:
import tensorflow as tf

# Define a function to create an example
def create_example():
    feature = {
        'name': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b'Ravindra'])),
        'age': tf.train.Feature(int64_list=tf.train.Int64List(value=[30])),
    }
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

# Write to TFRecord
with tf.io.TFRecordWriter('data.tfrecord') as writer:
    writer.write(create_example())

# Read TFRecord
raw_dataset = tf.data.TFRecordDataset('data.tfrecord')

# Print the record
for record in raw_dataset:
    print(tf.train.Example.FromString(record.numpy()))


In [None]:
# Digit Recognizer in python using tensorflow datatype

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Normalize pixel values to be between 0 and 1
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Reshape the input data to be a 4D tensor
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# Define the model
model = keras.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=1)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('\nTest accuracy:', test_acc)

# Make predictions
predictions = model.predict(x_test)

# Print some predictions and their corresponding labels
for i in range(2):
  predicted_label = np.argmax(predictions[i])
  true_label = y_test[i]
  print('Predicted label:', predicted_label)
  print('True label:', true_label)
  plt.imshow(x_test[i].reshape(28, 28), cmap='gray')
  plt.show()


[4] **Image Formats (JPG, PNG)**

Usage:
- Image storage and processing in computer vision tasks.

Advantages:
- Different compression levels, widespread compatibility.

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

# Open an image file
img = Image.open('image.jpeg')

# Convert method : Possible values :  “RGB” or “L” to “1”.
img1 = img.convert('L')
img2 = img.convert('1')

plt.imshow(img1)
plt.axis('off')
plt.show()

In [None]:
# prompt: Convert image into digital representation

from PIL import Image
import numpy as np

# Open the image
img = Image.open('image.jpeg')

# Convert the image to a NumPy array
image_array = np.array(img)

# Print the complete array of the image
#print(image_array)

#You can also access specific pixel values,  to print specific entry in array
print(image_array[2][4])

**PyTorch Native Format**


 - The PyTorch native format is flexible, easy to use, and optimized for saving models during development, training, and deployment within PyTorch-based ecosystems.

Use Cases :

- 	Saving and loading model weights (state_dict)
- 	Saving and loading the entire model (architecture + weights)
- 	Checkpointing during training
- 	Model inference and deployment


In [None]:
# Demo : Working of pytorch native format

import torch
import numpy as np
import matplotlib.pyplot as plt

# Generate random data
data = np.random.rand(100, 3, 224, 224)  # Example: 100 images, 3 channels, 224x224 resolution
labels = np.random.randint(0, 10, size=(100,))  # Example: 100 labels between 0 and 9

# Convert to PyTorch tensors
data_tensor = torch.from_numpy(data).float()
labels_tensor = torch.from_numpy(labels).long()

# Print sample data
print("Sample Data (First image, first channel):")

print(data_tensor[1, 1, :, :])
print("Sample Label (First image):")
print(labels_tensor[1])

plt.imshow(data_tensor[0, 0, :, :])
plt.axis('off')
plt.show()