<a href="https://colab.research.google.com/github/schumbar/SJSU_CMPE255/blob/main/assignment_04/ApacheBeam/Part_C_ApacheBeam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 04: Apache Beam Data Engineering Assignment
### Part C: Apache Beam Features
By Shawn Chumbar
  
Please note that I have used ChatGPT to assist me with this assignment.


Tasks:
1. Composite transform
2. Pipeline IO
3. triggers
4. windowing
5. ParDo

Sources:
1. [About Beam ML](https://beam.apache.org/documentation/ml/about-ml/)
2. [Get started with AI/ML pipelines](https://beam.apache.org/documentation/ml/overview/)
3. [Use RunInference with Sklearn](https://beam.apache.org/documentation/transforms/python/elementwise/runinference-sklearn/)
4. [Apache Beam Tutorial](https://www.macrometa.com/event-stream-processing/apache-beam-tutorial)
5. [Intro to Apache Beam - Python](https://colab.research.google.com/drive/1qrqbpRpfMtwosjcZQ3_qAWvBCXtzs-8D?usp=sharing)

Dataset Link:
[Healthcare Insurance](https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance)

In [5]:
!pip install apache_beam



In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import pandas as pd
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

#### Loading Dataset

In [8]:
# Load the datasets
file_path = '/content/drive/MyDrive/SJSU/CMPE_255/assignment_04/datasets/insurance.csv'
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


#### Exploring Data

In [9]:
data.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

This data seems to have the following columns:
* Age
* sex
* bmi
* children
* smoker
* region
* charges




In [10]:


# Function to parse each CSV row into a dictionary
def parse_csv(line):
    fields = line.split(',')
    return {
        'age': int(fields[0]),
        'sex': fields[1],
        'bmi': float(fields[2]),
        'children': int(fields[3]),
        'smoker': fields[4],
        'region': fields[5],
        'charges': float(fields[6])
    }

# Function to format results into a CSV string. It now accepts two parameters: key and value.
def to_csv_string(key, value):
    # Assuming you want to write the key and value separated by a comma
    return f"{key},{value}"


# Composite transform to calculate average charge by a grouping key (e.g., smoker status)
class CalculateAverageChargeByGroup(beam.PTransform):
    def __init__(self, group_key):
        self.group_key = group_key

    def expand(self, pcoll):
        return (
            pcoll
            | 'Extract Key Value' >> beam.Map(lambda elem: (elem[self.group_key], elem['charges']))
            | 'Group By Key' >> beam.GroupByKey()
            | 'Calculate Average' >> beam.Map(lambda elem: (elem[0], sum(elem[1]) / len(elem[1])))
        )

# Define the pipeline
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
    csv_lines = (
        p
        | 'Read from CSV' >> beam.io.ReadFromText(file_path, skip_header_lines=1)
        | 'Parse CSV to Dict' >> beam.Map(parse_csv)
    )

    # Calculate average charge by 'smoker' status as an example
    average_charge_by_smoker = (
        csv_lines
        | 'Average Charge by Smoker' >> CalculateAverageChargeByGroup('smoker')
    )
    # Convert the results to CSV format and write them to a file
    (
        average_charge_by_smoker
        | 'Format as CSV' >> beam.MapTuple(to_csv_string)
        | 'Write to File' >> beam.io.WriteToText('/content/drive/MyDrive/SJSU/CMPE_255/assignment_04/datasets/output.csv')
    )




