# Getting Started with PyFLink with Soumil Shah

# Step 1: Install Libraray and Packages 

In [3]:
! pip install Faker

Collecting Faker
  Obtaining dependency information for Faker from https://files.pythonhosted.org/packages/73/51/cbc859707aa0fc0ad3819ffb3bdaeee28d10d5ef30150ed9d16691ac3795/Faker-19.6.1-py3-none-any.whl.metadata
  Using cached Faker-19.6.1-py3-none-any.whl.metadata (15 kB)
Using cached Faker-19.6.1-py3-none-any.whl (1.7 MB)
Installing collected packages: Faker
Successfully installed Faker-19.6.1


In [8]:
! pip show apache-flink

Name: apache-flink
Version: 1.17.1
Summary: Apache Flink Python API
Home-page: https://flink.apache.org
Author: Apache Software Foundation
Author-email: dev@flink.apache.org
License: https://www.apache.org/licenses/LICENSE-2.0
Location: /Users/soumilnitinshah/anaconda3/envs/my-new-environment/lib/python3.8/site-packages
Requires: apache-beam, apache-flink-libraries, avro-python3, cloudpickle, fastavro, httplib2, numpy, pandas, pemja, protobuf, py4j, pyarrow, python-dateutil, pytz, requests
Required-by: 


In [1]:
! java -version

openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)


# Step 2 : Basics 

# PyFlink Offers 
* DataStream API
* Table API 

# Table API 
* Apache Flink offers a Table API as a unified, relational API for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded, batch data sets and produce the same results. The Table API in Flink is commonly used to ease the definition of data analytics, data pipelining, and ETL applications.

## Table Enviroment 


-> streaming TableEnvironment
```
env_settings = EnvironmentSettings.in_streaming_mode()
table_env = TableEnvironment.create(env_settings)
```


->batch TableEnvironment
```
env_settings = EnvironmentSettings.in_batch_mode()
table_env = TableEnvironment.create(env_settings)
```


#### Creating DataFraeme from LIst of Tuples

In [6]:
from pyflink.table import EnvironmentSettings, TableEnvironment
from faker import Faker

# Create a batch TableEnvironment
env_settings = EnvironmentSettings.in_batch_mode()
table_env = TableEnvironment.create(env_settings)

# Initialize Faker
fake = Faker()

# Generate fake data and convert it into a PyFlink table with column names
data = [(fake.name(), fake.city(), fake.state()) for _ in range(10)]  # Generate 10 rows of fake data

# Define column names
column_names = ["name", "city", "state"]

# Create a PyFlink table with column names
table = table_env.from_elements(data, schema=column_names)

# Print the table
table.execute().print()


+--------------------------------+--------------------------------+--------------------------------+
|                           name |                           city |                          state |
+--------------------------------+--------------------------------+--------------------------------+
|                 Anthony Chavez |                     New Pamela |                          Maine |
|                Scott Blanchard |                   New Ianhaven |                      Louisiana |
|                  Jerry Jackson |            Port Katherinemouth |                         Kansas |
|             Jessica Cunningham |                       Maryview |                       Kentucky |
|               Nicholas Morales |                     Conwayland |                  Massachusetts |
|                 Kimberly Lynch |                     Port Jared |                        Alabama |
|                 William Jordan |                    Randallbury |                        

# Creating Temp View 


In [7]:
table_env.create_temporary_view('source_table', table)

table_env.execute_sql(f"SELECT * FROM source_table ").print()

+--------------------------------+--------------------------------+--------------------------------+
|                           name |                           city |                          state |
+--------------------------------+--------------------------------+--------------------------------+
|                 Anthony Chavez |                     New Pamela |                          Maine |
|                Scott Blanchard |                   New Ianhaven |                      Louisiana |
|                  Jerry Jackson |            Port Katherinemouth |                         Kansas |
|             Jessica Cunningham |                       Maryview |                       Kentucky |
|               Nicholas Morales |                     Conwayland |                  Massachusetts |
|                 Kimberly Lynch |                     Port Jared |                        Alabama |
|                 William Jordan |                    Randallbury |                        

### Selecting a column

In [10]:
table.select(col("name"), col("city")).execute().print()

+--------------------------------+--------------------------------+
|                           name |                           city |
+--------------------------------+--------------------------------+
|                 Anthony Chavez |                     New Pamela |
|                Scott Blanchard |                   New Ianhaven |
|                  Jerry Jackson |            Port Katherinemouth |
|             Jessica Cunningham |                       Maryview |
|               Nicholas Morales |                     Conwayland |
|                 Kimberly Lynch |                     Port Jared |
|                 William Jordan |                    Randallbury |
|                  William Hicks |                   Hubbardmouth |
|                    Marie Moore |                      South Tim |
|               Christina Thomas |                   South Joseph |
+--------------------------------+--------------------------------+
10 rows in set


### Filtering Data

In [11]:
from pyflink.table.expressions import col

table \
    .select(col("name"), col("city"), col("state")) \
    .where(col("state") == 'Vermont') \
    .execute().print()

+--------------------------------+--------------------------------+--------------------------------+
|                           name |                           city |                          state |
+--------------------------------+--------------------------------+--------------------------------+
|                 William Jordan |                    Randallbury |                        Vermont |
+--------------------------------+--------------------------------+--------------------------------+
1 row in set


### Group By

In [12]:
table \
    .group_by(col("state")) \
    .select(col("state").alias("state"), col("name").count.alias("count")) \
    .execute().print()

+--------------------------------+----------------------+
|                          state |                count |
+--------------------------------+----------------------+
|                         Kansas |                    1 |
|                        Alabama |                    1 |
|                         Hawaii |                    2 |
|                          Maine |                    1 |
|                  Massachusetts |                    1 |
|                       Kentucky |                    1 |
|                      Louisiana |                    1 |
|                        Arizona |                    1 |
|                        Vermont |                    1 |
+--------------------------------+----------------------+
9 rows in set


# Creating SINK

In [13]:
table_env.execute_sql("""
    CREATE TABLE print_sink (
        name STRING, 
        city STRING,
        state STRING
    ) WITH (
        'connector' = 'print'
    )
""")

table_env.execute_sql("""
    INSERT INTO print_sink
        SELECT * FROM source_table
""").wait()



+I[Anthony Chavez, New Pamela, Maine]
+I[Scott Blanchard, New Ianhaven, Louisiana]
+I[Jerry Jackson, Port Katherinemouth, Kansas]
+I[Jessica Cunningham, Maryview, Kentucky]
+I[Nicholas Morales, Conwayland, Massachusetts]
+I[Kimberly Lynch, Port Jared, Alabama]
+I[William Jordan, Randallbury, Vermont]
+I[William Hicks, Hubbardmouth, Hawaii]
+I[Marie Moore, South Tim, Arizona]
+I[Christina Thomas, South Joseph, Hawaii]


# Collect Results to Client 

In [76]:
table_result = table_env.execute_sql(f"SELECT * FROM source_table ")

with table_result.collect() as results:
   for result in results:
       print(result)


<Row('Deborah Elliott', 'Boydhaven', 'Washington')>
<Row('Nathaniel Lee', 'Christinemouth', 'Wyoming')>
<Row('James White', 'Garzabury', 'Georgia')>
<Row('Timothy Ortiz', 'New Phillip', 'Washington')>
<Row('Amanda Flores', 'Jesseborough', 'California')>
<Row('Christopher Hawkins', 'New Jonathan', 'Alaska')>
<Row('Kathy Jones', 'Lake Tammy', 'Alabama')>
<Row('Chad Woodward', 'Grantmouth', 'Utah')>
<Row('Allison Smith', 'South Tylermouth', 'Maryland')>
<Row('Thomas Larson', 'Santiagostad', 'Georgia')>


# Convert Pandas DataFrame to PyFlink Table and Vice Versa

In [78]:
pandas_df = table.to_pandas()
pandas_df

Unnamed: 0,name,city,state
0,Deborah Elliott,Boydhaven,Washington
1,Nathaniel Lee,Christinemouth,Wyoming
2,James White,Garzabury,Georgia
3,Timothy Ortiz,New Phillip,Washington
4,Amanda Flores,Jesseborough,California
5,Christopher Hawkins,New Jonathan,Alaska
6,Kathy Jones,Lake Tammy,Alabama
7,Chad Woodward,Grantmouth,Utah
8,Allison Smith,South Tylermouth,Maryland
9,Thomas Larson,Santiagostad,Georgia


In [81]:
# Create a PyFlink Table from a Pandas DataFrame with the specified row type
table_temp = t_env.from_pandas(pandas_df)
table_temp.execute().print()

+----+--------------------------------+--------------------------------+--------------------------------+
| op |                           name |                           city |                          state |
+----+--------------------------------+--------------------------------+--------------------------------+
| +I |                Deborah Elliott |                      Boydhaven |                     Washington |
| +I |                  Nathaniel Lee |                 Christinemouth |                        Wyoming |
| +I |                    James White |                      Garzabury |                        Georgia |
| +I |                  Timothy Ortiz |                    New Phillip |                     Washington |
| +I |                  Amanda Flores |                   Jesseborough |                     California |
| +I |            Christopher Hawkins |                   New Jonathan |                         Alaska |
| +I |                    Kathy Jones |       

# UDF

In [82]:
table.execute().print()

+--------------------------------+--------------------------------+--------------------------------+
|                           name |                           city |                          state |
+--------------------------------+--------------------------------+--------------------------------+
|                Deborah Elliott |                      Boydhaven |                     Washington |
|                  Nathaniel Lee |                 Christinemouth |                        Wyoming |
|                    James White |                      Garzabury |                        Georgia |
|                  Timothy Ortiz |                    New Phillip |                     Washington |
|                  Amanda Flores |                   Jesseborough |                     California |
|            Christopher Hawkins |                   New Jonathan |                         Alaska |
|                    Kathy Jones |                     Lake Tammy |                        

# UDF

In [103]:
import uuid
import functools  # Import functools

from pyflink.table.udf import udf
from pyflink.table.expressions import col, call
from pyflink.table import TableEnvironment, EnvironmentSettings



def generate_guid():
    return str(uuid.uuid4())


myhash = udf(functools.partial(a), result_type='STRING')

result_table = table.select(col("city"), col("name"), call(myhash).alias("guid"))

result_table.execute().print()

+--------------------------------+--------------------------------+--------------------------------+
|                           city |                           name |                           guid |
+--------------------------------+--------------------------------+--------------------------------+
|                      Boydhaven |                Deborah Elliott | 5982f808-fdf8-45ae-a25d-f10... |
|                 Christinemouth |                  Nathaniel Lee | 9f5b03e3-56aa-4b2f-ab71-160... |
|                      Garzabury |                    James White | 6b511392-71a8-4634-91ef-813... |
|                    New Phillip |                  Timothy Ortiz | 9d4236a5-dde4-46c0-83ea-e77... |
|                   Jesseborough |                  Amanda Flores | ed3b09e7-2c33-44ca-b190-857... |
|                   New Jonathan |            Christopher Hawkins | 7ea8e9c7-b86e-4660-93cf-5f9... |
|                     Lake Tammy |                    Kathy Jones | ba85202f-8d7b-4857-b789

# Referneces 
* https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/table/intro_to_table_api/