# Today's topics

- Who am I?
- Go over the prerequisites in the README file
- Why python?
    - Small demos

## Who am I?

For those that don't know me...

- I am the Lead QA for the DPT team
    - and I lead a team of one...me :)
- I'm a language nerd that's programmed in something like 15+ languages
    - java, c, c++, rust, python, javascript, typescript, haskell, clojure, perl to name the biggies
- The first half of my career was in the hardware industry, including some linux kernel driver code
- Spent about 60% of my career in QA, and 40% as a dev

## Why python?

Python is one of the most in-demand languages in the market now as shown by Stack Overflow surveys.  

- The basics are easy to learn
- Python is king of data analytics on local machines 
- Modern Type System like type literals, unions, and None handling
- Python is the lingua franca of Machine Learning and AI in general

## The basics are simple

Python is relatively simple and thus can be adopted quickly.  This is a big disadvantage over rust, which has a reputation for being hard to learn.  

- Being simple makes python great for prototyping code.  
- Speed of development is much faster than most languages which is ideal for ad-hoc/exploratory testing and experiments

In [ ]:
from dataclasses import dataclass
from datetime import datetime, timezone
from uuid import UUID, uuid4

from pydantic import BaseModel

# Without type annotations (Please don't do this!!)
def to_datetime_bad(ymd):
    return datetime.strptime(ymd, "%Y_%m_%d")

# With type annotations, ymd is a string parameter, and the function returns a datetime object
def to_datetime(ymd: str) -> datetime:
    """Returns a year_month_day str format as a datetime"""  # <- a docstring
    return datetime.strptime(ymd, "%Y_%m_%d")  # python is not expression oriented, so you must use the return keyword

print(to_datetime("2023_8_27"))



In [17]:
# A dataclass is like a Java POJO and automatically generates methods like __eq__ and __hash__
@dataclass
class EventDetails:
    uuid: UUID
    created: datetime
    s3_path: str

# While dataclass is quick and easy, it's not always the best choice for JSON (de)serialization
uid = uuid=uuid4()
now = datetime.now(tz=timezone.utc)
details1 = EventDetails(uuid=uid, created=now, s3_path="s3://foo")
details2 = EventDetails(uuid=uid, created=now, s3_path="s3://foo")
print(details1)
print(f"Does details1 have the same value as details2? {details1 == details2}")
print(f"Is details1 the same object as details2? {details1 is details2}")
print(f"Address of details1 = {id(details1)}, details2 = {id(details2)}")

# For serde, sometimes it's better to use pydantic, as it offers validation, and more JSON serializable types
class EventDetails2(BaseModel):
    uuid: UUID
    created: datetime
    s3_path: str

details = EventDetails2(uuid=uuid4(), created=datetime.now(tz=timezone.utc), s3_path="s3://foo")
print(details.json(indent=2))

EventDetails(uuid=UUID('1819574a-089e-4fe8-92a4-51eec7b9b5a2'), created=datetime.datetime(2023, 9, 11, 14, 46, 49, 837058, tzinfo=datetime.timezone.utc), s3_path='s3://foo')
Does details1 have the same value as details2? True
Is details1 the same object as details2? False
Address of details1 = 5650290064, details2 = 5645234320
{
  "uuid": "947c5694-d064-4bb7-82c6-0ebdb17db40d",
  "created": "2023-09-11T14:46:49.852535+00:00",
  "s3_path": "s3://foo"
}


### Python is king of local data processing

Data scientists heavily use python frameworks like pandas, polars and duckdb for data analytics for Small Data 

- Python is also making good strides to make data processing work on data that can't fit on a local machine 
    - [polars](https://www.pola.rs/) can now lazy load data larger than disk size.
    - Frameworks like [fugue](https://fugue-tutorials.readthedocs.io/index.html) can distribute pandas/polars workloads across a distributed cluster much like spark now.
- Python is also starting to challenge Java in Big Data data _processing_.  
    - Java rules the roost with spark and flink, but this is starting to change.
    - Even most spark tutorials use pyspark rather than scala to show how it works.
    - OpenAI uses the [ray framework](https://docs.ray.io/en/latest/) to train GPT LLM's, 
        - because batch inference on spark was too slow
        - Partially due to a [better shuffle algorithm](https://www.anyscale.com/blog/ray-breaks-the-usd1-tb-barrier-as-the-worlds-most-cost-efficient-sorting)
        - and not having to marshall data to accelerator libraries like pytorch or numpy

> **Data _Processing_ vs. _Compute_**
> 
> There's not really a standard definition of this, but this is what I mean by _compute_ vs _processing_.
> Data _processing_ is where you do some kind of querying or "cleaning" of data it to make it more structured or
> useable. Data _compute_ is where calculations are performed on data, usually converting the data into a numerical form
> first.  Data processing is typically IO bound, but can sometimes be compute bound (eg calculating new values for data
> in a column).  Conversely, data compute is usually CPU bound but can be IO bound (eg, shuffling data to worker nodes)
> Java has fared well at data _processing_, but python (with accelerators) is king of data _compute_

In [ ]:
import duckdb as dd
from datetime import datetime

conn = dd.connect()

start = datetime.now()
relation = conn.read_json("example_data/test-runs2.ndjson")
second = conn.sql("""SELECT tests.* FROM relation""")
print(second)

result = conn.sql("""SELECT test_dt_endpoint.* FROM second
WHERE test_dt_endpoint.result = 'error'
AND test_dt_endpoint.exception LIKE '%NullPointerException%'""")
end = datetime.now()
print(f"Took {end - start}")

# import polars as pl 
# pl.Config.set_tbl_cols(10)
# pl.Config.set_fmt_str_lengths(50)
# pl.Config.set_tbl_width_chars(300)
print(result.pl())

relation.to_parquet("example_data/test-runs.parquet")  # type: ignore



### Modern type system

Python now has a sophisticated type system that is more powerful than Java (20+)

- The type checker is optional and only runs either from the IDE or through something like a [mypy](https://mypy-lang.org/) or pyright
- It can take care of None handling for you (unlike Java where null is a member of all types)
    - in the python type system, type `T` is distinct from type `T | None`
- The type system can be both _nominal_ (like Java) or _structural_ (stricter duck typing) through [Protocols](https://peps.python.org/pep-0544/)
- _Unions_ handle many of the same capabilities as _sealed_ classes and interfaces in Java 15+
- Python 3.12 via [PEP-695](https://peps.python.org/pep-0695/) will auto-detect Generic Variance (no need to manually specify it)
- The type system can have _Literal_ values which are more convenient than _Enum_ types
    - Literals let you have some features of _dependent type_ systems

> **Not your father's null**
>
> In many older languages, the null type is a value that is a member of all types.  It can be returned by any function
> which means you must always check for its presence.  More modern type systems eschew null types altogether, and use
> a Functional Programming concept like Haskell's Maybe, or Rust's Option.  Java's Optional is a bastardized attempt.
> In python, the type system with strict checks can catch many (but not all) None type exceptions.

In [None]:
# The IDE  (with pylance) tells that we said our return type was `int` but it caught that there is a return path
# that can return a None.  So we should fix this.  Note that this will still run.  Type annotations are ignored at
# runtime.  So we should always fix these issues in the IDE or with a mypy lint check
from pathlib import Path
from typing import Literal


def foo(x: int) -> int:
    if x < 10:
        return None
    else:
        return x * 2

ans = foo(3)
print(ans)

# Example of a function returning a Union type
# We can declare that a function can take a Union of types or return a Union
def date_or_str(base_dir: str | Path) -> str | Path:
    dt = datetime.now().strftime('%Y_%m_%d') 
    if isinstance(base_dir, str):
        base_dir = Path(base_dir)
        return f"{base_dir}/{dt}"
    else:
        return base_dir / Path(dt)
    
print(date_or_str("/tmp"))
print(date_or_str(Path("/tmp")))

# Example of a function that can only take the Literal values of 1 or 2.  Any other int will be an error
def only_1_or_2(data: Literal[1, 2]):
    match data:
        case 1:
            return "one"
        case 2:
            return "two"
        
only_1_or_2(3)
# can I trick it?
arg = 3
only_1_or_2(arg)
# Hmm, what if I fake the type?
arg: Literal[1] = 3
# What about some other expression that evaluates to 1 or 2?
arg2 = 1 + 1
only_1_or_2(arg)
arg3 = 0 + 1
only_1_or_2(arg3)

def i_return_a_1():
    return 1

only_1_or_2(i_return_a_1() + 1)

### Defacto language for ML

Like it or not, python is the defacto language for Machine learning

- all the most popular ML toolkits (pytorch, tensorflow, hugging faces, sci-kit-learn, numpy, etc)
- nearly all the examples you will see will be in python
- you will need to collaborate with others who only know/use python

> The future is AI
>
> Either AI will become a tool we have to learn as end-users, or it may even eventually replace us.  If the former,
> knowing how it works under the hood will help you use it better.  If the latter, you will have better job security.
> Either way, understanding how AI works is crucial to your career

### Mojo (a superset of python)

A secret reason I want to get people to use python is as a gateway drug to the mojo language.  

- Mojo is an upcoming language that promises rust and C performance
    - and even faster with hardware accelerators (ie GPU/TPU/NPU).  
- It is currently in a very alpha stage and has serious limitations
    - For example, it doesn't even have a dict type in its standard library
    - The standard library has no File IO or any kind of networking
    
But it shows a lot of promise because:

- It should be faster than C(++) or rust for parallelizable apps
- It has no garbage collector, making it ideal for performance or latency sensitive apps
    - less risk of OOM
    - less GC stutters
- The syntax is very similar to PEP-695 typed python with compile time Parameterized Expressions
    - The Generic types are named (unlike Java or Rust), meaning you can treat them as variables in an expression
- You can (eventually) run python code as-is
- It can generate stand-alone binaries
- you can gradually port python code to mojo code over time (no Rewrite it in Rust)
- since it will be a superset of python (eventually), it also gets the entire python ecosystem for free 

These are huge advantages over rust for example.