# Why python?

Python is one of the most in-demand languages in the market now as shown by Stack Overflow surveys.  This in turn is driven to a large degree by the rise of Big Data and Machine Learning.  

- The basics are easy to learn
- Python is king of data analytics on local machines, and is starting to challenge Java in Big Data
- 


## The basics are simple

Python is relatively simple and thus can be adopted quickly.  This is a big disadvantage over rust, which has a reputation for being hard to learn.  Being simple makes python great for prototyping code.  Speed of development is much faster than most languages which is ideal for ad-hoc/exploratory testing and experiments

In [None]:
# Simple function, class and dataclass

from dataclasses import dataclass
from datetime import datetime, timezone
from uuid import UUID, uuid4

from pydantic import BaseModel

# ymd is a string parameter, and the function returns a datetime object
def to_datetime(ymd: str) -> datetime:
    """Returns a year_month_day str format as a datetime"""  # <- a docstring
    return datetime.strptime(ymd, "%Y_%m_%d")  # python is not expression oriented, so you must use the return keyword

print(to_datetime("2023_8_27"))

# A simple data type
@dataclass
class EventDetails:
    uuid: UUID
    created: datetime
    s3_path: str

# While dataclass is quick and easy, it's not always the best choice for JSON (de)serialization
details = EventDetails(uuid=uuid4(), created=datetime.now(tz=timezone.utc), s3_path="s3://foo")
print(details)

# For serde, sometimes it's better to use pydantic, as it offers validation, and more JSON serializable types
class EventDetails2(BaseModel):
    uuid: UUID
    created: datetime
    s3_path: str

details = EventDetails2(uuid=uuid4(), created=datetime.now(tz=timezone.utc), s3_path="s3://foo")
print(details.json(indent=2))

### Python is king of local data processing

Data scientists heavily use python frameworks like pandas, polars and duckdb for data analytics (polars can now lazy
load data larger than disk size).  Python is also starting to challenge Java in Big Data data _processing_.  Java rules
the roost with spark and flink, but this is starting to change.Even most spark tutorials use pyspark rather than scala
to show how it works.

> **Data _Processing_ vs. _Compute_**
> 
> Data _processing_ is where you do some kind of querying on data, or "cleaning" it to make it more structured or
> useable. Data _compute_ is where calculations are performed on data, usually converting the data into a numerical form
> first.  Data processing is typically IO bound, but can sometimes be compute bound (eg calculating new values for data
> in a column).  Conversely, data compute is usually CPU bound but can be IO bound (eg, shuffling data to worker nodes)
> Java has fared well at data _processing_, but python (with accelerators) is king of data _compute_

Python is also making good strides to make data processing work on data that can't fit on a local machine (a limitation
of pandas) Frameworks like [fugue](https://fugue-tutorials.readthedocs.io/index.html) can distribute pandas/polars
workloads across a distributed cluster much like spark now.

> For example, OpenAI uses the ray framework to train GPT LLM's, because batch inference on spark was too slow. 

In [None]:
# TODO: Create some NDJSON data and some parquet files, and show how we can do queries

In [None]:
# TODO: Create a simple example of a ray frame work

### Modern type system

Although the type checker is optional and only runs either from the IDE or through something like a mypy linter,
python's type system is actually more powerful than Java 19.  It can for example, take care of None handling for you.
For example, unlike Java, where all types are effectively a union of the values of their type plus null (called a
_bottom_ type in programming language theory), in the python type system, type `T` is distinct from type `T | None`

> **Not your father's null**
>
> In many older languages, the null type is a value that is a member of all types.  It can be returned by any function
> which means you must always check for its presence.  More modern type systems eschew null types altogether, and use
> a Functional Programming concept like Haskell's Maybe, or Rust's Option.  Java's Optional is a bastardized attempt.
> In python, the type system with strict checks can catch many (but not all) None type exceptions.

In [None]:
# The IDE  (with pylance) tells that we said our return type was `int` but it caught that there is a return path
# that can return a None.  So we should fix this.  Note that this will still run.  Type annotations are ignored at
# runtime.  So we should always fix these issues in the IDE or with a mypy lint check
from pathlib import Path


def foo(x: int) -> int:
    if x < 10:
        return None
    else:
        return x * 2

ans = foo(3)
print(ans)

# Example of a function returning a Union type
# We can declare that a function can take a Union of types or return a Union
def date_or_str(base_dir: str | Path) -> str | Path:
    dt = datetime.now().strftime('%Y_%m_%d') 
    if isinstance(base_dir, str):
        base_dir = Path(base_dir)
        return f"{base_dir}/{dt}"
    else:
        return base_dir / Path(dt)
    
print(date_or_str("/tmp"))
print(date_or_str(Path("/tmp")))

### Defacto language for ML

Python is the defacto language of all the most popular ML toolkits (pytorch, tensorflow, hugging faces, sci-kit-learn,
numpy, etc).  While python has its warts (all languages do), you pretty much have to know python to do any ML in a
collaborative effort, because almost everyone else's code will be using some python framework.

> The future is AI
>
> Either AI will become a tool we have to learn as end-users, or it may even eventually replace us.  If the former,
> knowing how it works under the hood will help you use it better.  If the latter, you will have better job security.
> Either way, understanding how AI works is crucial to your career

### Mojo (a superset of python)

A secret reason I want to get people to use python is as a gateway drug to the mojo language.  Mojo is an upcoming
language that promises rust and C performance, and even faster with hardware accelerators (ie GPU/TPU/NPU).  It is
currently in a very alpha stage and has serious limitations.  For example, it doesn't even have a list or dict type in
its standard library.  But it shows a lot of promise because:

- It should be faster than C or rust for parallelizable apps
- It has no garbage collector, making it ideal for performance or latency sensitive apps
- The syntax is very similar to PEP-695 typed python
- You can run python code as-is
- You can code in python (though you will only get a little performance boost)

What this means, is that you can gradually port python code to mojo code over time, and get huge performance and
infrastructure.  Since it is a superset of python, it also gets the entire python ecosystem for free.  These are huge
advantages over rust for example.