## 0.Python Introduction

Authors : Kefia Ali

<p align="center">
  <a>
    <img src="./figures/logo-hi-paris-retina.png" alt="Logo" width="280" height="180">
  </a>

  <h3 align="center">Data Science Bootcamp</h3>
</p>

# Introduction to Python

- General presentation of the language
    - Philosophy
    - Evolutions & Successes
- Practical exploration of the language through a concrete example
    - Computing the GCD of two numbers: discovering the basics of the language through different implementations

## Introduction to Python

- Very powerful and intuitive programming language
    - Interpreted: no compilation phase (like C or Java)
    - Simple syntax / high level (indentation)
    - Strong and dynamic typing: variables are not typed (carried by the value)
    - Duck typing: we rely on the value interface rather than the type
    - Very powerful native data structures (lists, dictionaries, sets, iterators, etc.)
    - Pay attention to mutable and immutable types

In [None]:
class Magic:
    def __len__(self):
        return 1


l = (1, 2, 3)
print(len(l))
m = Magic()
print(len(m))

In [None]:
s = "123"
print(s[1])
s[1] = 4

In [None]:
def f1(s):
    s += "3"


def f2(l):
    l += [3]


s = "12"
f1(s)
print(s)

l = [1, 2]
f2(l)
print(l)

- Very rich functionalities in the standard library
    - text handling (`upper`, `lower`, `regexp`, `diff`, `unicode`)
    - numerical manipulation (`random`, `decimal`, `sqrt`, `trigo`)
    - file system, processing, threading, network, etc. (web server in 2 lines)
    - data persistence (`pickle`, `json`, `csv`, `sqlite`, `zlib`)
    - os interface (`os`, `io`, `logging`)
    - UI library (`Tkinter` as a standard)
    - Modules / Packages management (very important)

- Perpetual evolution
    - Very active community supported by companies / universities (Google, Dropbox, etc.)
    - A big shift in 2010 (`py3k`): incompatible version to correct design errors
    - 10 years to migrate all community projects / frameworks
    - Async approach built into the language to overcome performance limitations (`GIL`)
    - Recent communication on work in progress by Guido van Rossum on Python performance (Microsoft)

- Widely used in many fields
    - Calcul / Data / Stats : `Numpy`, `Pandas`, `Scikit`, `TF`, `Torch`
    - Web : `Flask`, `Django`, `FastAPI`
    - Admin / Automatisation / Cloud : `Ansible`, `awscli`, `azure-cli`

- Good introduction : https://learnxinyminutes.com/docs/python/

## Exploration pratique (gcd)

- gcd : greatest common divisor of two numbers (Plus Grand Commun Diviseur)
- Explanation :
    - gcd(a, b) with a = nb + m
    - if m != 0
        - gcd(a, b) divides m
        - if d divides m and b then d divides a
        => gcd(a, b) = gcd(b, m)
    - if m == 0 => gcd(a, b) = b
- Algorithm:
    - while a >= b, a = a - b
    - if a == 0 => res = b
    - otherwise while b >= a, b = b - a
      - if b == 0 => res = a
      - otherwise while ...

In [None]:
# Naive approach (using a language to express an algorithm)

a, b = 12, 9
res = None

while True:
    a = a - b  # we assume that a >= b
    if a == 0:
        res = b
        break
    elif a < b:
        a, b = b, a

print(res)

### Take away
- Easy and simple syntax (we assign as done on paper)
- `if`, `else`, `elif`
- `while`, `break`, `continue`
- `print` is your best friend

In [None]:
# To capitalize on a logic, you need building blocks

def gcd1(a, b):
    a, b = max(a, b), min(a, b)
    while True:
        a = a - b
        if a == 0:
            return b
        elif a < b:
            a, b = b, a

In [None]:
gcd1(12, 9)

In [None]:
gcd1(13, 12)

### Take away
- Functions : building block important to capitalize / factorize
- Classes are the level above (Capitalize on a concept)
- Modules (.py file) and packages (hierarchy of modules) allow to group functions and classes to share / publish them (pypi)

In [None]:
# Improvement : we don't like the `while True` in development (the error is fatal)
# A function can call itself (recursivity)

def _gcd2(s, b):
    while b >= s:
        b -= s
    if b == 0:
        return s
    else:
        return _gcd2(b, s)


def gcd2(a, b):
    return _gcd2(min(a, b), max(a, b))

In [None]:
gcd2(12, 13)

In [None]:
gcd2(14, 14)

In [None]:
# More readability / better use of language

def gcd3(a, b):
    s, b = min(a, b), max(a, b)
    while s != 0:
        s, b = b % s, s
    return b

In [None]:
gcd3(12, 13)

In [None]:
gcd3(14, 7)

### Take away
- Python is a very rich language (operations, native functions, multiple assignments)
- Code readability is an important criterion of the quality  (we spend more time maintaining than creating)
- A standard (pep8) and tools exist to check and format the code

In [None]:
import math


math.gcd(14, 7)

### Take away
- Before starting a development, check if there is already a library that does it
- As a Software Engineer, Data Enginner or Data Scientist, our job is more to find the best combination of components to do the job than to create new things

In [None]:
import random


N = 100000

random.seed()
A = [random.randint(1, 1000000000) for i in range(N)]
B = [random.randint(1, 1000000000) for i in range(N)]

In [None]:
import timeit


def bench(f):
    def ff():
        r = [f(a, b) for a, b in zip(A, B)]
        print(f"gcd({A[0]}, {B[0]}) = {r[0]}")
        return r
    t = timeit.timeit(ff, number=1)
    print(f"exec time : {t:.2f}s")

### Take away
- String format
- List Comprehension (exists for dict)
- Embedded functions
- The standard library is very powerful (min, max, zip, timeit)

In [None]:
bench(gcd1)

In [None]:
bench(gcd2)

In [None]:
bench(gcd3)

In [None]:
bench(math.gcd)

### Take away
- The closer you get to the standard, the better you perform
- To do data engineering, we must be vigilant because performance quickly becomes critical

In [None]:
import random


N = 10000000

random.seed()
A = [random.randint(1, 1000000000) for i in range(N)]
B = [random.randint(1, 1000000000) for i in range(N)]

In [None]:
def log(a, b, r):
    print(f"gcd({a}, {b}) = {r}")


def naive():
    R = [math.gcd(a, b) for a, b in zip(A, B)]
    for a, b, r in zip(A, B, R):
        log(a, b, r)
        break


def lazy():
    R = (math.gcd(a, b) for a, b in zip(A, B))
    for a, b, r in zip(A, B, R):
        log(a, b, r)
        break

In [None]:
print(f"naive exec time : {timeit.timeit(naive, number=1):.2f}s")
print()
print(f"lazy exec time : {timeit.timeit(lazy, number=1):.2f}s")

### Take away
- There are "lazy" data structures that avoid intermediary storage
- This consists in creating iterators and links on iterators that iterate on existing data
- Very common pattern in Python (iterate on database records, aggregations on CSV files > memory size)
- Compatible with builtins and standard library (`sorted`, `zip`, `enumerate`, etc.)

In [None]:
import numpy as np


AA = np.array(A)
BA = np.array(B)


def fast():
    R = np.gcd(AA, BA)
    log(AA[0], BA[0], R[0])

In [None]:
print(f"fast exec time : {timeit.timeit(fast, number=1):.2f}s")

### Take away
- Python's basic data structures are not optimized for numeric compute
- The `Numpy` package is the basis for all numerical computation libraries in Python
    - Block data structure uniformly typed 
    - Flexibility to shape the data to your needs
    - Rich library of operations to manipulate these arrays
    - Utilities to build arrays (timeseries, random, range, etc.)
    - Utilities to load from files and serialize
- Nice introduction : http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54

In [None]:
a = np.random.random((3, 2))
print(a)
print(a.dtype)
print(a.shape)

In [None]:
a[1:, 1:].shape

In [None]:
b = np.random.random((2, 2, 2, 2))
print(b)
print(b.dtype)
print(b.shape)
print(b[0, 0, 0, :])

In [None]:
a.reshape((2, 3))

In [None]:
b = np.random.random((3, 2))
print(b)

In [None]:
print(a+b)

In [None]:
print(a*b)

In [None]:
print(a/b)

In [None]:
a.dot(b.reshape(2, 3))

In [None]:
a = np.array((1, 2, 3))

In [None]:
a.dtype

In [None]:
a[3] = 1  # Pas de resizing (Attention)

In [None]:
a[0] = 0
print(a)

In [None]:
a[0] = 5.5  # Cast (Attention)
print(a)

In [None]:
def double1(a):
    return a*2


def double2(a):
    return np.array([v*2 for v in a])


print(double1(np.array((1, 2, 3))))
print(double2(np.array((1, 2, 3))))

In [None]:
from functools import partial


a = np.arange(1000000)

print(timeit.timeit(partial(double1, a), number=10))
print(timeit.timeit(partial(double2, a), number=10))

### Take away
- Numpy is a real Swiss army knife for computational processing while keeping the flexibility of python
- You have to be careful to keep the internal structures of Numpy (go through the Numpy API)
- Use of `partial` (creation of objects of type function with arguments)

In [None]:
import numpy as np
import matplotlib.pyplot as plt


x1 = np.arange(0.0, 2.0, 0.1)
y1 = np.sin(2 * np.pi * x1)

x2 = np.arange(0.0, 2.0, 0.01)
y2 = np.sin(2 * np.pi * x2)

fig, ax = plt.subplots()

ax.plot(x1, y1, color="red")
ax.plot(x2, y2, color="blue")
ax.grid()

plt.show()

### Take away
- To visualize the numbers, very powerful libraries to visualize the `numpy` array

### Pandas
Pandas is open source toll for data manipulation, analysis and cleaning. It is well suited for different kinds of data and to manage many named columns with different types, such as:

- Tabular data with heterogeneously-typed columns
- Ordered and unordered time series data
- Arbitrary matrix data with row & column labels
- Unlabelled data
- Any other form of observational or statistical data sets

Nice introduction : https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [None]:
import pandas as pd


# read all medals of Winter Olympics between 1924 and 2006
df = pd.read_csv('http://winterolympicsmedals.com/medals.csv')

In [None]:
df

In [None]:
df.dtypes

Pandas are built on the top of NumPy

In [None]:
df.index

In [None]:
df.values

Python Pandas Operations

Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operations for data manipulation are demonstrated below

In [None]:
# filtering
df[(df.NOC == 'FRA') & (df.Year == 2002)]

In [None]:
# aggregations
df.groupby(["Year"]).agg({"Event": "nunique", "NOC": "nunique", "Medal": "count"})

In [None]:
# the number of all medals winned by coutry for each categories
df.pivot_table(index='NOC', columns='Medal', values='Event', aggfunc='count')

In [None]:
(df
     .groupby(["NOC", "Medal"])
     .agg({"Event": "count"})
     .reset_index(level=[1])
     .pivot(columns="Medal")
     .fillna(0)
)

### Numpy or Pandas ?

Now majorly the difference between Numpy and Pandas lies in their data structure, memory consumption, and usage.*

- Numpy majorly works with numerical data whereas Pandas works with tabular data.
- The data structure in Pandas are Series, Dataframes and Panel whose objects can go upto three. Whereas Numpy has Arrays whose objects can go upto n dimensions.
- Numpy consumes less memory as compared to Pandas.
- Pandas perform better with the data having 500K rows or more whereas Numpy performances better for 50K rows or less
- Pandas is more widely used in industry than Numpy.
- Good read : http://gouthamanbalaraman.com/blog/numpy-vs-pandas-comparison.html

In [None]:
# 1st version

df_filter = df.loc[df.NOC.isin(['AUT', 'FRA', 'CHN', 'USA', 'FIN']), :].reset_index(drop=True)
df_pivot = df_filter.pivot_table(index='Year', columns='NOC', values='Medal', aggfunc='count')
df_pivot = df_pivot.fillna(0)
df_pivot.plot()

#### To improve code readability, we use Pandas Method Chaning.
Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it
- To deep dive : https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69

In [None]:
# 2nd version

(
    df
    .loc[df.NOC.isin(['AUT', 'FRA', 'CHN', 'USA', 'FIN'])]
    .pivot_table(index='Year', columns='NOC', values='Medal', aggfunc='count')
    .fillna(0)
    .plot()
)