# What is a Generator and why use it?

Generator functions allow us to declare a function to act like an iterator.
They are very handy when it comes to dealing with huge datasets and resource limitation.
Generators do not compute the value of items when instantiated.
It computes the value of item only when you ask for it.

In [3]:
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row
!wget -O data.csv https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
reader = csv_reader("data.csv")
print(next(reader))
print(next(reader))

--2022-01-07 14:54:15--  https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
Resolving www.stats.govt.nz (www.stats.govt.nz)... 45.60.11.104
Connecting to www.stats.govt.nz (www.stats.govt.nz)|45.60.11.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5881081 (5.6M) [text/csv]
Saving to: ‘data.csv’


2022-01-07 14:54:22 (1.17 MB/s) - ‘data.csv’ saved [5881081/5881081]

Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Variable_name,Variable_category,Value,Industry_code_ANZSIC06

2020,Level 1,99999,All industries,Dollars (millions),H01,Total income,Financial performance,"733,258","ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)"



# What is a Decorator and why use it?

Decorators are powerful tools to wrap a function or class in order to extend or modify their functionality without permanently changing them.

In [None]:
def getName():
    return "Daniel"

def addTextToName(function):
    def wrapper():
        return "This is your name: {}".format(function())
    return wrapper

namePrinter = addTextToName(getName)

def wrapName(func):
    def wrapper():
        print("Your name is: {}".format(func()))
        print("Nice to meet you.")
    return wrapper

@wrapName
def getNameWrapped():
    return "Daniel"

print(getName())
print("=====")
print(namePrinter())
print("=====")
getNameWrapped()

Daniel
=====
This is your name: Daniel
=====
Your name is: Daniel
Nice to meet you.


# What is list/dict comprehension and why use it?

List/dict comprehension is a easy and useful way to create a list/dict in a one line for loop code.
It is faster than normal for loop since it computes the value at the same time of the itefration.
It also cleaner since it has less lines of code.

In [None]:
salaries = {'Anne': 50000, 'Bert': 60000, 'Carl': 70000, 'Dom': 80000}
raisedSalaries = {"key_"+k:round(v*1.13,0) for (k,v) in salaries.items()}
print(raisedSalaries)

{'key_Anne': 56500.0, 'key_Bert': 67800.0, 'key_Carl': 79100.0, 'key_Dom': 90400.0}


In [None]:
something = [1,2,3]
a = [i for i in something if i<2]
a

[1]

# You are multithreading a list of parallel tasks using a thread pool. What is t.join() used for? What is the purpose of using t.join() rather than skipping it. This program seems to work with or without the t.join(). Why do we still include it?

Join thread blocks the main thread to perform until the thread is done with its task.

In [None]:
from threading import Thread
from queue import Queue
import time

def worker(args,q):
    time.sleep(1)
    print("done {}".format(args))
    q.put(1)
    return

workerList=[]
for i in range(3):
    q = Queue()
    t = Thread(target=worker,args=(i,q))
    t.start()
    workerList.append([q,t])

for i,workerPair in enumerate(workerList):
    workerPair[1].join()
    
print("ALL WORK DONE")

total=0
for i,workerPair in enumerate(workerList):
    total+=workerPair[0].get()
    
print("TOTAL={}".format(total))

done 1done 2
done 0

ALL WORK DONE
TOTAL=3


# What is df.apply(myFunction, axis=1)?

Apply function allows to apply a function on your dataframe. Axis=1 indicate that it will be applied along column axis.

In [None]:
import pandas as pd

def reverseName(row):
    text=row["name"]
    return "".join(reversed(list(text)))

df = pd.DataFrame(data={'name': ["Alice", "Bob"]})
df["reversed"]=df.apply(reverseName,axis=1)
display(df.head())

Unnamed: 0,name,reversed
0,Alice,ecilA
1,Bob,boB


# What operation is df1.merge(df2, left_on='lkey', right_on='rkey') doing? What would this be called in SQL?

Merge function is called Join in SQL.
df1 is the left dataframe.
df2 is the right dataframe.
we are using left_on when we want to merge dataframes when there is no same column name in both dataframes.

In [None]:
df1 = pd.DataFrame({'name': ['Brian', 'Bill', 'Frank'],
                    'demerits': [2, 3, 5]})
df2 = pd.DataFrame({'name': ['Brian', 'Bill', 'Frank'],
                    'convictions': [6, 7, 8]})
df1.merge(df2, on='name')

Unnamed: 0,name,demerits,convictions
0,Brian,2,6
1,Bill,3,7
2,Frank,5,8


# What is faster, pd.concat([df1,df2,df3]) or a loop of df.append()? Explain your reasoning.

pd.concat is faster 50 time more based on some benchmarks. It adds dataframes at the same time.
pd.append is slower and more expensive since it copies the value in each iteration.
The appen space complexity is (O(n^2)).

In [None]:
%%time
import random
import pandas as pd
numRows=10000

df = pd.DataFrame(columns=["age","gender"])
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    df=df.append(df2)
df.head()

Wall time: 8.55 s


Unnamed: 0,age,gender
0,110,M
0,75,M
0,116,F
0,9,M
0,4,F


In [None]:
%%time
import random
numRows=10000
resultArr = []
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    resultArr.append(df2)

df=pd.concat(resultArr)
df.head()

Wall time: 3.52 s


Unnamed: 0,age,gender
0,109,F
0,62,M
0,113,M
0,43,M
0,50,M


# What is the purpose of tools such as Flask and Django?

Flask and Django are frameworks to create web application.
Flask is suitable for smaller application. It is very scaleable for small application.
Django is a full stack framework and it is suitable for more complex applications. It also offers MVC, and predefined libraries like imaging, graphics, claculation, etc. Also it is a cross platform framework.

There are several other web applications, however Flask and Django are the most popular ones.

Nginx is a web server software which is mostly used as a reverse proxy by sitting behind the firewall in order to handle incoming requests by directing them properly. 
Nginx is a event-driven architecture and it handles multiple requests with one thread. It is a newer tech and for steady configuration is very stable.

Apache is web server HTTP and get and send requests. 
It is process-driven and creates a new thread for each request.
Apache can handle dynamic content by embedding a PHP processor to each thread.

Gunicorn is a pure HTTP python server for WSGI and allows you to run a python application concurrently by running multiple python processe.

Waitress is another pure python WSGI and lately gained attention. It offers some functionality that gunicorn does not support like HTTP REQUEST BUFFERING.

uWSGI is a full stack frame work can be configured more complex than gunicorn and should be used only when it is needed.

# Compare the purposes of 1,2, and 3:

1) Flask/Django/Others

2) apache2/nginx

3) gunicorn/other WSGI

# You are writing a program that scrapes text from a long list of websites. How would you apply parallelism to speed up the scraping task?

from concurrent.future import ThreadingPoolExecutor

This class has a method called map. By calling method, and pass the text scraping function to is as target, each worker/threah handle one and it seepds up the process.
Note that map method take the maximum workers which is different for each system.

# You are writing a python3 program. When should you use Docker and when should you use VENV?

The main difference is the level of isolation.

VENV encapsulates the python project environment along with the dependencies and could be simply moved around. This is a useful for local development and simple applications.

Docker encapsulates the whole OS and is a better option for more complex project that needs to be run and shipped.

# What is requirements.txt used for?

It is file contains all dependencies needed to run a python project and it is a good practice to keep it in the same directory of python project.

# Why not use `git add .`

Using git add ., adds all the file located in the directory.
some files might not be useful to track and kept in the repo.

# Compare matrix multiplication using the GPU, x86 CPU, and x86 vector coprocessor such as AVX2/SSE3. Why do these different kinds of hardware all exist in our personal computers?

These processors taking SIMD (Single Instructio Multiple Data).
AVX2 has 16 of 256 bites registers. Also it can handle 3 values. 

SSE3 has 8 of 128 bites regiters. It can handle 2 values.

CPU x86 has 8 32 bite registers. 


GPU: gpu uses parallel computation and the number of cores are much more greater than CPUs.


Matrix Multiplcation:

GPU vs AVX2 ----> gpu is 10 times faster.
AVX2 vs SSE3 ---> AVX2 up to 3 times faster(a research work comparison).
GPU vs None AVX2 ---> 50 to 100 times faster. 

# Compare and contrast Ubuntu and RHEL.

Ubuntu: 
- OS: Linux
- ENV: Desktop, Serve
- Use Case: General use case, server
- Level: Good for new people to Linux

Red Hat:
- OS: Linux
- ENV: Desktop, Server, Cloud
- Use Case: Business, Commercial
- Level: Intermediate people in Ubuntu who want to start use it commercially. 