# Python Programming

**What is Python?** source: [python.org](https://www.python.org/doc/essays/blurb/)

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.

**What is Python capable of?***

Long answer short: everything. Most often Python is used for web developments and prototyping, using frameworks such as [Flask](http://flask.palletsprojects.com/en/1.1.x/) and [Django](https://www.djangoproject.com/). Because of its simplicity, it is also popularly used in data science, both in industry and in academia.


In this lab, we mainly cover the following python programming topics:

<ul>
    <li>data types</li>
    <li>built-in data structure</li>
    <li>I/O in Python</li>
    <li>loops in Python</li>
    <li>function and class</li>
</ul>

## Data Types

In [None]:
# interger
a = 1
print('I am an interger:', a)

In [None]:
# float -> decimal numbers
a = 1.0
print('I am a float:', a)

In [None]:
# strings -> an array of characteristics
a = "GOTO Chicago Fall 2019"
print('I am a string:', a)

In [None]:
# boolean -> binary
a = True
print('I am a binary:', a)

**Data Type Conversion**

In [None]:
# interger to float
a = 1
print('I was an interger:', a, ', but now I am a floaw:', float(a))

In [None]:
# float to interger
a = 1.0
print('I was a float:', a, ', but now I am an interger:', int(a))

In [None]:
# string to interger/float
a = '1.0'
print('I was a string: "'+ a +'", but now I am a number:', eval(a))

In [None]:
# interger/float to string
a = 1.0
print('I was a number:', a, ', but now I am a string:', str(a))

Additional material for reading if interested: [source](https://www.geeksforgeeks.org/type-conversion-python/)

## Data Structures

In [None]:
# list -> mutable
a = [1, 2, 3, 'a', 'b', 'c']
print('I am a list:', a)

In [None]:
# tuple -> immutable
a = (1, 2, 3, 'a', 'b', 'c')
print('I am a tuple:', a)

In [None]:
# dictionary
a = {'1': 'a', 2: 'b', 'c': 3}
print('I am a dictionary:', a)

**List**

List is mutable, meaning after a list is defined, you can alter the values stored in it. **Tuple** is not mutable, meaning once it's defined, you can't change the values. Let's focus on the list for now.

In [None]:
a = [1, 2, 3]

In [None]:
# select an item
a[0]

In [None]:
# select a range of items
a[0:2]

In [None]:
# add an item
# method 1: append to the end of the list
a.append(4)
a

In [None]:
# method 2: insert into a specific position
a.insert(2, 10)
a

In [None]:
# remove an item
# method 1: pop the last item in the list
a.pop()

In [None]:
a

In [None]:
# method 2: remove a specific position
a.pop(2)

In [None]:
a

In [None]:
# method 3: remove a specific value
a.remove(3)

In [None]:
a

**Dictionary**

In [None]:
# initiate an empty dictionary
a = dict()
a

In [None]:
# add an item
# dictionary[key_name]= value_name
a['my_key'] = 'my_value'
a['my_key_2'] = 'my_value_2'
a

In [None]:
# select a value based on key
a['my_key']

In [None]:
# remove an item
# method 1: pop the item added last
a.popitem()

In [None]:
a

In [None]:
# method 2: pop the key
a.pop('my_key')

In [None]:
a

**Substring**

In [None]:
a = 'I am a string.'

In [None]:
# select a sub string out of the main string
a[:5]

## Input/Output (I/O) in Python

Imagine you're reading a book, what are the steps for you to finish the process of reading a book? It should all include three steps,

<ol>
    <li>Open the book</li>
    <li>Read the book</li>
    <li>Close the book</li>
</ol>

Similar in Python, to read a text file from your local, you need to open a the file, read the content, and close the file. Here is how to do it.

**Step-by-Step Version**

In [None]:
# open the file
path = './data/dummy.txt'
file = open(path, 'r')

`r` denotes read only mode.

In [None]:
# read the content
content = file.read()

In [None]:
print(content)

In [None]:
# close the file
file.close()

Now you can work with the content in `content` variable

In [None]:
content

In [None]:
# split() is a built-in method with Python's string datatype that splits a string using the sub-string 
# specified; if nothing is specified, it splits by a single space by default
content.split('\n')

**Question:** What is the output data structure of the split method?

**"All-in-One" or The Clean Version**

In [None]:
# "with" method takes care of closing the file so you don't have to worry about that
with open('./data/dummy.txt', 'r') as file:
    content2 = file.read()

In [None]:
content2

Though you may have heard about tools like `Pandas` and `Numpy` that take care of I/O for you, knowing how to correctly read files using native Python is also important because data come in different formats.

## Loops in Python

Given that you are already a developer, this section is just to show you how to write for and while loops in Python.


**For Loop**

In [None]:
my_list = [1, 2, 3, '4', '5', '7', True]

In [None]:
for item in my_list:                        # loop through the list
    if isinstance(item, str):               # if the item is a string
        print(item, 'is a string.')
    else:
        print(item, 'is not a string.')

**While Loop**

In [None]:
i = 0
while i < len(my_list):
    item = my_list[i]
    if isinstance(item, str):               # if the item is a string
        print(item, 'is a string.')
    else:
        print(item, 'is not a string.')
    i += 1

**Break and Continue**

`break`: it breaks out of a for loop and continue the code

`continue`: it skips the rest of the code in the current loop and moves onto the next iteration

In [None]:
for item in my_list:
    if not isinstance(item, str):
        continue
        print('this is not a string.. this line is not even executed at all')
    else:
        print(item)

**Q&A:** What happens if we swap out `continue` with `break`?

## Function and Class

A `function` is a set of python instructions/statements that take in inputs, do things to them, and output the results out. Note: This is the standard function definition, and not all functions need to take in inputs and/or returns output.

A `class` is a code template that creates objects and it contains one or more functions. If you have experience in Java, it is just like the Java class.

In [None]:
# function example
def my_function(a_input):                        # define a function
    output = str(a_input) + ' is the input'     # do something to the input
    return output                               # return output

In [None]:
my_function(5)

In [None]:
# class example
class my_class(object):
    """i am a doc string :) """
    
    def __init__(self, input_str=None):
        """I am the first function to be called when a new object is initiated"""
        self.input_str = str(input_str)
        self.output_str = None
    
    def do_something(self):
        """I am a separate method that does something in this class
        Note that I don't have to return anything here
        """
        self.output_str = self.input_str + ' is the input'
    
    def give_me_output(self):
        return self.output_str

In [None]:
new_instance = my_class(5)     # at this step, __init__ is called

In [None]:
new_instance.do_something()    # here, do_something() is called

In [None]:
new_instance.give_me_output()  # here, give_me_output() is called

Finally, if you care about code styling a lot, which you should, feel free to check out PEP 8 Python coding styling standards [here](https://www.python.org/dev/peps/pep-0008/).

**Exercise**

<span style='color: red;'>If time permits</span>, try build a calculator that does the following in Python.
<ul>
    <li>Addition</li>
    <li>Subtraction</li>
    <li>Multiplication</li>
    <li>Division</li>
</ul>

In [None]:
class MyCalculator(object):
    """insert your doc string here"""
    
    def __init__(self):
        pass
    
    def add(self):
        pass
    
    def subtract(self):
        pass
    
    def multiply(self):
        pass
    
    def divide(self):
        pass

# Statistics

Whew, I'm sure the Python section bored you out a little. Let's learn something more fun.

<img src='https://media1.giphy.com/media/1P1qpwEW1YGdxeOQLq/giphy.gif' alt='Let the fun begin'>

Let's first use Python's `random` module to generate 1,000,000 random numbers.

In [None]:
import random
random.seed(1234)     
# by setting a seed, you can ensure every time you run this block of code, the output is the same

dataset = [random.random() for _ in range(1_000_000)]    
# underscore means that variable is not important to store

import statistics # let's import this for later

In [None]:
len(dataset)

In [None]:
dataset[:5]

**Arithmetic Mean a.k.a. Average**

In [None]:
def average(arr):
    return sum(arr) / len(arr)

In [None]:
mean = average(dataset)

In [None]:
mean

In [None]:
# validate
statistics.mean(dataset)

**Median**

In [None]:
def median(arr):
    arr = sorted(arr)
    median_index = int(len(arr) / 2)
    if len(arr) % 2 == 0:      # if the list has even number of elements
        median = average([arr[median_index], arr[median_index-1]])
    else:
        median = arr[median_index]
    return median

In [None]:
median(dataset)

In [None]:
# validate
statistics.median(dataset)

**Variance**

In [None]:
def variance(arr):
    mean = average(arr)      # note that I am using the average() function here
    variance = sum(
                map(lambda i: (i-mean)**2, arr)
                  ) / (len(arr) - 1)
    return variance

`map` operation "maps" a function to every element of a collection passed in the second section.

`lambda` is a Python way to explicitly define small-scale function anonymously; see more [here](https://realpython.com/python-lambda/)

In [None]:
variance(dataset)

In [None]:
# validate
statistics.variance(dataset)

**Standard Deviation**

In [None]:
def stdv(arr):
    var = variance(arr)
    standard_deviation = var ** 0.5
    return standard_deviation

In [None]:
stdv(dataset)

In [None]:
# validate
statistics.stdev(dataset)

Ok... I got the math now, but what do all of these mean???????

<img src='https://media.giphy.com/media/1oJLpejP9jEvWQlZj4/giphy.gif' alt='confusion'>

<img src='https://miro.medium.com/max/24000/1*IdGgdrY_n_9_YfkaCh-dag.png'>

Nomral distribution is a bell-curve shaped distribution, where `mean` equals to `median`. In addition, it is estimated that 

<ul>
    <li><strong>68%</strong> of the data fall between +/-1 standard deviation></li>
    <li><strong>95%</strong> of the data fall between +/-2 standard deviation></li>
    <li><strong>99%</strong> of the data fall between +/-3 standard deviation></li>
</ul>

To calculate how far the value is away from the sample mean, `z-score` is calculated which we do not cover today. Read more about `z-score` [here](https://www.investopedia.com/terms/z/zscore.asp).

**Q&A**: Given we know how far a data point is away from the mean, what information can we interpret from it?

## Let's Make It A Bit More Complicated

We've been focusing on one variable. Let's now look into the descriptive statistics of two variables.

Assuming we have two variables, `X` and `Y`, we can calculate the same descriptive statistics above for both of them such that

<table width='100%'>
    <tr>
        <td> Average </td>
        <td>
            \begin{equation}
                \bar{X} = \frac{\displaystyle\sum_{i=1}^n x_i}{n}
            \end{equation}
        </td>
        <td>
            \begin{equation}
                \bar{Y} = \frac{\displaystyle\sum_{i=1}^n y_i}{n}
            \end{equation}
        </td>
    </tr>
    <tr>
        <td> Variance </td>
        <td>
            \begin{equation}
                Var(X) = \sigma_x^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{X})^2} {n-1}
            \end{equation}
        </td>
        <td>
            \begin{equation}
                Var(Y) = \sigma_y^2 = \frac{\displaystyle\sum_{i=1}^{n}(y_i - \bar{Y})^2} {n-1}
            \end{equation}
        </td>
    </tr>
    <tr>
        <td> Standard Deviation </td>
        <td>
            \begin{equation}
                Stdev(X) = \sqrt{Var(X)}
            \end{equation}
        </td>
        <td>
            \begin{equation}
                Stdev(Y) = \sqrt{Var(Y)}
            \end{equation}
        </td>
    </tr>
</table>

**Covariance**

Covariance is a measure of how much two random variables vary together.

\begin{equation}
    cov_{x,y}=\frac{\displaystyle\sum_{i=1}^{N}(x_{i}-\bar{x})(y_{i}-\bar{y})}{N-1}
\end{equation}

If the covariance is positive, it tells us that the two variables are positively related, and vice versa. However, covariance does not have a upper nor lower limit, thus we cannot compare the magnitude of inter-variable relationship.

**Correlation**

Correlation coefficients are used to measure how strong the relationship is between two variables. There are several types of correlation coefficients and Pearson's Correlation is commonly used in describing linear relations.

\begin{equation}
    cor_{x,y} = \frac{cov_{x, y}}{\sigma_x\sigma_y}
\end{equation}

In [None]:
# Let's generate two random variables with the same size
import numpy as np
# numpy (Numerical Python) is a very popular data science tool (https://numpy.org/)

x = np.random.random(1_000_000)
y = np.random.random(1_000_000)

In [None]:
len(x), len(y)

In [None]:
x[:5]

In [None]:
# Let's calculate covariance
np.cov(x, y)

The output above is known as `variance-covariance matrix`, where the diagonal represents the variance of each variable, and the other positions represents the covariance of different variable pairs, such that

<table width='30%'>
    <tr>
        <td></td>
        <td>X</td>
        <td>Y</td>
    </tr>
    <tr>
        <td>X</td>
        <td>Var(X)</td>
        <td>Cov(X, Y)</td>
    </tr>
    <tr>
        <td>Y</td>
        <td>Cov(X, Y)</td>
        <td>Var(Y)</td>
    </tr>
</table>

**Note:** Cov(X, X) is equal to Var(X)

In [None]:
# Let's calculate correlation
np.corrcoef(x, y)

The output above is known as `correlation matrix`, where the diagonal represents the correlation of the variable with itself, and the other positions represents the correlation of different variable pairs, such that

<table width='30%'>
    <tr>
        <td></td>
        <td>X</td>
        <td>Y</td>
    </tr>
    <tr>
        <td>X</td>
        <td>Corr(X, X)</td>
        <td>Corr(X, Y)</td>
    </tr>
    <tr>
        <td>Y</td>
        <td>Corr(X, Y)</td>
        <td>Corr(Y, Y)</td>
    </tr>
</table>

**Note:** Corr(X, X) is 1 -> A variable is perfectly postiively correlated with itself.

# Recap

In this session, we've covered

<ol>
    <li>Basic Python programming</li>
    <li>Statistics for single variable</li>
    <li>Descriptive statistics for two variables</li>
</ol>

You should be able to
<ol>
    <li>Complete the Calculator exercise</li>
    <li>Explain what descriptive statistics are appropriate for sing variable</li>
    <li>Interpret the implications of descriptive statistics</li>
    <li>Explain the difference between covariance and correlation</li>
    <li>Interpret Pearson's Correlation Coefficient</li>
<ol>