# SSC Data Science and Analytics Workshop 2021¶

## The Data Scientist’s Workflow: EDA and Statistical Modeling with Python in Jupyter Notebooks¶

### Introduction to Python (David)

Welcome! Over the next 40 minutes, I'll introduce some of the fundamentals of the **Python programming language**.

Note before we begin: you can follow along with a copy of this Jupyter notebook here: ...

First, a bit about the Python language:

- General purpose, multi-paradigm programming language
- Beginner-friendly syntax, a common first programming language
- Robust ecosystem of libraries spanning all application domains, including statistics, data science, and machine learning
- Supports extensions written in C when efficiency is key

## Simple Data Types and Operations (aka Python as an extended calculator)

Python has three numeric data types: `int`, `float`, and `complex`. We'll focus on just `int` and `float` today.

In [1]:
3 + 4

7

In [2]:
1.5 * 2.7

4.050000000000001

In [3]:
10 / 3

3.3333333333333335

In [4]:
10 // 3

3

In [5]:
2 ** 100

1267650600228229401496703205376

The boolean type `bool` has two values, `True` and `False`.

Comparison operators return booleans:

In [6]:
3 > 0

True

In [7]:
3 == 5

False

In [8]:
3 != 5

True

Three boolean operators: `not`, `and`, `or`.

In [9]:
3 > 0 and 2 > 0

True

In [10]:
3 < 0 or 2 < 0

False

Strings (aka `str`) are sequences of characters. Python has some pretty neat operations on strings.

In [11]:
"Python" + "is" + "cool"

'Pythoniscool'

In [12]:
"Python" in "the Python programming language"

True

In [13]:
s = "Python is cool"
s[1]

'y'

In [14]:
s[0:6]

'Python'

Python strings support Unicode by default.

In [15]:
"Allô 你好"

'Allô 你好'

Python supports **string interpolation**, which lets you embed code expressions within strings.

We'll demo one modern approach: f-strings.

In [16]:
n = 10

f'2 to the power of {n} is {2**n}.'

'2 to the power of 10 is 1024.'

One **warning** with using Jupyter notebooks: you can use variables across cells, but have to be careful about executing cells in order.

In [17]:
x = 100
2 ** x

1267650600228229401496703205376

In [18]:
y = x + 1
2 ** y

2535301200456458802993406410752

In [19]:
z = y + 1
2 * z

204

Tips to avoid this:

- Always run cells in top-down order
- Use the "`In [XX]`" numbers to track the history
- When in doubt, re-run all cells (`Ctrl + A` and then `Ctrl-Enter`)

## Functions, methods, and libraries

A **function** is a ...

Here are some examples of Python's built-in functions.

In [20]:
abs(-1.5)

1.5

In [21]:
max(1, 3)

3

In [22]:
max(1, 2, 3, 4, 5)

5

In [23]:
len("Hello, world!")

13

**Tip**: you can use `help` to get info about a function.

In [24]:
help(abs)

Help on built-in function abs in module builtins:

abs(x, /)
    Return the absolute value of the argument.



![python_builtin_functions.png](python_builtin_functions.png)

Python has many **libraries** that define additional functions. We can access these functions by **importing** them from libraries.

In [25]:
import math  # Import the entire math library

math.sqrt(100)  # Call the sqrt function from the math library

10.0

In [26]:
from math import sqrt  # Import a specific function

sqrt(100)  # Now, call the function without prefixing with ".math"

10.0

You can use the `dir` and `help` functions to learn more about a library.

In [27]:
dir(math)

['__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'acos',
 'acosh',
 'asin',
 'asinh',
 'atan',
 'atan2',
 'atanh',
 'ceil',
 'comb',
 'copysign',
 'cos',
 'cosh',
 'degrees',
 'dist',
 'e',
 'erf',
 'erfc',
 'exp',
 'expm1',
 'fabs',
 'factorial',
 'floor',
 'fmod',
 'frexp',
 'fsum',
 'gamma',
 'gcd',
 'hypot',
 'inf',
 'isclose',
 'isfinite',
 'isinf',
 'isnan',
 'isqrt',
 'lcm',
 'ldexp',
 'lgamma',
 'log',
 'log10',
 'log1p',
 'log2',
 'modf',
 'nan',
 'nextafter',
 'perm',
 'pi',
 'pow',
 'prod',
 'radians',
 'remainder',
 'sin',
 'sinh',
 'sqrt',
 'tan',
 'tanh',
 'tau',
 'trunc',
 'ulp']

In [28]:
help(math)

Help on built-in module math:

NAME
    math

DESCRIPTION
    This module provides access to the mathematical functions
    defined by the C standard.

FUNCTIONS
    acos(x, /)
        Return the arc cosine (measured in radians) of x.
        
        The result is between 0 and pi.
    
    acosh(x, /)
        Return the inverse hyperbolic cosine of x.
    
    asin(x, /)
        Return the arc sine (measured in radians) of x.
        
        The result is between -pi/2 and pi/2.
    
    asinh(x, /)
        Return the inverse hyperbolic sine of x.
    
    atan(x, /)
        Return the arc tangent (measured in radians) of x.
        
        The result is between -pi/2 and pi/2.
    
    atan2(y, x, /)
        Return the arc tangent (measured in radians) of y/x.
        
        Unlike atan(y/x), the signs of both x and y are considered.
    
    atanh(x, /)
        Return the inverse hyperbolic tangent of x.
    
    ceil(x, /)
        Return the ceiling of x as an Integral.
      

In Python, data types (aka *classes*) can define functions that operate on values of that class. A function that is defined as a part of a class is called a **method**.

Here are some examples of string methods.

In [29]:
my_str = 'python is a cool language'

In [30]:
my_str.upper()

'PYTHON IS A COOL LANGUAGE'

In [31]:
my_str.count('o')

3

In [32]:
my_str.split()

['python', 'is', 'a', 'cool', 'language']

**Tip**: You can call `dir` and `help` on data types like `str` too.

## Collection data types (lists and dictionaries)

A **list** is an ordered sequence of values. In Python, lists can contain elements of different types, although usually they won't.

In [33]:
my_list = [10, 20, 30, 40]

In [34]:
my_list[0]  # List indexing starts at 0!

10

In [35]:
len(my_list)

4

In [36]:
30 in my_list

True

In [37]:
print(my_list + [7, 8, 9])  # List concatenation
print(my_list)              # Original list is unchanged

[10, 20, 30, 40, 7, 8, 9]
[10, 20, 30, 40]


In [38]:
my_list.append('Python')
my_list.append('R')
print(my_list)

[10, 20, 30, 40, 'Python', 'R']


In [39]:
help(list)

Help on class list in module builtins:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self))

A **dictionary** is a map (aka lookup table) from keys to corresponding values.

In [40]:
my_dict = {'a': 10, 'b': 20, 'c': 30}

In [41]:
my_dict['a']  # Key lookup

10

In [42]:
'a' in my_dict

True

In [43]:
my_dict['a'] = 40  # Key assignment
print(my_dict)

my_dict['d'] = 60  # Can assign to new keys
print(my_dict)

{'a': 40, 'b': 20, 'c': 30}
{'a': 40, 'b': 20, 'c': 30, 'd': 60}


## Control flow statements

If statements use three keywords: `if`, `elif`, `else`.

In [44]:
x = 10

if x > 50:
    print('x is greater than 50')
elif x > 5:
    print('x is greater than 5')
elif x > 0:
    print('x is greater than 0')
else:
    print('x is <= 0')

x is greater than 5


**Warning**: Python is infamous for its syntactically significant whitespace. Code inside branches of an if statement **must** be indented.

For loops can iterate over a range of numbers or collection.

In [45]:
for i in range(0, 10):
    print(i)

0
1
2
3
4
5
6
7
8
9


In [46]:
my_list = [10, 20, 30, 40]

for number in my_list:
    print(number)

10
20
30
40


In [47]:
my_dict = {'a': 10, 'b': 20, 'c': 40}

for key in my_dict:
    print(key)
    print(my_dict[key])

a
10
b
20
c
40


## A Larger Example; Defining Functions

Now, let's put together everything we've learned so far to create a small model.

1. Simulate the rolling of two fair six-sided dice, n = 1000 times. Compute the sum of the dice in each trial.
2. Report the frequencies of the sums.

In [48]:
import random

trials = []  # This variable stores all of the sums

for i in range(0, 1000):
    d1 = random.randint(1, 6)
    d2 = random.randint(1, 6)

    trials.append(d1 + d2)

print(trials)

[8, 9, 9, 8, 7, 8, 11, 7, 3, 10, 10, 9, 5, 7, 3, 7, 7, 6, 7, 5, 5, 9, 10, 3, 6, 8, 8, 9, 7, 9, 6, 6, 4, 12, 7, 9, 6, 9, 9, 5, 10, 11, 5, 4, 9, 9, 7, 9, 9, 8, 8, 8, 10, 7, 8, 2, 7, 8, 7, 7, 7, 4, 3, 10, 8, 6, 9, 8, 4, 7, 2, 9, 9, 3, 5, 7, 3, 8, 6, 7, 9, 6, 9, 8, 7, 6, 11, 7, 6, 5, 6, 5, 7, 9, 5, 4, 6, 9, 5, 9, 10, 11, 2, 10, 8, 7, 8, 3, 8, 10, 8, 6, 7, 7, 3, 10, 6, 7, 8, 8, 2, 12, 7, 8, 7, 6, 2, 10, 11, 6, 12, 11, 7, 2, 6, 7, 8, 7, 9, 9, 7, 9, 10, 8, 6, 6, 2, 6, 7, 10, 5, 2, 7, 9, 7, 8, 12, 7, 10, 5, 11, 12, 10, 8, 3, 8, 5, 5, 8, 9, 2, 6, 2, 6, 5, 8, 10, 6, 7, 11, 4, 4, 9, 8, 8, 8, 7, 10, 9, 10, 5, 3, 8, 5, 8, 7, 3, 6, 7, 11, 6, 6, 6, 8, 7, 9, 6, 5, 9, 3, 6, 8, 11, 8, 7, 8, 9, 7, 4, 10, 7, 4, 6, 8, 7, 6, 7, 6, 3, 7, 8, 5, 5, 5, 6, 7, 7, 7, 6, 7, 7, 8, 8, 5, 9, 3, 6, 9, 3, 9, 6, 10, 4, 7, 8, 3, 8, 8, 8, 5, 4, 8, 3, 4, 11, 6, 5, 9, 7, 9, 5, 10, 6, 11, 5, 8, 11, 9, 8, 9, 10, 6, 7, 9, 7, 8, 6, 10, 7, 9, 9, 7, 7, 5, 7, 6, 10, 10, 10, 2, 7, 7, 6, 8, 3, 4, 7, 4, 10, 6, 7, 4, 3, 10, 10, 10, 8, 

In [49]:
counts = {}  # This variable stores counts of each possible sum

for i in range(2, 13):
    counts[i] = 0  # Each count starts at 0

for trial in trials:
    counts[trial] = counts[trial] + 1  # Increment counts[trial] by 1

print(counts)

{2: 34, 3: 52, 4: 70, 5: 98, 6: 158, 7: 167, 8: 141, 9: 113, 10: 87, 11: 52, 12: 28}


What if we wanted to parameterize our simulation by the number of trials `n`?

We can define our own function to do this.

In [50]:
# A simple function definition
def add(x, y):
    return x + y

In [51]:
add(10, 4)

14

In [52]:
def two_dice_sum(n):
    """Simulate rolling two six-sided dice n times. Return a frequency count of their sum."""
    trials = []  # This variable stores all of the sums

    for i in range(0, n):
        d1 = random.randint(1, 6)
        d2 = random.randint(1, 6)

        trials.append(d1 + d2)

    counts = {}  # This variable stores counts of each possible sum

    for i in range(2, 13):
        counts[i] = 0  # Each count starts at 0

    for trial in trials:
        counts[trial] = counts[trial] + 1  # Increment counts[trial] by 1

    return counts

In [53]:
two_dice_sum(1000)

{2: 26,
 3: 54,
 4: 88,
 5: 113,
 6: 142,
 7: 161,
 8: 151,
 9: 108,
 10: 79,
 11: 47,
 12: 31}

In [54]:
two_dice_sum(1000000)

{2: 27869,
 3: 55353,
 4: 83205,
 5: 110980,
 6: 138915,
 7: 166363,
 8: 139127,
 9: 111392,
 10: 83238,
 11: 55760,
 12: 27798}

## NumPy

So far, we've been working in pure Python. But for the rest of the workshop, we're going to rely on a set of third-party libraries that are the standard for statistical and data science Python applications.

**NumPy** is a library that forms the basis of high-performance scientific computing in Python.

- Efficient storage and operation on $n$-dimensional arrays
- Written in C, but offers a high-level Python API

![numpy.png](numpy.png)

The fundamental data type in NumPy is the `array` (aka `ndarray`).

In [55]:
import numpy as np

np.array([1, 2, 3, 4, 5])

array([1, 2, 3, 4, 5])

In [56]:
np.random.rand(100)

array([0.71379929, 0.99552184, 0.99526784, 0.17380469, 0.24039985,
       0.41856841, 0.40874089, 0.68924653, 0.78263335, 0.88405177,
       0.87138009, 0.71683134, 0.93896173, 0.4015585 , 0.14638154,
       0.9872575 , 0.56909674, 0.63198703, 0.71823128, 0.12809703,
       0.35064142, 0.38495738, 0.64980314, 0.33857352, 0.5242218 ,
       0.2420199 , 0.71151075, 0.58012767, 0.7305419 , 0.56347474,
       0.09825253, 0.08814823, 0.01406803, 0.01634735, 0.53406845,
       0.66828775, 0.92701634, 0.48039796, 0.71328651, 0.5059838 ,
       0.47459075, 0.59552729, 0.3163192 , 0.11858963, 0.76475215,
       0.62279785, 0.51830213, 0.57137313, 0.64313929, 0.93138714,
       0.81506978, 0.3133717 , 0.14491597, 0.12319854, 0.47783078,
       0.5695755 , 0.90908795, 0.17535929, 0.58771067, 0.86262653,
       0.6826697 , 0.10662259, 0.5831299 , 0.14975792, 0.68687673,
       0.18011699, 0.19469796, 0.04643973, 0.16026473, 0.02760805,
       0.89509652, 0.99765679, 0.40856894, 0.43422933, 0.65139

In [57]:
m = np.arange(100).reshape(20, 5)
m

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44],
       [45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54],
       [55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64],
       [65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74],
       [75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84],
       [85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94],
       [95, 96, 97, 98, 99]])

In [58]:
print(m.size)
print(m.shape)
print(m.ndim)
print(m.dtype)

100
(20, 5)
2
int32


You can using *indexing* and *slicing* to access elements and subarrays.

In [59]:
m[0]

array([0, 1, 2, 3, 4])

In [60]:
m[0, 0]

0

In [61]:
m[0:20, 0]  # or just m[:, 0]

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
       85, 90, 95])

In [62]:
m[13:15, 1:3]

array([[66, 67],
       [71, 72]])

One of the key features of NumPy is that most operations on arrays are *elementwise*, meaning they are applied to each element of the array.

In [63]:
numbers = np.arange(1, 20)
print(numbers)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


In [64]:
print(numbers + 100)
print(numbers * 3)
print(1 / (numbers + 1))
print(numbers ** 2)

[101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
 119]
[ 3  6  9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57]
[0.5        0.33333333 0.25       0.2        0.16666667 0.14285714
 0.125      0.11111111 0.1        0.09090909 0.08333333 0.07692308
 0.07142857 0.06666667 0.0625     0.05882353 0.05555556 0.05263158
 0.05      ]
[  1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324
 361]


These operators can also perform elementwise operations on two arrays.

In [65]:
numbers2 = numbers + 100
print(numbers2)

print(numbers + numbers2)
print(numbers * numbers2)
print(numbers < numbers2)

[101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
 119]
[102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
 138]
[ 101  204  309  416  525  636  749  864  981 1100 1221 1344 1469 1596
 1725 1856 1989 2124 2261]
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True]


NumPy defines over 60 [**universal functions**](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs), which is its name for functions that operate elementwise on arrays.

In [66]:
print(np.sin(numbers))
print(np.sqrt(numbers))
print(np.mod(numbers2, numbers))

[ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427 -0.2794155
  0.6569866   0.98935825  0.41211849 -0.54402111 -0.99999021 -0.53657292
  0.42016704  0.99060736  0.65028784 -0.28790332 -0.96139749 -0.75098725
  0.14987721]
[1.         1.41421356 1.73205081 2.         2.23606798 2.44948974
 2.64575131 2.82842712 3.         3.16227766 3.31662479 3.46410162
 3.60555128 3.74165739 3.87298335 4.         4.12310563 4.24264069
 4.35889894]
[ 0  0  1  0  0  4  2  4  1  0  1  4  9  2 10  4 15 10  5]


## References

- [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
- [NumPy quickstart](https://numpy.org/doc/stable/user/quickstart.html)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)