# SRM 641 Week 2

# Functions and NumPy Arrays


# Learning Objectives

By the end of this week, you should be able to:

- Use and create functions
- Analyze NumPy arrays
- Create NumPy arrays
- Modify NumPy arrays
- Index/slice NumPy arrays


# Functions

A function is a Python object that you can "call" to perform an action or compute and return another object. Functions are methods of code organization and reuse in Python. They are useful when you have to repeat the same or very similar code more than once. 

Python has several useful built-in functions to help you work with different objects and/or your environment. Here is a small sample of them:

- `print(obj`) to print/return the object
- `type(obj)` to determine the type of an object
- `len(container`) to determine how many items are in a container
- `callable(obj`) to determine if an object is callable
- `sorted(container)` to return a new list from a container, with the items sorted
- `sum(container)` to compute the sum of a container of numbers
- `min(container)` to determine the smallest item in a container
- `max(container)` to determine the largest item in a container
- `abs(number)` to determine the absolute value of a number
- `repr(obj)` to return a string representation of an object

So far we've seen or worked with built functions `print()`, `len()`, `type()`, `str()`, `int()`, `float()` etc.

Complete list of built-in functions can be found here: https://docs.python.org/3/library/functions.html


# Commonly-used built-in functions include max, min, and round.

- Use `max` to find the largest value of one or more values.
- Use `min` to find the smallest.
- Both work on character strings as well as numbers.
- “Larger” and “smaller” use (0-9, A-Z, a-z) to compare letters.


In [1]:
print(max(1, 2, 3))

3


In [2]:
print(min('a', 'A', '0'))

0


## Functions may have default values for some arguments

- `round` will round off a floating-point number.
- By default, rounds to zero decimal places.


In [3]:
round(3.712)

4

In [4]:
# We can specify the number of decimal places we want

round(3.712, 1)

3.7

## Functions attached to objects are called methods

Functions take another form that will be common in the pandas episodes. Methods have parentheses like functions, but come after the variable. To access an attribute of an object, use a dot (.) after the object, then specify the attribute (i.e. obj.attribute)

When an attribute of an object is a callable, that attribute is called a method. It is the same as a function, only this function is bound to a particular object.

When an attribute of an object is not a callable, that attribute is called a property. It is just a piece of data about the object, that is itself another object.

Some methods on string objects
- `capitalize()` to return a capitalized version of the string (only first char uppercase)
- `upper()` to return an uppercase version of the string (all chars uppercase)
- `lower()` to return an lowercase version of the string (all chars lowercase)
- `count(substring)` to return the number of occurences of the substring in the string
- `startswith(substring)` to determine if the string starts with the substring
- `endswith(substring)` to determine if the string ends with the substring
- `replace(old, new)` to return a copy of the string with occurences of the "old" replaced by "new"

In [5]:
# Assign a string to a variable
my_string = 'this is my sTriNg'

In [6]:
# Return a capitalized version of the string

my_string.capitalize()

'This is my string'

In [7]:
# Return an uppercase version of the string

my_string.upper()

'THIS IS MY STRING'

In [8]:
# Use the built-in function help to get help for a function

help(min)

Help on built-in function min in module builtins:

min(...)
    min(iterable, *[, default=obj, key=func]) -> value
    min(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its smallest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the smallest argument.



# User-defined functions

We'll now learn to write our own user-defined functions. Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.


In [9]:
# Functions are declared with the `def` keyword. 
# A function contains a block of code with an optional use of the return keyword
# An argument is a value passed into a function.

# The first line is a def statement, which defines a function(). 
# The code in the block that follows the def statement is the body of the function. 
# This code is executed when the function is called, not when the function is first defined.

def name_of_function(arg): 
    ...
    return(output)

In [10]:
def name_of_function(arg1,arg2):
    '''
    This is where the function's Document String (docstring) goes.
    When you call help() on your function it will be printed out.
    '''
    # Do stuff here
    # Return desired result

In [11]:
# simple function

def say_hello():
    print('hello')

In [12]:
# Call the function

say_hello()

hello


Accepting parameters (arguments): 

In [13]:
# We can write functions with one input and one output argument

def square(x):
    a_sqr= x * x
    return(a_sqr)

So far we've only seen `print()` used, but if we actually want to save the resulting variable we need to use the return keyword. Return allows a function to return a result that can then be stored as a variable, or used in whatever manner a user wants. The return keyword allows you to actually save the result of the output of a function as a variable. The print() function simply displays the output to you, but doesn't save it for future use

In [14]:
# Call the function

square(10)

100

In [15]:
# Another function that converts fahrenheit to celcius

def fahr_to_celsius(temp):
    # Assign the converted value to a variable
    converted = ((temp - 32) * (5/9))
    # Return the value of the new variable
    return converted


The function definition opens with the keyword def followed by the name of the function (fahr_to_celsius) and a parenthesized list of parameter names (temp). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the return value.

In [16]:
# run the function

fahr_to_celsius(32)

0.0

In [17]:
# using two functions

print('freezing point of water:', fahr_to_celsius(32), 'C')

freezing point of water: 0.0 C


## NumPy Basics

NumPy (Numerical Python), is one of the most important foundational packages for numerical computing in Python. 

While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array computing semantics, like pandas, much more effectively.

See https://numpy.org/doc/stable/ for more info about NumPy.

In general, you should use NumPy when your data is composed of matrices or arrays.

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

### Install NumPy on Anaconda

If you installed the Anaconda distribution of Python, NumPy comes pre-installed and no further installation steps are necessary.

If you use a version of Python from python.org or a version of Python that came with your operating system, the Anaconda Prompt and conda or pip can be used to install NumPy.

Install NumPy with the Anaconda Prompt
To install NumPy, open the Anaconda Prompt and type:

`conda install numpy`

Type y for yes when prompted.

Install NumPy with pip
To install NumPy with pip, bring up a terminal window and type:

`$ pip install numpy`

In [18]:
# First import the library numpy as np
# Before you can use the functions in a module, you must import the module with an import statement. 
# In code, an import statement consists of the following:

# The import keyword
# The name of the module
# Optionally, more module names, as long as they are separated by commas

import numpy as np

In [19]:
# print the version number of numpy

print(np.__version__)

1.24.3


### List vs NumPy Arrays

NumPy is used to construct homogeneous arrays and perform mathematical operations on arrays. A NumPy array is different from a Python list. The data types stored in a Python list can all be different.

In [20]:
# Create a list

my_list = [ 1, -0.038, 'school', True]

The list above contains four different data types: 1 is an integer, -0.038 is a float, 'school' is a string, and 'True' is a boolean.

In [21]:
# Check the data type type 

my_list = [1, -0.038, 'school', True]
for item in my_list:
    print(type(item))

<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>


The values stored in a NumPy array must all share the same data type.

In [22]:
# Create an array 

data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])

In [23]:
# View the data

data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

In [24]:
# Checking the shape of data

data.shape   #2 rows, 3 columns

(2, 3)

In [25]:
# check dimension

data.ndim

2

In [26]:
# Check the data type

data.dtype

dtype('float64')

In [27]:
# Arithmetic with numPy arrays

arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [28]:
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [29]:
# Multiple the arrays

arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])


# Analyzing Data using NumPy

We are going to analyze some data using numPy. The CSV file contains the number of inflammation flare-ups per day for the 60 patients in the initial clinical trial, with the trial lasting 40 days. Each row corresponds to a patient, and each column corresponds to a day in the trial. Once a patient has their first inflammation flare-up they take the medication and wait a few weeks for it to take effect and reduce flare-ups.

Download the data from Canvas.

In [30]:
np.loadtxt(fname='Data/inflammation-01.csv', delimiter=',')

array([[0., 0., 1., ..., 3., 0., 0.],
       [0., 1., 2., ..., 1., 0., 1.],
       [0., 1., 1., ..., 2., 1., 1.],
       ...,
       [0., 1., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 2., 0.],
       [0., 0., 1., ..., 1., 1., 0.]])

The expression `np.loadtxt(...)` is a function call that asks Python to run the function `loadtxt` which belongs to the numpy library. The dot notation in Python is used as an object attribute/property specifier or for invoking its method. `object.property` will give you the `object.property` value, `object_name.method()` will invoke on object_name method.

`np.loadtxt` has two parameters: the name of the file we want to read and the delimiter that separates values on a line. These both need to be character strings (or strings for short), so we put them in quotes.

Since we haven’t told it to do anything else with the function’s output, the notebook displays it. In this case, the output is the data we just loaded. By default, only a few rows and columns are shown (with ... to omit elements when displaying big arrays). Note that, to save space when displaying NumPy arrays, Python does not show us trailing zeros, so 1.0 becomes 1..

Our call to `np.loadtxt` read our file but didn’t save the data in memory. To do that, we need to assign the array to a variable.

In [31]:
# Load the file and save it 

data = np.loadtxt(fname = 'Data/inflammation-01.csv', delimiter = ',')

In [32]:
# If we want to check if its loaded we can print

print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]


In [33]:
data

array([[0., 0., 1., ..., 3., 0., 0.],
       [0., 1., 2., ..., 1., 0., 1.],
       [0., 1., 1., ..., 2., 1., 1.],
       ...,
       [0., 1., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 2., 0.],
       [0., 0., 1., ..., 1., 1., 0.]])

In [34]:
# we can check the data type

type(data)

numpy.ndarray

Data is in an N-dimensional array, the functionality for which is provided by the NumPy library

A Numpy array contains one or more elements of the same type. The type function will only tell you that a variable is a NumPy array but won’t tell you the type of thing inside the array. We can find out the type of the data contained in the NumPy array.

In [35]:
print(data.dtype)

float64


This tells us that the NumPy array’s elements are floating-point numbers.

In [36]:
# We can also check the array's shape

data.shape

(60, 40)

The output tells us that the data array variable contains 60 rows and 40 columns

### Slicing or Selecting Data

Multiple values stored within an array can be accessed simultaneously with array slicing. To pull out a section or slice of an array, the colon operator `:` is used when calling the `index`. The general form is: `<slice> = <array>[start:stop]`

Where `<slice>` is the slice or section of the array object `<array>`. The index of the slice is specified in `[start:stop]`. Remember Python counting starts at 0 and ends at n-1. The index `[0:2]` pulls the first two values out of an array. The index `[1:3]` pulls the second and third values out of an array.

On either sides of the colon, a blank stands for "default".

- `[:4]` corresponds to `[start=default:stop=4]`
- `[1:]` corresponds to `[start=1:stop=default]`

Therefore, the slicing operation `[:4]` pulls out the first to fourth values in an array. The slicing operation `[1:]` pull out the second through the last values in an array.


Our inflammation data has two dimensions, so we will need to use two indices to refer to one specific value: 
<br>

- `<slice> = <array>[start_row:end_row, start_col:end_col]`

In [37]:
# if we want to print the first value 

data[0, 0]  # first row, first column

0.0

In [38]:
print('first value in data:', data[0, 0])

first value in data: 0.0


In [39]:
# Check the middle value

print('middle value in data:', data[29, 19])

middle value in data: 16.0


The code `data[29, 19]` accesses the element at row 30, column 20. Python starts counting at 0 unlike R or other languages that start counting at 1.

An index like `[30, 20]` selects a single element of an array, but we can select whole sections as well. 

In [40]:
# For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:
    
data[0:4, 0:10]

array([[0., 0., 1., 3., 1., 2., 4., 7., 8., 3.],
       [0., 1., 2., 1., 2., 1., 3., 2., 2., 6.],
       [0., 1., 1., 3., 3., 2., 6., 2., 5., 9.],
       [0., 0., 2., 0., 4., 2., 2., 1., 6., 7.]])

The slice 0:4 means, “Start at index 0 and go up to, but not including, index 4”.

In [41]:
# slice/select from row 5 to 10, columns 0 to 10

data[5:10, 0:10]

array([[0., 0., 1., 2., 2., 4., 2., 1., 6., 4.],
       [0., 0., 2., 2., 4., 2., 2., 5., 5., 8.],
       [0., 0., 1., 2., 3., 1., 2., 3., 5., 3.],
       [0., 0., 0., 3., 1., 5., 6., 5., 5., 8.],
       [0., 1., 1., 2., 1., 3., 5., 3., 5., 8.]])

In [None]:
# Exercise 1

# complete the code. Insert the correct syntax for printing the number 50 from the array
arr = np.array([10, 20, 30, 40, 50, 60, 70])

print(arr[_____  ])

In [None]:
# Exercise 2

# Insert the correct syntax for printing the number 50 from the array.

arr = np.array([[10, 20, 30, 40], [50, 60, 70, 80]])

print(arr[_____])

## Analyzing data

NumPy has several useful functions that take an array as input to perform operations on its values. If we want to find the average inflammation for all patients on all days, for example, we can ask NumPy to compute data’s mean value:

In [43]:
# check the mean

np.mean(data)

6.14875

In [44]:
# check the max

np.max(data)

20.0

In [45]:
# Check the min

np.min(data)

0.0

In [46]:
# Check the standard deviation

np.std(data)

4.613833197118566

When analyzing data we often want to look at variations in statistical values, such as the maximum inflammation per patient or the average inflammation per day. One way to do this is to create a new subset array of the data we want, then ask it to do the calculation.

In [47]:
# Create a subset of only patient 0

patient_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)

In [48]:
patient_0

array([ 0.,  0.,  1.,  3.,  1.,  2.,  4.,  7.,  8.,  3.,  3.,  3., 10.,
        5.,  7.,  4.,  7.,  7., 12., 18.,  6., 13., 11., 11.,  7.,  7.,
        4.,  6.,  8.,  8.,  4.,  4.,  5.,  7.,  3.,  4.,  2.,  3.,  0.,
        0.])

In [49]:
# maximum inflammation for patient 0

max(patient_0)

18.0

In [50]:
# you dont have to save the subset, you can directly select and print patient 2 info

np.max(data[2, :])

19.0


To calculate statistics on two-dimensional arrays, you can use the axis argument in the same functions (e.g. np.max) to specify which axis you would like to summarize:

- vertical axis downwards, summarizing across rows (axis=0)
- hortizontal axis, summarizing across columns (axis=1)


What if we need the maximum inflammation for each patient over all days or the average for each day? We want to perform the operation across an axis:

In [51]:
# Calculate the average across axis 0 (across rows)
# By using np.mean(array, axis=0), you are requesting the avg value from each column across all rows of data. 


np.mean(data, axis=0)

array([ 0.        ,  0.45      ,  1.11666667,  1.75      ,  2.43333333,
        3.15      ,  3.8       ,  3.88333333,  5.23333333,  5.51666667,
        5.95      ,  5.9       ,  8.35      ,  7.73333333,  8.36666667,
        9.5       ,  9.58333333, 10.63333333, 11.56666667, 12.35      ,
       13.25      , 11.96666667, 11.03333333, 10.16666667, 10.        ,
        8.66666667,  9.15      ,  7.25      ,  7.33333333,  6.58333333,
        6.06666667,  5.95      ,  5.11666667,  3.6       ,  3.3       ,
        3.56666667,  2.48333333,  1.5       ,  1.13333333,  0.56666667])

In [52]:
# We can check the shape

np.mean(data, axis=0).shape # 40 are the rows

(40,)

The expression (40,) tells us we have an N×1 vector.

In [53]:
# We can average across axis 1 (columns)to get the average inflammation per patient across all days.

np.mean(data, axis=1)

array([5.45 , 5.425, 6.1  , 5.9  , 5.55 , 6.225, 5.975, 6.65 , 6.625,
       6.525, 6.775, 5.8  , 6.225, 5.75 , 5.225, 6.3  , 6.55 , 5.7  ,
       5.85 , 6.55 , 5.775, 5.825, 6.175, 6.1  , 5.8  , 6.425, 6.05 ,
       6.025, 6.175, 6.55 , 6.175, 6.35 , 6.725, 6.125, 7.075, 5.725,
       5.925, 6.15 , 6.075, 5.75 , 5.975, 5.725, 6.3  , 5.9  , 6.75 ,
       5.925, 7.225, 6.15 , 5.95 , 6.275, 5.7  , 6.1  , 6.825, 5.975,
       6.725, 5.7  , 6.25 , 6.4  , 7.05 , 5.9  ])

In [54]:
# Check the shape

np.mean(data, axis = 1).shape

(60,)

## Key points:
    
- Import a library into a program using import libraryname.
- Use the numpy library to work with arrays in Python.
- The expression array.shape gives the shape of an array.
- Use array[x, y] to select a single element from a 2D array.
- Array indices start at 0, not 1.
- Use low:high to specify a slice that includes the indices from low to high-1.
- Use # some kind of explanation to add comments to programs.
- Use np.mean(array), np.max(array), and np.min(array) to calculate simple statistics.
- Use numpy.mean(array, axis=0) or numpy.mean(array, axis=1) to calculate statistics across the specified axis.

References:
For more about functions and numpy, see:
- Python for Data Analysis by Wes McKinney
- Automating the Boring Stuff by Al Sweigert, Ch. 3
https://problemsolvingwithpython.com/05-NumPy-and-Arrays/05.00-Introduction/
