# DASC 512 - Data in Python: A Primer

This notebook presents the basics of working with data in Python - a fundamental skill of this course!

## Variables
This is how you save values to use or modify later.

In [1]:
# To set a variable, there is no need to initialize in Python. Simply set it using '='.
x = 5
print(x)

5


In [2]:
# To modify the variable, you have to set it again.
x = 5
x + 5
print(x)
x = x + 5
print(x)

5
10


In [3]:
# Variables can be just about anything that doesn't start with a numeral or include key symbols like + or #
# So make them descriptive! You can typically auto-complete with <tab>
This_is_a_really_long_variable_name_1234567890_ = 125
print(This_is_a_really_long_variable_name_1234567890_)

125


## Data Elements

### Integers
Numerical data taking whole number values

In [4]:
# Addition
3 + 3

6

In [5]:
# Multiplication
2 * 2

4

In [6]:
# Division - Note that result is no longer an integer
6 / 3

2.0

In [7]:
# Conversion to Integer
int(6 / 3)

2

A note on rounding in Python: it uses the IEEE 754 (banker's rounding) standard. On a tie, rather than always rounding up, it will round to the nearest even integer.

In [6]:
# Round to Integer
print(round(1.5))
print(round(2.5))
print(round(3.5))
print(round(4.5))

2
2
4
4


In [8]:
# Exponentiation
2 ** 10

1024

In [9]:
# Modulo (i.e., remainder)
print(4 % 2)
print(5 % 2)
print(6 % 2)

0
1
0


In [10]:
# Parentheses (remember PEMDAS!)
print(5 + 4 * 6 + 7)
print((5 + 4) * (6 + 7))

36
117


In [11]:
# Factorial
import math
math.factorial(5)

120

### Floats
Numerical data in the real number line

In [12]:
# Division resulted in a float before
6 / 3

2.0

In [13]:
# Be aware, you may lose precision using floats
300 - 1/3 - 1/3 - 1/3

299.00000000000006

In [14]:
# If needed, can convert an integer to float
float(3)

3.0

In [15]:
# Or you can just add a decimal
3.

3.0

In [16]:
# Note that some functions require integer inputs
try:
    print(math.factorial(5.5))
except Exception as ex:
    print(f'Error: {ex}.')

Error: factorial() only accepts integral values.


### Booleans
Simple variables taking the values of True or False

In [17]:
True

True

In [18]:
False

False

In [19]:
# Checking equality of two Booleans
print(True == True)
print(False == True)

True
False


In [20]:
# Standard logic functions
print(True and False)
print(True or False)
print(not False)
print(not True)

False
True
True
False


In [21]:
# Comparison operators return Booleans
print(1 < 2)
print(1 > 2)
print(1 >= 1)
print(1 == 1)
print(1 != 1)

True
False
True
True
False


## Iterables
These data types can be used in functions calling for iterables, such as loops.
By default, they are zero-indexed, meaning the first value is at index 0.

### Tuples
Tuples are immutable iterables. That means you cannot modify them.

In [22]:
# A tuple is defined using parentheses and commas.
(1,2,3)

(1, 2, 3)

In [23]:
# Again, they are immutable.
x = (1,2,3)
print(x)
try: 
    x[0] = 4
except Exception as ex:
    print(f'Error: {ex}.')

(1, 2, 3)
Error: 'tuple' object does not support item assignment.


In [24]:
# You can count the number of instances of a value in a tuple.
(2, 2, 3, 3, 4, 4, 4).count(4)

3

In [25]:
# Addition concatenates two tuples
(1,2,3)+(2,3,4)

(1, 2, 3, 2, 3, 4)

In [26]:
# And 

In [27]:
# You can use "tuple unpacking" to assign multiple variables at once.
(x,y)=(3,5)
print(x)
print(y)

3
5


### Strings
Strings are similar to tuples of characters. They are special because they are so often used for text.

In [28]:
'You can define strings with single quotes'

'You can define strings with single quotes'

In [29]:
"You can also use double quotes"

'You can also use double quotes'

In [30]:
"If you want to include 'single quotes', use 'double quotes'. Also useful if you need to type it's for 'it is'."

"If you want to include 'single quotes', use 'double quotes'. Also useful if you need to type it's for 'it is'."

In [31]:
'If you want to include "double quotes", use "single quotes".'

'If you want to include "double quotes", use "single quotes".'

In [32]:
# Concatenating strings
'this' + 'that'

'thisthat'

In [33]:
# Multiplying strings
'this' * 3

'thisthisthis'

In [34]:
# Converting integers to Strings
str(800)

'800'

In [35]:
# Can also convert floats to strings
print(str(2.5))
print(math.pi)

2.5
3.141592653589793


In [36]:
# Strings can be "sliced". 
# The first letter is index 0. 
# A colon means "to" and having one side open means "to that end".
# All slicing in Python is to the left. Note that [0:3] includes indices 0, 1, and 2 (not 3).
print('sliced'[:3])
print('sliced'[0:3])
print('sliced'[3:])

sli
sli
ced


In [37]:
# Using negative index goes the other way. This is handy for finding the last n letters.
print('sliced'[-3:])
print('sliced'[-3:-1])
print('sliced'[-2:])

ced
ce
ed


In [38]:
# In the same way, we can call just one index for a single character
'sliced'[4]

'e'

In [39]:
# But they're still immutable
x = 'Failure'
try: 
    x[0] = '4'
except Exception as ex:
    print(f'Error: {ex}.')

Error: 'str' object does not support item assignment.


In [40]:
# There are many ways to format strings, which we'll explore later.
# My go-to method is using f-strings and variables (more on variables later).
x = math.pi
print(f'Pi is an irrational number, but it is approximately {x}.')
print(f'Pi is an irrational number, but it is approximately {x:0.48f}.')
print(f'Pi is an irrational number, but it is approximately {x:0.0f}.')

Pi is an irrational number, but it is approximately 3.141592653589793.
Pi is an irrational number, but it is approximately 3.141592653589793115997963468544185161590576171875.
Pi is an irrational number, but it is approximately 3.


### Lists
Lists are very similar to tuples, but they are mutable.

In [41]:
# Create a list using square brackets
[1,2,3]

[1, 2, 3]

In [42]:
# The .pop() function removes the last values and feeds it to the call
lst = [1,2,3]
x = lst.pop()
print(x)
print(lst)

3
[1, 2]


In [43]:
# The .append() function adds a value to the end of a list. Note that iterables can contain multiple types of data.
lst = [1,2,3]
lst.append('this')
print(lst)

[1, 2, 3, 'this']


In [44]:
# Like other iterables, lists are by default zero-indexed.
lst = [1,2,3]
print(lst[0])
print(lst[1])
print(lst[2])

1
2
3


In [45]:
# You can also make nested lists and call them using slicing
lst = [[1,2],[4,5,[6]],['ex','eye','zee'],3]
print(lst[1])
print(lst[1][2])
print(lst[1][2][0])
# You can do the same to dig into an embedded iterable of another type
print(lst[2][1][1:])

[4, 5, [6]]
[6]
6
ye


In [46]:
# Like tuples, lists can be added (concatenated) and multiplied
print([1,2,3] + [4,5,6])
print([1,2,3] * 5)

[1, 2, 3, 4, 5, 6]
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]


In [47]:
# Unlike tuples, lists can be modified
lst = [[1,2],[4,5,[6]],['ex','eye','zee'],3]
lst[2] = ['x','y','z']
print(lst)

[[1, 2], [4, 5, [6]], ['x', 'y', 'z'], 3]


### Sets
Sets are unordered iterables. They only store unique values.

In [48]:
# Create a set using curly brackets
{1,2,3}

{1, 2, 3}

In [49]:
# Adding repeat values does nothing
{1,2,2,2,2,2,3,3,3,3,3,1,5}

{1, 2, 3, 5}

In [50]:
# The order things are added also does not matter
{5,4,2,2,2,2,3,3,1,-1}

{-1, 1, 2, 3, 4, 5}

In [51]:
# Like tuples, they are immutable
st = {1,2,3}
try:
    st[2]=5
except Exception as ex:
    print(f'Error: {ex}')

Error: 'set' object does not support item assignment


In [52]:
# Because they are unordered, sets cannot be sliced either
try:
    print(st[2])
except Exception as ex:
    print(f'Error: {ex}')

Error: 'set' object is not subscriptable


### Dictionaries
Dictionaries are similar to sets. They store key:value pairs.

In [53]:
# They still use curly braces but now also include colons.
d = {'key1':'value1','key2':'value2'}

In [54]:
# The .keys() function calls the keys from the dictionary.
d.keys()

dict_keys(['key1', 'key2'])

In [55]:
# The .values() function calls the values.
d.values()

dict_values(['value1', 'value2'])

In [56]:
# The .items() function includes all key:value pairs
d.items()

dict_items([('key1', 'value1'), ('key2', 'value2')])

In [57]:
# The value of a dictionary lies in using a key to return a value.
d['key1']

'value1'

In [58]:
# Dictionaries are mutable.
d['key1'] = 'new value'
d.items()

dict_items([('key1', 'new value'), ('key2', 'value2')])

In [59]:
# Like sets, they won't store more than one value per key. Note that later items overwrite values.
d = {'key1':'value1','key2':'value2','key1':'value3'}
d.items()

dict_items([('key1', 'value3'), ('key2', 'value2')])

### Range
A range object is particularly useful as an iterable. We will use them often.

In [60]:
# The range() function generates a range object. Let's look at its help documentation.
help(range)

Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object
 |  
 |  Return an object that produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |

In [61]:
# Note that we won't see the values in the range unless we use it as an iterable.
print(range(4))

range(0, 4)


In [62]:
# We'll do for loops in a moment, but I'll use it here to demonstrate the range values.
# Here is a zero-indexed range (the default). Note that 4 is not included.
for i in range(4):
    print(i)

0
1
2
3


In [63]:
# Here we specify a lower and upper limit. The lower limit is included, but not the upper limit.
for i in range(3,10):
    print(i)

3
4
5
6
7
8
9


In [64]:
# Here we specify a step size of 2, so we get all odd numbers between 3 (inclusive) and 10 (exclusive).
for i in range(3,10,2):
    print(i)

3
5
7
9


## Numpy
The numpy package includes some additional data types that we'll use regularly.

In [65]:
import numpy as np

### Arrays
Arrays are vectors. They look like lists, but math operators work differently with them.

In [66]:
# To create an array, you usually pass numpy a list.
arr = np.array([1,2,3])
print(arr)

[1 2 3]


Note that there are no commas in an array.

In [67]:
# Arrays multiply and add differently.
arr1 = np.array([1,2,3])
arr2 = 2 * arr1
print(arr2)
arr3 = arr1 + arr2
print(arr3)
arr4 = arr1 * arr2
print(arr4)

[2 4 6]
[3 6 9]
[ 2  8 18]


In [68]:
# Arrays can also be created using other functions, such as starting with zeros or ones.
print(np.zeros(10))
print(np.ones(10))

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [69]:
# Array can be two or more dimensional too. A two dimensional array is a matrix.
# The .eye() function creates an identity matrix (zero matrix with diagonals of one).
print(np.eye(5))

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


In [70]:
# The numpy .arange() function creates arrays using range-like inputs. It allows for non-integer values.
print(np.arange(1,10))
print(np.arange(0,1,0.1))

[1 2 3 4 5 6 7 8 9]
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]


In [71]:
# The numpy .linspace() function is similar but you specify the number of equally spaced values instead of the step size. 
# It also includes the right-most bound.
print(np.linspace(1,5,21))

[1.  1.2 1.4 1.6 1.8 2.  2.2 2.4 2.6 2.8 3.  3.2 3.4 3.6 3.8 4.  4.2 4.4
 4.6 4.8 5. ]


In [72]:
# We will use the np.random library extensively as well to draw random numbers.
print(np.random.randint(0,10,20))
print(np.random.rand(20))

[7 2 7 8 8 7 2 2 2 7 8 2 2 0 7 3 0 8 6 2]
[0.5252803  0.97819978 0.72900968 0.9966266  0.24487934 0.11368977
 0.7316917  0.18022027 0.15628807 0.96626413 0.49859965 0.51608012
 0.42527345 0.56533122 0.13952853 0.34772276 0.3807468  0.76848378
 0.88526469 0.87025647]


In [73]:
# Two or more dimensional arrays can be indexed by tuple. For 2-d, it is (row, column).
# Note the .reshape() function can turn a 1-d array into 2-d (or more)
mat = np.arange(1,21).reshape(4,5)
print(mat)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]


In [74]:
print(mat[2,3])

14


In [75]:
# We can also use slicing to define part of the matrix.
print(mat[2:4,3:])

[[14 15]
 [19 20]]


In [76]:
# We will also use many of the handy built-in functions with arrays.
# We'll do all this in context later.
arr = np.arange(1,21)
print(arr.mean())
print(arr.sum())
print(arr.var())
print(arr.std())
print(arr.max())
print(arr.min())

10.5
210
33.25
5.766281297335398
20
1


## Pandas
The pandas package is commonly called the "Excel of Python." This gives us data matrices to manipulate.

In [77]:
import pandas as pd

### Series
Pandas Series are similar to one-dimension arrays with indexed rows (cases).

In [78]:
# Series can come from lists
ser = pd.Series([1,3,5,7])
print(ser)

0    1
1    3
2    5
3    7
dtype: int64


In [79]:
# Or from arrays
ser = pd.Series(np.arange(1,6))
print(ser)

0    1
1    2
2    3
3    4
4    5
dtype: int32


In [80]:
# Or from dictionaries
ser = pd.Series({'zero':0, 'one':1, 'two':2})
print(ser)

zero    0
one     1
two     2
dtype: int64


In [81]:
# You can specify the indices separately
ser = pd.Series(np.arange(1,6), index=['a','b','c','d','e'])
print(ser)

a    1
b    2
c    3
d    4
e    5
dtype: int32


In [82]:
# You can then call rows by index.
print(ser['c'])

3


In [83]:
# This implicitly calls the Series.loc[] function
print(ser.loc['c'])

3


In [84]:
# If you want the zero-indexed numerical row, you can use Series.iloc[] instead
print(ser.iloc[2])

3


In [85]:
# Pandas Series have a lot of similar functions to numpy arrays.
# Some are slightly different, which we'll look at later.
print(ser.mean())
print(ser.std())
print(ser.max())

3.0
1.5811388300841898
5


### DataFrame
The data matrix, like an Excel Workbook of Python - this is where we'll mostly live.

In [86]:
# DataFrames can be created from nested lists
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [87]:
# Can also be created from arrays
df = pd.DataFrame(np.random.randint(0,10,50).reshape(5,10))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,8,7,5,6,7,5,8,5,0,7
1,3,1,0,2,4,7,3,5,4,2
2,5,7,6,3,0,2,6,5,8,6
3,3,8,6,6,1,8,7,5,0,5
4,2,7,0,3,3,2,5,0,7,9


In [88]:
# Particularly usefully, they can be created from dictionaries.
lst1 = [1,2,3,4,5]
lst2 = [2.3,4.5,6.7,8.9,9.0]
df = pd.DataFrame({'Arthur':lst1, 'Patsy':lst2})
df

Unnamed: 0,Arthur,Patsy
0,1,2.3
1,2,4.5
2,3,6.7
3,4,8.9
4,5,9.0


In [89]:
# If you call a single column, you get a Series
df['Arthur']

0    1
1    2
2    3
3    4
4    5
Name: Arthur, dtype: int64

In [90]:
# You can create a Boolean Series using comparison operators
df['Arthur'] < 3

0     True
1     True
2    False
3    False
4    False
Name: Arthur, dtype: bool

In [91]:
# Then pass that Boolean Series in square brackets as a filter
df[df['Arthur'] < 3]

Unnamed: 0,Arthur,Patsy
0,1,2.3
1,2,4.5


In [92]:
# Successive square brackets will filter more or call columns.
df[df['Arthur'] < 3]['Patsy']

0    2.3
1    4.5
Name: Patsy, dtype: float64

In [93]:
# You can also add new columns.
df['Sir Robin'] = df['Arthur'] + df['Patsy']
df

Unnamed: 0,Arthur,Patsy,Sir Robin
0,1,2.3,3.3
1,2,4.5,6.5
2,3,6.7,9.7
3,4,8.9,12.9
4,5,9.0,14.0


In [94]:
# Unless specified, functions on DataFrames don't modify the DataFrame
df.drop(4)

Unnamed: 0,Arthur,Patsy,Sir Robin
0,1,2.3,3.3
1,2,4.5,6.5
2,3,6.7,9.7
3,4,8.9,12.9


In [95]:
df

Unnamed: 0,Arthur,Patsy,Sir Robin
0,1,2.3,3.3
1,2,4.5,6.5
2,3,6.7,9.7
3,4,8.9,12.9
4,5,9.0,14.0


In [96]:
# But most functions have an optional inplace argument
df.drop(4,inplace=True)
df

Unnamed: 0,Arthur,Patsy,Sir Robin
0,1,2.3,3.3
1,2,4.5,6.5
2,3,6.7,9.7
3,4,8.9,12.9


In [97]:
# I used the .drop() function above to remove a row. If you specify axis=1, you can remove a column.
df.drop('Sir Robin',axis=1)

Unnamed: 0,Arthur,Patsy
0,1,2.3
1,2,4.5
2,3,6.7
3,4,8.9


In [98]:
# The DataFrame.info() function tells you some information about the values.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Arthur     4 non-null      int64  
 1   Patsy      4 non-null      float64
 2   Sir Robin  4 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 128.0 bytes


We'll dive into **much** more detail later on all the things we can do with a DataFrame.

# Functions
You've probably done this before, but let's look at some types of functions.

## Conditional execution
Using if, elif, and else, we can do a lot.

In [99]:
if 1 < 2:
    print('Yes!!!')

Yes!!!


In [100]:
if 1 > 2:
    print('Yes!!!')

In [101]:
# Elif means else if. The earlier ifs and elifs must evaluate to False for these to trigger.
x = 2
if x < 2:
    print('Yes!!!')
elif x < 3:
    print('Maybe...')

Maybe...


In [102]:
# Else executes if all other ifs and elifs are False.
x = 5
if x < 2:
    print('Yes!!!')
elif x < 3:
    print('Maybe...')
elif x < 4:
    print('Try again later.')
else:
    print('Nope.')

Nope.


## Loops
These make use of those iterables to do various tasks.

### For loops
Executes the contained code once for each value in the passed iterable.

In [103]:
lst = [1,2,3,4,5,9]
for item in lst:
    print(item)

1
2
3
4
5
9


In [104]:
# Can be combined with a range to just execute code some number of times.
for a in range(5):
    print("I'll print this 5 times.")

I'll print this 5 times.
I'll print this 5 times.
I'll print this 5 times.
I'll print this 5 times.
I'll print this 5 times.


In [105]:
# A shortened for loop can be used to construct a list.
lst = [(item ** 2) for item in range(5)]
print(lst)

[0, 1, 4, 9, 16]


### While loops
Be careful with these. They'll execute until the condition executes to False.
Beware the infinite loop!

In [106]:
i = 1
while i < 6:
    print(i)
    i = i + 1

1
2
3
4
5


## Custom Functions
You can save functions and use them repeatedly. You can even save them in a .py file and import them so you can keep your perfected code to use in multiple assignments, or to have a library of useful things for later.

In [107]:
def my_func_here(a, b='default'):
    '''
    This is the function's Docstring.
    When someone calls help(my_func), this is what they'll see.
    You should put solid information in here for your future self and/or customer.
    
    Example:
    This function prints the two given values.
    Inputs:
        a - Any value. This will be printed first.
        b - Any value. This will be printed second. If no second value is given, it will assume the string 'default'.
    '''
    print(a)
    print(b)

In [108]:
my_func_here(3)

3
default


In [109]:
my_func_here(3,4)

3
4


In [110]:
my_func_here(b=4,a=3)

3
4


In [111]:
from my_func import my_func_py

In [112]:
help(my_func_py)

Help on function my_func_py in module my_func:

my_func_py(a, b='default')
    This is the function's Docstring.
    When someone calls help(my_func), this is what they'll see.
    You should put solid information in here for your future self and/or customer.
    
    Example:
    This function prints the two given values.
    Inputs:
        a - Any value. This will be printed first.
        b - Any value. This will be printed second. If no second value is given, it will assume the string 'default'.



### Lambda expressions
Lambda expressions are shortened versions of functions that may be useful for simple expressions.

In [113]:
lamb_func = lambda xx: xx + 5
lamb_func(7)

12

In [115]:
# You can always choose not to use lambda expressions, so take them or leave them.
def alt_func(a):
    return a + 5
alt_func(7)

12

# That's all for now!
We will see so, so much more of this later, but at least you now know the basics.