## Data Types


### University of Virginia
### DS 5100: Programming for Data Science
### Last Updated: April 2, 2021
---  


### PREREQUISITES
- variables

### SOURCES 
- python documentation on built-in data types, operations, methods  
https://docs.python.org/3/library/stdtypes.html  


- python data types  
https://www.geeksforgeeks.org/python-data-types/  


- mutable vs immutable data types  
https://towardsdatascience.com/immutable-vs-mutable-data-types-in-python-e8a9a6fcfbdc

### OBJECTIVES
- Present the essential Python data types and demonstrate some functionality 
- Demonstrate the `format()` function for embedding data types into string expressions

NOTE: 
1. See sources for more details. As the course progresses, things will go deeper.  
2. We jump ahead a bit, showing functionality involving `if-statements` and `for-loops`,  
both covered in more detail later.


### CONCEPTS

- zero-based indexing
- boolean
- integer
- float
- string
- list
- tuple
- set
- dictionary (dict)
- enumerate
- range
- structuring strings containing data types
- data type conversions


---

### Introduction to Python Built-in Data Types

This notebook introduces essential Python built-in data types, their operations, and methods.  
Please refer to references above for more details.  
As the course progresses, the data types will be further detailed.  

Everything in Python is an object, or class instance (to be discussed later).

**Zero-based indexing**  
Python uses zero-based indexing, which means for a collection `mylist`

`mylist[0]` references the first element  
`mylist[1]` references the second element, etc

For any iterable object of length *N*:  
`mylist[:n]` will return the first *n* elements from index *0* to *n-1*  
`mylist[-n:]` will return the last *n* elements from index *N-n* to *N-1*

Walter Coleman txx3ej 6/22/2021

## Boolean

A `boolean` takes one of `True` or `False`, which are built-in values

check if `cache` is True, using `if` statement  
`if` statement using a bool evaluates to True or False

In [2]:
cache = True

if cache:
    print('data will be cached')

data will be cached


In [3]:
print(type(cache))

<class 'bool'>


In [4]:
isinstance.__doc__

'Return whether an object is an instance of a class or of a subclass thereof.\n\nA tuple, as in ``isinstance(x, (A, B, ...))``, may be given as the target to\ncheck against. This is equivalent to ``isinstance(x, A) or isinstance(x, B)\nor ...`` etc.'

In [5]:
# check if `cache` is a `bool`  

isinstance(cache, bool)

True

complex statements can be built with operators

In [6]:
cache = True
oome = False

if cache or oome:
    print('condition met!')

condition met!


AND statements will short circuit if an early condition fails.  

In [7]:
if oome and cache:
    print('condition met!')

In this case, since *oome* is False, the check on *cache* never happens.

## Numeric Values

Numeric values can be `int`, `float`, `complex`

## Integer

Positive or negative whole numbers

In [8]:
epoch   = 21
divisor = 10
print('quotient:', epoch // divisor)
print('remainder:', epoch % divisor)

quotient: 2
remainder: 1


In [9]:
print(type(epoch))

<class 'int'>


In [10]:
isinstance(epoch, int)

True

## Float

Real number with floating point representation. Specified by a decimal point. 

In [11]:
f1_score = 0.95

In [12]:
print(type(f1_score))

<class 'float'>


In [13]:
isinstance(f1_score, float)

True

## String

A string is an array of bytes representing Unicode characters

Defined with single, double, or triple quotes

In [1]:
status = 'success'

In [15]:
print(type(status))

<class 'str'>


In [16]:
isinstance(status, str)

True

Subsetting into a string

In [17]:
# extract first 2 characters
status[:2]

'su'

In [18]:
# extract last 2 characters
status[-2:]

'ss'

In [2]:
status.startswith('a')

False

In [3]:
status.endswith('s')

True

it is NOT possible to reassign elements of a string. Python strings are **immutable**.

In [19]:
status = 'success'
status[0] = 't'

TypeError: 'str' object does not support item assignment

In [1]:
str1 = 'C:/Users/Documents/'
str2 = 'ds5110_summer2021_participation_summary'

In [2]:
str1_2 = str1 + str2
print(str1_2)

C:/Users/Documents/ds5110_summer2021_participation_summary


In [3]:
import os

In [4]:
os.path.join(str1,str2)

'C:/Users/Documents/ds5110_summer2021_participation_summary'

### TRY FOR YOURSELF (UNGRADED EXERCISES)

1) Define a string and print:
- the first three characters of the string
- the last three characters of the string

In [20]:
mystr = 'mississippi'

In [21]:
# mystr = 'python'
print(mystr[:3])
print(mystr[-3:])

mis
ppi


## List

A list is an ordered sequence of items. They can contain mixed types.

In [22]:
mixed = ['a', 5 ,3.2]
print(mixed)

['a', 5, 3.2]


list of floats

In [4]:
trans = [3.14, 2.71]

In [3]:
print(type(trans))

In [None]:
trans[1]

In [None]:
# out of range, will break

trans[2]

cannot subset into a float, will break

In [5]:
trans[1][1]

but you can do this:

In [5]:
str(trans[1])[1]

'.'

list of strings

In [23]:
financial_derivatives = ['futures','swaps','options']

In [24]:
print(type(financial_derivatives))

<class 'list'>


In [25]:
isinstance(financial_derivatives, list)

True

elements are strings

In [26]:
isinstance(financial_derivatives[0], str)

True

indexing into the list

In [27]:
print('first     :',financial_derivatives[0])
print('first two :',financial_derivatives[:2])
print('last      :',financial_derivatives[-1])
print('last two  :',financial_derivatives[-2:])

first     : futures
first two : ['futures', 'swaps']
last      : options
last two  : ['swaps', 'options']


index into first element, extracting first three characters from string:

In [28]:
financial_derivatives[0][:3]

'fut'

loop over the derivatives, printing them with their string lengths

In [29]:
for deriv in financial_derivatives:
    print(deriv, len(deriv))

futures 7
swaps 5
options 7


`enumerate` the list, which extracts index value and data

In [30]:
enumerate.__doc__

'Return an enumerate object.\n\n  iterable\n    an object supporting iteration\n\nThe enumerate object yields pairs containing a count (from start, which\ndefaults to zero) and a value yielded by the iterable argument.\n\nenumerate is useful for obtaining an indexed list:\n    (0, seq[0]), (1, seq[1]), (2, seq[2]), ...'

In [31]:
for ix, deriv in enumerate(financial_derivatives):
    print('index:', ix, ', derivative:', deriv)

index: 0 , derivative: futures
index: 1 , derivative: swaps
index: 2 , derivative: options


In [32]:
# just print the pairs

for ix, deriv in enumerate(financial_derivatives):
    print(ix, deriv)

0 futures
1 swaps
2 options


In [6]:
variables = ['x1','x2','x3']
response = ['y']
variables + response

['x1', 'x2', 'x3', 'y']

### TRY FOR YOURSELF (UNGRADED EXERCISES)

2) Assign values to a list, and print the second element from the list

In [2]:
lis = [1,2,3,4,5]

In [3]:
lis[1]

2

In [None]:
mylist = ['first','second','third']
print(mylist[1])

## Tuple

A tuple can contain any number of elements of any datatype  
Created with comma-separated values, with or without parenthesis  

In [9]:
# define a tuple of mixed types

grab_bag = (4, 'EHR', ['year','in','review'])

In [7]:
grab_bag_no_parens = 4, 'EHR', ['year','in','review']
grab_bag_no_parens

(4, 'EHR', ['year', 'in', 'review'])

In [10]:
print(type(grab_bag))

<class 'tuple'>


In [11]:
isinstance(grab_bag, tuple)

True

In [12]:
# show the first element

grab_bag[0]

4

In [13]:
# show the last element

grab_bag[-1]

['year', 'in', 'review']

Tuples, like strings, are immutable.  
What happens when we try to reassign `grab_bag` ?

In [14]:
grab_bag[0] = 5

TypeError: 'tuple' object does not support item assignment

Define a tuple of strings

In [15]:
coffees = ('brazilian','costa_rican','colombian')

if `coffees` is a tuple, print the first element

In [16]:
if isinstance(coffees, tuple):
    print(coffees[0])

brazilian


In [12]:
for coffee in coffees:
    print(coffee)

brazilian
costa_rican
colombian


In [17]:
for ix, coffee in enumerate(coffees):
    print(ix, coffee)

0 brazilian
1 costa_rican
2 colombian


### TRY FOR YOURSELF (UNGRADED EXERCISES)

3) Assign values to a tuple  
Enumerate the tuple, printing each index and value

In [18]:
teams = ('Raiders','Seahawks','49ers')

In [19]:
for ix, team in enumerate(teams):
    print(ix, team)

0 Raiders
1 Seahawks
2 49ers


In [14]:
flags = (1, 0, 1, 1)

for ix, flag in enumerate(flags):
    print(ix, flag)

0 1
1 0
2 1
3 1


## Set

A `set` is an unordered collection of unique objects

In [20]:
peanuts = {'snoopy','snoopy','woodstock'}

In [21]:
print(type(peanuts))

<class 'set'>


In [22]:
peanuts

{'snoopy', 'woodstock'}

Note the set is deduped

Since sets are unordered, they don't have an index. This will break:

In [18]:
peanuts[0]

TypeError: 'set' object is not subscriptable

In [19]:
for peanut in peanuts:
    print(peanut)

snoopy
woodstock


In [23]:
set1= {'R'}

In [24]:
set2 = {'SQL','Python'}

In [25]:
set1.union(set2)

{'Python', 'R', 'SQL'}

In [26]:
set1 + set2

TypeError: unsupported operand type(s) for +: 'set' and 'set'

**Check if a value is in the set using `in`**

In [20]:
'snoopy' in peanuts

True

### TRY FOR YOURSELF (UNGRADED EXERCISES)

4) Assign a value to a string, and assign values to a set

Check if the string is in the set

In [21]:
word = "my_word"
sets = {'Lady','Michgan','Jesus','Jones','Abington','Lazer','Lady'}
word in sets

False

In [None]:
val = 'ERROR'
levels = {'WARN','ERROR','CRITICAL'}

val in levels

## Dict

A `dictionary` or `dict` is an unordered collection of unique key-value pairs

They follow this format:

{'key1' : 'value1', 'key2' : 'value2', ...}

In [22]:
airports = {'LAGUARDIA':'LGA','LOGAN':'BOS','NANTUCKET':'ACK','CHARLOTTESVILLE':'CHO'}

In [23]:
print(type(airports))

<class 'dict'>


In [24]:
airports

{'LAGUARDIA': 'LGA',
 'LOGAN': 'BOS',
 'NANTUCKET': 'ACK',
 'CHARLOTTESVILLE': 'CHO'}

can index by key

In [25]:
airports['LOGAN']

'BOS'

can't index like this, since it treats the index as a key:

In [26]:
airports[0]

KeyError: 0

more failure...attempting to access a nonexistent key will throw an error:

In [27]:
airports['KENNEDY']

KeyError: 'KENNEDY'

safer to use `get` (nothing returns)

In [28]:
airports.get('DULLES')

can assign like this:

In [29]:
airports['DULLES'] = 'IAD'

In [30]:
airports

{'LAGUARDIA': 'LGA',
 'LOGAN': 'BOS',
 'NANTUCKET': 'ACK',
 'CHARLOTTESVILLE': 'CHO',
 'DULLES': 'IAD'}

extract keys

In [31]:
airports.keys()

dict_keys(['LAGUARDIA', 'LOGAN', 'NANTUCKET', 'CHARLOTTESVILLE', 'DULLES'])

loop over the airports, printing the keys

In [32]:
for key in airports.keys():
    print(key)

LAGUARDIA
LOGAN
NANTUCKET
CHARLOTTESVILLE
DULLES


extract values

In [33]:
airports.values()

dict_values(['LGA', 'BOS', 'ACK', 'CHO', 'IAD'])

save the values as a list

In [34]:
airport_codes = list(airports.values())

print(airport_codes)

['LGA', 'BOS', 'ACK', 'CHO', 'IAD']


extract keys and values using `items()`

In [35]:
airports.items()

dict_items([('LAGUARDIA', 'LGA'), ('LOGAN', 'BOS'), ('NANTUCKET', 'ACK'), ('CHARLOTTESVILLE', 'CHO'), ('DULLES', 'IAD')])

loop over the airports, printing the keys and values

In [36]:
for key, val in airports.items():
    print(key, '-', val)

LAGUARDIA - LGA
LOGAN - BOS
NANTUCKET - ACK
CHARLOTTESVILLE - CHO
DULLES - IAD


In [27]:
dict_w_tuple = {'first_key': ('Catcher','in','the','Rye')}

In [28]:
print(dict_w_tuple)
dict_w_tuple[0]

{'first_key': ('Catcher', 'in', 'the', 'Rye')}


KeyError: 0

In [30]:
type(dict_w_tuple['first_key'])

tuple

### TRY FOR YOURSELF (UNGRADED EXERCISES)

5) Create a dictionary containing at least three key-value pairs  

Show how to index into the dict with one of the keys to extract the corresponding value using `get()`

Show how to store the keys in a list.

In [31]:
# Task: given data1, data2, dedupe and put in a list
data1 = {'R', 'Julia', 'Julia', 'SQL'}
data2 = {'R', 'Python'}

In [36]:
tuple(data1.union(data2))

('Julia', 'R', 'Python', 'SQL')

In [37]:
states = {'Alaska':'Juneau', 'Oregon':'Salem', 'Arizona':'Phoenix'}

In [38]:
states.get('Alaska')

'Juneau'

In [39]:
name_age = {'greg':15, 'annabel':22, 'joaquin':19}

print('name_age[joaquin]=', name_age.get('joaquin'))


names = list(name_age.keys())

print('names:', names)

name_age[joaquin]= 19
names: ['greg', 'annabel', 'joaquin']


## Range

A range is a sequence of integers, from `start` to `stop` by `step`  
The `start` point is zero by default.  
The `step` is one by default.  
The `stop` point is NOT included.  

Ranges can be assigned to a variable.

In [40]:
for x in range(5):
    print(x)

0
1
2
3
4


In [41]:
help(range)

Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object
 |  
 |  Return an object that produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |

In [42]:
for x in range(1, 20, 3):
    print(x)

1
4
7
10
13
16
19


In [43]:
rng = range(5)

In [44]:
print(type(rng))

<class 'range'>


In [45]:
for rn in rng:
    print(rn)

0
1
2
3
4


another range:

In [46]:
rangy = range(1, 11, 2)
for rn in rangy:
    print(rn)

1
3
5
7
9


### TRY FOR YOURSELF (UNGRADED EXERCISES)

6) More often, ranges may be used without assignment.  
What will this print?

In [47]:
for rn in range(-5, 5, 2):
    print(rn)

-5
-3
-1
1
3


## `format()`  

Variable values can be embedding in strings using the `format()` function.  
Place {} in the string in order from left to right. followed by `.format(var1, var2, ...`)`

In [6]:
epoch = 20
loss = 1.55

print('Epoch: {}, loss: {}'.format(epoch, loss))

Epoch: 20, loss: 1.55


This breaks, as three variables are required based on number of {}

In [7]:
print('Epoch: {}, loop: {}, loss: {}'.format(epoch, loss))

IndexError: Replacement index 2 out of range for positional args tuple

---

### TRY FOR YOURSELF (UNGRADED EXERCISES)

7) Use `format()` to print a string containing each variable name followed by its value, using comma separators between the variables (e.g., epoch 1, mode TRAIN, ...)

In [47]:
epoch = 1
mode = "TRAIN"
loss = 0.46

print("YOUR STRING HERE")

YOUR STRING HERE


In [48]:
epoch = 1
mode = "TRAIN"
loss = 0.46

print("epoch {}, mode {}, loss {}".format(epoch, mode, loss))

epoch 1, mode TRAIN, loss 0.46


8) Define a list containing one each of: integer, float, bool, string

Print the length of the list using `len()`

In [49]:
mylist = [4, 2.2, False, 'fancy']

print(len(mylist))

4


Write a `for-loop` to iterate over the list, printing the val and `type()` of each element.

In [50]:
for val in mylist:
    print(val, type(val))

4 <class 'int'>
2.2 <class 'float'>
False <class 'bool'>
fancy <class 'str'>


## Data Type Conversions

There are many functions for converting between data types.

`float -> int` : chops the decimal

In [50]:
val = 3.8
print('value: {}, type: {}'.format(val, type(val)))

val_int = int(val)
print('value: {}, type: {}'.format(val_int, type(val_int)))

value: 3.8, type: <class 'float'>
value: 3, type: <class 'int'>


`string -> float`

In [51]:
val = '3.8'
print('value: {}, type: {}'.format(val, type(val)))

val_int = float(val)
print('value: {}, type: {}'.format(val_int, type(val_int)))

value: 3.8, type: <class 'str'>
value: 3.8, type: <class 'float'>


**Converting string decimal to integer will fail:**

In [52]:
val = '3.8'
print('value: {}, type: {}'.format(val, type(val)))

val_int = int(val)
print('value: {}, type: {}'.format(val_int, type(val_int)))

value: 3.8, type: <class 'str'>


ValueError: invalid literal for int() with base 10: '3.8'

`list -> dictionary` using `fromkeys`. values default to None.

In [53]:
month = ["jan", "feb", "mar"]

month_dict = dict.fromkeys(month)
month_dict

{'jan': None, 'feb': None, 'mar': None}

Advice: many other sensible conversions are supported; the documentation can help.

---