## Python Basics

In Jupyter notebooks (`.ipynb`), the browser is sending your Python code accross a machine in the cloud, which executes the code in a Python 3 interpreter, and sends the results back.

This is why you can see an output when running cells.

### Python Functions

In [1]:
def add_numbers(x, y):
    return x + y

add_numbers(1, 2)

3

In [2]:
def add_numbers(x, y, z=None):
    if z == None:
        return x + y
    else:
        return x + y + z
    
print(add_numbers(1, 2))
print(add_numbers(1, 2, 3))

3
6


### Types and Sequences

In [3]:
type('This is a string')

str

In [4]:
type(None)

NoneType

In [5]:
type(1)

int

In [6]:
type(1.0)

float

In [7]:
type(add_numbers)

function

In [8]:
# Tuples
x = (1, 'a', 2, 'b')
type(x)

tuple

In [9]:
# Lists
x = [1, 'a', 2, 'b']
type(x)

list

In [10]:
for item in x:
    print(item)

1
a
2
b


In [11]:
i = 0
while i != len(x):
    print(x[i])
    i += 1

1
a
2
b


In [12]:
1 in [1, 2, 3]

True

In [13]:
# Slicing
x = 'This is a string'
print(x[0])
print(x[0:1])
print(x[0:2])

T
T
Th


In [14]:
x[-1]

'g'

In [15]:
x[-4:-2]

'ri'

In [16]:
x[:3]

'Thi'

In [17]:
x[3:]

's is a string'

### Text Analysis

In [19]:
# Text analysis
firstname = 'Christopher'
lastname = 'Brooks'

print(firstname + ' ' + lastname)
print(firstname * 3)
print('Chris' in firstname)

Christopher Brooks
ChristopherChristopherChristopher
True


In [20]:
firstname = 'Christopher Arthur Hansen Brook'.split(' ')[0]
lastname = 'Christopher Arthur Hansen Brook'.split(' ')[-1]
print(firstname)
print(lastname)

Christopher
Brook


### Dictionaries

In [21]:
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}
x['Christopher Brooks']

'brooksch@umich.edu'

In [22]:
x['Kevyn Collins-Thompson'] = None
x['Kevyn Collins-Thompson']

In [23]:
for name in x:
    print(x[name])

brooksch@umich.edu
billg@microsoft.com
None


In [24]:
for email in x.values():
    print(email)

brooksch@umich.edu
billg@microsoft.com
None


In [25]:
for name, email in x.items():
    print(name)
    print(email)

Christopher Brooks
brooksch@umich.edu
Bill Gates
billg@microsoft.com
Kevyn Collins-Thompson
None


In [26]:
# Unpacking
x = ('Christopher', 'Brooks', 'brooksch@umich.edu')
fname, lname, email = x
fname

'Christopher'

In [27]:
lname

'Brooks'

In [28]:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Arbor')
fname, lname, email = x

ValueError: too many values to unpack (expected 3)

### Strings

In Python 3, strings are **Unicode** based. In early computing, characters of strings were limited to one of 256 different values: upper and lower case Latin characters, as well as single digit numbers. $\rightarrow$ *ASCII*

The world now doesn't just run on Latin characters and there's a need to support non-English languages as well as characters which are not commonly used in words, but commonly used elsewhere like mathematical operators. The *Unicode Transformation Format*, UTF, is an attempt to solve this. It can be used to represent over a million different characters, including symbols and emojis.

In [29]:
print('Chris' + 2)

TypeError: can only concatenate str (not "int") to str

In [31]:
print('Chris' + str(2))

Chris2


In [34]:
sales_record = {'price': 3.24,
                'num_items': 4,
                'person': 'Chris'}
sales_statement = f"{sales_record['person']} bought {sales_record['num_items']} items at a total price of {sales_record['num_items']*sales_record['price']}"

print(sales_statement)

Chris bought 4 items at a total price of 12.96


### Reading and Writing CSV files

We'll use the `CSV` module to read CSV files and set the floating point precision for printing to $2$, using the iPython magic (`%`).

In [52]:
import csv
import os

%precision 2

path = 'G:/Mi unidad/GitHub/course-applied-data-science/1-intro-ds/resources/week-1/datasets/'

try:
    os.chdir(path)
    print(f'Current working directory {path}')
except FileNotFoundError:
    print(f'Directory: {path} does not exist')
except NotADirectoryError:
    print(f'{path} is not a directory')
except PermissionError:
    print(f'You do not have permissions to change to {path}')

Current working directory G:/Mi unidad/GitHub/course-applied-data-science/1-intro-ds/resources/week-1/datasets/


In [53]:
with open('mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))

mpg

[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '3',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '2',
  'year': '2008',
  'cyl': '4',
  'trans': 'manual(m6)',
  'drv': 'f',
  'cty': '20',
  'hwy': '31',
  'fl': 'p',
  'class': 'compact'},
 {'': '4',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '2',
  'year': '2008',
  'cyl': '4',
  'trans': 'auto(av)',
  'drv': 'f',
  'cty': '21',
  'hwy': '30',
  'fl': 'p',
  'class': 'compact'},
 {'': '5',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '2.8',
  'year': '1999',
  'cyl': '6',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '16',
  'hwy': '26',
 

In [54]:
mpg[:3]

[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '3',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '2',
  'year': '2008',
  'cyl': '4',
  'trans': 'manual(m6)',
  'drv': 'f',
  'cty': '20',
  'hwy': '31',
  'fl': 'p',
  'class': 'compact'}]

In [55]:
len(mpg)

234

In [56]:
mpg[0].keys()

dict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

In [58]:
mpg[0]

{'': '1',
 'manufacturer': 'audi',
 'model': 'a4',
 'displ': '1.8',
 'year': '1999',
 'cyl': '4',
 'trans': 'auto(l5)',
 'drv': 'f',
 'cty': '18',
 'hwy': '29',
 'fl': 'p',
 'class': 'compact'}

In [59]:
# Get manufacturers for all mpg values
manufacturers = set(data['manufacturer'] for data in mpg)
manufacturers

{'audi',
 'chevrolet',
 'dodge',
 'ford',
 'honda',
 'hyundai',
 'jeep',
 'land rover',
 'lincoln',
 'mercury',
 'nissan',
 'pontiac',
 'subaru',
 'toyota',
 'volkswagen'}

In [60]:
# Get vehicle classes for all mpg values
veh_classes = set(data['class'] for data in mpg)
veh_classes

{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

In [61]:
# Get number of cylinders 
cylinders = set(data['cyl'] for data in mpg)
cylinders

{'4', '5', '6', '8'}

In [66]:
# Find average highway MPG (Miles per Gallon)
hwy_classes = []

for v_class in veh_classes: # Iterate over all vehicle classes
    sum_mpg = 0
    v_class_count = 0
    for data in mpg:
        if data['class'] == v_class:
            sum_mpg += float(data['hwy']) # 'hwy' is str type -> You need to convert it into float
            v_class_count += 1
    hwy_classes.append((v_class, sum_mpg / v_class_count)) # ('vehicle class', 'average hwy mpg')

hwy_classes.sort(key=lambda x: x[1])
hwy_classes

[('pickup', 16.88),
 ('suv', 18.13),
 ('minivan', 22.36),
 ('2seater', 24.80),
 ('midsize', 27.29),
 ('subcompact', 28.14),
 ('compact', 28.30)]

### Dates and Times

One of the most common legacy methods for storing the date and time in online transactions systems is based on *the offset from the epoch: January 1, 1970*. It's not uncommon to see systems storing the date of a transaction in seconds or milliseconds since this date. $\rightarrow$ You need to convert them to make much sense out of the data.

In [67]:
import datetime as dt
import time as tm

In [68]:
tm.time()

1682885901.46

In [69]:
date_now = dt.datetime.fromtimestamp(tm.time())
date_now

datetime.datetime(2023, 4, 30, 21, 18, 48, 42863)

In [70]:
date_now.year, date_now.month, date_now.day, date_now.hour, date_now.minute, date_now.second

(2023, 4, 30, 21, 18, 48)

In [71]:
# Create time deltas
delta = dt.timedelta(days=100)
delta

datetime.timedelta(days=100)

In [72]:
today = dt.date.today()

In [73]:
today - delta

datetime.date(2023, 1, 20)

In [75]:
today > today - delta

True

### Objects and `map()` function

In [76]:
# A class object in Python
class Person:
    department = 'School of Information'

    def set_name(self, new_name):
        self.name = new_name

    def set_location(self, new_location):
        self.location = new_location

In [77]:
person = Person()
person.set_name('Christopher Brooks')
person.set_location('Ann Arbor, MI, USA')
print(f'{person.name} lives in {person.location} and works in the department {person.department}')

Christopher Brooks lives in Ann Arbor, MI, USA and works in the department School of Information


The `map(function, iterable, ...)` function is an example of a functional programming feature in Python. It returns an iterator that applies *function* to every item of *iterable*, yielding the results. If additional *iterable* arguments are passed, *function* must take that many arguments and is applied to the items from all iterables in parallel.

In [78]:
# map() function
store1_prices = [10.00, 11.00, 12.34, 2.34]
store2_prices = [9.00, 11.10, 12.34, 2.01]
cheapest = map(min, store1_prices, store2_prices)
cheapest # Lazy evaluation


<map at 0x2635e5378b0>

In Python, the `map()` function returns a `map` object. It doesn't actually try and run the function `min` on the two lists until you look inside for a value. This is a design pattern of the language, and it's commonly used when dealing with big data. This allows us to have very efficient memory management, even though something might be computationally complex.

Maps are iterable, just like `lists` and `tuples`, so we can use a `for` loop to look at all of the values in the map.

This passing around of functions and data structures which they should be applied to, is a hallmark of functional programming and it's very common in data analysis and cleaning.

In [79]:
# Example: Create a function to split the title and last name of some lecturers from the course
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson', 'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']

def split_title_and_name(person):
    title = person.split()[0]
    lastname = person.split()[-1]
    return f'{title} {lastname}'

In [80]:
list(map(split_title_and_name, people))

['Dr. Brooks', 'Dr. Collins-Thompson', 'Dr. Vydiswaran', 'Dr. Romero']

### Lambda and List Comprenhensions

`lambda` functions are Python's way of creating anonymous functions. These are the same as other functions, but they have no name. The intent is that they are simple or short lived and it's easier just to write out the function in one line instead of going to the trouble of creating a named function.

`lambda` syntax:

`lambda` + list of arguments + : + single expression

In [81]:
my_function = lambda a, b, c : a + b
my_function(1, 2, 3)

3

In [82]:
# An example using lambda
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson', 'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']

def split_title_and_name(person):
    return person.split()[0] + ' ' + person.split()[-1]

In [83]:
# Option 1
for person in people:
    print(split_title_and_name(person) == (lambda x: x.split()[0] + ' ' + x.split()[-1])(person))

True
True
True
True


In [86]:
# Option 2
list(map(split_title_and_name, people)) == list(map(lambda person: person.split()[0] + ' ' + person.split()[-1], people))

True

*List Comprehensions* is a way to create collections or sequences using a more abbreviated syntax.

In [87]:
my_list = []
for number in range(0, 1000):
    if number % 2 == 0:
        my_list.append(number)

my_list

[0,
 2,
 4,
 6,
 8,
 10,
 12,
 14,
 16,
 18,
 20,
 22,
 24,
 26,
 28,
 30,
 32,
 34,
 36,
 38,
 40,
 42,
 44,
 46,
 48,
 50,
 52,
 54,
 56,
 58,
 60,
 62,
 64,
 66,
 68,
 70,
 72,
 74,
 76,
 78,
 80,
 82,
 84,
 86,
 88,
 90,
 92,
 94,
 96,
 98,
 100,
 102,
 104,
 106,
 108,
 110,
 112,
 114,
 116,
 118,
 120,
 122,
 124,
 126,
 128,
 130,
 132,
 134,
 136,
 138,
 140,
 142,
 144,
 146,
 148,
 150,
 152,
 154,
 156,
 158,
 160,
 162,
 164,
 166,
 168,
 170,
 172,
 174,
 176,
 178,
 180,
 182,
 184,
 186,
 188,
 190,
 192,
 194,
 196,
 198,
 200,
 202,
 204,
 206,
 208,
 210,
 212,
 214,
 216,
 218,
 220,
 222,
 224,
 226,
 228,
 230,
 232,
 234,
 236,
 238,
 240,
 242,
 244,
 246,
 248,
 250,
 252,
 254,
 256,
 258,
 260,
 262,
 264,
 266,
 268,
 270,
 272,
 274,
 276,
 278,
 280,
 282,
 284,
 286,
 288,
 290,
 292,
 294,
 296,
 298,
 300,
 302,
 304,
 306,
 308,
 310,
 312,
 314,
 316,
 318,
 320,
 322,
 324,
 326,
 328,
 330,
 332,
 334,
 336,
 338,
 340,
 342,
 344,
 346,
 348,
 350,

In [88]:
# With list comprehension
my_list = [number for number in range(0, 1000) if number % 2 == 0]
my_list

[0,
 2,
 4,
 6,
 8,
 10,
 12,
 14,
 16,
 18,
 20,
 22,
 24,
 26,
 28,
 30,
 32,
 34,
 36,
 38,
 40,
 42,
 44,
 46,
 48,
 50,
 52,
 54,
 56,
 58,
 60,
 62,
 64,
 66,
 68,
 70,
 72,
 74,
 76,
 78,
 80,
 82,
 84,
 86,
 88,
 90,
 92,
 94,
 96,
 98,
 100,
 102,
 104,
 106,
 108,
 110,
 112,
 114,
 116,
 118,
 120,
 122,
 124,
 126,
 128,
 130,
 132,
 134,
 136,
 138,
 140,
 142,
 144,
 146,
 148,
 150,
 152,
 154,
 156,
 158,
 160,
 162,
 164,
 166,
 168,
 170,
 172,
 174,
 176,
 178,
 180,
 182,
 184,
 186,
 188,
 190,
 192,
 194,
 196,
 198,
 200,
 202,
 204,
 206,
 208,
 210,
 212,
 214,
 216,
 218,
 220,
 222,
 224,
 226,
 228,
 230,
 232,
 234,
 236,
 238,
 240,
 242,
 244,
 246,
 248,
 250,
 252,
 254,
 256,
 258,
 260,
 262,
 264,
 266,
 268,
 270,
 272,
 274,
 276,
 278,
 280,
 282,
 284,
 286,
 288,
 290,
 292,
 294,
 296,
 298,
 300,
 302,
 304,
 306,
 308,
 310,
 312,
 314,
 316,
 318,
 320,
 322,
 324,
 326,
 328,
 330,
 332,
 334,
 336,
 338,
 340,
 342,
 344,
 346,
 348,
 350,

In [89]:
# Convert this function into a list comprehension
def times_tables():
    lst = []
    for i in range(10):
        for j in range (10):
            lst.append(i*j)
    return lst

times_tables() == [i * j for i in range(10) for j in range(10)]

True

A harder example:

Many organizations have user ids which are constrained in some way. Imagine you work at an internet service provider and the user ids are all two letters followed by two numbers (e.g. aa49). Your task at such an organization might be to hold a record on the billing activity for each possible user. 

Write an initialization line as a single list comprehension which creates a list of all possible user ids. Assume the letters are all lower case.

In [92]:
lowercase = 'abcdefghijklmnopqrstuvwxyz'
digits = '0123456789'

answer = [a+b+c+d for a in lowercase for b in lowercase for c in digits for d in digits]
answer[:50] # Display first 50 ids

['aa00',
 'aa01',
 'aa02',
 'aa03',
 'aa04',
 'aa05',
 'aa06',
 'aa07',
 'aa08',
 'aa09',
 'aa10',
 'aa11',
 'aa12',
 'aa13',
 'aa14',
 'aa15',
 'aa16',
 'aa17',
 'aa18',
 'aa19',
 'aa20',
 'aa21',
 'aa22',
 'aa23',
 'aa24',
 'aa25',
 'aa26',
 'aa27',
 'aa28',
 'aa29',
 'aa30',
 'aa31',
 'aa32',
 'aa33',
 'aa34',
 'aa35',
 'aa36',
 'aa37',
 'aa38',
 'aa39',
 'aa40',
 'aa41',
 'aa42',
 'aa43',
 'aa44',
 'aa45',
 'aa46',
 'aa47',
 'aa48',
 'aa49']