# Review: Lists!

In [1]:
x = [5, 10, 15, 20, 25, 30]

In [2]:
x[3]

20

In [3]:
# Note that we don't HAVE to have a variable next to our index. 
# The following code will also work:

[2, 4, 6, 8, 10][4]

10

**Basically, we need to have something that evaluates to a list to the left of our index.**

In [4]:
# Error:
x[90]

IndexError: list index out of range

## Common functions 


+ `type()`
+ `len()`
+ `max()`
+ `min()`
+ `sum()`
+ `sorted()`
+ `range()`
+ `list()`

In [5]:
# Use type() to troubleshoot errors!

type(x)

list

In [6]:
# how many items are inside your list

len(x)

6

In [7]:
# A list with a single element:
len([10])

1

In [8]:
type([10])

list

In [9]:
# len() function always returns an integer! 

len([])

0

In [10]:
[]

[]

In [11]:
type([])

list

In [12]:
max(x)

30

In [13]:
min(x)

5

In [14]:
sum(x)

105

In [15]:
# returns a list sorted in numerical or alphabetical order

sorted(x)

[5, 10, 15, 20, 25, 30]

In [16]:
sorted([17, -4, 1004, 3, 15, 8.3])

[-4, 3, 8.3, 15, 17, 1004]

In [17]:
sorted(["badger", "crocodile", "aadvark", "zebra", "emu"])

['aadvark', 'badger', 'crocodile', 'emu', 'zebra']

In [18]:
range(10)

range(0, 10)

In [19]:
# using list() casts your range object into a list. 
# range() will otherwise give you the numbers one at a time. 

list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [20]:
# You can also use list() to break up a string into its component characters

list("this is a test")

['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't']

## Indices

In [21]:
x[-1]

30

In [22]:
x[-2]

25

In [24]:
# our usual
x[3]

20

In [28]:
x[-0]

5

In [29]:
x[-90]

IndexError: list index out of range

In [27]:
n = 1 + 2

# You can put anything in the index position, as long as it returns an integer 
# Or evaluates to an integer.
# To get the nth element of a list:

x[n]

20

In [26]:
x[2*2]

25

## List Slices

In [30]:
# From index 1, up to (but not including) index 4 

x[1:4]

[10, 15, 20]

In [32]:
# You can also use variables that evaluate to integers in this syntax as well. 

n = 2
x[n:n+3]

[15, 20, 25]

In [33]:
# Note that you have to put the indexes in order, smallest number to the highest number

x[-3:-1]

[20, 25]

In [34]:
x[4:2]

[]

In [35]:
# Note that this DOES NOT give you an index error
# it just gives you all that it can of the list

x[3:9000]

[20, 25, 30]

In [36]:
# Important difference between list indexes and list slices! 
# List slices are list
# But using a list index returns the data type of the item in the list

type(x[1:4])

list

In [37]:
for item in x[2:5]:
    print(item)

15
20
25


In [38]:
# Equivalent to the expression below
# First 4 elements of the list

x[0:4]

[5, 10, 15, 20]

In [39]:
# Equivalent to the expression above

x[:4]

[5, 10, 15, 20]

In [40]:
# Everything from the 4th element, up to the end of the list

x[4:]

[25, 30]

In [41]:
# Last three items in the list 
x[-3:]

[20, 25, 30]

In [42]:
# Everything but the last element

x[:-1] 

[5, 10, 15, 20, 25]

In [43]:
# Everything but the first element

x[1:]

[10, 15, 20, 25, 30]

# List comprehensions

+ You want to perform some transformation on the entire list in order to get another list
+ OR, you want to filter some list and end up with a filtered list
+ Start with a list and end with another list! 



In [44]:
x

[5, 10, 15, 20, 25, 30]

In [52]:
source = [3, -1, 4, -2, 5, -3, 6]
dest = []

for item in source: # loop through each item in our list named "source"
    if item > 0: # FILTERS: if that item is greater than zero, then:
        dest.append(item * item) # TRANSFORMS: square the item, the add item to our "dest" list

dest

[9, 16, 25, 36]

In [54]:
ark = ["aardvark", "badger", "crocodile", "dingo", "emu", "flamingo"]
zoo = []

for item in ark:
    
    # we didn't transform anything, but we filtered our list! 
    if len(item) <= 6:
        zoo.append(item) 
        
zoo

['badger', 'dingo', 'emu']

This is a very common computational task! So common that we want a shorted way to write it. Python uses a bit of syntax called a **list comprehension** to 

In [56]:
[item * item for item in source if item > 0]

[9, 16, 25, 36]

In [61]:
[item for item in ark if len(item) <= 6]

['badger', 'dingo', 'emu']

**LIST COMPREHENSION STRUCTURE**
    
    [PREDICATE_EXPRESSION **for** TEMPORARY_VARIABLE **in** SOURCE_LIST **if** MEMBERSHIP_EXPRESSION]
    
+ The **membership expression** is where you'd place the condition under which you'd like items from your SOURCE list to be included in your new list
+ The **predicate expression** is where you'd apply a transformation over each item in your SOURCE list (that will be apparent in your new list)

In [62]:
# what i want is: [2, 7, 12, 17, 22, 27]
    
stuff = []
for item in x:
    stuff.append(item - 3)
    
stuff

[2, 7, 12, 17, 22, 27]

In [63]:
[item - 3 for item in x]

# 'item - 3' is our predicate expression

[2, 7, 12, 17, 22, 27]

In [64]:
source = [3, -1, 4, -2, 5, -3, 6]

In [65]:
# modulo operator: % 

60 % 5

0

In [66]:
60 % 7

4

In [67]:
# Check if a number is even
60 % 2

0

In [68]:
5 % 2

1

In [69]:
[item * item for item in range(10) if item % 2 == 0]

[0, 4, 16, 36, 64]

In [70]:
dest = []
for i in range(10):
    if i % 2 == 0:
        dest.append(i*i)
        
dest        

[0, 4, 16, 36, 64]

In [71]:
# In Python any integer that is not 0 is evaluated to be "True"
# `item % 2` returns as TRUE when item is NOT even (odd numbers % 2 returns 1)

[item * item for item in range(10) if item % 2]

[1, 9, 25, 49, 81]

### A slightly more practical example:

In [72]:
rawdata = "2,3,5,7,11,13,17,19,23"

In [73]:
sum(rawdata)

# this error says that you need to give me a list of integers, not one long string! 

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [74]:
list(rawdata)

# Yikes, this returns one item per character

['2',
 ',',
 '3',
 ',',
 '5',
 ',',
 '7',
 ',',
 '1',
 '1',
 ',',
 '1',
 '3',
 ',',
 '1',
 '7',
 ',',
 '1',
 '9',
 ',',
 '2',
 '3']

In [76]:
values = rawdata.split(',')

values

['2', '3', '5', '7', '11', '13', '17', '19', '23']

In [77]:
sum(values)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [78]:
srcstr = "1237"
type(srcstr)

str

In [79]:
# int() : put a string into it, get an integer out of it 

converted = int(srcstr)
type(converted)

int

In [80]:
int(values)

# int() expects strings, not a list! 

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

In [82]:
values

['2', '3', '5', '7', '11', '13', '17', '19', '23']

In [81]:
# Converts all of our strings into integers:

[int(value) for value in values]

[2, 3, 5, 7, 11, 13, 17, 19, 23]

In [85]:
# Note that [int(value) for value in values] evaluates to a list. 
# You may want to save your list comprehension to a variable

num_values = [int(value) for value in values]
num_values

[2, 3, 5, 7, 11, 13, 17, 19, 23]

In [86]:
sum(num_values)

100

In [87]:
# More efficient way of doing the last two cells

sum([int(value) for value in values])

100

In [88]:
stuff = "hello there how are you"

In [89]:
stuff.split(" ")

['hello', 'there', 'how', 'are', 'you']

In [90]:
# 'e' is almost never the delimiter.. but look! 

stuff.split("e")

['h', 'llo th', 'r', ' how ar', ' you']

In [91]:
# Cool concrete poem allison just made

for i in stuff.split("e"):
    print(i)

h
llo th
r
 how ar
 you


# String operations

### the `in` operator

In [92]:
# Acts the same as <, >, +, etc.

"foo" in "buffoon" 

True

In [93]:
"foo" in "reginald"

False

In [94]:
ark

['aardvark', 'badger', 'crocodile', 'dingo', 'emu', 'flamingo']

In [95]:
# The `in` in the second part of our list expression is an OPERATOR
[animal for animal in ark if "a" in animal]

['aardvark', 'badger', 'flamingo']

In [96]:
# Similar to:

for animal in ark:
    if "a" in animal:
        print(animal)

aardvark
badger
flamingo


In [97]:
# Same as:
dest = []

for animal in ark:
    if "a" in animal:
        dest.append(animal)
        
dest

['aardvark', 'badger', 'flamingo']

### `STRING.startswith()` method

In [98]:
check = "foodie"

In [99]:
check.startswith("foo")

True

In [100]:
check.startswith("f")

True

In [101]:
check.startswith("blah")

False

In [102]:
# Case-sensitive!

check.startswith("Foo")

False

In [103]:
# Will return true if that string is entirely numerical

check.isdigit()

False

In [107]:
number_str = "12345"

In [105]:
number_str.isdigit()

True

In [108]:
# False, if there are ANY other characters

number_str = "12,345"
number_str.isdigit()

False

In [109]:
# .islower(), .isupper()

check.islower()

True

In [110]:
check.isupper()

False

In [111]:
yelling = "I LIKE THINGS AND THEY ARE GOOD"

In [112]:
yelling.isupper()

True

**Built-in Python string methods:** [check 'em out](https://docs.python.org/2/library/stdtypes.html#string-methods). 

## Finding substrings

In [113]:
src = "Now is the winter of our discontent"

In [114]:
# Returns the starting index of our source string where the substring is found

src.find("win")

11

In [115]:
# Returns -1 if the substring is not found

src.find("lose")

-1

In [120]:
# We might have expected 0 to represent "FALSE", but 0 is a valid index position. 

src.find("N")

0

In [116]:
# Parsing our string! 
# Everything from the location of our substring to the end of the string

location = src.find("win")
src[location:] # remember that location is a integer; it's like you're saying: src[11:]

'winter of our discontent'

In [122]:
location = src.find("o")

if location != -1:
    print(src[location:location+2])

ow


In [123]:
location = src.find("z")

if location != -1:
    print(src[location:location+2])
    
# Nothing prints because there is no 'z' in our string

In [125]:
location = src.find("z")
print(location)
print(src[location:location+2])

-1



In [126]:
src

'Now is the winter of our discontent'

In [127]:
# Number of instances that the substring 'is' occurs in our string

src.count('is')

2

In [129]:
for vowel in ['a', 'e', 'i', 'o', 'u']:
    print(vowel, src.count(vowel))

a 0
e 3
i 3
o 4
u 1


In [130]:
src2 = "Someone tell me where the poetry is."
src3 = "Is this really all the poetry you have?"

# Note that .count(' is ') would return false on both src2 and src3 (in the case we were trying to isolate the word is)

In [131]:
# This does NOT work how we would want it to:
src.count("is" or " is " or " is" or "is ")

2

In [133]:
# Instead, 

my_is_patterns = ['is', ' is ', 'is ', ' is, ', 'Is']
    
count = 0
for item in my_is_patterns:
    print(item, src.count(item))
    count += src.count(item)
    
# But note that this double counts some!

is 2
 is  1
is  1
 is,  0
Is 0


### Quick review: string indices and slices

In [117]:
message = "bungalow"
message[0]

'b'

In [118]:
message[-1]

'w'

In [119]:
message[3:-2]

'gal'

## String transformation

Review of strings: 

+ checks: returns boolean 
+ finds/count: returns integer
+ transformation: returns strings

In [134]:
comment = "ARGUMENTATION! DISAGREEMENT! STRIFE!"

In [135]:
comment.lower()

'argumentation! disagreement! strife!'

In [136]:
comment

'ARGUMENTATION! DISAGREEMENT! STRIFE!'

In [137]:
message = 'e.e. cummings is not happy about this'

In [138]:
message.upper()

'E.E. CUMMINGS IS NOT HAPPY ABOUT THIS'

In [139]:
str1 = "dog"
str2 = "Dog"

In [140]:
str1 == str2

False

In [141]:
# If you apply .lower() on both of our strings, we can treat them the same
str1.lower() == str2.lower()

True

In [142]:
movie = "dr. strangelove, or, how I learned to love the bomb"

In [143]:
movie.title()

'Dr. Strangelove, Or, How I Learned To Love The Bomb'

In [144]:
movie2 = "rosemary's baby"
movie2.title()

# Oh no.... they capitalized the S. 

"Rosemary'S Baby"

In [145]:
rawtext = "    weird extra spaces before and after    "
rawtext.strip()

'weird extra spaces before and after'

In [148]:
line = "hello there this is a line of text\n"
print(line)

hello there this is a line of text



In [149]:
# Gets rid of the new line character '\n'

print(line.strip())

hello there this is a line of text


In [150]:
song = "I got rhythm, I got music, I got my man, who could ask for anything more"

In [151]:
song.replace("I got", "I used to have")

'I used to have rhythm, I used to have music, I used to have my man, who could ask for anything more'

In [153]:
# Original string is unchanged 

song

'I got rhythm, I got music, I got my man, who could ask for anything more'

In [154]:
rawdata = "Get data that<br>looks like this<br>beacause it was<br>too much trouble"

In [155]:
print(rawdata)

Get data that<br>looks like this<br>beacause it was<br>too much trouble


In [157]:
print(rawdata.replace("<br>", "\n"))

Get data that
looks like this
beacause it was
too much trouble


In [158]:
step1 = song.replace("I got", "I used to have")
print(step1)

step2 = step1.replace("more", "less")
print(step2)

I used to have rhythm, I used to have music, I used to have my man, who could ask for anything more
I used to have rhythm, I used to have music, I used to have my man, who could ask for anything less


In [159]:
# Chaining methods! 

song.replace("I got", "someday I will have").replace("anything more", "a future more bright")

'someday I will have rhythm, someday I will have music, someday I will have my man, who could ask for a future more bright'

In [161]:
# More chaining! 
song.replace("I got", "someday I will have").replace("anything more", "a future more bright").upper() + " STOP"

'SOMEDAY I WILL HAVE RHYTHM, SOMEDAY I WILL HAVE MUSIC, SOMEDAY I WILL HAVE MY MAN, WHO COULD ASK FOR A FUTURE MORE BRIGHT STOP'

In [166]:
telegram = song.replace("I got", "someday I will have").replace(
    "anything more", "a future more bright").replace(
    ',',' STOP').upper() + " STOP"

In [167]:
print("original:", song)
print("much improved:", telegram)

original: I got rhythm, I got music, I got my man, who could ask for anything more
much improved: SOMEDAY I WILL HAVE RHYTHM STOP SOMEDAY I WILL HAVE MUSIC STOP SOMEDAY I WILL HAVE MY MAN STOP WHO COULD ASK FOR A FUTURE MORE BRIGHT STOP


## REGULAR EXPRESSIONS: A difficult problem

Earlier, we tried to find the word "is", but ran into some trouble (we had to create a list of all the ways that "is" might appear in our string. Our solution to this problem lies in **regular expressions**. 

In [170]:
input_str = "Yes, my zip code is 12345. I heard that Gary's zip code is 23456. But 212 is not a zip code."

In [171]:
print(input_str)

Yes, my zip code is 12345. I heard that Gary's zip code is 23456. But 212 is not a zip code.


### Our task: we want a list of strings that represent all the zipcodes in this text

In [173]:
# How we might solve this problem without regular expressions

current = ""
zips = []

for ch in input_str:
    if ch.isdigit():
        current += ch # if it is a digit, then it saves the character to our variable
    else: 
        current = "" # if it isn't a digit, then it resets our variable to an empty string
    if len(current) == 5: # if we've saved a series of 5 numbers in 'current', then add 'current' to our list
        zips.append(current)
        current = ""
        
zips

['12345', '23456']

In [174]:
# Issue with this code: 

input_str = "Yes, my zip code is 12345. I heard that Gary's zip code is 23456. But 1234523456 is not a zip code."

current = ""
zips = []

for ch in input_str:
    if ch.isdigit():
        current += ch
    else: 
        current = ""
    if len(current) == 5:
        zips.append(current)
        current = ""
        
zips

# 1234523456 is counted as two zip codes, but it's not a zip code at all

['12345', '23456', '12345', '23456']

Regular expressions is a "sub-language" used by multiple programming languages. There's a built-in Python library called `re`

In [175]:
import re
zips = re.findall(r"\d{5}", input_str)

# \d : digits
# {5} : how many you'd like to find

zips

['12345', '23456', '12345', '23456']

Basically, you're writing short little templates of the kind of text you'd like to find. 

More regular expressions to come on Thursday! 