# Welcome to Session 5 - Searching for Data in Data Structures and Matching Patterns in Data

Testing for the presence of data within a data object is an important skill. This is accomplished in a few key ways:
1) Test to see if a match exists within a simple object like a string, integer, etc.
2) Test to see if a match exists within a data structure like a list or dictionary.
3) Test to see if complex patterns exist within data, using Regular Expressions.

## The 'In' Operator

### Test for a Match in a String

In [None]:
def testMembership(mydata,teststring):
    if teststring in mydata:
        print('found it!')
    else:
        print('didn\'t find it') # Note: If I want to use an apostrophe in my text, I must "escape" it with \.


mytext = 'The most beautful inshore fish is the Red drum.'

testMembership(mytext,'drum')
testMembership(mytext,'bass')


### Test for a Match in a list

In [None]:
# Same function as above

def testMembership(mydata,teststring):
    if teststring in mydata:
        print('found it!')
    else:
        print('didn\'t find it')
        
mylist = ['The most beautful inshore fish is the Red drum.','Sheepshead is cool looking though!','I\'ll always be glad to see a Black seabass though.']

testMembership(mylist,'drum')
testMembership(mylist,'bass')

#### Question 1

What happened?

In [None]:
# Try this

def testMembership(mydata,teststring):
    if teststring in mydata:
        print('found it!')
    else:
        print('didn\'t find it')
        
mylist = ['The most beautful inshore fish is the Red drum.','Sheepshead is cool looking though!','I\'ll always be glad to see a Black seabass though.','drum','bass']

testMembership(mylist,'drum')
testMembership(mylist,'bass')

When testing for membership in a data structure, the 'in' operator is looking for the test string AS a member.
1) The members of the list we used are themselves strings
2) If we want to search within those strings, we need to create a loop to do so.

#### Activity 1

Using the code below, introduce a 'for' loop within the testMembership() function to iterate over the list and check for the test string in each list item.

Include a "found it"/"didn't find it" output message, but only print it once for the whole list; not once for every list item.
* Hint: Create a message variable with a default message and only change it if a match is found.

In [None]:
# Tackle Activity 1 here

def testMembership(mydata,teststring):

    
    
    
        
mylist = ['The most beautful inshore fish is the Red drum.','Sheepshead is cool looking though!','I\'ll always be glad to see a Black seabass out there.']

testMembership(mylist,'drum')
testMembership(mylist,'bass')

### Test for a Match in a Dictionary

#### Activity 2

You already know how to iterate over a dictionary. Use that knowledge to complete the testMembership() function to check if the test string is found in the dictionary values.

In [None]:
# Tackle Activity 2 here

def testMembership(mydata,teststring):
    
    
    
        
mydict = {'line1':'The most beautful inshore fish is the Red drum.','line2':'Sheepshead is cool looking though!','line3':'I\'ll always be glad to see a Black seabass though.'}

testMembership(mydict,'drum')
testMembership(mydict,'bass')

### Pattern Matching with Regular Expressions

Regular Expressions, AKA Regex, are powerful tools for matching multipe patterns in a body of text or data.

The Python Regex module must be imported.

In [None]:
import re

#### Search for a match using the re.search() function

In [None]:
mystring = 'XCGQGDBHFWDDDFHFHFGDDD'

found = re.search(r'DDD', mystring) # Search for matches of 'DDD' in mystring
print(found)
print(found.group()) # The found.group() object contains the first match.

found = re.search(r'ZZZ', mystring) # Search for matches of 'ZZZ' in mystring
print(found)

When the FIRST match is found, an match object is returned.

When NO match is found, None is returned. None is a special data type called NoneType.

Parameters of re functions

re.search(r'DDD', mystring)

r = optional, but highly recommended raw string designator. This prevents interpretation of characters like '\' that have special meaning in Python.
'DDD' = The regular expression pattern string specifying what must be matched. It is case specific.
mystring = any variable that represents the string in which the search for the match is conducted.

In [None]:
# Use re.search to test for a match object.

if re.search(r'DDD', mystring): # This is like saying "if this returns a match object..."
    print('found it')
else:
    print('didn\'t find it')

#### Get all matches using the re findall() function

In [None]:
mystring = 'XCGQGDBHFWDDDFHFHFGDDD'

found = re.findall(r'DDD', mystring) # Search for matches of 'DDD' in mystring
print(found)

found = re.findall(r'ZZZ', mystring) # Search for matches of 'ZZZ' in mystring
print(found)

re.findall() always returns a list. If there are matches, they will be in the list. An empty list means no matches were found.

We can count the number of matches using Python's length - len() - function.

In [None]:
mystring = 'XCGQGDBHFWDDDFHFHFGDDD'

found = re.findall(r'DDD', mystring) # Search for matches of 'DDD' in mystring
print(found)
print(len(found))

found = re.findall(r'ZZZ', mystring) # Search for matches of 'ZZZ' in mystring
print(found)
print(len(found))

#### Activity 3

Find and print all the matches for 'manage' in the SCDNR Marine Resources Division purpose statement

In [None]:
# Tackle Activity 3 here

purpose = '''The Division of Marine Resources is responsible for the management and conservation of the state's marine and estuarine resources.
The division conducts monitoring and research on the state’s marine resources and makes recommendations for the management of those resources.
The division is headquartered at the Marine Resources Center on Charleston Harbor.
Fishery managers open and close marine fishing seasons, recommend size and catch limits for fish, track trends in abundance of marine species and review coastal development activities.
Through the use of permits and permit conditions the managers control the harvest of fish, shrimp, crabs and shellfish.
Marine Resources Division staff actively work with regional authorities such as the Atlantic States Marine Fisheries Commission and the South Atlantic Fishery Management Council
to ensure that marine fisheries are effectively and sustainable managed throughout their range.'''





#### Question 2

What happened? Is it what you expected?

The regex takes things very literally. It matches every occurrence of what we ask it to match.

But what about when there are different variants possible, like manage, managed, manager, and management?

In [None]:
found = re.findall(r'manage[rdments]{0,4}', purpose)
print(found)
print(len(found))

Let's take a closer look at this regex:

manage[rdments]{0,4}

1) This statement could be verbalized as "find the word 'manage' followed by any of the letters r,d,m,e,n,t, or s occurring 0 to 4 times in succession.
2) [rdments] is the range of optional (in this case) letters. It is case specific
3) {0,4} specifies the range of number of occurrences. Minimum of 0 (making it optional), maximum of 4.
   * There are several other characters that can be used to specify number of occurences of a regex component:
       * \* (0 or more) e.g. manage[rdments]*
       * \+ (1 or more) e.g. manage[rdments]+
       * \? (0 or 1) e.g. manage[rdments]? 

#### Multiple Match Conditions to Account for Several Patterns

In standard genetic code, codons consist of a combination of three letters from A, C, G, and U.

Suppose we have a sequence and we want to identify all codons that have a first base of (begin with) A, a third base of (end with) C, and have either C, G, or U (but not A) as the second base (middle item).

The possible matches are ACC, AGC, or AUC

Here's a regex that matches all three forms:
r'A[^A]C'

In [None]:
import re

sequence = 'TGATCGTAGCTAGCTAGCTACGTATATATATAGCGCGTA'
matches = re.finditer(r'A[^A]C', sequence)   # looks for a sequence starting with "A" with the next base not being an "A" and the third and final base as a "C" 

# Note, we're introducing re.finditer() here. This returns an iterable object of matches
# It gives us the matches in a 'group()' object and the start position in a 'start()' object

for n in matches:
    print(f'Found: {n.group()} at start site: {n.start() + 1}')

#In order to print the values of variables or objects within text, we use f before the string to specify it as a string literal.
#It will incorporate the values of the variables or objects we place inside the curly braces as strings.

[Pythex is a great tool for testing a regex and seeing what it does and doesn't match visually](https://pythex.org/)
* Paste your body of text in the "Your test string" box
* Design your regex in the "Your regular expression" box

Pythex also includes a handy cheat sheet (expand at the bottom of the screen) to help with designing a regex.

### Activity 4

Use the Pythex website to create a regex to match the SC vessel registrations in the following text

"The research was conducted by vessels SC-2243-GZ, SC-9629-BB, and SC-1035-DA in Saint Helena Sound, while Charleston Harbor was surveyed by the vessels SC-7478-KS and SC-1151-GG."


Note that vessel registrations all
* begin with 'SC'
* then a hyphen
* then four numbers
* then a hyphen
* then two letters

e.g. SC-2345-AF

In [None]:
# Tackle Activity 4 here
# Use your regex in the following code:

text = "The research was conducted by vessels SC-2243-GZ, SC-9629-BB, and SC-1035-DA in Saint Helena Sound, while Charleston Harbor was surveyed by the vessels SC-7478-KS and SC-1151-GG."
found = re.findall('[your regex goes here]', text)

print(found)


### Quiz

[hyperlink to a quiz here. Perhaps we can use Google Forms quizzes with multi-choice questions to help solidify learning]

### Challenge

Challenge description [challenge is to consolidate and practice content learned during this session]

In [None]:
#Tackle the challenge here




### Resources

* [Pythex is a great tool for testing a regex and seeing what it does and doesn't match visually](https://pythex.org/)
* [Python documentation - Regular expression operations](https://docs.python.org/3/library/re.html)
* [W3Schools - Python RegEx](https://www.w3schools.com/python/python_regex.asp)