# Welcome to Session 5 - Searching for Data in Data Structures and Matching Patterns in Data

Testing for the presence of data within a data object is an important skill. This is accomplished in a few key ways:
1) Test to see if a match exists within a simple object like a string, integer, etc.
2) Test to see if a match exists within a data structure like a list or dictionary.
3) Test to see if complex patterns exist within data, using Regular Expressions.

## The 'In' Operator

We already looked at some simple operators used to test conditions (==, !-, <, >). These are usually used in conditional statements (if...elif...else).

The 'in' operator tests for the presence of:
* a substring in a string, e.g. "is 'drum' in 'Red drum'?"
* membership (presence) of something in a data structure , e.g. "is 'drum' in the list ['red drum','black drum','black seabass']?"
and it returns True or False.

### Test for a Match in a String

In [None]:
def testMembership(dna,codon):
    if codon in dna:
        print('found it!')
    else:
        print('didn\'t find it') # Note: If I want to use an apostrophe in my text, I must "escape" it with \.


dna = 'TTTGTGATACCGAATCTCTGTCTCCCGCTAGTCAACCTTGATACGCTCTT'

testMembership(dna,'AAA')
testMembership(dna,'ACG')


### Test for a Match in a list

In [None]:
def testMembership(dnalist,codon):
    if codon in dnalist:
        print('found it!')
    else:
        print('didn\'t find it')

dnalist = ['TTTGTGATACCGAATCTCTGTCTCCCGCTAGTCAACCTTGATACGCTCTT',
          'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA',
          'CCCCCCCGTTAGTATGGCAGCCCGTAACGCCCGGGGCACATCCGTTCACA']


testMembership(dnalist,'AAA')
testMembership(dnalist,'ACG')

#### Question 1

What happened?

In [None]:
# Try this

def testMembership(dnalist,codon):
    if codon in dnalist:
        print('found it!')
    else:
        print('didn\'t find it')

dnalist = ['TTTGTGATACCGAATCTCTGTCTCCCGCTAGTCAACCTTGATACGCTCTT',
           'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA',
           'CCCCCCCGTTAGTATGGCAGCCCGTAACGCCCGGGGCACATCCGTTCACA',
           'AAA',
           'ACG']

testMembership(dnalist,'AAA')
testMembership(dnalist,'ACG')

When testing for membership in a data structure, the 'in' operator is looking for the test string AS a member.
1) The members of the list we used are themselves strings
2) If we want to search within those strings, we need to create a loop to do so.

#### Activity 1

Using the code below, introduce a 'for' loop within the testMembership() function to iterate over the list and check for the ACG codon in each list item.

Include a "found it"/"didn't find it" output message, but only print it once for the whole list; not once for every list item.
* Hint: Create a message variable with a default message and only change it if a match is found.

When you're done with Activity 1, please use the [Miro Board](https://miro.com/app/board/uXjVNCUJ0JI=/) to indicate completion in the area for this session and this activity.

In [None]:
# Tackle Activity 1 here

dnalist = ['TTTGTGATACCGAATCTCTGTCTCCCGCTAGTCAACCTTGATACGCTCTT',
           'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA',
           'CCCCCCCGTTAGTATGGCAGCCCGTAACGCCCGGGGCACATCCGTTCACA']

def testMembership(dnalist,codon):




### Test for a Match in a Dictionary

The 'if...in' technique when applied to a dictionary will by default search for exact matches within the dictionary keys only. To search for exact matches in the dictionary values, specify [dictionaryname].values().

To search for matches within the keys or values of the dictionary, iterate over the dictionary keys or values.

#### Activity 2

You already know how to iterate over a dictionary. Use that knowledge to complete the testMembership() function to check if the test string is found in the dictionary values.

In [None]:
# Tackle Activity 2 here

dnadict = {'seq1':'TTTGTGATACCGAATCTCTGTCTCCCGCTAGTCAACCTTGATACGCTCTT',
           'seq2':'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA',
           'seq3':'CCCCCCCGTTAGTATGGCAGCCCGTAACGCCCGGGGCACATCCGTTCACA'}

#def testMembership(dnadict,codon):





### Pattern Matching with Regular Expressions

Regular Expressions, AKA Regex, are powerful tools for matching multipe patterns in a body of text or data.

The Python Regex module must be imported.

In [None]:
import re

#### Search for a match using the re.search() function

Note: Regular expression searching is case-sensitive. We must account for upper and lower cases if they are likely to be present.

In [None]:
dna = 'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA'

found = re.search(r'GTT', dna) # Search for matches of 'GTT' in dna
print(found)
print(found.group()) # The found.group() object contains the first match.

found = re.search(r'ZZZ', dna) # Search for matches of 'ZZZ' in dna
print(found)

When the FIRST match is found, an match object is returned.

When NO match is found, None is returned. None is a special data type called NoneType.

Parameters of re functions

re.search(r'DDD', mystring)

r = optional, but highly recommended raw string designator. This prevents interpretation of characters like '\' that have special meaning in Python.
'DDD' = The regular expression pattern string specifying what must be matched. It is case specific.
mystring = any variable that represents the string in which the search for the match is conducted.

#### With re.search() we cannot use match object methods if a match was not found!

In [None]:
# Use re.search to test for a match object.

dna = 'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA'

found = re.search(r'ttt', dna) # Search for matches of 'ttt' in mystring
print(found)
print(found.group()) # The found.group() object contains the first match.

In [None]:
# Use re.search combined with an if...else statment to test for a match object.

if re.search(r'TTT', dna): # This is like saying "if this returns a match object..."
    print('found it')
else:
    print('didn\'t find it')

#### Get *all* matches using the re findall() function

In [None]:
dna = 'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA'

found = re.findall(r'TTT', dna) # Search for matches of 'TTT' in dna
print(found)

found = re.findall(r'ZZZ', dna) # Search for matches of 'ZZZ' in dna
print(found)

re.findall() always returns a list. If there are matches, they will be in the list. An empty list means no matches were found.

We can count the number of matches using Python's length - len() - function.

In [None]:
dna = 'CTTACGATCTAATCTGTTTCCCTTAAACGTTAACGGTTGTTTTGTGCTTA'

found = re.findall(r'TTT', dna) # Search for matches of 'DDD' in dna
print(found)
print(len(found))

found = re.findall(r'ZZZ', dna) # Search for matches of 'ZZZ' in dna
print(found)
print(len(found))

#### Activity 3

Find and print all the matches for 'manage' in the SCDNR Marine Resources Division purpose statement

In [None]:
# Tackle Activity 3 here

purpose = '''The Division of Marine Resources is responsible for the management and conservation of the state's marine and estuarine resources.
The division conducts monitoring and research on the state’s marine resources and makes recommendations for the management of those resources.
The division is headquartered at the Marine Resources Center on Charleston Harbor.
Fishery managers open and close marine fishing seasons, recommend size and catch limits for fish, track trends in abundance of marine species and review coastal development activities.
Through the use of permits and permit conditions the managers control the harvest of fish, shrimp, crabs and shellfish.
Marine Resources Division staff actively work with regional authorities such as the Atlantic States Marine Fisheries Commission and the South Atlantic Fishery Management Council
to ensure that marine fisheries are effectively and sustainable managed throughout their range.'''





#### Question 2

What happened? Is it what you expected?

The regex takes things very literally. It matches every occurrence of what we ask it to match.

But what about when there are different variants possible, like manage, managed, manager, and management?

In [None]:
found = re.findall(r'manage[rdments]{0,4}', purpose)
print(found)
print(len(found))

Let's take a closer look at this regex:

manage[rdments]{0,4}

1) This statement could be verbalized as "find the word 'manage' followed by any of the letters r,d,m,e,n,t, or s occurring 0 to 4 times in succession.
2) [rdments] is the range of optional (in this case) letters. It is case specific
3) {0,4} specifies the range of number of occurrences. Minimum of 0 (making it optional), maximum of 4.
   * There are several other characters that can be used to specify number of occurences of a regex component:
       * \* (0 or more) e.g. manage[rdments]*
       * \+ (1 or more) e.g. manage[rdments]+
       * \? (0 or 1) e.g. manage[rdments]?

We should remain aware of potential matches with the regex when there are typos in the text.

#### Multiple Match Conditions to Account for Several Patterns

In standard DNA genetic code, codons consist of a combination of three letters from A, C, G, and T.

Suppose we have a sequence and we want to identify all codons that have a first base of (begin with) A, a third base of (end with) C, and have either C, G, or T (but not A) as the second base (middle item).

The possible matches are ACC, AGC, or ATC

Here's a regex that matches all three forms:
r'A[^A]C'

In [None]:
import re

sequence = 'TGATCGTAGCTAGCTAGCTACGTATATATATAGCGCGTA'
matches = re.finditer(r'A[^A]C', sequence)   # looks for a sequence starting with "A" with the next base not being an "A" and the third and final base as a "C"
# Note, we're introducing re.finditer() here. This returns an iterable object of matches
# It gives us the matches in a 'group()' object and the start position in a 'start()' object

for n in matches:
    print(f'Found: {n.group()} at start site: {n.start() + 1}')

#In order to print the values of variables or objects within text, we use f before the string to specify it as a string literal.
#It will incorporate the values of the variables or objects we place inside the curly braces as strings.

Here's a walkthrough of the regex r'A[^A]C':

A - Matches must begin with 'A'

[^A] - Matches must then be followed by any character BUT an 'A' ('^' means NOT)

C - Matches must then be followed by a C

But... this regex could also match errant letters, rogue numbers, or invasive characters in the 2nd base.

[^A] means *anything but an A*. So a Z would be matched.

In this case, with only three possible options for the match, it would be prudent to be more explicit and specify that *one of C, G, or T must be matched*.

**How would we modify the regex to do that?**

###Activity 4

In [None]:
#Tackle Activity 4 here. Replace the part of the regex [^A] with your solution.
#If correct, the result will be the same as the printout from the previous example.

sequence = 'TGATCGTAGCTAGCTAGCTACGTATATATATAGCGCGTA'
matches = re.finditer(r'A[^A]C', sequence)   # looks for a sequence starting with "A" with the next base not being an "A" and the third and final base as a "C"


for n in matches:
    print(f'Found: {n.group()} at start site: {n.start() + 1}')


[Pythex is a great tool for testing a regex and seeing what it does and doesn't match visually](https://pythex.org/)
* Paste your body of text in the "Your test string" box
* Design your regex in the "Your regular expression" box

Pythex also includes a handy cheat sheet (expand at the bottom of the screen) to help with designing a regex.

### Activity 5

Use the Pythex website to create a regex to match the SC vessel registrations only, in the following text:

"The survey was conducted by vessels SC-2243-GZ, SC-9629-BB, and SC-1035-DA in Saint Helena Sound, while Charleston Harbor was surveyed by the vessels SC-7478-KS and SC-1151-GG. Assistance was also received from GA-2231-DD and FL-6632-FY."

Note that vessel registrations all
* begin with 'SC'
* then a hyphen
* then any **four** numbers - You can use [0-9] to represent all numbers
* then a hyphen
* then any **two** letters - You can use [A-Z] to represent all capital letters

e.g. SC-2345-AF

In [None]:
# Tackle Activity 5 here
# Use your regex in the following code:

text = '''The research was conducted by vessels SC-2243-GZ, SC-9629-BB, and SC-1035-DA in Saint Helena Sound,
while Charleston Harbor was surveyed by the vessels SC-7478-KS and SC-1151-GG.
Assistance was also received from GA-2231-DD and FL-6632-FY.'''

found = re.findall(r'[your regex goes here]', text)

print(found)


### Summative Assessment Quiz

The purpose of summative assessment quizzes is twofold:

1) The process of recall helps to transfer information from short term to longer term memory.
2) The quizzes help us evaluate the effectiveness of our training sessions.

Take [Summative Assessment Quiz 5](https://cofc.libwizard.com/f/intro-python-5) to test your knowledge about this session.

### Resources

* [Pythex is a great tool for testing a regex and seeing what it does and doesn't match visually](https://pythex.org/)
* [Python documentation - Regular expression operations](https://docs.python.org/3/library/re.html)
* [W3Schools - Python RegEx](https://www.w3schools.com/python/python_regex.asp)