## Syntax of Regular Expressions in Python

All of the functions associated with regular expressions in Python are contained in the `re` module, so the very first thing we have to do is import it.

In [1]:
import re

The most useful function in the re module is the search function. Its syntax is:

```python
re.search(pattern, string)
```

where pattern is the regular expression, and string is the string that we want to search for a match. The search function returns a "match object" depending on whether it finds a match of the pattern or not. This match object can be used in an if statement to determine if a match has been found.  

```python
myPattern = re.search(pattern, string)
```

In [2]:
s="Biology = 5Xfun!"

In [3]:
re.search("\S+\s\S+\s\S+", s)

<re.Match object; span=(0, 16), match='Biology = 5Xfun!'>

In [4]:
re.search("\w\W\w\W", s)

In [5]:
re.search("\d+", s)

<re.Match object; span=(10, 11), match='5'>

In [7]:
re.search("\D+", s)

<re.Match object; span=(0, 10), match='Biology = '>

In [8]:
re.search("\S+", s)

<re.Match object; span=(0, 7), match='Biology'>

Some more practical example related to biology

In [10]:
row = "Sox2  chr3  34548926  34551382  +"

re.search("\w\s{2}chr\w{,2}\s{2}\d+\s{2}\d+\s{2}\s{2}[\+|\-]", row)

In [12]:
re.search("\w\s{2}chr\w{,2}\s{2}(\d+\s{2}){2}[\+|\-]", row)

<re.Match object; span=(3, 33), match='2  chr3  34548926  34551382  +'>

In [13]:
re.search("\w+\s{2}chr\w{,2}\s{2}(\d+\s{2}){2}[\+|\-]", row)

<re.Match object; span=(0, 33), match='Sox2  chr3  34548926  34551382  +'>

In [15]:
# the easiest answer and that minimizes potential for unexpected matches
re.search("\S+\s+chr\S+\s+\d+\s+\d+\s+[+-]", row)

<re.Match object; span=(0, 33), match='Sox2  chr3  34548926  34551382  +'>

This match object also contains the part or parts of the string that match the pattern. You may recall that parentheses are used to specify groups for regular expressions to match. As it turns out, these groups are also contained within the match object. For example, if we have an expression like this:

```
"(\S+)\s(MIT_\S+)\s\S+"
```

that is used for the search function, then we can use

```python
myPattern.group(1)
```

to get the first group (what is matched by `"\S+"`), and

```python
myPattern.group(2)
```

to get the second group (in this case, a word starting with `"MIT_"`).  

If we want to get the entire string matched by the pattern, we can use

```python
myPattern.group(0)
```

Now let's see how this looks in the `re.search` command.

```python
myPattern = re.search(r"(\S+)\s(MIT_\S+)\s\S+", string)
if myPattern:
    print (myPattern.group(1)+myPattern.group(2))
```

Note that if we want to use a regular expression for the pattern parameter, we have to precede the string that specifies it with `'r'`, so Python knows to read it as a regular expression.

### What do these lines do?

Working with regular expressions can be tricky because it can be hard to determine precisely what a line of code does.

With that in mind, let's review the syntax of the previous Python code and regular expressions. Given the following line of text (a string) from a data file, where the string is formatted like so:

_ID# GeneName chromosome DNA-Strand Expr1 Expr2 Expr3_, where Expr# is data from experiment #.

Identify the action(s) the following lines of Python code perform(s).

In [4]:
line = "1684730100 Ter3A chr8 - 641.57 12.03 113.7"

In [5]:
re.search(r"\d+\s(\S+)\s\S+\s[+-]\s(\d+\.?\d*)\s(\d+\.?\d*)\s(\d+\.?\d*)", line)

<re.Match object; span=(0, 42), match='1684730100 Ter3A chr8 - 641.57 12.03 113.7'>

In [7]:
pattern1 = re.search(r"\d+\s\S+\s\S+\s[+-]\s\S+\s\S+\s\S+", line)

In [8]:
pattern1

<re.Match object; span=(0, 42), match='1684730100 Ter3A chr8 - 641.57 12.03 113.7'>

Both of the lines of code contain regular expressions that match the whole string of "line". So both re.search commands will return a match. The first regular expression contains parentheses specifying groups to save separately (specifically the gene name and the experimental data). The second line of code contains no parentheses to save particular groups, but saves the match object in a variable called pattern1.

In [9]:
pattern2 = re.search(r"\d+\s\S+\s\S+\s[+-]\s\d+\s\d+\s\d+", line)

In [11]:
print (pattern2.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

The regular expression does not match any portion of the line due to the lack of accounting for decimal points, so there is nothing for pattern2.group(1) to return and the command gives an error.

In [12]:
pattern2 = re.search(r"\d+\s(\S+)\s\S+\s[+-]\s(\d+\.?\d*)\s(\d+\.?\d*)\s(\d+\.?\d*)", line)
print (pattern2.group(1))

Ter3A


The regular expression contains parentheses defining groups to be saved separately in the match object. The first parentheses defined group is Ter3A. Entering pattern2.group(0) would still return the whole match.

### Writing Regular Expression Commands

Recall the question "Writing Regular Expressions", in which we wrote an expression to match a data file of genomic coordinates for each gene formatted like so:

```
Sox2  chr3  34548926  34551382  +
```

Now put this expression into the re.search function, and create a match object "myPattern", which is fed lines of the file in a string called "line". Just match the whole line: do not specify any groups.

In [6]:
line = "Sox2  chr3  34548926  34551382  +"

myPattern = re.search(
    r"\S+\s+chr\S+\s+\d+\s+\d+\s+[+-]", # beware chr is not necessarily followed by digits
    line
)

In [7]:
myPattern

<re.Match object; span=(0, 33), match='Sox2  chr3  34548926  34551382  +'>

Given a data file that lists departments by name with their average number of students, like so:
```
The Biology department averages 32 students/class
```
write the command that will match the whole line and specify in groups only the names of the various departments and the numbers of students for further use in the object "myPattern". Assume that each line of the file is supplied to the command by way of a string named "line". Your goal should be to create groups of whatever the department name might be and whatever the student number might be for each line of the hypothetical file.

In [12]:
line = "The Biology department averages 32 students/class"
pattern = r"The (\S+) department averages (\d+) students\/class"

# r"The\s(\S+)\sdepartment\saverages\s(\S+)\sstudents\/class" # is another solution

myPattern = re.search(
    pattern,
    line
)

In [18]:
myPattern

<re.Match object; span=(0, 49), match='The Biology department averages 32 students/class>

In [19]:
print(myPattern.group(1))
print(myPattern.group(2))

Biology
32


Define a function named chromosome_regex that takes input line and uses the search command to match a line of a file like the one shown into a match object called myPattern:
```
Sox2  chr3  34548926  34551382  +
```

In [24]:
def chromosome_regex(line: str = "") -> str:
    """
    Takes input line and uses the search command to match a line of a file like the one shown into a match object called myPattern:
    ```
    Sox2  chr3  34548926  34551382  +
    ```
    Returns the matched string from the match object
    """

    import re

    return re.search(r"\S+\s+(chr\S+)\s+\d+\s+\d+\s+[+-]", line).group() # with group(1) we get the chromosome ID


In [25]:
chromosome_regex('Sox2  chr3  34548926  34551382  +')

'Sox2  chr3  34548926  34551382  +'

### Writing Regular Expressions: Application

Given what you went through so far, you should now be able to write scripts that search strings, specify relevant groups using regular expressions, and return just the data you are looking for. Let's give it a try. Note: This question is set to unlimited attempts.

Write a function named 'RNAseqParser' that:

* Takes an input string of data.
* Creates the output string with the repeating format "x/ty/n...", where
    **x** is the gene name
    **\t** is a tab
    **y** is the x gene's expression data found in the 5th column of the original data file
    **\n** is a line return
    You could accomplish this by searching each line of the data with the for line in X: for loop syntax (where X is your inputted data), and storing the match information in a match object.
* Returns your properly formatted output string.

Remember to import any modules you might need, and keep in mind that some genes have expression values of zero. In this embedded Python grader, paste your final code to be graded after developing and debugging it in your local Spyder (or other) IDE. Click Submit to have your code graded.

In [117]:
example_line = "ENSMUSG00000000708\tKat2b\t9.379815\t0.37079784\t1.1033436\t5.6754346"
patt = re.search(r"\S+\t(\S+)\t\d*\.\d+\t\d*\.\d+\t(\d*\.\d+)\t\d*\.\d+", example_line) # \d*\.\d+ to capture floating
print(patt)
print(patt.group(0))
print(patt.group(1))
print(patt.group(2))


<re.Match object; span=(0, 64), match='ENSMUSG00000000708\tKat2b\t9.379815\t0.37079784\t>
ENSMUSG00000000708	Kat2b	9.379815	0.37079784	1.1033436	5.6754346
Kat2b
1.1033436


In [168]:
def RNAseqParser(filename:str="") -> str:
    import re

    output = "" # will store the final output
    
    with open(filename, 'r') as data:

        data_string = data.readline().strip("'").lstrip("\\n") # a string is encoded in the example file
        
        for line in data_string.split('\\n'):
            if line.startswith("ensGene"): # we skip the first row
                continue
            
            pattern_search = re.search(
                r"\S+\\t(\S+)\\t\d*\.\d+\\t\d*\.\d+\\t(\d*\.\d+)\\t\d*\.\d+",
                line
            )
            
            output += (pattern_search.group(1) + '\t' + pattern_search.group(2) + '\n')
    return output

In [169]:
print(RNAseqParser('./input_string.txt'))

Tcfe3	7.205497
Kat2b	1.1033436
Snrpn	13.403415
Rmnd5b	14.050683
Fbxo9	6.499769
Def8	15.014166
Ell2	3.5680292
Ifrd1	15.508437
Akr1b3	1.2716209
Ubl3	9.046656
Mov10	6.25411
Pdcd2l	15.635618
Clpp	32.20393
Mrpl2	50.002293
Pnkp	6.193148
Relb	1.7450844
Klf4	4.1997404
Ciao1	9.724962
Rad23a	15.284632
Atp6v1f	24.671564
Arhgef18	5.004999
Polr2e	36.53202



In [316]:
def RNAseqParser(data:str="") -> str:
    import re

    output = "" # will store the final output

    for line in data.split('\n')[1:]: # first line is blank
        if "ensGene" in line: # we skip the first row
            continue
        
        pattern_search = re.search(
            r"\S+\t(\S+)\t\d*\.\d+\t\d*\.\d+\t(\d*\.\d+)\t\d*\.\d+",
            line
        )

        output += (pattern_search.group(1) + '\t' + pattern_search.group(2) + '\n')
    return output

In [317]:
RNAseqParser('\nensGene\tgeneSymb\tESC.RPKM\tMES.RPKM\tCP.RPKM\tCM.RPKM\nENSMUSG00000000134\tTcfe3\t14.92599\t6.080252\t7.205497\t5.5972915\nENSMUSG00000000708\tKat2b\t9.379815\t0.37079784\t1.1033436\t5.6754346\nENSMUSG00000000948\tSnrpn\t40.668293\t14.529371\t13.403415\t23.01873\nENSMUSG00000001054\tRmnd5b\t43.369095\t7.0136724\t14.050683\t11.829396\nENSMUSG00000001366\tFbxo9\t7.6720843\t6.9369035\t6.499769\t6.778531\nENSMUSG00000001482\tDef8\t24.153797\t15.451096\t15.014166\t13.819534\nENSMUSG00000001542\tEll2\t8.156232\t3.5004125\t3.5680292\t2.2641196\nENSMUSG00000001627\tIfrd1\t28.733929\t16.701181\t15.508437\t12.778727\nENSMUSG00000001642\tAkr1b3\t4.319858\t1.9163351\t1.2716209\t0.82428175\nENSMUSG00000001687\tUbl3\t28.78591\t9.088697\t9.046656\t20.373514\nENSMUSG00000002227\tMov10\t29.740297\t3.2102342\t6.25411\t9.091757\nENSMUSG00000002635\tPdcd2l\t30.69546\t18.50777\t15.635618\t15.247209\nENSMUSG00000002660\tClpp\t93.85232\t51.403442\t32.20393\t33.370808\nENSMUSG00000002767\tMrpl2\t86.59501\t61.894024\t50.002293\t51.35253\nENSMUSG00000002963\tPnkp\t8.918158\t5.5222096\t6.193148\t6.496989\nENSMUSG00000002983\tRelb\t7.0391517\t1.501116\t1.7450844\t2.5017977\nENSMUSG00000003032\tKlf4\t41.70846\t7.747598\t4.1997404\t6.5344357\nENSMUSG00000003662\tCiao1\t15.639003\t11.429388\t9.724962\t11.069197\nENSMUSG00000003813\tRad23a\t30.253717\t16.276289\t15.284632\t21.372665\nENSMUSG00000004285\tAtp6v1f\t30.517672\t23.897362\t24.671564\t25.907063\nENSMUSG00000004568\tArhgef18\t13.561201\t6.151879\t5.004999\t6.8743706\nENSMUSG00000004667\tPolr2e\t91.243706\t51.02243\t36.53202\t33.37132')

'Tcfe3\t7.205497\nKat2b\t1.1033436\nSnrpn\t13.403415\nRmnd5b\t14.050683\nFbxo9\t6.499769\nDef8\t15.014166\nEll2\t3.5680292\nIfrd1\t15.508437\nAkr1b3\t1.2716209\nUbl3\t9.046656\nMov10\t6.25411\nPdcd2l\t15.635618\nClpp\t32.20393\nMrpl2\t50.002293\nPnkp\t6.193148\nRelb\t1.7450844\nKlf4\t4.1997404\nCiao1\t9.724962\nRad23a\t15.284632\nAtp6v1f\t24.671564\nArhgef18\t5.004999\nPolr2e\t36.53202\n'

In [318]:
# another correct answer

import re
      
def RNAseqParser(input_string): 
    list = [] #creates empty list
    list = input_string.split('\n') #converts input_string into a list of strings separated by \n
    
    output_string = ''  #creates empty string object  
    for string in list: #for loop iterates with each string in your list 
        pattern = re.search(r'(ENSMUSG\d+)\s+(\w+)\s+(\d+.\d+)\s+(\d+.\d+)\s+(\d+.\d+)\s+(\d+.\d+)', string) #searches for a match to your regular expression
        if pattern: #if a match to your regular expression is found...
            #...you add a string consisting of the gene name (group 2), a space (\t),  CP.RPKM value (group 5) and a line break (\n) to your output string object
            output_string += pattern.group(2) + '\t' + pattern.group(5) + '\n' 
    return output_string

In [319]:
# a last one

import re

def RNAseqParser(text):
    ans = ''
    data = text.split('\n')
    for line in data:
        genePattern = re.search(r'(ENS\S+)\s(\S+)\s(\d+[.]*\d*)\s(\d+[.]*\d*)\s(\d+[.]*\d*)\s(\d+[.]*\d*)', line)
        if genePattern:
            ans += genePattern.group(2) + '\t' + genePattern.group(5) + '\n'
    return ans