# Multiline Regex Patterns

We've covered the entire set of regex metacharacters, but have applied our patterns only to the relatively short single-line movie titles. This chapter covers how to build regexes to match patterns on strings which span multiple lines. Specifically, a **multline** string is one with at least one newline character in it. We begin by reading in a sample of the newsgroups dataset (from the [scikit-learn machine learning library][1]). It's read in as two-column DataFrame containing the category of the message and the text.

[1]: https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset

In [None]:
import re
import pandas as pd
pd.set_option('display.max_colwidth', 100)

def find_pattern(s, pattern, **kwargs):
    filt = s.str.contains(pattern, **kwargs)
    return s[filt]

news = pd.read_csv('../data/newsgroups.csv')
news.head()

The `text` column is assigned to its own variable name as a Series and the first 350 characters of the first message is output.

In [None]:
text = news['text']
text.iloc[0][:350]

Use the `print` function to display the message as it would be viewed online. Notice that the newline characters, `\n`, get displayed as actual new lines.

In [None]:
print(text.iloc[0][:350])

## Anchors with the multiline flag

By default, the start and end anchors (`^`/`$`) use the very first character and very last character of the string, regardless of the number of new lines, as the start and end positions. There is only one start and one end character. If we attempt to extract the organization name by capturing the text following `'Organization: '` as done below, it will fail as it is anchored to the start of the first line. The vast majority of the messages begin with `'From'`.

In [None]:
text.str.extract(r'^Organization: (.*)$').head(3)

To make the start and end metacharacters anchor to the start and end of each line we can use the multiline flag (`re.M` or `re.MULTILINE`). This anchors `'Organization: '` to the start of a new line. Note that the `.` special character does NOT match new lines so stops matching when it hits a new line character so the dollar sign end anchor is unnecessary.

In [None]:
text.str.extract(r'^Organization: (.*)$', flags=re.M).head(3)

## Matching every character with `.`

The `.` special character matches every character except newline characters. The pattern `(.*)` captures the entire first line only.

In [None]:
text.str.extract(r'(.*)').head(3)

To change the behavior of the `.` so that it does match every single character, use the dotall flag (`re.S` or `re.DOTALL`). Here we capture the entire header information of the message, which typically ends with `'Lines :'` followed by digits.

In [None]:
text.str.extract(r'^(.*?Lines: \d+)', flags=re.S).head(3)

As an alternative to using the dotall flag, you can use two complementing character sets within the square brackets such as `[\s\S]+`, which matches every single space or non-space character. Since they are complements of one another,  the pattern matches every single character.

## Using multiple flags

It's possible to use multiple flags simultaneously by separating them with the pipe symbol. For example, the multiline and dotall flags can be used simultaneously with `re.M | re.S`. This technically computes the bitwise OR operation between the two underlying integers. Below, we attempt to find the entire sentence (possibly spanning multiple lines) of a line that begins with `'They`'. 

In [None]:
s = text.str.extract(r'^(They .*?\.)', flags=re.M | re.S).dropna()[0]
s.head()

Printing out the first sentence found displays the multiple lines.

In [None]:
print(s.loc[61])

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Find the number of messages that have at least one line that begins with `'The'` and ends with a period.</span>

### Exercise 2

<span style="color:green; font-size:16px">Find the messages that do not start with the word `'From'`.</span>

### Exercise 3

<span style="color:green; font-size:16px">Extract the first word from the messages that do not start with the word `'From'` and then count the occurrences of each of these words.</span>

### Exercise 4

<span style="color:green; font-size:16px">Extract the header of each message. These are the lines at the top that begin with `From`, `Subject`, `Organization`, etc... and assign it to a variable named `header`. Take a look at a few of the individual messages to understand what pattern would work best. Make sure `header` is a Series of strings.</span>

### Exercise 5

<span style="color:green; font-size:16px">Use the`extractall` method to extract two groups from the header. For all lines that do not begin with a space, extract the first word up to but not including the colon. That is the first group. Extract the remaining characters to the right of the colon as the second group. Name the groups `'property'` and `'value'`.</span>

### Exercise 6

<span style="color:green; font-size:16px">Attempt to extract all emails from each message. If you are up for a serious challenge, use the exact specifications for [valid email addresses](https://en.wikipedia.org/wiki/Email_address). The solution presented finds the most common types of emails.</span>

### Exercise 7

<span style="color:green; font-size:16px">From each email found, extract the characters after the last dot of each email (usually com, edu, org, etc...), make them lowercase, and count the occurrences of each.</span>

### Exercise 8

<span style="color:green; font-size:16px">Extract the number of lines of each message as the integer following the word `'Lines:'` in the header. Find the average number of lines per message.</span>

### Exercise 9

<span style="color:green; font-size:16px">Extract all 10-digit phone numbers.</span>

### Exercise 10

<span style="color:green; font-size:16px">In this exercise you'll find the most common words in each category. Run the first two cells below to place the category in the index and then extract the body from the message as a Series. Extract all the words between 5 and 12 characters in length. Make them all lowercase and remove the words in the list `remove_words`. Finally, count occurrence of each word by category returning the top 10 per category.</span>

In [None]:
textc = news.set_index('category')['text']
textc.head(3)

In [None]:
body = textc.str.extract(r'\n\n(.*)', flags=re.S)[0]
body.head(3)

In [None]:
remove_words = ['would', 'article', 'there', 'about', 'their', 'other', 'should', 'could',
                'those', 'these', 'which', 'where', 'writes', 'anyone', 'someone', 'because']

### Exercise 11

<span style="color:green; font-size:16px">Find all messages that have have `'From'`, '`Subject'`, `'Organization'`, and `'Lines'` as parts of the header.</span>