# Lecture 12

## Working With Text

In this lecture we're going to talk about pattern matching in strings using regular expressions. Regular
expressions, or regexes, are written in a condensed formatting language. In general, you can think of a
regular expression as a pattern which you give to a regex processor with some source data. The processor then
parses that source data using that pattern, and returns chunks of text back to the a data scientist or 
programmer for further manipulation. There's really three main reasons you would want to do this - to check
whether a pattern exists within some source data, to get all instances of a complex pattern from some source
data, or to clean your source data using a pattern generally through string splitting. Regexes are not
trivial, but they are a foundational technique for data cleaning in data science applications, and a solid
understanding of regexs will help you quickly and efficiently manipulate text data for further data science
application.

Now, you could teach a whole course on regular expressions alone, especially if you wanted to demystify how
the regex parsing engine works and efficient mechanisms for parsing text. In this lecture I want to give you
basic understanding of how regex works - enough knowledge that, with a little directed sleuthing, you'll be
able to make sense of the regex patterns you see others use, and you can build up your practical knowledge of
how to use regexes to improve your data cleaning. By the end of this lecture, you will understand the basics
of regular expressions, how to define patterns for matching, how to apply these patterns to strings, and how
to use the results of those patterns in data processing.

In [1]:
text1 = "The impossible could not have happened, therefore the impossible must be possible in spite of appearances."

len(text1) # The length of text1

106

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

16

In [3]:
text2

['The',
 'impossible',
 'could',
 'not',
 'have',
 'happened,',
 'therefore',
 'the',
 'impossible',
 'must',
 'be',
 'possible',
 'in',
 'spite',
 'of',
 'appearances.']

<br>
List comprehension allows us to find specific words:

In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['impossible',
 'could',
 'have',
 'happened,',
 'therefore',
 'impossible',
 'must',
 'possible',
 'spite',
 'appearances.']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['The']

In [6]:
[w for w in text2 if w.endswith('e')] # Words in text2 that end in 'e'

['The',
 'impossible',
 'have',
 'therefore',
 'the',
 'impossible',
 'be',
 'possible',
 'spite']

<br>
We can find unique words using `set()`.

In [9]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [10]:
len(set(text4))

5

In [11]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [12]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [13]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Processing free-text

In [15]:
text5 = '"The impossible could not have happened, therefore the impossible must be possible in spite of appearances" \
#impossible @ https://www.goodreads.com/author/quotes/123715.Agatha_Christie'
text6 = text5.split(' ')

text6

['"The',
 'impossible',
 'could',
 'not',
 'have',
 'happened,',
 'therefore',
 'the',
 'impossible',
 'must',
 'be',
 'possible',
 'in',
 'spite',
 'of',
 'appearances"',
 '#impossible',
 '@',
 'https://www.goodreads.com/author/quotes/123715.Agatha_Christie']

Some word comparison functions
  * s.startswith(t)
  * s.endswith(t)
  * t in s
  * s.isupper(); s.islower(); s.istitle()
  * s.isalpha(); s.isdigit(); s.isalnum()
  
The full list of Python String Methods you can find for example here - https://docs.python.org/2.5/lib/string-methods.html

<br>
Finding hastags:

In [16]:
[w for w in text6 if w.startswith('#')]

['#impossible']

<br>
Finding callouts:

In [17]:
[w for w in text6 if w.startswith('@')]

['@']

In [18]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [None]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

First we'll import the re module, which is where python stores regular expression libraries.

In [19]:
import re

 There are several main processing functions in re that you might use. The first, `match()` checks for a match
 that is at the beginning of the string and returns a boolean. Similarly, `search()`, checks for a match
 anywhere in the string, and returns a boolean.

 Lets create some text for an example

In [20]:
text = "This is a good day."

# Now, lets see if it's a good day or not:
if re.search("good", text): # the first parameter here is the pattern
    print("Wonderful!")
else:
    print("Alas :(")

Wonderful!


In [21]:
if re.match('day',text):# checks for a  match that is at the beginning of the string
    print("Wonderful!")
else:
    print("Alas :(")


Alas :(


In [22]:
if re.match('This',text): 
    print("Wonderful!")
else:
    print("Alas :(")

Wonderful!


In [27]:
re.match('day',text)

In addition to checking for conditionals, we can segment a string. The work that regex does here is called
**tokenizing**, where the string is separated into substrings based on patterns. Tokenizing is a core activity in the natural language processing, which we won't talk much about here but that you will study in the future

The ``findall()`` and ``split()`` functions will parse the string for us and return chunks. Lets try and example

In [28]:
text = "Mary works hard. Mary gets good grades. Our student Mary is succesful."

# This is a bit of a fabricated example, but lets split this on all instances of Mary
re.split("Mary", text)

['', ' works hard. ', ' gets good grades. Our student ', ' is succesful.']

You'll notice that split has returned an empty string, followed by a number of statements about Mary, all as
elements of a list. If we wanted to count how many times we have talked about Mary, we could use `findall()`

In [29]:
re.findall("Mary", text)

['Mary', 'Mary', 'Mary']

Ok, so we've seen that `.search()` looks for some pattern and returns a boolean, that `.split()` will use a
pattern for creating a list of substrings, and that `.findall()` will look for a pattern and pull out all
occurences.

Now that we know how the python regex API works, lets talk about more complex patterns. The regex
specification standard defines a markup language to describe patterns in text. Lets start with anchors.
Anchors specify the start and/or the end of the string that you are trying to match. The caret character ``^``
means start and the dollar sign character ``$`` means end. If you put ``^`` before a string, it means that the text
the regex processor retrieves must start with the string you specify. For ending, you have to put the ``$``
character after the string, it means that the text Regex retrieves must end with the string you specify.

Here's an example

In [30]:
text = "Mary works diligently. Mary gets good grades. Our student Mary is succesful."

# Lets see if this begins with Mary
re.search("^Mary",text)

<re.Match object; span=(0, 4), match='Mary'>

Notice that `re.search()` actually returned to us a new object, called `re.Match object`. An `re.Match object`
always has a boolean value of `True`, as something was found, so you can always evaluate it in an `if` statement
as we did earlier. The rendering of the match object also tells you what pattern was matched, in this case
 the word 'Mary', and the location the match was in, as the span.

# Patterns and Character Classes

Let's talk more about patterns and start with character classes. Let's create a string of a single learners'
grades over a semester in one course across all of their assignments

In [31]:
grades="ACAAAABCBCBAA"

# If we want to answer the question "How many B's were in the grade list?" we would just use B
re.findall("B",grades)

['B', 'B', 'B']

If we wanted to count the number of A's or B's in the list, we can't use "AB" since this is used to match
all A's followed immediately by a B. 

In [32]:
re.findall("AB",grades)

['AB']

Instead, we put the characters A and B inside square brackets

In [33]:
re.findall("[AB]",grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

This is called the set operator. You can also include a range of characters, which are ordered
alphanumerically. For instance, if we want to refer to all lower case letters we could use [a-z]. Lets build
a simple regex to parse out all instances where this student receive an A followed by a B or a C

In [34]:
re.findall("[A][B-C]",grades)

['AC', 'AB']

 Notice how the [AB] pattern describes a set of possible characters which could be either (A OR B), while the
[A][B-C] pattern denoted two sets of characters which must have been matched back to back. You can write
this pattern by using the pipe operator, which means OR

In [35]:
re.findall("AB|AC",grades)

['AC', 'AB']

We can use the caret with the set operator to negate our results. For instance, if we want to parse out only
the grades which were not A's

In [36]:
re.findall("[^A]",grades)

['C', 'B', 'C', 'B', 'C', 'B']

Note this carefully - the caret was previously matched to the beginning of a string as an anchor point, but
inside of the set operator the caret, and the other special characters we will be talking about, lose their
meaning. This can be a bit confusing. What do you think the result would be of this?

In [37]:
re.findall("^[^A]",grades)

[]

In [38]:
re.findall("[A]$",grades)

['A']

In [39]:
re.findall("^A|[AB]$",grades)

['A', 'A']

It's an empty list, because the regex says that we want to match any value at the beginning of the string
which is not an A. Our string though starts with an A, so there is no match found. And remember when you are
using the set operator you are doing character-based matching. So you are matching individual characters in
an OR method.

# Quantifiers

Ok, so we've talked about anchors and matching to the beginning and end of patterns. And we've talked about
characters and using sets with the `[]` notation. We've also talked about character negation, and how the pipe
`|` character allows us to `or` operations. Lets move on to quantifiers.

Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic
quantifier is expressed as `e{m,n}`, where `e` is the expression or character we are matching,` m `is the minimum
number of times you want it to matched, and ` n` is the maximum number of times the item could be matched.

Let's use these grades as an example. How many times has this student been on a back-to-back A's streak?

In [40]:
re.findall("A{2,10}",grades) # we'll use 2 as our min, but ten as our max

['AAAA', 'AA']

So we see that there were two streaks, one where the student had four A's, and one where they had only two
A's

We might try and do this using single values and just repeating the pattern

In [41]:
re.findall("A{1,1}A{1,1}",grades)

['AA', 'AA', 'AA']

As you can see, this is different than the first example. The first pattern is looking for any combination
of two A's up to ten A's in a row. So it sees four A's as a single streak. The second pattern is looking for
two A's back to back, so it sees two A's followed immediately by two more A's. We say that the regex
processor begins at the start of the string and consumes variables which match patterns as it does.

It's important to note that the regex quantifier syntax does not allow you to deviate from the {m,n}
pattern. In particular, if you have an extra space in between the braces you'll get an empty result

In [43]:
re.findall("A{2,2}",grades)

['AA', 'AA', 'AA']

And as we have already seen, if we don't include a quantifier then the default is {1,1}

In [44]:
re.findall("AA",grades)

['AA', 'AA', 'AA']

And if you just have one number in the braces, it's considered to be both m and n

In [45]:
re.findall("A{2}",grades)

['AA', 'AA', 'AA']

Using this, we could find a decreasing trend in a student's grades

In [46]:
re.findall("A{1,10}B{1,10}C{1,10}",grades)

['AAAABC']

Now, that's a bit of a hack, because we included a maximum that was just arbitrarily large. There are three 
other quantifiers that are used as short hand, an asterix `*` to match 0 or more times, a question mark `?` to
match 0 or 1 times, or a `+` plus sign to match one or more times. Lets look at a more complex example,
and load some data scraped from wikipedia (https://en.wikipedia.org/wiki/Python_(programming_language))

In [47]:
with open("Python.txt","r") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
# and lets print that variable out to the screen
wiki

'History[edit]\n\nPython was conceived in the late 1980s[34] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which was inspired by SETL),[35] capable of exception handling and interfacing with the Amoeba operating system.[8] Its implementation began in December 1989.[36] Van Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation" from his responsibilities as Python\'s Benevolent Dictator For Life, a title the Python community bestowed upon him to reflect his long-term commitment as the project\'s chief decision-maker.[37] He now shares his leadership as a member of a five-person steering council.[38][39][40] In January 2019, active Python core developers elected Brett Cannon, Nick Coghlan, Barry Warsaw, Carol Willing and Van Rossum to a five-member "Steering Council" to lead the project.[41] Guido van Rossum has since then w

 Scanning through this document one of the things we notice is that the headers all have the words [edit] behind them, followed by a newline character. So if we wanted to get a list of all of the headers in this
article we could do so using `re.findall`

In [48]:
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

['History[edit]',
 'features[edit]',
 'semantics[edit]',
 'Indentation[edit]',
 'flow[edit]',
 'Expressions[edit]']

Ok, that didn't quite work. It got all of the headers, but only the last word of the header, and it really
was quite clunky. Lets iteratively improve this. First, we can use `\w` to match any letter, including digits
and numbers.

In [49]:
re.findall("[\w]{1,100}\[edit\]",wiki)

['History[edit]',
 'features[edit]',
 'semantics[edit]',
 'Indentation[edit]',
 'flow[edit]',
 'Expressions[edit]']

This is something new. `\w ` is a metacharacter, and indicates a special pattern of any letter or digit. There
are actually a number of different metacharacters listed in the documentation. For instance, `\s` matches any
whitespace character.

Next, there are three other quantifiers we can use which shorten up the curly brace syntax. We can use an
asterix `*` to match 0 or more times, so let's try that.

In [50]:
re.findall("[\w]*\[edit\]",wiki)

['History[edit]',
 'features[edit]',
 'semantics[edit]',
 'Indentation[edit]',
 'flow[edit]',
 'Expressions[edit]']

Now that we have shortened the regex, let's improve it a little bit. We can add in a spaces using the space
character

In [51]:
re.findall("[\w ]*\[edit\]",wiki)

['History[edit]',
 'Design philosophy and features[edit]',
 'Syntax and semantics[edit]',
 'Indentation[edit]',
 'Statements and control flow[edit]',
 'Expressions[edit]']

Ok, so this gets us the list of section titles in the wikipedia page! You can now create a list of titles by
iterating through this and applying another regex

In [52]:
for title in re.findall("[\w ]*\[edit\]",wiki):
    # Now we will take that intermediate result and split on the square bracket [ just taking the first result
    print(re.split("[\[]",title)[0])

History
Design philosophy and features
Syntax and semantics
Indentation
Statements and control flow
Expressions


# Groups

Ok, this works, but it's a bit of a pain. To this point we have been talking about a regex as a single
pattern which is matched. But, you can actually match different patterns, called groups, at the same time,
and then refer to the groups you want. To group patterns together you use parentheses, which is actually
 pretty natural. Lets rewrite our `findall` using groups

In [53]:
re.findall("([\w ]*)(\[edit\])",wiki)

[('History', '[edit]'),
 ('Design philosophy and features', '[edit]'),
 ('Syntax and semantics', '[edit]'),
 ('Indentation', '[edit]'),
 ('Statements and control flow', '[edit]'),
 ('Expressions', '[edit]')]

Nice - we see that the python `re` module breaks out the result by group. We can actually refer to groups by
number as well with the match objects that are returned. But, how do we get back a list of match objects?
Thus far we've seen that ``findall()`` returns strings, and ``search()`` and ``match()`` return individual ``Match objects``. But what do we do if we want a list of Match objects? In this case, we use the function ``finditer()``

In [54]:
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())

('History', '[edit]')
('Design philosophy and features', '[edit]')
('Syntax and semantics', '[edit]')
('Indentation', '[edit]')
('Statements and control flow', '[edit]')
('Expressions', '[edit]')


We see here that the ``groups()`` method returns a tuple of the group. We can get an individual group using
`group(number)`, where `group(0)` is the whole match, and each other number is the portion of the match we are
interested in. In this case, we want `group(1)`

In [57]:
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group()) #by default number=0

History[edit]
Design philosophy and features[edit]
Syntax and semantics[edit]
Indentation[edit]
Statements and control flow[edit]
Expressions[edit]


One more piece to regex is labeling or naming groups. In the previous example I showed you how you can use the position of the group. But giving them a label and looking at the results as a dictionary is pretty useful. For that we use the syntax `(?P<name>)`, where the parethesis starts the group, the `?P` indicates that this is an extension to basic regexes, and `<name>` is the dictionary key we want to use wrapped in `<>`.

In [58]:
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

History
Design philosophy and features
Syntax and semantics
Indentation
Statements and control flow
Expressions


In [59]:
 # Lets see all dictionry
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    print(item.groupdict())

{'title': 'History', 'edit_link': '[edit]'}
{'title': 'Design philosophy and features', 'edit_link': '[edit]'}
{'title': 'Syntax and semantics', 'edit_link': '[edit]'}
{'title': 'Indentation', 'edit_link': '[edit]'}
{'title': 'Statements and control flow', 'edit_link': '[edit]'}
{'title': 'Expressions', 'edit_link': '[edit]'}


Of course, we can print out the whole dictionary for the item too, and see that the last string is still
in there. Here's the dictionary kept for the last match

In [60]:
print(item.groupdict())

{'title': 'Expressions', 'edit_link': '[edit]'}


 Ok, we have seen how we can match individual character patterns with `[]`, how we can group matches together
 using `()`, and how we can use quantifiers such as `*`, `?`, or `{mn}` to describe patterns, the `\w`, which standards for any word character. There are a number of short hands which are used with regexes for different kinds of characters, including:
 a `. `for any single character which is not a newline a` \d` for any digit and` \s` for any whitespace character, like `spaces` and `tabs`.
There are more, and a full list can be found in the python documentation for regexes - https://docs.python.org/3/library/re.html

# Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data.
Pandas builds on this and provides a comprehensive set of *vectorized string operations* that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data.
In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

## Introducing Pandas String Operations

We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [61]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

array([ 4,  6, 10, 14, 22, 26])

This *vectorization* of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.
For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:

In [62]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
#data.apitalize() - give an error
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing values.
For example:

In [63]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:

In [64]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [65]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

## Tables of Pandas String Methods

If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties.
The examples in this section use the following series of names:

In [67]:
detectives = pd.Series(['James Bond', 'Johnny English', 'Sherlock Holmes',
                   'Hercule Poirot', 'Jane Marple', 'Jules Maigret'])

### Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [68]:
detectives.str.lower()

0         james bond
1     johnny english
2    sherlock holmes
3     hercule poirot
4        jane marple
5      jules maigret
dtype: object

But some others return numbers:

In [69]:
detectives.str.len()

0    10
1    14
2    15
3    14
4    11
5    13
dtype: int64

Or Boolean values:

In [71]:
detectives.str.startswith('S')

0    False
1    False
2     True
3    False
4    False
5    False
dtype: bool

Still others return lists or other compound values for each element:

In [72]:
detectives.str.split()

0         [James, Bond]
1     [Johnny, English]
2    [Sherlock, Holmes]
3     [Hercule, Poirot]
4        [Jane, Marple]
5      [Jules, Maigret]
dtype: object

### Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

With these, you can do a wide range of interesting operations.
For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [74]:
detectives.str.extract('([A-Za-z]+)') #  or '(\w+)'
# also you may add parameter  'expand=False' to output a Series

Unnamed: 0,0
0,James
1,Johnny
2,Sherlock
3,Hercule
4,Jane
5,Jules


In [75]:
detectives.str.extract('(\w+)( \w+)')

Unnamed: 0,0,1
0,James,Bond
1,Johnny,English
2,Sherlock,Holmes
3,Hercule,Poirot
4,Jane,Marple
5,Jules,Maigret


In [76]:
detectives.str.extract('(?P<name>\w*)(?P<last_name> \w+)')

Unnamed: 0,name,last_name
0,James,Bond
1,Johnny,English
2,Sherlock,Holmes
3,Hercule,Poirot
4,Jane,Marple
5,Jules,Maigret


Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:

In [77]:
detectives.str.findall('^[^AEIOU]*[^aeiou]$')

0         [James Bond]
1                   []
2    [Sherlock Holmes]
3     [Hercule Poirot]
4                   []
5      [Jules Maigret]
dtype: object

The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data.

### Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations:

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

#### Vectorized item access and slicing

The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.
Note that this behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [78]:
detectives.str.slice(0, 3)

0    Jam
1    Joh
2    She
3    Her
4    Jan
5    Jul
dtype: object

In [79]:
detectives.str.slice(0, 3) == detectives.str[0:3]

0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.

These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [80]:
detectives.str.split().str.get(-1)

0       Bond
1    English
2     Holmes
3     Poirot
4     Marple
5    Maigret
dtype: object

#### Indicator variables

Another method that requires a bit of extra explanation is the ``get_dummies()`` method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains some categorial information, such as A="speaks French," B="born in the United Kingdom," C="likes cheese," D="likes Vesper Martini":

In [81]:
full_detectives = pd.DataFrame({'name': detectives,
                           'info': ['B|C|D', 'B|D', 'B|C',
                                    'A|C', 'B|C', 'B|C|D']})
full_detectives

Unnamed: 0,name,info
0,James Bond,B|C|D
1,Johnny English,B|D
2,Sherlock Holmes,B|C
3,Hercule Poirot,A|C
4,Jane Marple,B|C
5,Jules Maigret,B|C|D


The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:

In [82]:
full_detectives['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,0,1,1,0
3,1,0,1,0
4,0,1,1,0
5,0,1,1,1


With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

We won't dive further into these methods here, but I encourage you to read through ["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation.

This lecture has been an overview of regular expressions, and really, we've just scratched the surface of what 
you can do. 
But, there are lots of great examples and reference guides on the web, including the python
documentation for regex, and with these in hand you should be able to write concise and readable code which
performs well too. Having basic regex literacy is a core skill for applied data scientists. 

In [83]:
titanic = pd.read_csv('D:\Света\\Фреймворки пайтон\dataframe\\titanic.csv')

In [84]:
titanic['PClass'] = titanic['PClass'].str[0]
titanic['PClass'] = pd.to_numeric(titanic['PClass'],errors = 'coerce')
titanic

Unnamed: 0,PassengerID,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1.0,29.00,female,1,1
1,2,"Allison, Miss Helen Loraine",1.0,2.00,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1.0,30.00,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1.0,25.00,female,0,1
4,5,"Allison, Master Hudson Trevor",1.0,0.92,male,1,0
...,...,...,...,...,...,...,...
1308,1309,"Zakarian, Mr Artun",3.0,27.00,male,0,0
1309,1310,"Zakarian, Mr Maprieder",3.0,26.00,male,0,0
1310,1311,"Zenni, Mr Philip",3.0,22.00,male,0,0
1311,1312,"Lievens, Mr Rene",3.0,24.00,male,0,0


In [86]:
titanic_fem = titanic[titanic['SexCode']==1]

In [89]:
names = titanic_fem['Name'].str.extract('(?P<last_name>\w*), (?P<title>\w*) (?P<name>[\w ]*)')
names

Unnamed: 0,last_name,title,name
0,Allen,Miss,Elisabeth Walton
1,Allison,Miss,Helen Loraine
3,Allison,Mrs,Hudson JC
6,Andrews,Miss,Kornelia Theodosia
8,Appleton,Mrs,Edward Dale
...,...,...,...
1283,Vestrom,Miss,Hulda Amanda Adolfina
1293,Wilkes,Mrs,Ellen
1304,Yasbeck,Mrs,Antoni
1306,Zabour,Miss,Hileni


In [90]:
names['title'].unique()


array(['Miss', 'Mrs', 'Madame', 'Lady', 'Dr', 'the', 'Ms', nan, 'Mlle',
       'Hilda', 'Jenny'], dtype=object)

In [93]:
names.groupby(['title']).count()

Unnamed: 0_level_0,last_name,name
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Dr,1,1
Hilda,1,1
Jenny,1,1
Lady,1,1
Madame,1,1
Miss,236,236
Mlle,1,1
Mrs,200,200
Ms,13,13
the,1,1


In [95]:
names[names['title'] == 'the']

Unnamed: 0,last_name,title,name
214,Rothes,the,Countess of
