# Solutions

1. [Introduction to Regular Expressions](#1.-Introduction-to-Regular-Expressions)
1. [Quantifiers](#2.-Quantifiers)
1. [Or Conditions and Character Classes](#3.-Or-Conditions-and-Character-Classes)
1. [Grouping and Capturing](#4.-Grouping-and-Capturing)
1. [Multiline Regex Patters](#5.-Multiline-Regex-Patterns)
1. [Project - Feature Engineering on the Titanic](#Project---Feature-Engineering-on-the-Titanic)

In [1]:
import re
import pandas as pd
import numpy as np

## 1. Introduction to Regular Expressions

In [2]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']
def find_pattern(s, pattern, **kwargs):
    filt = s.str.contains(pattern, **kwargs)
    return s[filt]

### Exercise 1

<span style="color:green; font-size:16px">Find all movies that have 2 consecutive z's in them.</span>

In [3]:
find_pattern(title, r'zz')

416                All That Jazz
907         The Dukes of Hazzard
1041                   Bedazzled
2234                   Paparazzi
2524                    Hot Fuzz
2593    The Lizzie McGuire Movie
3215       Into the Grizzly Maze
3535                Mystic Pizza
4399              Blue Like Jazz
Name: title, dtype: object

### Exercise 2

<span style="color:green; font-size:16px">Find all movies that begin with 9.</span>

In [4]:
find_pattern(title, r'^9')

1651                       9
2416                9½ Weeks
3705    90 Minutes in Heaven
Name: title, dtype: object

### Exercise 3

<span style="color:green; font-size:16px">Find all movies that have a `b` as their third character.</span>

In [5]:
find_pattern(title, r'^..b').head()

22                Robin Hood
228                  RoboCop
286           Public Enemies
448                   Robots
494    Babe: Pig in the City
Name: title, dtype: object

### Exercise 4

<span style="color:green; font-size:16px">Find all movies with a fourth-to-last character of `M` and a last character of `e`.</span>

In [6]:
find_pattern(title, r'M..e$')

704            The Green Mile
1167                   8 Mile
1616                Like Mike
2122           Moonlight Mile
2486             How She Move
2653                 The Muse
2913    Max Keeble's Big Move
3215    Into the Grizzly Maze
3406               Magic Mike
3696                     Made
4805        The World Is Mine
Name: title, dtype: object

### Exercise 5

<span style="color:green; font-size:16px">Use a regular expression to find movies that are exactly 6 characters in length.</span>

In [7]:
find_pattern(title, r'^......$').head()

0      Avatar
41     Cars 2
58     WALL·E
125    Frozen
168    Sahara
Name: title, dtype: object

### Exercise 6

<span style="color:green; font-size:16px">Complete exercise 5 using a different string-only Series method that does not require a regex.</span>

In [8]:
filt = title.str.len() == 6
title[filt].head(10)

0      Avatar
41     Cars 2
58     WALL·E
125    Frozen
168    Sahara
292    Eraser
298    Eragon
368    Pixels
426    Jumper
428    Zodiac
Name: title, dtype: object

### Exercise 7

<span style="color:green; font-size:16px">Find all movies containing the letter `'q'` ignoring case.</span>

In [9]:
find_pattern(title, r'q', flags=re.I).head()

12                           Quantum of Solace
73                               Suicide Squad
590    Alvin and the Chipmunks: The Squeakquel
762                             Gangster Squad
839                              The Equalizer
Name: title, dtype: object

## 2. Quantifiers

In [10]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

### Exercise 1

<span style="color:green; font-size:16px">Find all movies that have `'z'` as their 15th character.</span>

In [11]:
find_pattern(title, r'^.{14}z')

2484      American Dreamz
2625    Ramona and Beezus
Name: title, dtype: object

### Exercise 2

<span style="color:green; font-size:16px">Find all movies that have the word `'Boy'` or `'Boys'` in them followed by a space.</span>

In [12]:
find_pattern(title, r'Boys? ')

188                        Bad Boys II
1864         Jimmy Neutron: Boy Genius
2528                    Boys and Girls
2859    The Boy in the Striped Pajamas
2907              The Boys from Brazil
3823                 The Boy Next Door
4166                    Boys Don't Cry
4575      All the Boys Love Mandy Lane
Name: title, dtype: object

### Exercise 3

<span style="color:green; font-size:16px">Find all movies that have between 40 and 43 characters in them. Can you verify the results with another `str` accessor method?</span>

In [13]:
m40_43 = find_pattern(title, r'^.{40,43}$')
m40_43.head()

1        Pirates of the Caribbean: At World's End
4      Star Wars: Episode VII - The Force Awakens
13     Pirates of the Caribbean: Dead Man's Chest
16       The Chronicles of Narnia: Prince Caspian
18    Pirates of the Caribbean: On Stranger Tides
Name: title, dtype: object

In [14]:
m40_43.str.len().head()

1     40
4     42
13    42
16    40
18    43
Name: title, dtype: int64

In [15]:
m40_43.str.len().value_counts()

40    15
41    12
42     9
43     9
Name: title, dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Find all movies that begin with 'The' and end in 'Movie'.</span>

In [16]:
find_pattern(title, r'^The.*Movie$')

319                                     The Peanuts Movie
561                                 The Angry Birds Movie
569                                    The Simpsons Movie
759                                        The Lego Movie
1586                      The SpongeBob SquarePants Movie
1734                                    The Rugrats Movie
1895                           The Wild Thornberrys Movie
2162                                     The Tigger Movie
2593                             The Lizzie McGuire Movie
2645    The Pirates Who Don't Do Anything: A VeggieTal...
3296                                     The Muppet Movie
4597                             The Kentucky Fried Movie
Name: title, dtype: object

### Exercise 5

<span style="color:green; font-size:16px">Find all movies that begin with 'The' and end in 'Movie' and have no more than 10 characters between these two words.</span>

In [17]:
find_pattern(title, r'^The.{,10}Movie$')

319      The Peanuts Movie
569     The Simpsons Movie
759         The Lego Movie
1734     The Rugrats Movie
2162      The Tigger Movie
3296      The Muppet Movie
Name: title, dtype: object

### Exercise 6

<span style="color:green; font-size:16px">Find all movies that begin with 'The' and end in 'Movie' and have at least 30 characters between these two words.</span>

In [18]:
find_pattern(title, r'^The.{30,}Movie$')

2645    The Pirates Who Don't Do Anything: A VeggieTal...
Name: title, dtype: object

### Exercise 7

<span style="color:green; font-size:16px">Find all movies that begin with capital `G` followed by at least one `o`, followed by a `d`.</span>

In [19]:
find_pattern(title,  r'^Go+d')

98             Godzilla Resurgence
162                  Gods of Egypt
874              Gods and Generals
1688                       Godsend
1886                    Goodfellas
1906               Good Luck Chuck
2462                     Good Boy!
2551                          Good
2735                    Good Deeds
2787         Good Morning, Vietnam
3037             Good Will Hunting
3197               Good Intentions
3412    Good Night, and Good Luck.
3509               Good Bye Lenin!
3558               Goddess of Love
3919             Gods and Monsters
4438              God's Not Dead 2
4440                 Godzilla 2000
4649                     Good Kill
4793                     Good Dick
Name: title, dtype: object

### Exercise 8

<span style="color:green; font-size:16px">Find all movies have either `Free` or `Fee` in them.</span>

In [20]:
find_pattern(title, r'Fr?ee')

174                      Happy Feet 2
266             Live Free or Die Hard
391                        Happy Feet
856                        Free Birds
1001              Free State of Jones
1507    Mandela: Long Walk to Freedom
1712             The Color of Freedom
1727                      Cry Freedom
2114                  Freedom Writers
2719                          Freedom
3181                       Free Style
3463                         Freeheld
4027                          Freeway
4226                     Freeze Frame
4459             20 Feet from Stardom
Name: title, dtype: object

### Exercise 9

<span style="color:green; font-size:16px">Find all movies that begin with any five characters followed by a space, followed by a `'t'` not case sensitive.</span>

In [21]:
find_pattern(title, r'^.{5} t', flags=re.I).head()

106    Alice Through the Looking Glass
107                    Shrek the Third
127               Thor: The Dark World
299          Where the Wild Things Are
384               K-19: The Widowmaker
Name: title, dtype: object

## 3. Or Conditions and Character Classes

In [22]:
import pandas as pd
title = pd.read_csv('../data/movie.csv')['title']
title.head()

0                                        Avatar
1      Pirates of the Caribbean: At World's End
2                                       Spectre
3                         The Dark Knight Rises
4    Star Wars: Episode VII - The Force Awakens
Name: title, dtype: object

### Exercise 1

<span style="color:green; font-size:16px">Find all movies that either start with `'C'` or end with `'c'`.</span>

In [23]:
find_pattern(title, r'^C|c$').head()

26                                 Titanic
27              Captain America: Civil War
41                                  Cars 2
86     Captain America: The Winter Soldier
118      Charlie and the Chocolate Factory
Name: title, dtype: object

### Exercise 2

<span style="color:green; font-size:16px">Find all movies that are 6 or less characters in length or between 30 and 33 characters in length.</span>

In [24]:
find_pattern(title, r'^.{,6}$|^.{30,33}$').head()

0                              Avatar
37    Transformers: Age of Extinction
41                             Cars 2
53     Transformers: Dark of the Moon
56                              Brave
Name: title, dtype: object

### Exercise 3

<span style="color:green; font-size:16px">Find all movies that have the word `'and'` followed by the word `'the'`.</span>

In [25]:
find_pattern(title, r'\band the\b').head()

9                 Harry Potter and the Half-Blood Prince
54     Indiana Jones and the Kingdom of the Crystal S...
64     The Chronicles of Narnia: The Lion, the Witch ...
81                           Snow White and the Huntsman
100                             The Fast and the Furious
Name: title, dtype: object

### Exercise 4

<span style="color:green; font-size:16px">Find all movies that have the word `'and'` followed by two words and then followed by the word `'the`'. In this exercise, words are defined as 1 or more consecutive "word" characters.</span>

In [26]:
find_pattern(title, r'\band \w+ \w+ the\b')

4400    Down and Out with the Dolls
Name: title, dtype: object

### Exercise 5

<span style="color:green; font-size:16px">Find all movies that begin with `'The'` followed by the next word that begins with digits.</span>

In [27]:
find_pattern(title, r'^The [0-9]')

212                                      The 13th Warrior
429                                           The 6th Day
1354                                         The 5th Wave
1817                               The 40-Year-Old Virgin
1958                                               The 33
3567                                      The 5th Quarter
4373    The 41-Year-Old Virgin Who Knocked Up Sarah Ma...
Name: title, dtype: object

### Exercise 6

<span style="color:green; font-size:16px">Find all movies that have three consecutive capital letters in them.</span>

In [28]:
find_pattern(title, r'[A-Z]{3}').head(10)

4        Star Wars: Episode VII - The Force Awakens
40                                     TRON: Legacy
58                                           WALL·E
140                         Mission: Impossible III
177                                         The BFG
233    Star Wars: Episode III - Revenge of the Sith
340                               Jurassic Park III
440                                           RED 2
589                         AVP: Alien vs. Predator
725                                             RED
Name: title, dtype: object

### Exercise 7

<span style="color:green; font-size:16px">Find all movies that begin and end with a capital letter.</span>

In [29]:
find_pattern(title, r'^[A-Z].*[A-Z]$').head()

46                 World War Z
58                      WALL·E
140    Mission: Impossible III
151            Men in Black II
177                    The BFG
Name: title, dtype: object

### Exercise 8

<span style="color:green; font-size:16px">Find all the movies that have a digit followed by a comma followed by a digit.</span>

In [30]:
find_pattern(title, r'[0-9],[0-9]')

276                                10,000 B.C.
3266    Ultramarines: A Warhammer 40,000 Movie
3641              20,000 Leagues Under the Sea
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Exercise 9

<span style="color:green; font-size:16px">Find all the movies that have either an ampersand or a question mark in them.</span>

In [31]:
find_pattern(title, r'[&?]').head(15)

129                                      Angels & Demons
145                                Mr. Peabody & Sherman
214                                       Batman & Robin
252                                     Mr. & Mrs. Smith
278                                       Town & Country
347    Percy Jackson & the Olympians: The Lightning T...
425             Cats & Dogs: The Revenge of Kitty Galore
437                                        Lilo & Stitch
512                 The Adventures of Rocky & Bullwinkle
703                                          Marley & Me
715                                          Cats & Dogs
740                                      Starsky & Hutch
747                                  Up Close & Personal
813                      Did You Hear About the Morgans?
850                                         Tango & Cash
Name: title, dtype: object

### Exercise 10

<span style="color:green; font-size:16px">Which movie has the most ampersands, question marks, and periods in it?</span>

In [32]:
count = title.str.count(r'[&.?]')
count.head()

0    0
1    0
2    0
3    0
4    0
Name: title, dtype: int64

In [33]:
idx = count.idxmax()
idx

542

In [34]:
title.loc[idx]

'The Man from U.N.C.L.E.'

### Exercise 11

<span style="color:green; font-size:16px">Find all the movies with exactly three words with each word no more than 6 characters in length. For this exercise, a word is defined as consecutive non-space characters followed by exactly one space (or end of string).</span>

In [35]:
find_pattern(title, r'^\S{,6} \S{,6} \S{,6}$').head()

14    The Lone Ranger
15       Man of Steel
32         Iron Man 3
43        Toy Story 3
46        World War Z
Name: title, dtype: object

### Exercise 12

<span style="color:green; font-size:16px">Find all movies that have four consecutive non-word characters.</span>

In [36]:
find_pattern(title, r'\W{4}')

252                      Mr. & Mrs. Smith
3703    Jekyll and Hyde... Together Again
3828         What the #$*! Do We (K)now!?
4498                  The Helix... Loaded
Name: title, dtype: object

### Exercise 13

<span style="color:green; font-size:16px">Find all movies that have at least one word that ends in `'ats'`.</span>

In [37]:
find_pattern(title, r'ats\b')

425     Cats & Dogs: The Revenge of Kitty Galore
715                                  Cats & Dogs
1552                            Cats Don't Dance
1593                 Rugrats in Paris: The Movie
1607                                 Hope Floats
1734                           The Rugrats Movie
1896                             Rugrats Go Wild
2013                  The Men Who Stare at Goats
2084                     Josie and the Pussycats
3001                                     Tomcats
3497                                        Bats
3513                                    Mallrats
Name: title, dtype: object

### Exercise 14

<span style="color:green; font-size:16px">Find all the movies containing, but not ending in `'tes'`.</span>

In [38]:
find_pattern(title, r'tes\B')

1140                The Sweetest Thing
1941     The Greatest Game Ever Played
1976        The World's Fastest Indian
2238      The Greatest Story Ever Told
2547                The White Countess
2708        Bathory: Countess of Blood
3169              World's Greatest Dad
3569                      The Greatest
3817        The Greatest Show on Earth
4271      The Greatest Movie Ever Sold
4845    The Past is a Grotesque Animal
Name: title, dtype: object

### Exercise 15

<span style="color:green; font-size:16px">Find all movies containing a word that is at least 7 lowercase letters in length.</span>

In [39]:
find_pattern(title, r'\b[a-z]{7,}\b')

2310    Les couloirs du temps: Les visiteurs II
2824              Del 1 - Män som hatar kvinnor
3455                              Les visiteurs
3584                        L'auberge espagnole
3761                            Camping sauvage
3925                          La otra conquista
Name: title, dtype: object

### Exercise 16

<span style="color:green; font-size:16px">Find all movies that have a word that contains, but does not start with `'Z'`.</span>

In [40]:
find_pattern(title, r'\BZ')

2127    eXistenZ
Name: title, dtype: object

### Exercise 17

<span style="color:green; font-size:16px">Find all movies containing a word starting with `'W'` and ending with `'w'`.</span>

In [41]:
find_pattern(title, r'\bW\w*w\b')

1229                Secret Window
2833    The Widow of Saint-Pierre
Name: title, dtype: object

### Exercise 18

<span style="color:green; font-size:16px">Count the total number of digit characters between 7 and 9 in all of the movies.</span>

In [42]:
title.str.count(r'[7-9]').sum()

53

### Exercise 19

<span style="color:green; font-size:16px">Use the `count` method to count the number of words in each title. Use consecutive word characters as the definition of a word.</span>

In [43]:
title.str.count(r'\b\w+\b').head()

0    1
1    8
2    1
3    4
4    7
Name: title, dtype: int64

## 4. Grouping and Capturing

### Exercise 1

<span style="color:green; font-size:16px">Find all movies that begin with the exact words `'I'`, `'An'`, or `'How'`, are followed by at least 20 characters, and end in lowercase `'n'` through '`s'`. Make sure there is no warning.</span>

In [44]:
find_pattern(title, r'^(?:I|An|How)\b.{19,}[n-s]$')

93                       How to Train Your Dragon
215                How the Grinch Stole Christmas
929                  How to Lose a Guy in 10 Days
2009        I Still Know What You Did Last Summer
2406                      I Love You, Beth Cooper
2457              I Know What You Did Last Summer
2847                    I Love You Phillip Morris
3177    An Alan Smithee Film: Burn Hollywood Burn
Name: title, dtype: object

### Exercise 2

<span style="color:green; font-size:16px">Find all movies that have a lower case consonant followed by a lowercase vowel followed immediately by a repeat of those same two letters. For example, `'Banana'` repeats `'na'` twice in succession.</span>

In [45]:
find_pattern(title, r'([bcdfghjklmnpqrstvwxyz][aeiou])\1')

  filt = s.str.contains(pattern, **kwargs)


2294    Princess Mononoke
2559               Ararat
2969             Madadayo
3196            Partition
4231              Bananas
Name: title, dtype: object

### Exercise 3

<span style="color:green; font-size:16px">For all movies that begin with 'The' and are followed by the next word that begins with a digit, extract just the digits part of the word.</span>

In [46]:
title.str.extract(r'^The (\d+)').dropna()

Unnamed: 0,0
212,13
429,6
1354,5
1817,40
1958,33
3567,5
4373,41


### Exercise 4

<span style="color:green; font-size:16px">Find all movies that have two separate numbers in them. An example would be, '7 days and 7 nights'.</span>

In [47]:
find_pattern(title, r'\d+\D+\d+')

276                                10,000 B.C.
289                 The Taking of Pelham 1 2 3
509                           2 Fast 2 Furious
1043                              3:10 to Yuma
1610                            13 Going on 30
1617        Naked Gun 33 1/3: The Final Insult
2466                     40 Days and 40 Nights
2646                                     U2 3D
3266    Ultramarines: A Warhammer 40,000 Movie
3308                                     50/50
3516                           Fahrenheit 9/11
3576                                     11:14
3641              20,000 Leagues Under the Sea
3934                                      2:13
4210                   24 7: Twenty Four Seven
4376                    Friday the 13th Part 2
4532              4 Months, 3 Weeks and 2 Days
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Exercise 5

<span style="color:green; font-size:16px">Find all movies that have six or more non-vowel and non-space characters in a row.</span>

In [48]:
find_pattern(title, r'[^aeiouAEIOU\s]{6,}')

276                                10,000 B.C.
542                    The Man from U.N.C.L.E.
1935                          Punch-Drunk Love
2392                                  Catch-22
2480                         Brooklyn's Finest
2507                   When Harry Met Sally...
2912        Tales from the Crypt: Demon Knight
3266    Ultramarines: A Warhammer 40,000 Movie
3641              20,000 Leagues Under the Sea
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Exercise 6

<span style="color:green; font-size:16px">Extract the very next non-space character after 't' or 'T' for each movie, convert to lowercase, and then return the count of each character.</span>

In [49]:
# extract always returns a DataFramea
title.str.extract(r'[Tt](\S)').dropna().head()

Unnamed: 0,0
0,a
1,e
2,r
3,h
4,a


In [50]:
# Select first and only column as a Series with `[0]`.
title.str.extract(r'[Tt](\S)')[0].str.lower().value_counts().head(10)

h    1486
e     275
o     198
i     179
a     164
r     150
t      81
y      69
u      53
s      49
Name: 0, dtype: int64

In [51]:
# there is an `expand` paramter that you can set to False to return a Series
title.str.extract(r'[Tt](\S)', expand=False).head()

0    a
1    e
2    r
3    h
4    a
Name: title, dtype: object

### Exercise 7

<span style="color:green; font-size:16px">Extract all the words that begin with 'T' or 't' and end in 'e' then find their frequency converting to lowercase first.</span>

In [52]:
title.str.extractall(r'\b([tT]\w*e)\b')[0].str.lower().value_counts().head(10)

the        1555
time         26
tale         12
true         10
three         8
teenage       7
take          6
there         6
trouble       4
trade         4
Name: 0, dtype: int64

### Exercise 8

<span style="color:green; font-size:16px">Find all movies containing a 6, 7, and 8 character word.</span>

In [53]:
find_pattern(title, r'^(?=.*\b\w{6}\b)(?=.*\b\w{7}\b)(?=.*\b\w{8}\b)')

10                     Batman v Superman: Dawn of Justice
101                   The Curious Case of Benjamin Button
193              Harry Potter and the Prisoner of Azkaban
209                    Sherlock Holmes: A Game of Shadows
223                       Charlie's Angels: Full Throttle
1239                         Aliens vs. Predator: Requiem
1736                              Florence Foster Jenkins
1775                How to Lose Friends & Alienate People
2075                       Deuce Bigalow: European Gigolo
2113                              Silver Linings Playbook
2365    Borat: Cultural Learnings of America for Make ...
2711          Dungeons & Dragons: Wrath of the Dragon God
2894                              Johnson Family Vacation
3283     The Haunting in Connecticut 2: Ghosts of Georgia
4110    Marilyn Hotchkiss' Ballroom Dancing and Charm ...
4373    The 41-Year-Old Virgin Who Knocked Up Sarah Ma...
Name: title, dtype: object

### Exercise 9

<span style="color:green; font-size:16px">Find all movies that have a `'q'` that is not followed by a `'u'`.</span>

In [54]:
find_pattern(title, r'q(?!u)', flags=re.I)

1366                               John Q
3294           Valley of the Wolves: Iraq
4403                                    Q
4566    Iraq for Sale: The War Profiteers
Name: title, dtype: object

### Exercise 10

<span style="color:green; font-size:16px">Find all movies containing an `'s'` that is not preceded by a vowel or a single quote mark. The `'s'` cannot be the first or last letter of the word. Ignore case.</span>

In [55]:
find_pattern(title, r"^.*(?<![aeiou'])\Bs\B", flags=re.I).head()

29                         Jurassic World
35                    Monsters University
36    Transformers: Revenge of the Fallen
37        Transformers: Age of Extinction
50                       The Great Gatsby
Name: title, dtype: object

### Exercise 11

<span style="color:green; font-size:16px">Find all movies where the first four characters of the first word repeat at least 10 characters after the fourth character somewhere else in the title. Use a named group.</span>

In [56]:
find_pattern(title, r'^(?P<first_four>\w{4}).{10,}(?P=first_four)')

  filt = s.str.contains(pattern, **kwargs)


1690    Hoodwinked Too! Hood vs. Evil
3412       Good Night, and Good Luck.
Name: title, dtype: object

### Exercise 12

<span style="color:green; font-size:16px">Find all movies that repeat one character three or more times in a row.</span>

In [57]:
find_pattern(title, r'(.)\1\1').head(10)

  filt = s.str.contains(pattern, **kwargs)


140                          Mission: Impossible III
233     Star Wars: Episode III - Revenge of the Sith
276                                      10,000 B.C.
340                                Jurassic Park III
474                                    Fantasia 2000
697                          3000 Miles to Graceland
810                                        Rambo III
886                          The Godfather: Part III
972                            Beverly Hills Cop III
1191                     Back to the Future Part III
Name: title, dtype: object

### Exercise 13

<span style="color:green; font-size:16px">Find all movies where a word of at least four characters in length repeats but has a different starting letter. For example, `'Hocus Pocus'` and `'Kill Bill'` both work.</span>

In [58]:
find_pattern(title, r'\b(\w)(\w{3,})\b.*\b(?!\1)\w\2\b')

  filt = s.str.contains(pattern, **kwargs)


846                       Kill Bill: Vol. 1
849                       Kill Bill: Vol. 2
1616                              Like Mike
1650    What's the Worst That Could Happen?
1754                            Hocus Pocus
1831                            Pain & Gain
4461                    Walking and Talking
4602        They Will Have to Kill Us First
Name: title, dtype: object

### Exercise 14

<span style="color:green; font-size:16px">Extract all characters from the first digit through the last. Output the first 10 non-missing rows.</span>

In [59]:
title.str.extract(r'(\d.*\d)').dropna().head(10)

Unnamed: 0,0
60,2012
85,47
212,13
258,300
268,80
276,10000
289,1 2 3
384,19
400,102
474,2000


### Exercise 15

<span style="color:green; font-size:16px">Extract all characters from the first digit through the second digit. Output the first 10 non-missing rows.</span>

In [60]:
title.str.extract(r'(\d.*?\d)').dropna().head(10)

Unnamed: 0,0
60,20
85,47
212,13
258,30
268,80
276,10
289,1 2
384,19
400,10
474,20


### Exercise 16

<span style="color:green; font-size:16px">Extract all characters from the first word that ends in `'e'` through the next word that ends in `'e'`. Ignore case.</span>

In [61]:
title.str.extract(r'(\b\w*?e\b.*?e\b)', flags=re.I).dropna()

Unnamed: 0,0
4,Episode VII - The
9,the Half-Blood Prince
14,The Lone
16,The Chronicles of Narnia: Prince
20,The Hobbit: The
...,...
4863,The Naked Ape
4868,The Image
4876,"Hide, Die"
4900,The Circle


## 5. Multiline Regex Patterns

In [62]:
news = pd.read_csv('../data/newsgroups.csv')
news.head()

Unnamed: 0,category,text
0,sci.med,From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1,talk.politics.guns,From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2,misc.forsale,From: mark@ardsley.business.uwo.ca (Mark Bramw...
3,misc.forsale,From: zmed16@trc.amoco.com (Michael)\nSubject:...
4,talk.politics.guns,From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...


In [63]:
text = news['text']

### Exercise 1

<span style="color:green; font-size:16px">Find the number of messages that have at least one line that begins with `'The'` and ends with a period.</span>

In [64]:
find_pattern(text, r'^The.*\.$', flags=re.M).size

15

### Exercise 2

<span style="color:green; font-size:16px">Find the messages that do not start with the word `'From'`.</span>

In [65]:
find_pattern(text, r'^(?!From)')

16     Subject: help for school\nFrom: mcrandall@eagl...
23     Subject: Re: PHILS, NL EAST NOT SO WEAK\nFrom:...
66     Organization: University of Illinois at Chicag...
82     Organization: University of Illinois at Chicag...
96     Subject: Life on Mars???\nFrom: schiewer@pa881...
162    Subject: Re: FORSALE: Men Without Hats- Folk o...
176    Nntp-Posting-Host: surt.ifi.uio.no\nFrom: Thom...
180    Subject: After-Market Cruise Controls: Specifi...
195    Subject: CDs for sale [update]\nFrom: koutd@hi...
213    Organization: University of Illinois at Chicag...
214    Subject: Space FAQ 01/15 - Introduction\nFrom:...
226    Organization: University of Illinois at Chicag...
227    Organization: Ryerson Polytechnical Institute\...
253    Subject: Re: Can't Breathe -- Update\nFrom: RG...
306    Subject: EXPERTS on PENICILLIN...LOOK!\nFrom: ...
317    Subject: Re: Non-lethal alternatives to handgu...
389    Organization: University of Illinois at Chicag...
399    Subject: STARGARDTS DISE

### Exercise 3

<span style="color:green; font-size:16px">Extract the first word from the messages that do not start with the word `'From'` and then count the occurrences of each of these words.</span>

In [66]:
text.str.extract('^(?!From)(\w+)').value_counts()

Subject         11
Organization     6
Nntp             1
dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Extract the header of each message. These are the lines at the top that begin with `From`, `Subject`, `Organization`, etc... and assign it to a variable named `header`. Take a look at a few of the individual messages to understand what pattern would work best. Make sure `header` is a Series of strings.</span>

In [67]:
# assume that two newlines separate the body and header
header = text.str.extract(r'^(.*?)\n\n', flags=re.S, expand=False)
header.head()

0    From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1    From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2    From: mark@ardsley.business.uwo.ca (Mark Bramw...
3    From: zmed16@trc.amoco.com (Michael)\nSubject:...
4    From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...
Name: text, dtype: object

Printing out an example of one of the headers

In [68]:
print(header.iloc[100])

From: c23reg@kocrsv01.delcoelect.com (Ron Gaskins)
Subject: Re: Dumbest automotive concepts of all tim
Originator: c23reg@koptsw21
Keywords: Dimmer switch location (repost)
Organization: Delco Electronics Corp.
Lines: 22


### Exercise 5

<span style="color:green; font-size:16px">Use the`extractall` method to extract two groups from the header. For all lines that do not begin with a space, extract the first word up to but not including the colon. That is the first group. Extract the remaining characters to the right of the colon as the second group. Name the groups `'property'` and `'value'`.</span>

In [69]:
pattern = r'^(?P<property>\S+): (?P<value>.*)'
df = header.str.extractall(pattern, flags=re.M)
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,property,value
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,From,nyeda@cnsvax.uwec.edu (David Nye)
0,1,Subject,Re: Post Polio Syndrome Information Needed Ple...
0,2,Organization,University of Wisconsin Eau Claire
0,3,Lines,21
1,0,From,ndallen@r-node.hub.org (Nigel Allen)
1,1,Subject,"WACO: Clinton press conference, part 1"
1,2,Organization,R-node Public Access Unix - 1 416 249 5366
1,3,Lines,282
2,0,From,mark@ardsley.business.uwo.ca (Mark Bramwell)
2,1,Subject,Re: Cellular Phone (Portable) for sale


The second group occasionally spans across multiple lines and when that occurs that line does not begin with a space. See the example of message header 176. Both `'In-Reply-To'` and `'Organization'` span multiple lines.

In [70]:
print(header.loc[176])

Nntp-Posting-Host: surt.ifi.uio.no
From: Thomas Parsli <thomasp@ifi.uio.no>
Subject: Re: Rewording the Second Amendment (ideas)
In-Reply-To: arc@cco.caltech.edu (Aaron Ray Clements)'s message of 21 Apr
        1993 12:34:51 GMT
Organization: Dept. of Informatics, University of Oslo, Norway
        <1993Apr20.083057.16899@ousrvr.oulu.fi>
        <viking.735378520@ponderous.cc.iastate.edu>
        <1993Apr21.091130.17788@ousrvr.oulu.fi>
        <1r3f1bINN3n6@gap.caltech.edu>
Lines: 24
Originator: thomasp@surt.ifi.uio.no


Our original pattern did not extract these extra lines.

In [71]:
df.loc[176]

Unnamed: 0_level_0,property,value
match,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Nntp-Posting-Host,surt.ifi.uio.no
1,From,Thomas Parsli <thomasp@ifi.uio.no>
2,Subject,Re: Rewording the Second Amendment (ideas)
3,In-Reply-To,arc@cco.caltech.edu (Aaron Ray Clements)'s mes...
4,Organization,"Dept. of Informatics, University of Oslo, Norway"
5,Lines,24
6,Originator,thomasp@surt.ifi.uio.no


To take care of this edge case, we use the dotall option so that the second group continues to match across newlines. A positive lookahead is used to match the end of a line, but a negative lookahead is used to make sure there is not a space after the newline.

In [72]:
pattern = r'^(?P<property>\S+): (?P<value>.*?)(?=$)(?!\n\s)'
df = header.str.extractall(pattern, flags=re.M | re.S)
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,property,value
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,From,nyeda@cnsvax.uwec.edu (David Nye)
0,1,Subject,Re: Post Polio Syndrome Information Needed Ple...
0,2,Organization,University of Wisconsin Eau Claire
0,3,Lines,21
1,0,From,ndallen@r-node.hub.org (Nigel Allen)
1,1,Subject,"WACO: Clinton press conference, part 1"
1,2,Organization,R-node Public Access Unix - 1 416 249 5366
1,3,Lines,282
2,0,From,mark@ardsley.business.uwo.ca (Mark Bramwell)
2,1,Subject,Re: Cellular Phone (Portable) for sale


### Exercise 6

<span style="color:green; font-size:16px">Attempt to extract all emails from each message. If you are up for a serious challenge, use the exact specifications for [valid email addresses](https://en.wikipedia.org/wiki/Email_address). The solution presented finds the most common types of emails.</span>

In [73]:
emails = text.str.extractall(r'(\w[\w.-]*?\w@\w[\w.-]*\w)(?<![-._])')[0].str.lower()
emails.head()

   match
0  0               nyeda@cnsvax.uwec.edu
   1                 keith@actrix.gen.nz
   2               nyeda@cnsvax.uwec.edu
1  0              ndallen@r-node.hub.org
2  0        mark@ardsley.business.uwo.ca
Name: 0, dtype: object

### Exercise 7

<span style="color:green; font-size:16px">From each email found, extract the characters after the last dot of each email (usually com, edu, org, etc...), make them lowercase, and count the occurrences of each.</span>

In [74]:
emails.str.extract('.*\.(.*)$')[0].str.lower().value_counts().head(10)

edu       676
com       360
gov        69
ca         49
org        29
bitnet     13
net        13
uucp       12
au         11
uk         10
Name: 0, dtype: int64

### Exercise 8

<span style="color:green; font-size:16px">Extract the number of lines of each message as the integer following the word `'Lines:'` in the header. Find the average number of lines per message.</span>

In [75]:
lines = text.str.extract('Lines: (\d+)', expand=False).astype('float64')
lines.head()

0     21.0
1    282.0
2     19.0
3     12.0
4     62.0
Name: text, dtype: float64

In [76]:
lines.mean()

36.94486215538847

### Exercise 9

<span style="color:green; font-size:16px">Extract all 10-digit phone numbers.</span>

In [77]:
# we allow up to two non-digits to separate the first and second set of numbers
# and up to one non-digit between the last two sets of numbers
s = text.str.extractall(r'\b(\d{3}\D{1,2}\d{3}\D?\d{4})\b').dropna()[0]
s.head()

    match
1   0         416 249 5366
    1         202-456-2100
2   0        519) 661-3714
5   0        303) 493-6674
14  0         408 241-9760
Name: 0, dtype: object

Can go further and replace all of the characters between the first and second set of numbers with a hyphen.

In [78]:
s.replace(r'\D{1,2}', '-', regex=True).head()

    match
1   0        416-249-5366
    1        202-456-2100
2   0        519-661-3714
5   0        303-493-6674
14  0        408-241-9760
Name: 0, dtype: object

### Exercise 10

<span style="color:green; font-size:16px">In this exercise you'll find the most common words in each category. Run the first two cells below to place the category in the index and then extract the body from the message as a Series. Extract all the words between 5 and 12 characters in length. Make them all lowercase and remove the words in the list `remove_words`. Finally, count occurrence of each word by category returning the top 10 per category.</span>

In [79]:
textc = news.set_index('category')['text']
textc.head(3)

category
sci.med               From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
talk.politics.guns    From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
misc.forsale          From: mark@ardsley.business.uwo.ca (Mark Bramw...
Name: text, dtype: object

In [80]:
body = textc.str.extract(r'\n\n(.*)', flags=re.S, expand=False)
body.head(3)

category
sci.med               [reply to keith@actrix.gen.nz (Keith Stewart)]...
talk.politics.guns    Here is a press release from the White House.\...
misc.forsale          >\n>I hope you realize that for a cellular pho...
Name: text, dtype: object

In [81]:
remove_words = ['would', 'article', 'there', 'about', 'their', 'other', 'should', 'could',
                'those', 'these', 'which', 'where', 'writes', 'anyone', 'someone', 'because']

In [82]:
words = body.str.extractall(r'\b(\w{5,12})\b')[0].str.lower()
words = words[~words.isin(remove_words)]
words.head(10)

category  match
sci.med   0               reply
          1               keith
          2              actrix
          3               keith
          4             stewart
          5              become
          6          interested
          7             through
          8        acquaintance
          9               polio
Name: 0, dtype: object

In [83]:
word_count = words.groupby('category').value_counts()
top_word_count = word_count.groupby('category').head(10)
top_word_count

category            0         
misc.forsale        offer          34
                    condition      22
                    asking         20
                    drive          19
                    please         18
                    shipping       17
                    windows        16
                    price          15
                    excellent      14
                    interested     14
rec.autos           think          30
                    drive          23
                    right          22
                    better         19
                    little         18
                    speed          18
                    people         17
                    engine         15
                    power          15
                    without        15
rec.sport.baseball  alomar         39
                    think          37
                    better         33
                    games          28
                    pitcher        25
                   

### Exercise 11

<span style="color:green; font-size:16px">Find all messages that have have `'From'`, '`Subject'`, `'Organization'`, and `'Lines'` as parts of the header.</span>

In [84]:
pattern = r'^(?=.*^From)(?=.*^Subject)(?=.*^Organization)(?=.*^Lines)'
find_pattern(header, pattern, flags=re.M | re.S).head()

0    From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1    From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2    From: mark@ardsley.business.uwo.ca (Mark Bramw...
3    From: zmed16@trc.amoco.com (Michael)\nSubject:...
4    From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...
Name: text, dtype: object

## Project - Feature Engineering on the Titanic

In [85]:
titanic = pd.read_csv('../data/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Exercise 1
<span  style="color:green; font-size:16px">Extract the first character of the `Ticket` column and save it as a new column `ticket_first`. Find the total number of survivors, the total number of passengers, and the percentage of those who survived **by this column**. Next find the total survival rate for the entire dataset. Does this new column help predict who survived?</span>

In [86]:
titanic['ticket_first'] = titanic.Ticket.str[0]
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ticket_first
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,A
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,P
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,S


In [87]:
titanic.groupby('ticket_first').agg({'Survived': ['mean', 'sum', 'size']})

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,mean,sum,size
ticket_first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0.630137,92,146
2,0.464481,85,183
3,0.239203,72,301
4,0.2,2,10
5,0.0,0,3
6,0.166667,1,6
7,0.111111,1,9
8,0.0,0,2
9,1.0,1,1
A,0.068966,2,29


In [88]:
# overall survival rate
titanic['Survived'].mean()

0.3838383838383838

It does look like **`ticket_first`** has predictive power. 63% of those tickets beginning with '1' survived while versus 24% for '3'. Only 2 out of 29 people survived with tickets beginning with 'A'.

### Exercise 2

<span style="color:green; font-size:16px">If you did Exercise 2 correctly, you should see that only 7% of the people with tickets that began with 'A' survived. Find the survival rate for all those 'A' tickets by `Sex`.</span>

In [89]:
filt = titanic['ticket_first'] == 'A'
ticket_A = titanic[filt]
ticket_A.groupby('Sex').agg({'Survived': ['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,0.0,2
male,0.074074,27


### Exercise 3

<span style="color:green; font-size:16px">Find the survival rate by the last letter of the ticket. Is there any predictive power here?</span>

In [90]:
titanic['ticket_last'] = titanic['Ticket'].str[-1]
titanic.groupby('ticket_last').agg({'Survived': ['mean', 'sum', 'size']})

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,mean,sum,size
ticket_last,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,0.395062,32,81
1,0.43,43,100
2,0.364706,31,85
3,0.339286,38,112
4,0.297297,22,74
5,0.392405,31,79
6,0.419355,39,93
7,0.355556,32,90
8,0.453488,39,86
9,0.390805,34,87


No predictive power. They are all about equal.

### Exercise 4

<span style="color:green; font-size:16px">Find the length of each passengers name and assign to the `name_len` column. What is the minimum and maximum name length?</span>

In [91]:
titanic['name_len'] = titanic['Name'].str.len()
titanic['name_len'].agg(['min', 'max'])

min    12
max    82
Name: name_len, dtype: int64

### Exercise 5

<span style="color:green; font-size:16px">Pass the `name_len` column to the `pd.cut` function. Also, pass a list of equal-sized cut points to the `bins` parameter. Assign the resulting Series to the `name_len_cat` column. Find the frequency count of each bin in this column.</span>

In [92]:
titanic['name_len_cat'] = pd.cut(titanic['name_len'], bins=[0, 20, 40, 60, 80, 100])
titanic['name_len_cat'].head()

0    (20, 40]
1    (40, 60]
2    (20, 40]
3    (40, 60]
4    (20, 40]
Name: name_len_cat, dtype: category
Categories (5, interval[int64, right]): [(0, 20] < (20, 40] < (40, 60] < (60, 80] < (80, 100]]

In [93]:
titanic['name_len_cat'].value_counts()

(20, 40]     558
(0, 20]      243
(40, 60]      86
(60, 80]       3
(80, 100]      1
Name: name_len_cat, dtype: int64

### Exercise 6
<span  style="color:green; font-size:16px">Is name length a good predictor of survival?<span>

In [94]:
titanic.groupby('name_len_cat').agg({'Survived': ['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
name_len_cat,Unnamed: 1_level_2,Unnamed: 2_level_2
"(0, 20]",0.230453,243
"(20, 40]",0.383513,558
"(40, 60]",0.790698,86
"(60, 80]",1.0,3
"(80, 100]",1.0,1


Yes, the longer the name, the higher the survival rate.

### Exercise 7
<span  style="color:green; font-size:16px">Why do you think people with longer names had a better chance at survival?</span>

Let's output the shortest and longest 10 names

In [95]:
names = titanic.sort_values(by='name_len')['Name']
names.head(10)

826       Lam, Mr. Len
692       Lam, Mr. Ali
74       Bing, Mr. Lee
169      Ling, Mr. Lee
509     Lang, Mr. Fang
832     Saad, Mr. Amin
210     Ali, Mr. Ahmed
694    Weir, Col. John
108    Rekic, Mr. Tido
838    Chip, Mr. Chang
Name: Name, dtype: object

In [96]:
# Names exceed pandas display settings.
# change them with pd.options.display.max_colwidth 
# or just print out values
names.tail(10).values

array(['Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)',
       'Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)',
       'Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)',
       'Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)',
       'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)',
       'Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)',
       'Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)',
       'Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")',
       'Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")',
       'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'],
      dtype=object)

In [97]:
# temporarily set options in a context manager
with pd.option_context('display.max_colwidth', 100):
    print(names.tail(10))

18                                Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)
759                              Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
319                              Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)
41                               Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
25                              Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)
610                             Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)
670                         Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)
556                     Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")
427                   Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")
307    Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)
Name: Name, dtype: object


Looks like all the people with short names are men. All people with long names are females.

### Exercise 8
<span  style="color:green; font-size:16px">Using the titanic dataset, do your best to extract the title from a person's name. Examples of title are 'Mr.', 'Dr.', 'Miss', etc... Save this to a column called `title`. Find the frequency count of the titles.</span>

In [98]:
titanic['title'] = titanic['Name'].str.extract(r'(\w+[.])')
titanic['title'].value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
Countess.      1
Capt.          1
Ms.            1
Sir.           1
Lady.          1
Mme.           1
Don.           1
Jonkheer.      1
Name: title, dtype: int64

### Exercise 9
<span  style="color:green; font-size:16px">Does the title have good predictive value of survival?</span>

In [99]:
titanic.groupby('title').agg({'Survived':['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Capt.,0.0,1
Col.,0.5,2
Countess.,1.0,1
Don.,0.0,1
Dr.,0.428571,7
Jonkheer.,0.0,1
Lady.,1.0,1
Major.,0.5,2
Master.,0.575,40
Miss.,0.697802,182


### Exercise 10
<span  style="color:green; font-size:16px">Create a pivot table of survival by title and sex. Use two aggregation functions, mean and size</span>

In [100]:
titanic.pivot_table(index='title', columns='Sex', 
                    values='Survived', aggfunc=['mean', 'size'])

Unnamed: 0_level_0,mean,mean,size,size
Sex,female,male,female,male
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Capt.,,0.0,,1.0
Col.,,0.5,,2.0
Countess.,1.0,,1.0,
Don.,,0.0,,1.0
Dr.,1.0,0.333333,1.0,6.0
Jonkheer.,,0.0,,1.0
Lady.,1.0,,1.0,
Major.,,0.5,,2.0
Master.,,0.575,,40.0
Miss.,0.697802,,182.0,


### Exercise 11
<span  style="color:green; font-size:16px">Attempt to extract the first name of each passenger into the column `first_name`. Are there are males and females with the same first name?</span>

Most can be found like this following the title

In [101]:
pattern = r'\w+[.] (\w+)'
titanic['first_name'] = titanic['Name'].str.extract(pattern)

To be more precise, we can do this:

In [102]:
pattern = r'\w+[.][a-z (]+([A-Z]\w+)'
titanic['first_name'] = titanic['Name'].str.extract(pattern)

In [103]:
first_name_ct = titanic.groupby('first_name').agg({'Sex': 'nunique'})
first_name_ct.head()

Unnamed: 0_level_0,Sex
first_name,Unnamed: 1_level_1
Abraham,1
Achille,1
Ada,1
Adele,1
Adola,1


In [104]:
filt = first_name_ct['Sex'] == 2
first_name_ct[filt].head(10)

Unnamed: 0_level_0,Sex
first_name,Unnamed: 1_level_1
Albert,2
Alexander,2
Amin,2
Anders,2
Antoni,2
Benjamin,2
Carl,2
Charles,2
Dickinson,2
Edgar,2


Looks like some female first names are actually in parentheses after their husband/father name.

In [105]:
filt = titanic['first_name'] == 'Albert'
titanic.loc[filt, 'Name']

64                                 Stewart, Mr. Albert A
107                               Moss, Mr. Albert Johan
323    Caldwell, Mrs. Albert Francis (Sylvia Mae Harb...
690                              Dick, Mr. Albert Adrian
781            Dick, Mrs. Albert Adrian (Vera Gillespie)
817                                   Mallet, Mr. Albert
833                               Augustsson, Mr. Albert
Name: Name, dtype: object

### Exercise 12
<span  style="color:green; font-size:16px">The past several exercises have been an exercise in **feature engineering**. Several new features (columns) have been created from existing columns. Come up with your own feature and test it out on survival.</span>

Get first letter of cabin. Use 'Missing' if not present.

In [106]:
titanic['cabin_first'] = titanic.Cabin.str[0].fillna('Missing')

Just having a cabin is highly predictive.

In [107]:
titanic.groupby('cabin_first').agg({'Survived': ['size', 'mean']})#.sort_values('size', ascending=False)

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,size,mean
cabin_first,Unnamed: 1_level_2,Unnamed: 2_level_2
A,15,0.466667
B,47,0.744681
C,59,0.59322
D,33,0.757576
E,32,0.75
F,13,0.615385
G,4,0.5
Missing,687,0.299854
T,1,0.0
