# Solutions

1. [Introduction to Regular Expressions](#1.-Introduction-to-Regular-Expressions)
1. [Quantifiers](#2.-Quantifiers)
1. [Or Conditions](#3.-Or-Conditions)
1. [Character Sets and Grouping](#4.-Character-Sets-and-Grouping)
1. [Project - Explore Newsgroups with Regexes](#Project---Explore-Newsgroups-with-Regexes)
1. [Project - Feature Engineering on the Titanic](#Project---Feature-Engineering-on-the-Titanic)

In [1]:
import pandas as pd
import numpy as np

## 1. Introduction to Regular Expressions

In [2]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

### Exercise 1
<span  style="color:green; font-size:16px">Find all movies that have 2 consecutive z's in them.</span>

In [3]:
filt = title.str.contains('zz')
title[filt]

416                All That Jazz
907         The Dukes of Hazzard
1041                   Bedazzled
2234                   Paparazzi
2524                    Hot Fuzz
2593    The Lizzie McGuire Movie
3215       Into the Grizzly Maze
3535                Mystic Pizza
4399              Blue Like Jazz
Name: title, dtype: object

### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that begin with 9.</span>

In [4]:
filt = title.str.contains('^9')
title[filt]

1651                       9
2416                9½ Weeks
3705    90 Minutes in Heaven
Name: title, dtype: object

### Exercise 3
<span  style="color:green; font-size:16px">Find all movies that have a `b` as their third character.</span>

In [5]:
filt = title.str.contains('^..b')
title[filt].head()

22                Robin Hood
228                  RoboCop
286           Public Enemies
448                   Robots
494    Babe: Pig in the City
Name: title, dtype: object

### Exercise 4
<span  style="color:green; font-size:16px">Find all movies with a fourth-to-last character of `M` and a last character of `e`.</span>

In [6]:
filt = title.str.contains('M..e$')
title[filt].head()

704     The Green Mile
1167            8 Mile
1616         Like Mike
2122    Moonlight Mile
2486      How She Move
Name: title, dtype: object

### Exercise 5
<span  style="color:green; font-size:16px">Could you use a regular expression to find a movie that was exactly 6 characters in length?</span>

In [7]:
filt = title.str.contains('^......$')
title[filt].head(10)

0      Avatar
41     Cars 2
58     WALL·E
125    Frozen
168    Sahara
292    Eraser
298    Eragon
368    Pixels
426    Jumper
428    Zodiac
Name: title, dtype: object

### Exercise 6
<span  style="color:green; font-size:16px">What is a more natural way to complete Exercise 5 without a regex?</span>

In [8]:
filt = title.str.len() == 6
title[filt].head(10)

0      Avatar
41     Cars 2
58     WALL·E
125    Frozen
168    Sahara
292    Eraser
298    Eragon
368    Pixels
426    Jumper
428    Zodiac
Name: title, dtype: object

## 2. Quantifiers

Read in the movie dataset first.

In [9]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

### Exercise 1
<span  style="color:green; font-size:16px">Find all movies that have a 'z' as their 15th character.</span>

In [10]:
pattern = '^.{14}z'
filt = title.str.contains(pattern)
title[filt]

2484      American Dreamz
2625    Ramona and Beezus
Name: title, dtype: object

### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that have the word 'Friend' or 'Friends' in them.</span>

In [11]:
pattern = 'Friends?'
filt = title.str.contains(pattern)
title[filt]

1055                     My Best Friend's Wedding
1413                        Friends with Benefits
1775        How to Lose Friends & Alienate People
2216                        My Best Friend's Girl
3116    Seeking a Friend for the End of the World
3495                           Friends with Money
4184                          We Are Your Friends
4279                        Dysfunctional Friends
4670                               Mutual Friends
Name: title, dtype: object

### Exercise 3
<span  style="color:green; font-size:16px">Find all movies that have between 40 and 43 characters in them. Can you verify the results with another `str` accessor method?</span>

In [12]:
pattern = '^.{40,43}$'
filt = title.str.contains(pattern)
m40_43 = title[filt]
m40_43.head()

1        Pirates of the Caribbean: At World's End
4      Star Wars: Episode VII - The Force Awakens
13     Pirates of the Caribbean: Dead Man's Chest
16       The Chronicles of Narnia: Prince Caspian
18    Pirates of the Caribbean: On Stranger Tides
Name: title, dtype: object

In [13]:
m40_43.str.len().head()

1     40
4     42
13    42
16    40
18    43
Name: title, dtype: int64

In [14]:
m40_43.str.len().value_counts()

40    15
41    12
43     9
42     9
Name: title, dtype: int64

### Exercise 4
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' and end in 'Movie'</span>

In [15]:
pattern = '^The.*Movie$'
filt = title.str.contains(pattern)
title[filt]

319                                     The Peanuts Movie
561                                 The Angry Birds Movie
569                                    The Simpsons Movie
759                                        The Lego Movie
1586                      The SpongeBob SquarePants Movie
1734                                    The Rugrats Movie
1895                           The Wild Thornberrys Movie
2162                                     The Tigger Movie
2593                             The Lizzie McGuire Movie
2645    The Pirates Who Don't Do Anything: A VeggieTal...
3296                                     The Muppet Movie
4597                             The Kentucky Fried Movie
Name: title, dtype: object

### Exercise 5
<span  style="color:green; font-size:16px">Create your own Series and make a regular expression that uses the `+` metacharacter. Is this character necessary?</span>

## 3. Or Conditions

In [16]:
import pandas as pd
title = pd.read_csv('../data/movie.csv')['title']
title.head()

0                                        Avatar
1      Pirates of the Caribbean: At World's End
2                                       Spectre
3                         The Dark Knight Rises
4    Star Wars: Episode VII - The Force Awakens
Name: title, dtype: object

### Exercise 1
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' followed by the next word that begins with digits.</span>

In [17]:
pattern = '^The [0-9]'
filt = title.str.contains(pattern)
title[filt]

212                                      The 13th Warrior
429                                           The 6th Day
1354                                         The 5th Wave
1817                               The 40-Year-Old Virgin
1958                                               The 33
3567                                      The 5th Quarter
4373    The 41-Year-Old Virgin Who Knocked Up Sarah Ma...
Name: title, dtype: object

### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that have three consecutive capital letters in them.</span>

In [18]:
pattern = '[A-Z]{3}'
filt = title.str.contains(pattern)
title[filt].head()

4      Star Wars: Episode VII - The Force Awakens
40                                   TRON: Legacy
58                                         WALL·E
140                       Mission: Impossible III
177                                       The BFG
Name: title, dtype: object

### Exercise 3
<span  style="color:green; font-size:16px">Find all movies that have begin and end with a capital letter.</span>

In [19]:
pattern = '^[A-Z].*[A-Z]$'
filt = title.str.contains(pattern)
title[filt].head()

46                 World War Z
58                      WALL·E
140    Mission: Impossible III
151            Men in Black II
177                    The BFG
Name: title, dtype: object

### Exercise 4
<span  style="color:green; font-size:16px">Find all the movies that have a digit followed by a comma followed by a digit.</span>

In [20]:
pattern = r'[0-9],[0-9]'
filt = title.str.contains(pattern)
title[filt].head()

276                                10,000 B.C.
3266    Ultramarines: A Warhammer 40,000 Movie
3641              20,000 Leagues Under the Sea
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Exercise 5
<span  style="color:green; font-size:16px">Find all the movies that have either an ampersand or a question mark in them.</span>

In [21]:
pattern = '[&?]'
filt = title.str.contains(pattern)
title[filt].head()

129          Angels & Demons
145    Mr. Peabody & Sherman
214           Batman & Robin
252         Mr. & Mrs. Smith
278           Town & Country
Name: title, dtype: object

### Exercise 6
<span  style="color:green; font-size:16px">Which movie has the most ampersands, question marks, and periods in it?</span>

In [22]:
pattern = '[&.?]'
count = title.str.count(pattern)
count.head()

0    0
1    0
2    0
3    0
4    0
Name: title, dtype: int64

In [23]:
filt = count == count.max()
title[filt]

542    The Man from U.N.C.L.E.
Name: title, dtype: object

In [24]:
count.max()

5

## 4. Character Sets and Grouping

### Exercise 1
<span  style="color:green; font-size:16px">For all movies that begin with 'The' and are followed by the next word that begins with a digit, extract just the digits part of this word.</span>

In [25]:
pattern = r'^The (\d+)'
title.str.extract(pattern).dropna()

Unnamed: 0,0
212,13
429,6
1354,5
1817,40
1958,33
3567,5
4373,41


### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that have two separate numbers in them. An example would be, '7 days and 7 nights'.</span>

In [26]:
pattern = r'\d+\D+\d+'
filt = title.str.contains(pattern)
title[filt]

276                                10,000 B.C.
289                 The Taking of Pelham 1 2 3
509                           2 Fast 2 Furious
1043                              3:10 to Yuma
1610                            13 Going on 30
1617        Naked Gun 33 1/3: The Final Insult
2466                     40 Days and 40 Nights
2646                                     U2 3D
3266    Ultramarines: A Warhammer 40,000 Movie
3308                                     50/50
3516                           Fahrenheit 9/11
3576                                     11:14
3641              20,000 Leagues Under the Sea
3934                                      2:13
4210                   24 7: Twenty Four Seven
4376                    Friday the 13th Part 2
4532              4 Months, 3 Weeks and 2 Days
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Exercise 3
<span  style="color:green; font-size:16px">Find all the movies that have 6 or more non-vowel and non-space characters in a row.</span>

In [27]:
pattern = r'[^aeiouAEIOU ]{6,}'
filt = title.str.contains(pattern)
title[filt]

276                                10,000 B.C.
542                    The Man from U.N.C.L.E.
1935                          Punch-Drunk Love
2392                                  Catch-22
2480                         Brooklyn's Finest
2507                   When Harry Met Sally...
2912        Tales from the Crypt: Demon Knight
3266    Ultramarines: A Warhammer 40,000 Movie
3641              20,000 Leagues Under the Sea
4775             The Beast from 20,000 Fathoms
Name: title, dtype: object

### Exercise 4
<span  style="color:green; font-size:16px">Extract the very next character after 't' or 'T' for each movie.</span>

In [28]:
pattern = r'[Tt](.)'
title.str.extract(pattern).head()

Unnamed: 0,0
0,a
1,e
2,r
3,h
4,a


### Exercise 5
<span  style="color:green; font-size:16px">What is the most common character after 't' or 'T'?</span>

In [29]:
pattern = r'[Tt](.)'
letters = title.str.extract(pattern)
letters.head()

Unnamed: 0,0
0,a
1,e
2,r
3,h
4,a


This is a DataFrame. The column name is the integer 0. Let's select it as a Series.

In [30]:
letter_series = letters[0]
letter_series.head()

0    a
1    e
2    r
3    h
4    a
Name: 0, dtype: object

In [31]:
letter_series.value_counts().head()

h    1431
      311
e     266
o     183
i     169
Name: 0, dtype: int64

Minor detail here - there is an `expand` parameter than you can set to `False` to return a Series.

In [32]:
pattern = r'[Tt](.)'
letters = title.str.extract(pattern, expand=False)
letters.head()

0    a
1    e
2    r
3    h
4    a
Name: title, dtype: object

In [33]:
letters.value_counts().head()

h    1431
      311
e     266
o     183
i     169
Name: title, dtype: int64

The above only extracts the character after first appearance of the letter 't'. Use the `extractall` string method to get the first characters after each 't'.

In [34]:
pattern = r'[Tt](.)'
letters = title.str.extractall(pattern)
letters.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,a
1,0,e
1,1,h
1,2,
2,0,r


In [35]:
letters[0].value_counts().head()

h    1942
      620
e     480
o     354
i     343
Name: 0, dtype: int64

### Exercise 6
<span style="color:green; font-size:16px">Extract all the words that begin with 'T' or 't' and end in 'e' then find their frequency. Research the word boundaray special character.</span>

In [36]:
pattern = r'\b([tT]\w*e)\b'
letters = title.str.extractall(pattern)
letters[0].str.lower().value_counts()

the              1555
time               26
tale               12
true               10
three               8
teenage             7
there               6
take                6
trade               4
trouble             4
twice               2
terrace             2
treasure            2
torture             1
transcendence       1
thr3e               1
thunderdome         1
tadpole             1
terrible            1
trance              1
timeline            1
throttle            1
temple              1
tease               1
tide                1
tombstone           1
tree                1
triangle            1
trapeze             1
twelve              1
turtle              1
tootsie             1
tae                 1
triple              1
turbulence          1
tape                1
torque              1
Name: 0, dtype: int64

## Project - Explore Newsgroups with Regexes

In [37]:
news = pd.read_csv('../data/newsgroups.csv')
news.head()

Unnamed: 0,category,text
0,sci.med,From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1,talk.politics.guns,From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2,misc.forsale,From: mark@ardsley.business.uwo.ca (Mark Bramw...
3,misc.forsale,From: zmed16@trc.amoco.com (Michael)\nSubject:...
4,talk.politics.guns,From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...


### Extracting emails
It appears all emails follow the line in the header that begins with 'From:'. The following captures emails as the sequence of characters that do not have a space, parentheses or greater than or less than signs, or line breaks in them. There must also be an at symbol in the sequence.

In [38]:
pattern = r'\bFrom:.*?([^ ()<]+@[^ (>\n]+)'
emails = news['text'].str.extract(pattern)
emails.head()

Unnamed: 0,0
0,nyeda@cnsvax.uwec.edu
1,ndallen@r-node.hub.org
2,mark@ardsley.business.uwo.ca
3,zmed16@trc.amoco.com
4,fcrary@ucsu.Colorado.EDU


### Extracting the header
It appears that the header begins at the start of the email and continues until it hits an empty line. The following matches all characters (including line breaks) up until two line breaks in a row. This should represent the header. The pattern `[\s\S]` represents all characters. The dot special character does not match line breaks.

The `*?` represents a non-greedy match, meaning the pattern will stop after the first match. If the question mark was absent, then it would match until the last two line breaks in a row. That's called **greedy**.

In [39]:
headers = news['text'].str.extract(r'([\s\S]*?)\n\n')
headers.head()

Unnamed: 0,0
0,From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje...
1,From: ndallen@r-node.hub.org (Nigel Allen)\nSu...
2,From: mark@ardsley.business.uwo.ca (Mark Bramw...
3,From: zmed16@trc.amoco.com (Michael)\nSubject:...
4,From: fcrary@ucsu.Colorado.EDU (Frank Crary)\n...


### Example header

In [40]:
print(headers.loc[100, 0])

From: c23reg@kocrsv01.delcoelect.com (Ron Gaskins)
Subject: Re: Dumbest automotive concepts of all tim
Originator: c23reg@koptsw21
Keywords: Dimmer switch location (repost)
Organization: Delco Electronics Corp.
Lines: 22


### Finding posts with quotes
The assumption here is that the line begins with a greater than symbol.

In [41]:
filt = news['text'].str.contains(r'\n>')
posts_with_quotes = news.loc[filt, 'text']
print(posts_with_quotes.values[0])

From: nyeda@cnsvax.uwec.edu (David Nye)
Subject: Re: Post Polio Syndrome Information Needed Please !!!
Organization: University of Wisconsin Eau Claire
Lines: 21

[reply to keith@actrix.gen.nz (Keith Stewart)]
 
>My wife has become interested through an acquaintance in Post-Polio
>Syndrome This apparently is not recognised in New Zealand and different
>symptons ( eg chest complaints) are treated separately. Does anone have
>any information on it
 
It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.
 
Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons

### Counting words per category
We first put the category into the index and extract just the body of the posts (this excludes the header). This returns a DataFrame with a single column with name 0. We select this column in the second line.

In [42]:
body = news.set_index('category')['text'].str.extract(r'[\s\S]*?\n\n([\s\S]+)')
body_series = body[0]
body_series.head()

category
sci.med               [reply to keith@actrix.gen.nz (Keith Stewart)]...
talk.politics.guns    Here is a press release from the White House.\...
misc.forsale          >\n>I hope you realize that for a cellular pho...
misc.forsale          \nI have an Alesis HR-16 drum machine for sale...
talk.politics.guns    In article <C4tsHu.Ew6@magpie.linknet.com> man...
Name: 0, dtype: object

### Extract each individual non-quote line
We then use `extractall` to capture a pattern for each individual line. The assumption we make is that the line must begin with a word character.

In [43]:
body_lines = body_series.str.extractall(r'[\n]+(\w.*)')
body_lines.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
category,match,Unnamed: 2_level_1
sci.med,0,It would help if you (and anyone else asking f...
sci.med,1,"some subject) could ask specific questions, as..."
sci.med,2,in a textbook chapter covering all aspects of ...
sci.med,3,"looking for a comprehensive review, ask your l..."
sci.med,4,Most are happy to help with a request of this ...
sci.med,5,"Briefly, this is a condition in which patients..."
sci.med,6,residual weakness from childhood polio notice ...
sci.med,7,weakness as they get older. One theory is tha...
sci.med,8,neurons have to work harder and so die sooner.
sci.med,9,David Nye (nyeda@cnsvax.uwec.edu). Midelfort ...


### Split into individual words
We then split on any non-word character and use `expand=True` to put each word in its own column.

In [44]:
split_words = body_lines[0].str.split(r'\W+', expand=True)
split_words.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,...,27,28,29,30,31,32,33,34,35,36
category,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
sci.med,0,It,would,help,if,you,and,anyone,else,asking,for,...,,,,,,,,,,
sci.med,1,some,subject,could,ask,specific,questions,as,no,one,is,...,,,,,,,,,,
sci.med,2,in,a,textbook,chapter,covering,all,aspects,of,the,subject,...,,,,,,,,,,
sci.med,3,looking,for,a,comprehensive,review,ask,your,local,hospital,librarian,...,,,,,,,,,,
sci.med,4,Most,are,happy,to,help,with,a,request,of,this,...,,,,,,,,,,


### Stack words into a single column
Use the stack method to put all the words in a single column. This will put the column names into the index. We also,

In [45]:
stacked_words = split_words.stack().str.lower()
stacked_words.head(30)

category  match    
sci.med   0      0              it
                 1           would
                 2            help
                 3              if
                 4             you
                 5             and
                 6          anyone
                 7            else
                 8          asking
                 9             for
                 10        medical
                 11    information
                 12             on
          1      0            some
                 1         subject
                 2           could
                 3             ask
                 4        specific
                 5       questions
                 6              as
                 7              no
                 8             one
                 9              is
                 10         likely
                 11             to
                 12           type
          2      0              in
                 1               a


### Remove words less than 7 characters in length
These shorter words won't give us as much information about the topic as the longer ones.

In [46]:
long_word = stacked_words[stacked_words.str.len() >= 7]
long_word.head(20)

category  match    
sci.med   0      10          medical
                 11      information
          1      1           subject
                 4          specific
                 5         questions
          2      2          textbook
                 3           chapter
                 4          covering
                 6           aspects
                 9           subject
          3      0           looking
                 3     comprehensive
                 8          hospital
                 9         librarian
          4      7           request
          5      0           briefly
                 4         condition
                 7          patients
                 10      significant
          6      0          residual
dtype: object

### Groupby category and count the unique values
You can groupby an index level and the count the values for each group.

In [47]:
category_counts = long_word.groupby('category').value_counts().reset_index()
category_counts.columns = ['category', 'word', 'count']
category_counts.head(10)

Unnamed: 0,category,word,count
0,misc.forsale,condition,17
1,misc.forsale,excellent,11
2,misc.forsale,interested,9
3,misc.forsale,shipping,9
4,misc.forsale,windows,9
5,misc.forsale,printer,8
6,misc.forsale,publish,8
7,misc.forsale,contact,7
8,misc.forsale,software,7
9,misc.forsale,compatible,6


### Select top 10 words per category

In [48]:
top10_words = category_counts.groupby('category').head(10)
top10_words.head(20)

Unnamed: 0,category,word,count
0,misc.forsale,condition,17
1,misc.forsale,excellent,11
2,misc.forsale,interested,9
3,misc.forsale,shipping,9
4,misc.forsale,windows,9
5,misc.forsale,printer,8
6,misc.forsale,publish,8
7,misc.forsale,contact,7
8,misc.forsale,software,7
9,misc.forsale,compatible,6


### Fix the index
The index values are the old location of the rows. They don't make sense. Let's drop it.

In [49]:
top10_words = top10_words.reset_index(drop=True)
top10_words.head(20)

Unnamed: 0,category,word,count
0,misc.forsale,condition,17
1,misc.forsale,excellent,11
2,misc.forsale,interested,9
3,misc.forsale,shipping,9
4,misc.forsale,windows,9
5,misc.forsale,printer,8
6,misc.forsale,publish,8
7,misc.forsale,contact,7
8,misc.forsale,software,7
9,misc.forsale,compatible,6


### Get unique categories for querying

In [50]:
top10_words['category'].unique()

array(['misc.forsale', 'rec.autos', 'rec.sport.baseball', 'sci.med',
       'sci.space', 'talk.politics.guns'], dtype=object)

### Choose a couple categories

In [51]:
filt = top10_words['category'] == 'sci.space'
top10_words[filt]

Unnamed: 0,category,word,count
40,sci.space,telescope,27
41,sci.space,satellite,25
42,sci.space,national,24
43,sci.space,shuttle,22
44,sci.space,vehicle,18
45,sci.space,observatory,16
46,sci.space,because,15
47,sci.space,international,15
48,sci.space,spacecraft,14
49,sci.space,astronomical,13


In [52]:
filt = top10_words['category'] == 'talk.politics.guns'
top10_words[filt]

Unnamed: 0,category,word,count
50,talk.politics.guns,because,27
51,talk.politics.guns,federal,26
52,talk.politics.guns,believe,24
53,talk.politics.guns,against,23
54,talk.politics.guns,weapons,23
55,talk.politics.guns,without,23
56,talk.politics.guns,defense,22
57,talk.politics.guns,firearms,22
58,talk.politics.guns,control,21
59,talk.politics.guns,government,19


## Project - Feature Engineering on the Titanic

### Exercise 1
<span  style="color:green; font-size:16px">Extract the first character of the `Ticket` column and save it as a new column `ticket_first`. Find the total number of survivors, the total number of passengers, and the percentage of those who survived **by this column**. Next find the total survival rate for the entire dataset. Does this new column help predict who survived?</span>

In [53]:
titanic = pd.read_csv('../data/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [54]:
titanic['ticket_first'] = titanic.Ticket.str[0]
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ticket_first
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,A
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,P
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3


In [55]:
ticket_first_survival = titanic.groupby('ticket_first').agg({'Survived': ['mean', 'sum', 'size']})
ticket_first_survival

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,mean,sum,size
ticket_first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0.630137,92,146
2,0.464481,85,183
3,0.239203,72,301
4,0.2,2,10
5,0.0,0,3
6,0.166667,1,6
7,0.111111,1,9
8,0.0,0,2
9,1.0,1,1
A,0.068966,2,29


In [56]:
# overall survival rate
titanic['Survived'].mean()

0.3838383838383838

It does look like **`ticket_first`** has predictive power. 63% of those tickets beginning with '1' survived while versus 24% for '3'. Only 2 out of 29 people survived with tickets beginning with 'A'.

### Exercise 2
<span  style="color:green; font-size:16px">If you did Exercise 2 correctly, you should see that only 7% of the people with tickets that began with 'A' survived. Find the survival rate for all those 'A' tickets by `Sex`.</span>

In [57]:
filt = titanic['ticket_first'] == 'A'
ticket_A = titanic[filt]
ticket_A.groupby('Sex').agg({'Survived': ['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,0.0,2
male,0.074074,27


### Exercise 3
<span  style="color:green; font-size:16px">Find the survival rate by the last letter of the ticket. Is there any predictive power here?</span>

In [58]:
titanic['ticket_last'] = titanic['Ticket'].str[-1]
ticket_last_survival = titanic.groupby('ticket_last').agg({'Survived': ['mean', 'sum', 'size']})
ticket_last_survival

Unnamed: 0_level_0,Survived,Survived,Survived
Unnamed: 0_level_1,mean,sum,size
ticket_last,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,0.395062,32,81
1,0.43,43,100
2,0.364706,31,85
3,0.339286,38,112
4,0.297297,22,74
5,0.392405,31,79
6,0.419355,39,93
7,0.355556,32,90
8,0.453488,39,86
9,0.390805,34,87


No predictive power. They are all about equal.

### Exercise 4
<span  style="color:green; font-size:16px">Find the length of each passengers name and assign to the `name_len` column. What is the minimum and maximum name length?</span>

In [59]:
titanic['name_len'] = titanic['Name'].str.len()
titanic['name_len'].min()

12

In [60]:
titanic['name_len'].max()

82

### Exercise 5
<span  style="color:green; font-size:16px">Pass the `name_len` column to the `pd.cut` function. Also, pass a list of equal-sized cut points to the `bins` parameter. Assign the resulting Series to the `name_len_cat` column. Find the frequency count of each bin in this column.</span>

In [61]:
titanic['name_len_cat'] = pd.cut(titanic['name_len'], bins=[0, 20, 40, 60, 80, 100])
titanic['name_len_cat'].head()

0    (20, 40]
1    (40, 60]
2    (20, 40]
3    (40, 60]
4    (20, 40]
Name: name_len_cat, dtype: category
Categories (5, interval[int64]): [(0, 20] < (20, 40] < (40, 60] < (60, 80] < (80, 100]]

In [62]:
titanic['name_len_cat'].value_counts()

(20, 40]     558
(0, 20]      243
(40, 60]      86
(60, 80]       3
(80, 100]      1
Name: name_len_cat, dtype: int64

### Exercise 6
<span  style="color:green; font-size:16px">Is name length a good predictor of survival?<span>

In [63]:
titanic.groupby('name_len_cat').agg({'Survived': ['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
name_len_cat,Unnamed: 1_level_2,Unnamed: 2_level_2
"(0, 20]",0.230453,243
"(20, 40]",0.383513,558
"(40, 60]",0.790698,86
"(60, 80]",1.0,3
"(80, 100]",1.0,1


Yes, the longer the name, the higher the survival rate.

### Exercise 7
<span  style="color:green; font-size:16px">Why do you think people with longer names had a better chance at survival?</span>

Let's output the shortest and longest 10 names

In [64]:
names = titanic.sort_values(by='name_len')['Name']

In [65]:
names.head(10)

826       Lam, Mr. Len
692       Lam, Mr. Ali
74       Bing, Mr. Lee
169      Ling, Mr. Lee
509     Lang, Mr. Fang
832     Saad, Mr. Amin
210     Ali, Mr. Ahmed
694    Weir, Col. John
108    Rekic, Mr. Tido
838    Chip, Mr. Chang
Name: Name, dtype: object

In [66]:
# Names exceed pandas display settings.
# change them with pd.options.display.max_colwidth 
# or just print out values
names.tail(10).values

array(['Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)',
       'Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)',
       'Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)',
       'Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)',
       'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)',
       'Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)',
       'Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)',
       'Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")',
       'Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")',
       'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'],
      dtype=object)

In [67]:
# temporarily set options in a context manager
with pd.option_context('display.max_colwidth', 100):
    print(names.tail(10))

18                                Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)
759                              Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
319                              Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)
41                               Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)
25                              Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)
610                             Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)
670                         Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)
556                     Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")
427                   Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")
307    Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)
Name: Name, dtype: object


Looks like all the people with short names are men. All people with long names are females.

### Exercise 8
<span  style="color:green; font-size:16px">Using the titanic dataset, do your best to extract the title from a person's name. Examples of title are 'Mr.', 'Dr.', 'Miss', etc... Save this to a column called `title`. Find the frequency count of the titles.</span>

In [68]:
titanic['title'] = titanic['Name'].str.extract(r'(\w+[.])')
titanic['title'].value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
Don.           1
Mme.           1
Ms.            1
Countess.      1
Capt.          1
Sir.           1
Lady.          1
Jonkheer.      1
Name: title, dtype: int64

### Exercise 9
<span  style="color:green; font-size:16px">Does the title have good predictive value of survival?</span>

In [69]:
titanic.groupby('title').agg({'Survived':['mean', 'size']})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,size
title,Unnamed: 1_level_2,Unnamed: 2_level_2
Capt.,0.0,1
Col.,0.5,2
Countess.,1.0,1
Don.,0.0,1
Dr.,0.428571,7
Jonkheer.,0.0,1
Lady.,1.0,1
Major.,0.5,2
Master.,0.575,40
Miss.,0.697802,182


### Exercise 10
<span  style="color:green; font-size:16px">Create a pivot table of survival by title and sex. Use two aggregation functions, mean and size</span>

In [70]:
titanic.pivot_table(index='title', columns='Sex', 
                    values='Survived', aggfunc=['mean', 'size'])

Unnamed: 0_level_0,mean,mean,size,size
Sex,female,male,female,male
title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Capt.,,0.0,,1.0
Col.,,0.5,,2.0
Countess.,1.0,,1.0,
Don.,,0.0,,1.0
Dr.,1.0,0.333333,1.0,6.0
Jonkheer.,,0.0,,1.0
Lady.,1.0,,1.0,
Major.,,0.5,,2.0
Master.,,0.575,,40.0
Miss.,0.697802,,182.0,


### Exercise 11
<span  style="color:green; font-size:16px">Attempt to extract the first name of each passenger into the column `first_name`. Are there are males and females with the same first name?</span>

Most can be found like this following the title

In [71]:
pattern = r'\w+[.] (\w+)'
titanic['first_name'] = titanic['Name'].str.extract(pattern)

To be more precise, we can do this:

In [72]:
pattern = r'\w+[.][a-z (]+([A-Z]\w+)'
titanic['first_name'] = titanic['Name'].str.extract(pattern)

In [73]:
first_name_ct = titanic.groupby('first_name').agg({'Sex': 'nunique'})
first_name_ct.head()

Unnamed: 0_level_0,Sex
first_name,Unnamed: 1_level_1
Abraham,1
Achille,1
Ada,1
Adele,1
Adola,1


In [74]:
filt = first_name_ct['Sex'] == 2
first_name_ct[filt].head(10)

Unnamed: 0_level_0,Sex
first_name,Unnamed: 1_level_1
Albert,2
Alexander,2
Amin,2
Anders,2
Antoni,2
Benjamin,2
Carl,2
Charles,2
Dickinson,2
Edgar,2


Looks like some female first names are actually in parentheses after their husband/father name.

In [75]:
filt = titanic['first_name'] == 'Albert'
titanic.loc[filt, 'Name']

64                                 Stewart, Mr. Albert A
107                               Moss, Mr. Albert Johan
323    Caldwell, Mrs. Albert Francis (Sylvia Mae Harb...
690                              Dick, Mr. Albert Adrian
781            Dick, Mrs. Albert Adrian (Vera Gillespie)
817                                   Mallet, Mr. Albert
833                               Augustsson, Mr. Albert
Name: Name, dtype: object

### Exercise 12
<span  style="color:green; font-size:16px">The past several exercises have been an exercise in **feature engineering**. Several new features (columns) have been created from existing columns. Come up with your own feature and test it out on survival.</span>

Get first letter of cabin. Use 'Missing' if not present.

In [76]:
titanic['cabin_first'] = titanic.Cabin.str[0].fillna('Missing')

Just having a cabin is highly predictive.

In [77]:
titanic.groupby('cabin_first').agg({'Survived': ['size', 'mean']})#.sort_values('size', ascending=False)

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,size,mean
cabin_first,Unnamed: 1_level_2,Unnamed: 2_level_2
A,15,0.466667
B,47,0.744681
C,59,0.59322
D,33,0.757576
E,32,0.75
F,13,0.615385
G,4,0.5
Missing,687,0.299854
T,1,0.0
