# Regular expressions

Helpful to find strings in text data


<b>Identifiers:</b>
<p>
<ul>
<li><i>\d</i>: any number
<li><i>\s</i>: space
<li><i>\t</i>: tab
<li><i>\w</i>: any character
<li><i>.</i>: any character except a new line
<li><i>\b</i>: word boundary
<li><i>\$</i>: end of a line
<li><i>^</i>: beginning of a line
</ul>
</p>


<b>Modifiers:</b>
<p>
<ul>
<li><i>*</i>: 0 or more
<li><i>+</i>: 1 or more
<li><i>?</i>: 0 or 1
<li><i>{1,3}</i>: 1 to 3
<li><i>|</i>: either or
<li>\[\]: range
</ul>
</p>


In [1]:
f = open('Gillette.txt', encoding='utf-8')

In [2]:
tweets = f.readlines()

In [3]:
f.close()

In [5]:
len(tweets)

9349

In [8]:
t1 = tweets[10]

## Examples on a tweet

In [9]:
t1

'RT @ "Don\'t be a bully, don\'t beat people up for fun, don\'t belittle others, don\'t force others to deal with your unbridled sexual advances. Be a good dude and everyone wins." -@gillette  "I HATE GILLETTE NOW BECAUSE I LIKE DOING THOSE THINGS!" - Toxic Masculinity\n'

### Example 1
Does this tweet contain the "don't"?

In [10]:
'don\'t' in t1

True

To find all matches

In [11]:
import re

In [12]:
p = re.compile(r'don\'t')

In [13]:
matches = p.finditer(t1)

In [18]:
t1 = tweets[10]

In [19]:
t1

'RT @ "Don\'t be a bully, don\'t beat people up for fun, don\'t belittle others, don\'t force others to deal with your unbridled sexual advances. Be a good dude and everyone wins." -@gillette  "I HATE GILLETTE NOW BECAUSE I LIKE DOING THOSE THINGS!" - Toxic Masculinity\n'

In [15]:
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(24, 29), match="don't">
<_sre.SRE_Match object; span=(54, 59), match="don't">
<_sre.SRE_Match object; span=(77, 82), match="don't">


In [22]:
t1

'RT @ "Don\'t be a bully, don\'t beat people up for fun, don\'t belittle others, don\'t force others to deal with your unbridled sexual advances. Be a good dude and everyone wins." -@gillette  "I HATE GILLETTE NOW BECAUSE I LIKE DOING THOSE THINGS!" - Toxic Masculinity\n'

What if I want Don't and don't?

In [21]:
p = re.compile(r'[dD]on\'t')
matches = p.finditer(t1)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(6, 11), match="Don't">
<_sre.SRE_Match object; span=(24, 29), match="don't">
<_sre.SRE_Match object; span=(54, 59), match="don't">
<_sre.SRE_Match object; span=(77, 82), match="don't">


### Example 2
Does this tweet contain a word whose letters are all uppercase?

In [29]:
p = re.compile(r'\b[A-Z]+\b')

In [30]:
t1

'RT @ "Don\'t be a bully, don\'t beat people up for fun, don\'t belittle others, don\'t force others to deal with your unbridled sexual advances. Be a good dude and everyone wins." -@gillette  "I HATE GILLETTE NOW BECAUSE I LIKE DOING THOSE THINGS!" - Toxic Masculinity\n'

In [31]:
matches = p.finditer(t1)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 2), match='RT'>
<_sre.SRE_Match object; span=(189, 190), match='I'>
<_sre.SRE_Match object; span=(191, 195), match='HATE'>
<_sre.SRE_Match object; span=(196, 204), match='GILLETTE'>
<_sre.SRE_Match object; span=(205, 208), match='NOW'>
<_sre.SRE_Match object; span=(209, 216), match='BECAUSE'>
<_sre.SRE_Match object; span=(217, 218), match='I'>
<_sre.SRE_Match object; span=(219, 223), match='LIKE'>
<_sre.SRE_Match object; span=(224, 229), match='DOING'>
<_sre.SRE_Match object; span=(230, 235), match='THOSE'>
<_sre.SRE_Match object; span=(236, 242), match='THINGS'>


What if I want to find all UPPERCASE words of at least one character?

I need to specify that the uppercase letters be surrounded by word delimiters

In [29]:
p = re.compile(r'\b[A-Z]+\b')

In [30]:
t1

'RT @ "Don\'t be a bully, don\'t beat people up for fun, don\'t belittle others, don\'t force others to deal with your unbridled sexual advances. Be a good dude and everyone wins." -@gillette  "I HATE GILLETTE NOW BECAUSE I LIKE DOING THOSE THINGS!" - Toxic Masculinity\n'

In [31]:
matches = p.finditer(t1)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 2), match='RT'>
<_sre.SRE_Match object; span=(189, 190), match='I'>
<_sre.SRE_Match object; span=(191, 195), match='HATE'>
<_sre.SRE_Match object; span=(196, 204), match='GILLETTE'>
<_sre.SRE_Match object; span=(205, 208), match='NOW'>
<_sre.SRE_Match object; span=(209, 216), match='BECAUSE'>
<_sre.SRE_Match object; span=(217, 218), match='I'>
<_sre.SRE_Match object; span=(219, 223), match='LIKE'>
<_sre.SRE_Match object; span=(224, 229), match='DOING'>
<_sre.SRE_Match object; span=(230, 235), match='THOSE'>
<_sre.SRE_Match object; span=(236, 242), match='THINGS'>


### Example 3
Does this tweet contain a hashtag? # immediately followed by at least one letter

In [36]:
t1

'RT @ "Don\'t be a bully, don\'t beat people up for fun, don\'t belittle others, don\'t force others to deal with your unbridled sexual advances. Be a good dude and everyone wins." -@gillette  "I HATE GILLETTE NOW BECAUSE I LIKE DOING THOSE THINGS!" - Toxic Masculinity\n'

In [32]:
p = re.compile(r'#[a-zA-Z]+')

In [33]:
matches = p.finditer(t1)

In [35]:
for match in matches:
    print(match)

## Examples on all tweets

### Example 1
Find the original tweets where "men" is followed, later on, by "women"

In [39]:
p = re.compile(r'\bmen\b.*\bwomen\b', flags=re.IGNORECASE)

In [40]:
for i in range(len(tweets)):
    tweet = tweets[i]
    match = p.search(tweet)
    if match is None:
        continue
    print(str(i) + ': ' + tweet)

27: RT @ The ad called for men to treat women and eachother with BASIC kindness and humanity.... and people (men) are angry over it. Take that how you please. https://t.co/BNeqNByEDp

30: Gilette: hey this ad is a bit disconnected from reality but we're just saying to not be a piece of shit basically, nothing very harmful  Some Men: HURR DURR GILLETTE IS TRASH, MY MASCULINITY !!  Some Women: HAHA MEN ARE TRASH THANKS GILLETTE !!

48: RT @tangeliaee: men: women are so sensitive about everything gillette: be a good person men: https://t.co/I8LGYhxUSE

51: RT @tangeliaee: men: women are so sensitive about everything gillette: be a good person men: https://t.co/I8LGYhxUSE

59: RT @ The ad called for men to treat women and eachother with BASIC kindness and humanity.... and people (men) are angry over it. Take that how you please. https://t.co/BNeqNByEDp

102: RT @RealJamesWoods: Is this the @Gillette standard for how men should teach their sons to view women? https://t.co/VvRy6Lq5Oy

106: R

What if I want men to be followed by women within 6 characters

In [44]:
p = re.compile(r'\bmen\b.{0,6}\bwomen\b', flags=re.IGNORECASE)

In [45]:
for i in range(len(tweets)):
    tweet = tweets[i]
    
    # check whether it starts with RT
    p_RT = re.compile(r'^RT')
    m_RT = p_RT.search(tweet)
    if m_RT is not None:
        continue
    
    match = p.search(tweet)
    if match is None:
        continue
    print(str(i) + ': ' + tweet)

589: Just discussed the #GilletteAd with my sons. Explained that neither men nor women are inherently bad and that society has a role in shaping how we act and that Gillette was shining a light on male violence that has been perpetuated by society and pop culture.

1941: Brent Bozell: Gillette's Sexist Sermonizing to Men and Women https://t.co/o9yU2VzvR8 via @cnsnews

4740: @simpalmer @fleccas @Gillette First place.  Men and women are up in arms over this stupid Gillette commercial.   IF THE SHOE DOESNT FIT, DONT TRY TO PUT IT ON.   I raised my son to be a strong man, commercials didn't have an influence. Stop giving them power.   Give your woman the passcode.

5500: Gillette's Sexist Sermonizing to Men and Women https://t.co/3nbyvuQrFc

5564: @wdunlap @Gillette Where are the divisions being created? The add addresses men and women social structure toxic for decades WW|| soldiers worked together with lgbt, female and all members of society, as a whole we defeated the germans

6068: @pa

### Example 2