### Author: Ran Meng

This jupyter notebook contains my work for certification of "Regular Expressions in Python" instructed by Maria Eugenia Inzaugarat, from [DataCamp](https://learn.datacamp.com/courses/regular-expressions-in-python)

In [1]:
import pandas as pd
import re
from datetime import datetime
from string import Template

Your first project is to build a model for predicting if a movie will get a positive or negative review.
You need to start exploring your dataset. In order to create a function that will scan each movie review, you want to know how many characters every string has and print the result out together with a statement that indicate what the number refers to. To test if your function works correctly, you are going to start by analyzing only one example.

In [2]:
movie = 'fox and kelley soon become bitter rivals because the\
new fox books store is opening up right across the block from the small business .'

In [3]:
# Find characters in movie variable
length_string = len(movie)

# Convert to string
to_string = str(length_string)

# Predefined variable
statement = "Number of characters in this review:"

# Concatenate strings and print result
print(statement + " " + to_string)

Number of characters in this review: 134


#### Artificial reviews

While checking out the movie reviews in your dataset, you realize that some of them show an atypical pattern. Since you should only include true reviews in your analysis, you decide to extract the suspicious ones that follow this pattern. You want to see if it is possible to artificially create reviews by using the first and last part of one example review and changing a keyword in the middle.

In [4]:
movie1 = 'the most significant tension of _election_ is the potential relationship between a teacher and his student .'

movie2 = 'the most significant tension of _rushmore_ is the potential relationship between a teacher and his student .'

In [5]:
# Select the first 32 characters of movie1
first_part = movie1[:32]

# Select from 43rd character to the end of movie1
last_part = movie1[42:]

# Select from 33rd to the 42nd character
middle_part = movie2[32:42]

# Print concatenation and movie2 variable
print(first_part+ middle_part+last_part) 
print(movie2)

the most significant tension of _rushmore_ is the potential relationship between a teacher and his student .
the most significant tension of _rushmore_ is the potential relationship between a teacher and his student .


#### Palindromes

In [6]:
movie_palindrome = "oh my God! desserts I stressed was an ugly movie"

In [7]:
# Get the word
movie_title = movie_palindrome[11:30]

# Obtain the palindrome
palindrome = movie_title[::-1] # print every character starting from end

# Print the word if it's a palindrome
if movie_title == palindrome:
	print(movie_title)

desserts I stressed


#### Normalizing reviews

It's time to extract some important words present in your movie review dataset. First, you need to normalize them and then, count their frequency. Part of the normalization implies converting all the words to lowercase, removing special characters and extracting the root of a word so you count the variants as one.

In [8]:
movie = '$I supposed that coming from MTV Films I should expect no less$'

In [9]:
# Convert to lowercase and print the result
movie_lower = movie.lower()
print(movie_lower, '\n')

# Remove specified character and print the result
movie_no_space = movie_lower.strip()
print(movie_no_space, '\n')

# Split the string into substrings and print the result
movie_split = movie_no_space.split()
print(movie_split, '\n')

# Select root word and print the result
word_root = movie_split[1][:-1]
print(word_root, '\n')

$i supposed that coming from mtv films i should expect no less$ 

$i supposed that coming from mtv films i should expect no less$ 

['$i', 'supposed', 'that', 'coming', 'from', 'mtv', 'films', 'i', 'should', 'expect', 'no', 'less$'] 

suppose 



In [10]:
movie = 'the film,however,is all good<\\i>'

In [11]:
# Remove tags happening at the end and print results
movie_tag = movie.rstrip("<\i>")
print(movie_tag, '\n')

# Split the string using commas and print results
movie_no_comma = movie_tag.split(",")
print(movie_no_comma, '\n')

# Join back together and print results
movie_join = " ".join(movie_no_comma)
print(movie_join)

the film,however,is all good 

['the film', 'however', 'is all good'] 

the film however is all good


#### Split lines or split the line?

You are about to leave work when a colleague asks you to use your string manipulation skills to help him. You need to read strings from a file in a way that if the file contains strings on different lines, they are stored as separate elements. He also wants you to break the strings into pieces if you see that they contain commas.

In [12]:
file = 'mtv films election, a high school comedy, is a current example\nfrom there, \
director steven spielberg wastes no time, taking us into the water on a midnight swim'

In [13]:
# Split string at line boundaries
file_split = file.splitlines()

# Print file_split
print(file_split, '\n')

# Complete for-loop to split by commas
for substring in file_split:
    substring_split = substring.split(sep = ',')
    print(substring_split, '\n')

['mtv films election, a high school comedy, is a current example', 'from there, director steven spielberg wastes no time, taking us into the water on a midnight swim'] 

['mtv films election', ' a high school comedy', ' is a current example'] 

['from there', ' director steven spielberg wastes no time', ' taking us into the water on a midnight swim'] 



#### Finding a substring

It's a new day at work and you need to continue cleaning your dataset for the movie prediction project. While exploring the dataset, you notice a strange pattern: there are some repeated, consecutive words occurring between the character at position 37 and the character at position 41. You decide to write a function to find out which movie reviews show this peculiarity, remembering that the ending position you specify is not inclusive. If you detect the word, you also want to change the string by replacing it with only one instance of the word.

In [14]:
short_movies = pd.read_csv("short_movies.csv")

In [15]:
short_movies.head()

Unnamed: 0,id,tag,html,sent id,text,target
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos


In [16]:
movies = short_movies.iloc[200:203]['text']

print(movies)
print(type(movies))

200    it's clear that he's passionate about his beli...
201    I believe you I always said that the actor act...
202    it's astonishing how frightening the actor act...
Name: text, dtype: object
<class 'pandas.core.series.Series'>


In [17]:
movies = pd.Series.tolist(movies)
print(movies)

["it's clear that he's passionate about his beliefs , and that he's not just a punk looking for an excuse to beat people up .", 'I believe you I always said that the actor actor actor is amazing in every movie he has played', "it's astonishing how frightening the actor actor norton looks with a shaved head and a swastika on his chest."]


In [18]:
for movie in movies:
  	# Find if actor occurrs between 37 and 41 inclusive
    if movie.find("actor", 37, 42) == -1:
        print("Word not found")
    # Count occurrences and replace two by one
    elif movie.count("actor") == 2:  
        print(movie.replace("actor actor", "actor"))
    else:
        # Replace three occurrences by one
        print(movie.replace("actor actor actor", "actor"))

Word not found
I believe you I always said that the actor is amazing in every movie he has played
it's astonishing how frightening the actor norton looks with a shaved head and a swastika on his chest.


#### Where's the word?

Before finishing cleaning your dataset, you want to check if a specific word occurs in the reviews. You noticed earlier a specific pattern in the strings. Now, you want to create a function to check if a word is present between characters with index 12, and 50, remembering that ending position is exclusive, and print out the lowest index where this word occurs. There are two methods to handle this situation. You want to see which one works best.

In [19]:
movies = pd.Series.tolist(short_movies.iloc[137:139]['text'])

print(movies)

["heck , jackie doesn't even have enough money for a haircut , looks like , much less a personal hairstylist .", "in condor , chan plays the same character he's always played , himself , a mixture of bruce lee and tim allen , a master of both kung-fu and slapstick-fu ."]


In [20]:
for movie in movies:
  # Find the first occurrence of word
  print(movie.find('money', 12, 51))

39
-1


In [21]:
for movie in movies:
    try:
        # Find the first occurrence of word
  	    print(movie.index('money', 12, 51))
    except ValueError:
        print("substring not found")

39
substring not found


#### Put it in order!

Your company is analyzing the best way to provide users with different online courses. Your job is to scrape Wikipedia pages searching for tools used in Data Science subfields. You'll store the tool and field name in a database. After a text analysis, you realize that the information is provided in a specific position of the text but sometimes the field name is given first and the tool after that, while in other cases it's the other way around.

You decide to use positional formatting to handle these situations because it provides a way to reorder placeholders.

In [22]:
wiki = pd.read_csv('wikipedia.csv')

In [23]:
wiki.head()

Unnamed: 0.1,Unnamed: 0,tool,description
0,0,Natural Language Toolkit,suite of libraries and programs for symbolic a...
1,1,TextBlob,Python library for processing textual data. It...
2,2,Gensim,Gensim is a robust open-source vector space mo...
3,3,artificial intelligence,"In computer science, artificial intelligence (..."


In [24]:
ai = wiki.iloc[3, 2]
print(ai)

In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.


In [25]:
# Assign the substrings to the variables
first_pos = ai[3:19].lower()
second_pos = ai[21:44].lower()
print(first_pos, '\n')
print(second_pos)

computer science 

artificial intelligence


In [26]:
my_list = []
# Define string with placeholders 
my_list.append("The tool {} is used in {}")

# Define string with rearranged placeholders
my_list.append("The tool {1} is used in {0}")

# Use format to print strings
for my_string in my_list:
  	print(my_string.format(first_pos, second_pos))

The tool computer science is used in artificial intelligence
The tool artificial intelligence is used in computer science


In [27]:
courses = ['artificial intelligence', 'neural networks']

# Create a dictionary
plan = {
  		'field': courses[0],
        'tool': courses[1]
        }

In [28]:
# Complete the placeholders accessing elements of field and tool keys
my_message = "If you are interested in {data[field]}, you can take the course related to {data[tool]}"

# Use dictionary to replace placeholders
print(my_message.format(data = plan))

If you are interested in artificial intelligence, you can take the course related to neural networks


#### What day is today?

It's lunch time and you are talking with some of your colleagues. They comment that they feel that every morning someone should send them a reminder of what day it is so they can check in the calendar what their assignments are for that day.

You want to help out and decide to write a small script that takes the date and time of the day so that every morning, a message is sent to your colleagues. You can use the module datetime along with named placeholders to achieve your goal.

The date should be expressed as Month day, year, e.g. April 16, 2019 and the time as hh:mm, e.g. 16:30.

In [29]:
# %d(day), %B (month name), %m (month number), %Y(year), %H (hour) and %M(minutes)

# Assign date to get_date
get_date = datetime.now()

# Add named placeholders with format specifiers
message = "Good morning. Today is {today:%B %d, %Y}. It's {today:%H:%M} ... time to work!"

# Format date
print(message.format(today = get_date))

Good morning. Today is July 14, 2020. It's 23:06 ... time to work!


#### Literally formatting / f-strings

You remember that you've created a website that displayed data science facts but it was too slow. You think that it could be due to the string formatting you used. Because f-strings are very fast and easy to use, you decide to rewrite that project.

In [30]:
field1 = 'sexiest job'
field2 = 'data is produced daily'
field3 = 'Individuals'

fact1 = 21
fact2 = 2500000000000000000
fact3 = 72.41415415151
fact4 = 1.09

In [31]:
# Complete the f-string
print(f"Data science is considered '{field1}' in the {fact1}st century")

Data science is considered 'sexiest job' in the 21st century


In [32]:
# Complete the f-string
print(f"About {fact2:e} of {field2} in the world")

About 2.500000e+18 of data is produced daily in the world


In [33]:
# Complete the f-string
print(f"{field3} create around {fact3:.2f}% of the data but only {fact4:.1f}% is analyzed")

Individuals create around 72.41% of the data but only 1.1% is analyzed


#### Make this function

Wow! You are excited to see how fast and easy f-strings worked. So you plan to rewrite some more of your old code.

Now you know that f-strings allow you to evaluate expressions where they appear and include function and method calls. You decide to use them in a project where you analyze 120 tweets to check if they include links to different news. In that way, you expect the code to be cleaner and more readable.

In [34]:
n1 = 120
n2 = 7
s1 = 'httpswww.datacamp.com'

list_links = ['www.news.com',
 'www.google.com',
 'www.yahoo.com',
 'www.bbc.com',
 'www.msn.com',
 'www.facebook.com',
 'www.news.google.com']

In [35]:
print(f"{n1} tweets were downloaded in {n2} minutes indicating a speed of {n1/n2:.1f} tweets per min")

120 tweets were downloaded in 7 minutes indicating a speed of 17.1 tweets per min


In [36]:
# Replace the substring https by an empty string
print(f"{s1.replace('https', '')}")

www.datacamp.com


In [37]:
# Divide the length of list by 120 rounded to two decimals
print(f"Only {(len(list_links)*100/120):.2f}% of the posts contain links")

Only 5.83% of the posts contain links


Lastly, you want to rewrite an old real estate prediction project. At the time, you obtained historical information about house prices and used it to make a prediction on future values.

The date was in the datetime format: datetime.datetime(1990, 3, 17) but to print it out, you format it as 3-17-1990. You also remember that you defined a dictionary for each neighborhood. Now, you believe that you can handle both type of data better with f-strings.

In [38]:
east = {'date': datetime(2007, 4, 20, 0, 0), 'price': 1232443}
west = {'date': datetime(2006, 5, 26, 0, 0), 'price': 1432673}

In [39]:
# Access values of date and price in east dictionary
print(f"The price for a house in the east neighborhood was ${east['price']} in {east['date']:%m-%d-%Y}")

The price for a house in the east neighborhood was $1232443 in 04-20-2007


In [40]:
# Access values of date and price in west dictionary
print(f"The price for a house in the west neighborhood was ${west['price']} in {west['date']:%m-%d-%Y}.")

The price for a house in the west neighborhood was $1432673 in 05-26-2006.


#### Template Method

Once again, you scraped Wikipedia pages. This time, you searched for the description of useful tools used for text mining. Your first task is to prepare a report about different tools you found. You want to format the information contained in the dataset to be printed out as: The (tool) is a (description).

In this case, template strings are the best solution to interpolate data generated by external sources into an already created template.

For this example, the variables tool1, tool2 and tool3 contain three article titles. Each variable description1, description2 and description3 contains the corresponding article description.

In [41]:
tool1 = 'Natural Language Toolkit'
tool2 = 'TextBlob'
tool3 = 'Gensim'

description1 = 'suite of libraries and programs for symbolic and statistical natural language processing (NLP) \
for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the \
Department of Computer and Information Science at the University of Pennsylvania.'

description2 = 'Python library for processing textual data. It provides a simple API for diving into common natural \
language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, \
translation, and more.'

description3 = 'robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, \
SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data \
streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that \
only target batch and in-memory processing.'

In [42]:
# Create a template
wikipedia = Template("$tool is a $description")

# Substitute variables in template
print(wikipedia.substitute(tool=tool1, description=description1), '\n')
print(wikipedia.substitute(tool=tool2, description=description2), '\n')
print(wikipedia.substitute(tool=tool3, description=description3))

Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. 

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. 

Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.


In [43]:
tools = ['Natural Language Toolkit', '20', 'month']

# Select variables
our_tool = tools[0]
our_fee = tools[1]
our_pay = tools[2]

# Create template
course = Template("We are offering a 3-month beginner course on $tool just for $$$fee ${pay}ly") # Use $$ to escape the $ sign

# Substitute identifiers with three variables
print(course.substitute(tool=our_tool, fee=our_fee, pay=our_pay))

We are offering a 3-month beginner course on Natural Language Toolkit just for $20 monthly


In [44]:
answers = {'answer1': 'I really like the app. But there are some features that can be improved'}

# Complete template string using identifiers
the_answers = Template("Check your answer 1: $answer1, and your answer 2: $answer2")

# Use substitute to replace identifiers
try:
    print(the_answers.substitute(answers))
except KeyError:
    print("Missing information")

Missing information


In [45]:
# Use safe_substitute to replace identifiers
try:
    print(the_answers.safe_substitute(answers))
except KeyError:
    print("Missing information")

Check your answer 1: I really like the app. But there are some features that can be improved, and your answer 2: $answer2


### Regular Expressions

\d: digit

\w: word character

\W: non-word character

\s: whitespace

In [46]:
short_tweets = pd.read_csv('short_tweets.csv')

In [47]:
short_tweets.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467821085,Mon Apr 06 22:22:26 PDT 2009,NO_QUERY,crzy_cdn_bulas,our duck and chicken are taking wayyy too long...
1,0,1467821338,Mon Apr 06 22:22:30 PDT 2009,NO_QUERY,justnetgirl,Put vacation photos online (They were so cute)...
2,0,1467821455,Mon Apr 06 22:22:32 PDT 2009,NO_QUERY,CiaraRenee,I need a hug
3,0,1467821715,Mon Apr 06 22:22:37 PDT 2009,NO_QUERY,deelau,"@andywana Not sure what they are, only that th..."
4,0,1467822384,Mon Apr 06 22:22:47 PDT 2009,NO_QUERY,Lindsey0920,@oanhLove I hate when that happens...


In [48]:
sentiment_analysis = short_tweets.iloc[105, 5]

print(sentiment_analysis)

@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%


In [49]:
# Write the regex
regex = r"@robot\d\W"

# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


In [50]:
sentiment_analysis = short_tweets.iloc[653, 5]

print(sentiment_analysis)

Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7


In [51]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

# Write a regex to obtain number of likes
print(re.findall(r"likes:\s\d", sentiment_analysis))

# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

['User_mentions:2']
['likes: 9']
['number of retweets: 7']


#### Match and split

Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. You print some of these tweets to understand which pattern you need to match.

You notice that the sentences are always separated by a special character, followed by a number, the word break, and after that, another special character, e.g &4break!. The words are always separated by a special character, the word new, and a normal random character, e.g #newH.

In [52]:
sentiment_analysis = short_tweets.iloc[763, 5]

print(sentiment_analysis)

He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready


In [53]:
# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"

# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)
print(sentiment_sub)

He#newHis%newTin love with$newPscrappy.  He is&newYmissing him@newLalready


In [54]:
# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

sentiment_final = re.sub(regex_words, " ", sentiment_sub)
print(sentiment_final)

He is in love with scrappy.  He is missing him already


#### Repetitions

Back to your Twitter sentiment analysis project! There are several types of strings that increase your sentiment analysis complexity. But these strings do not provide any useful sentiment. Among them, we can have links and user mentions.

In order to clean the tweets, you want to extract some examples first. You know that most of the times links start with http and do not contain any whitespace, e.g. https://www.datacamp.com. User mentions start with @ and can have letters and numbers only, e.g. @johnsmith3.

You write down some helpful quantifiers to help you: \* zero or more times, \+ once or more, \? zero or once.

In [55]:
sentiment_analysis = pd.Series.tolist(short_tweets.iloc[545:548, 5])

print(sentiment_analysis)

['Boredd. Colddd @blueKnight39 Internet keeps stuffing up. Save me! https://www.tellyourstory.com', "I had a horrible nightmare last night @anitaLopez98 @MyredHat31 which affected my sleep, now I'm really tired", 'im lonely  keep me company @YourBestCompany! @foxRadio https://radio.foxnews.com 22 female, new york']


In [56]:
for tweet in sentiment_analysis:
    # Write regex to match http links and print out result
    print(re.findall(r"https://\S+", tweet))
    # Write regex to match user mentions and print out result
    print(re.findall(r"@\w+", tweet))
    print('\n')

['https://www.tellyourstory.com']
['@blueKnight39']


[]
['@anitaLopez98', '@MyredHat31']


['https://radio.foxnews.com']
['@YourBestCompany', '@foxRadio']




#### Some time ago

You are interested in knowing when the tweets were posted. After reading a little bit more, you learn that dates are provided in different ways. You decide to extract the dates using .findall() so you can normalize them afterwards to make them all look the same.

You realize that the dates are always presented in one of the following ways:

27 minutes ago

4 hours ago

23rd june 2018

1st september 2019 17:25

In [57]:
sentiment_analysis = pd.Series.tolist(short_tweets.iloc[232:235, 5])

print(sentiment_analysis)

['I would like to apologize for the repeated Video Games Live related tweets. 32 minutes ago', '@zaydia but i cant figure out how to get there / back / pay for a hotel 1st May 2019', 'FML: So much for seniority, bc of technological ineptness 23rd June 2018 17:54']


In [58]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\s\w+\s\w{3}", date)) # {x} x length

['32 minutes ago']
[]
[]


In [59]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w{2}\s\w+\s\d{4}", date))

[]
['1st May 2019']
['23rd June 2018']


In [60]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w{2}\s\w+\s\d{4}\s\d{1,2}:\d{2}", date))

[]
[]
['23rd June 2018 17:54']


#### Getting tokens

Your next step is to tokenize the text of your tweets. Tokenization is the process of breaking a string into lexical units or, in simpler terms, words. But first, you need to remove hashtags so they do not cloud your process. You realize that hashtags start with a # symbol and contain letters and numbers but never whitespace. After that, you plan to split the text at whitespace matches to get the tokens.

In [61]:
sentiment_analysis = short_tweets.iloc[365, 5]
print(sentiment_analysis)

ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever


In [62]:
# Write a regex matching the hashtag pattern
regex = r"#\w+"

# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)
print(no_hashtag)

# Get tokens by splitting text
print(re.split(r"\s+", no_hashtag)) # Split at every match of one or more consecutive whitespace

ITS NOT ENOUGH TO SAY THAT IMISS U    
['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']


### Regex metacharacters

You are not satisfied with your tweets dataset cleaning. There are still extra strings that do not provide any sentiment. Among them are strings that refer to text file names.

You also find a way to detect them:

- They appear at the start of the string.
- They always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u).
- They always finish with the txt ending.

You are not sure if you should remove them directly. So you write a script to find and store them in a separate dataset.

You write down some metacharacters to help you: ^ anchor to beginning, . any character.

In [63]:
sentiment_analysis = pd.Series.tolist(short_tweets.iloc[780:782, 5])

print(sentiment_analysis)

['AIshadowhunters.txt aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company', "ouMYTAXES.txt I am worried that I won't get my $900 even though I paid tax last year"]


In [64]:
# Write a regex to match text file name
regex = r"^[aeiousAEIOU]{2,3}\w*\.txt" # [] OR

for text in sentiment_analysis:
    # Find all matches of the regex
    print(re.findall(regex, text))
    
    # Replace all matches with empty string
    print(re.sub(regex, "", text))

['AIshadowhunters.txt']
 aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company
['ouMYTAXES.txt']
 I am worried that I won't get my $900 even though I paid tax last year


A colleague has asked for your help! When a user signs up on the company website, they must provide a valid email address.
The company puts some rules in place to verify that the given email address is valid:

- The first part can contain:
    * Upper A-Z and lowercase letters a-z
    * Numbers
    * Characters: !, #, %, &, *, $, .
    
    
- Must have @

- Domain:
    * Can contain any word characters
    * But only .com ending is allowed
    
The project consist of writing a script that checks if the email address follow the correct pattern. Your colleague gave you a list of email addresses as examples to test.

In [65]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

# Write a regex to match a valid email address
regex = r"[a-zA-Z]+\d*[!#%&*$\.]*@\w+\.com"

for example in emails:
  	# Match the regex to the string
    if re.findall(regex, example):
        # Complete the format method to print out the result
      	print("The email {email_example} is a valid email".format(email_example=example))
    else:
      	print("The email {email_example} is invalid".format(email_example= example)) 

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


The second part of the website project is to write a script that validates the password entered by the user. The company also puts some rules in order to verify valid passwords:

- It can contain lowercase a-z and uppercase letters A-Z
- It can contain numbers
- It can contain the symbols: *, #, $, %, !, &, .
- It must be at least 8 characters long but not more than 20

Your colleague also gave you a list of passwords as examples to test.

In [66]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']

# Write a regex to match a valid password
regex = r"[a-zA-Z0-9*#$%!&.]{8,20}}"

for example in passwords:
  	# Scan the strings to find a match
    if re.search(regex, example):
        # Complete the format method to print out the result
      	print("The password {pass_example} is a valid password".format(pass_example=example))
    else:
      	print("The password {pass_example} is invalid".format(pass_example=example))

The password Apple34!rose is invalid
The password My87hou#4$ is invalid
The password abc123 is invalid


#### Greedy vs non-greedy Match

You need to keep working and cleaning your tweets dataset. You realize that there are some HTML tags present. You need to remove them but keep the inside content as they are useful for analysis.

Let's take a look at this sentence containing an HTML tag:

I want to see that <strong> amazing show </strong> again!.

You know that for getting HTML tag you need to match anything that sits inside angle brackets < >. But the biggest problem is that the closing tag has the same structure. If you match too much, you will end up removing key information. So you need to decide whether to use a greedy or a lazy quantifier.

In [67]:
string = 'I want to see that <strong>amazing show</strong> again!'

# Write a regex to eliminate tags
string_notags = re.sub(r"<.+?>", "", string) # Use of ? employs lazy approach. This means removing anything inside <> by each <>

# Print out the result
print(string_notags)

I want to see that amazing show again!


Next, you see that numbers still appear in the text of the tweets. So, you decide to find all of them.

Let's imagine that you want to extract the number contained in the sentence I was born on April 24th. A lazy quantifier will make the regex return 2 and 4, because they will match as few characters as needed. However, a greedy quantifier will return the entire 24 due to its need to match as much as possible.

In [68]:
sentiment_analysis = short_tweets.iloc[91, 5]
print(sentiment_analysis)

Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left 


In [69]:
# Write a lazy regex expression 
numbers_found_lazy = re.findall(r"\d+?", sentiment_analysis)

# Print out the result
print(numbers_found_lazy)

['5', '3', '6', '1', '2']


In [70]:
# Write a greedy regex expression 
numbers_found_greedy = re.findall(r"\d+", sentiment_analysis)

# Print out the result
print(numbers_found_greedy)

['536', '12']


You have done some cleaning in your dataset but you are worried that there are sentences encased in parentheses that may cloud your analysis.

Again, a greedy or a lazy quantifier may lead to different results.

For example, if you want to extract a word starting with a and ending with e in the string I like apple pie, you may think that applying the greedy regex a.+e will return apple. However, your match will be apple pie. A way to overcome this is to make it lazy by using ? which will return apple.

In [71]:
sentiment_analysis = short_tweets.iloc[1, 5]
print(sentiment_analysis)

Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). 


In [72]:
# Write a greedy regex expression to match 
sentences_found_greedy = re.findall(r"\(.+\)", sentiment_analysis)

# Print out the result
print(sentences_found_greedy)

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]


In [73]:
# Write a lazy regex expression
sentences_found_lazy = re.findall(r"\(.+?\)", sentiment_analysis)

# Print out the results
print(sentences_found_lazy)

['(They were so cute)', "(I'm crying)"]


#### Grouping 

You are still working on your Twitter sentiment analysis. You analyze now some things that caught your attention. You noticed that there are email addresses inserted in some tweets. Now, you are curious to find out which is the most common name.

You want to extract the first part of the email. E.g. if you have the email marysmith90@gmail.com, you are only interested in marysmith90.
You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.

In [74]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']

In [75]:
# Write a regex that matches email
regex_email = r"\s([A-Za-z0-9]+)@\S+" #() helps to group the name before @ together

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


#### Flying home
Your boss assigned you to a small project. They are performing an analysis of the travels people made to attend business meetings. You are given a dataset with only the email subjects for each of the people traveling.

You learn that the text followed a pattern. Here is an example:

Here you have your boarding pass LA4214 AER-CDB 06NOV.

You need to extract the information about the flight:

- The two letters indicate the airline (e.g LA),
- The 4 numbers are the flight number (e.g. 4214).
- The three letters correspond to the departure (e.g AER),
- The destination (CDB),
- The date (06NOV) of the flight.

All letters are always uppercase.

In [76]:
flight = 'Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT'

In [77]:
regex = r"([A-Z]{2})([0-9]{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, flight)
print(flight_matches, '\n')

#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

[('IB', '3723', 'AMS', 'MAD', '06OCT')] 

Airline: IB Flight number: 3723
Departure: AMS Destination: MAD
Date: 06OCT


#### Alternation and non-capturing groups

You are still working on the Twitter sentiment analysis project. First, you want to identify positive tweets about movies and concerts.

You plan to find all the sentences that contain the words love, like, or enjoy and capture that word. You will limit the tweets by focusing on those that contain the words movie or concert by keeping the word in another group. You will also save the movie or concert name.

For example, if you have the sentence: I love the movie Avengers. You match and capture love. You need to match and capture movie. Afterwards, you match and capture anything until the dot.


In [78]:
sentiment_analysis = ['I totally love the concert The Book of Souls World Tour. It kinda amazing!',
 'I enjoy the movie Wreck-It Ralph. I watched with my boyfriend.',
 "I still like the movie Wish Upon a Star. Too bad Disney doesn't show it anymore."]

In [79]:
regex_positive = r"(love|like|enjoy).+(movie|concert)\s(.+?)\." # Use non-greedy only match part until the first period 
# and not second sentences

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    positive_matches = re.findall(regex_positive, tweet)
    
    # Complete format to print out the results
    print("Positive comments found {}".format(positive_matches))

Positive comments found [('love', 'concert', 'The Book of Souls World Tour')]
Positive comments found [('enjoy', 'movie', 'Wreck-It Ralph')]
Positive comments found [('like', 'movie', 'Wish Upon a Star')]


In [80]:
sentiment_analysis = ['That was horrible! I really dislike the movie The cabin and the ant. So boring.',
 "I disapprove the movie Honest with you. It's full of cliches.",
 'I dislike very much the concert After twelve Tour. The sound was horrible.']

After finding positive tweets, you want to do it for negative tweets. Your plan now is to find sentences that contain the words hate, dislike or disapprove. You will again save the movie or concert name. You will get the tweet containing the words movie or concert but this time, you don't plan to save the word.

For example, if you have the sentence: I dislike the movie Avengers a lot.. You match and capture dislike. **You will match but not capture the word movie**. Afterwards, you match and capture anything until the dot.

In [81]:
# Write a regex that matches sentences with the optional words
regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\." #?: match but not capture

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    negative_matches = re.findall(regex_negative, tweet)
    
    # Complete format to print out the results
    print("Negative comments found {}".format(negative_matches))

Negative comments found [('dislike', 'The cabin and the ant')]
Negative comments found [('disapprove', 'Honest with you')]
Negative comments found [('dislike', 'After twelve Tour')]


#### Backreferences

You now need to work on another small project you have been delaying. Your company gave you some PDF files of signed contracts. The goal of the project is to create a database with the information you parse from them. Three of these columns should correspond to the day, month, and year when the contract was signed.
The dates appear as Signed on 05/24/2016 (05 indicating the month, 24 the day). You decide to use capturing groups to extract this information. Also, you would like to retrieve that information so you can store it separately in different variables.

You decide to do a proof of concept.

In [82]:
contract =  'Provider will invoice Client for Services performed within 30 days of performance.  \
Client will pay Provider as set forth in each Statement of Work within 30 days of receipt and acceptance\
of such invoice. It is understood that payments to Provider for services rendered shall be made in full as \
agreed, without any deductions for taxes of any kind whatsoever, in conformity with Provider’s status as an \
independent contractor. Signed on 03/25/2001.'

In [83]:
# Write regex and scan contract to capture the dates described
regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
dates = re.search(regex_dates, contract)
print(dates, '\n')

# Assign to each key the corresponding match
signature = {
    "day": dates.group(2),
    "month": dates.group(1),
    "year": dates.group(3)
}

# Complete the format method to print-out
print("Our first contract is dated back to {data[year]}. Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

<_sre.SRE_Match object; span=(427, 447), match='Signed on 03/25/2001'> 

Our first contract is dated back to 2001. Particularly, the day 25 of the month 03.


In the meantime, you are working on one of your other projects. The company is going to develop a new product. It will help developers automatically check the code they are writing. You need to write a short script for checking that every HTML tag that is open has its proper closure.

You have an example of a string containing HTML tags:

<title>The Data Science Company</title>

You learn that an opening HTML tag is always at the beginning of the string. It appears inside <>. A closing tag also appears inside <>, but it is preceded by /.

You also remember that capturing groups can be referenced using numbers, e.g \4.

In [84]:
html_tags = ['<body>Welcome to our course! It would be an awesome experience</body>',
 '<article>To be a data scientist, you need to have knowledge in statistics and mathematics</article>',
 '<nav>About me Links Contact me!']

In [85]:
for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</(\w+)>", string)
 
    if match_tag:
        # If it matches print the first group capture: one-based index!!
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))

Your tag body is closed
Your tag article is closed
Close your nav tag!


In [86]:
# We can also use \n to reference nth capturing group

for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\1>", string)
 
    if match_tag:
        # If it matches print the first group capture: one-based index!!
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))

Your tag body is closed
Your tag article is closed
Close your nav tag!


#### Repeated Characters

Back to your sentiment analysis! Your next task is to replace elongated words that appear in the tweets. We define an elongated word as a word that contains a repeating character twice or more times. e.g. "Awesoooome".

Replacing those words is very important since a classifier will treat them as a different term from the source words lowering their frequency.

To find them, you will use capturing groups and reference them back using numbers. E.g \4.

If you want to find a match for Awesoooome. You first need to capture Awes. Then, match o and reference the same character back, and then, me.

In [87]:
sentiment_analysis = ['@marykatherine_q i know! I heard it this morning and wondered the same thing. Moscooooooow is so behind the times',
 'Staying at a friends house...neighborrrrrrrs are so loud-having a party',
 'Just woke up an already have read some e-mail']

In [88]:
# Complete the regex to match an elongated word (word with 2 or more consective identical letters)
regex_elongated = r"\w+(\w)\1\w*" # \1: Match the first group captured again: Msco o oooooow

for tweet in sentiment_analysis:
	# Find if there is a match in each tweet 
	match_elongated = re.search(regex_elongated, tweet)
    
	if match_elongated:
		# Assign the captured group zero 
		elongated_word = match_elongated.group(0)
        
		# Complete the format method to print the word
		print("Elongated word found: {word}".format(word=elongated_word))
	else:
		print("No elongated word found") 

Elongated word found: Moscooooooow
Elongated word found: neighborrrrrrrs
No elongated word found


#### Lookaround

Positive lookahead (?=) makes sure that first part of the expression is followed by the lookahead expression. Positive lookbehind (?<=) returns all matches that are preceded by the specified pattern.

Positive look-ahead
- Non-capturing group
- Checks that the first part of the expression is followed by the lookahead expression
- Returns only the first part of the expression

In [89]:
sentiment_analysis = short_tweets.iloc[47,5]

print(sentiment_analysis)

You need excellent python skills to be a data scientist. Must be! Excellent python


In [90]:
# Positive lookahead/look to the left
look_ahead = re.findall(r"\w+(?=\spython)", sentiment_analysis) # Look ahead: "python" comes after

# Print out
print(look_ahead)

['excellent', 'Excellent']


In [91]:
# Positive lookbehind/look to the right
look_behind = re.findall(r"(?<=[Pp]ython\s)\w+", sentiment_analysis) # Look behind: "python" comes first

# Print out
print(look_behind)

['skills']


Negative look-ahead
- Non-capturing group
- Checks that the first part of the expression is **not** followed by the lookahead expression
- Returns only the first part of the expression

Filtering phone numbers

Now, you need to write a script for a cell-phone searcher. It should scan a list of phone numbers and return those that meet certain characteristics.

The phone numbers in the list have the structure:

- Optional area code: 3 numbers
- Prefix: 4 numbers
- Line number: 6 numbers
- Optional extension: 2 numbers

E.g. 654-8764-439434-01.

You decide to use .findall() and the non-capturing group's negative lookahead (?!) and negative lookbehind (?<!).

In [92]:
cellphones = ['4564-646464-01', '345-5785-544245', '6476-579052-01']

In [93]:
for phone in cellphones:
	# Get all phone numbers not preceded by area code (negative lookbehind, look to the right)
	number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d*", phone)
	print(number)

['4564-646464-01']
[]
['6476-579052-01']


In [94]:
for phone in cellphones:
	# Get all phone numbers not followed by optional extension (negative lookahead, look to the left)
	number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
	print(number)

[]
['345-5785-544245']
[]
