# Regular Expressions and Basics of Text Analysis

This week, we will be learning Regular Expressions (regex) and how we can use them in our research.

In [None]:
# Regular Expressions in Python can be used by the "re" library.

import re

In [None]:
# In the re library you can use several options.

re.search(pattern, text)

# You can also look for a match and get a 'match' object as a result.

re.match(pattern, sequence)

In [None]:
# Most likely you will search for a pattern
pattern = 'Science'
text = 'Data Science'

if re.search(pattern, text):
    print("Match! I found ",pattern)
else: print("Not a match!")

In [None]:
re.search(pattern, text)

In [None]:
# Using only the .search or .match options are going to give you an object as a
# result, which could be difficult to interpret. You can use .group option to
# get around this.

re.search(r'.', 'Cookie').group()

Basic Patterns - https://developers.google.com/edu/python/regular-expressions

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

    a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
    . (a period) -- matches any single character except newline '\n'
    \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
    \b -- boundary between word and non-word
    \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
    \t, \n, \r -- tab, newline, return
    \d -- decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)
    ^ = start, $ = end -- match the start or end of the string
    \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can try putting a slash in front of it, \@. If its not a valid escape sequence, like \c, your python program will halt with an error.

In [None]:
# Let's have a look at some of these patterns and how they might be useful to us.
# https://www.datacamp.com/tutorial/python-regular-expression-tutorial
re.search(r'^Eat', "Eat cake!").group()

re.search(r'^Eat', "Let's Eat cake!").group()

In [None]:
# You can also look for the end of a string.
re.search(r'cake$', "Cake! Let's eat cake").group()

re.search(r'cake$', "Let's get some cake on our way home!").group()

In [None]:
# Mini Assignment: Write a regex pattern that looks for the beginning and the end of your first name, and make it match.

# Your code here...
last_c = re.search('M.+a', "Mustafa").group()
last_name = re.search('türk', 'Öztürk').group()

In [None]:
# The most likely application of regular expressions is looking for text systematically. For this, you will have to use brackets.
re.search('O*', 'FOObar')

In [None]:
re.search('[0-9]\s[0-9]', 'foo123bar')

In [None]:
re.search('[0-9a-fA-f]', '--- a0 ---')

In [None]:
# Let's look for the first digit in front of the string.
re.search('[^a-z]', '12345foo')

In [None]:
# Backslashes are going to be the most used and the most forgotten of all.
re.search('[ab\&cd]', 'foo[1]')

In [None]:
# The following are the important patterns that you will use.
# 1- Dot (.)
# The dot implies a wildcard in the text. If you want to look for cake, you can simply ask for "c.ke". But since it is a wildcard, you might get a result "coke".
re.search('foo.bar', 'fooxbar')
print(re.search('foo.bar', 'foobar'))
print(re.search('foo.bar', 'foo\nbar'))

In [None]:
# An example from https://www.geeksforgeeks.org/regular-expression-python-examples/
s = 'geeks.forgeeks'

# without using \
match = re.search(r'.', s)
print(match)

# using \
match = re.search(r's\.f', s)
print(match)

In [None]:
# Case Study from: https://www.datacamp.com/tutorial/python-regular-expression-tutorial
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

def do_something(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search("\*\*\* START OF THE PROJECT GUTENBERG EBOOK THE IDIOT \*\*\*", raw).end()
    # Discards the text starting Part 2 of the book
    stop = re.search("II\.", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = do_something(the_idiot_url)
processed_book = preprocess(book)
print(processed_book)

In [None]:
requests.get('https://www.gutenberg.org/files/2638/2638-0.txt').text

In [None]:
# How many times the word "the" have been used in the book?
len(re.findall(r'the', processed_book))

In [None]:
# Convert all the standalone "i" to "I" in the book, but not the "i" inside a word. (You need to use \s)
processed_book = re.sub(r'\si\s', " I ", processed_book)

In [None]:
# Find the number of times anyone was quoted ("") in the corpus.
len(re.findall(r'\”', book))

In [None]:
# Assignment 1: Write a regular expression that would find you an e-mail address. This is not intended to make you a scammer.

# Your code here...

# pattern = ...

In [None]:
# Let us work on a real-life example that I had to do a couple weeks ago.

import pandas as pd

df = pd.read_excel('https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/GermanyKreiseToClean.xlsx')

In [None]:
# The task is to clean all "1." or "9." etc. as well as anything in parantheses. Without coding could you come up with a plan for the computer on what to do?

df_test = df.iloc[:,0].apply(lambda x: re.sub(r'^\d+\.\s*|\s*\(.+\)$', '', str(x)))

# Short Introduction to NLTK and Tokenization

NLTK is a powerful library in Python for working with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It's widely used for prototyping and building research systems.

Tokenization in natural language processing (NLP) is the process of splitting text into smaller units, called tokens. These tokens can be words, sentences, or even parts of words. It's a fundamental step because it helps in simplifying the text data, making it easier to analyze or process for tasks like sentiment analysis, topic modeling, or information extraction.

The following code blocks were taken with modifications from DataCamp.

In [None]:
# Import necessary modules
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from  nltk.tokenize import word_tokenize
from  nltk.tokenize import sent_tokenize
import requests

In [None]:
# Import the first scene from Monty Pythons Holy Grail.
scene_one = requests.get(r"https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/MontyPython.txt").text

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

In [None]:
# How would you use regex here to parse the text when the soldier speaks? Give me a regex example.
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

# Your code/answer here...