# String, Lists, and Regular Expressions
Python provides two ways to process text: strings and lists. 
We will use various string functions such as lower(), append(), split(), and strip(), and Python's RE (regular expression) library.
### Strings:
Strings are delimited using ', ", or even ''' (triple quote)

In [1]:
text = 'New York, New York (So Good They Named It Twice)'
print(text)

New York, New York (So Good They Named It Twice)


In [2]:
# Multiline strings are denoted using triple-quotes 
multiline = """
one
two 
three
"""
print(multiline)


one
two 
three



String functions:

In [3]:
print(text.lower())

new york, new york (so good they named it twice)


In [4]:
text.upper()

'NEW YORK, NEW YORK (SO GOOD THEY NAMED IT TWICE)'

In [5]:
text.title()

'New York, New York (So Good They Named It Twice)'

In [6]:
text.capitalize()

'New york, new york (so good they named it twice)'

In [7]:
text.swapcase()

'nEW yORK, nEW yORK (sO gOOD tHEY nAMED iT tWICE)'

In [8]:
duck = "    If it looks like a duck,     swims like a duck, and     quacks like a duck, then it is    probably a duck.  "

In [9]:
duck.strip()

'If it looks like a duck,     swims like a duck, and     quacks like a duck, then it is    probably a duck.'

In [10]:
duck.rstrip()

'    If it looks like a duck,     swims like a duck, and     quacks like a duck, then it is    probably a duck.'

In [11]:
duck.lstrip()

'If it looks like a duck,     swims like a duck, and     quacks like a duck, then it is    probably a duck.  '

In [12]:
# To locate first occurrence of a character string: str.find() or str.index()
duck.find("duck")

23

In [13]:
# rfind() or rindex() works from end of string
duck.rfind("duck")

105

In [14]:
# Returns -1 if search string is not found
duck.find("fox")

-1

In [15]:
# Checking if a substring occurs at the begining (or end) of a string
duck.startswith("duck")

False

In [16]:
duck.strip().endswith("duck.")

True

In [17]:
duck.replace("duck", "chicken")

'    If it looks like a chicken,     swims like a chicken, and     quacks like a chicken, then it is    probably a chicken.  '

In [18]:
duck = "    If it looks like a duck,     swims like a duck, and     quacks like a duck, then it is    probably a duck.  "
duck.strip().replace("o", "--")

'If it l----ks like a duck,     swims like a duck, and     quacks like a duck, then it is    pr--bably a duck.'

In [19]:
# Splitting a string based on a substring location
duck.partition("duck")

('    If it looks like a ',
 'duck',
 ',     swims like a duck, and     quacks like a duck, then it is    probably a duck.  ')

In [20]:
# Splitting a string into a list of words
duck.split()

['If',
 'it',
 'looks',
 'like',
 'a',
 'duck,',
 'swims',
 'like',
 'a',
 'duck,',
 'and',
 'quacks',
 'like',
 'a',
 'duck,',
 'then',
 'it',
 'is',
 'probably',
 'a',
 'duck.']

Note: split() removes blank spaces, but does not remove punctuations.

In [21]:
# Splitting a multi-line document based on newline characters
multiline.splitlines()

['', 'one', 'two ', 'three']

In [22]:
# Joining words in a list back to a string
words = duck.split()
"--".join(words)

'If--it--looks--like--a--duck,--swims--like--a--duck,--and--quacks--like--a--duck,--then--it--is--probably--a--duck.'

"--" in the above example is an iterable; you can use a space here.

In [23]:
# Joining with newline to a multi-line string
"\n".join(words)

'If\nit\nlooks\nlike\na\nduck,\nswims\nlike\na\nduck,\nand\nquacks\nlike\na\nduck,\nthen\nit\nis\nprobably\na\nduck.'

In [24]:
print("\n".join(words))

If
it
looks
like
a
duck,
swims
like
a
duck,
and
quacks
like
a
duck,
then
it
is
probably
a
duck.


Formatting strings:

### Lists: 
Another way to store text. This is a "Bag of words" model; text is unordered here.

In [25]:
words = duck.split()
print(words)

['If', 'it', 'looks', 'like', 'a', 'duck,', 'swims', 'like', 'a', 'duck,', 'and', 'quacks', 'like', 'a', 'duck,', 'then', 'it', 'is', 'probably', 'a', 'duck.']


In [26]:
words[0]

'If'

In [27]:
words[5] + words[3]

'duck,like'

In [28]:
words[0:2]

['If', 'it']

In [29]:
words[-1]

'duck.'

In [30]:
positive_words = ['awesome', 'good', 'nice', 'super', 'like']

In [31]:
positive_words.append('delightful')
print(positive_words)

['awesome', 'good', 'nice', 'super', 'like', 'delightful']


In [32]:
negative_words = ['awful','lame','horrible','bad']
print(negative_words)

['awful', 'lame', 'horrible', 'bad']


In [33]:
# Combining lists
emotional_words = negative_words + positive_words
print(emotional_words)

['awful', 'lame', 'horrible', 'bad', 'awesome', 'good', 'nice', 'super', 'like', 'delightful']


### Tuples & Dictionaries:

In [34]:
# Tuple: immutable lists, indicated by parentheses
row = (1, 3, "fish")
print(row)

(1, 3, 'fish')


In [35]:
# Dictionary: key:value pairs
book = {'title': 'Cat in the Hat', 'author': 'Dr. Seuss', 'Year': 1957}
print(book)

{'title': 'Cat in the Hat', 'author': 'Dr. Seuss', 'Year': 1957}


### Looping through lists:

In [36]:
# Splitting a string into a set of words is called TOKENIZATION

tweet1 = 'We have some delightful new food in the cafeteria. Awesome!!!'
words1 = tweet1.split()
print(words1)

['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']


In [37]:
len(tweet1)

61

In [38]:
len(words1)

10

In [39]:
for word in words1:
    print(word)

We
have
some
delightful
new
food
in
the
cafeteria.
Awesome!!!


In [40]:
for word in words1:
    if word in positive_words:
        print(word)

delightful


"awesome" was not detected because it was not an exact character match

In [41]:
tweet2 = "Food is lame today. I don't like it at all"
words2 = tweet2.split()

for word in words2:
    if word in positive_words:
        print('+')

+


In [42]:
for word in words2:
    if word in positive_words:
        print ('+')
    elif word in negative_words:
        print ('-')

-
+


In the above example, lame (-) occurs first, followed by like (+)

In [43]:
for word in words2:
    if word in positive_words:
        print (word + ' is a positive word.')
    elif word in negative_words:
        print (word + " is a negative word.")

lame is a negative word.
like is a positive word.


### Formatting strings

In [44]:
pi = 3.14159
print(pi, str(pi))

3.14159 3.14159


In [45]:
print('pi = ' + pi))

SyntaxError: unmatched ')' (<ipython-input-45-cf4ea36d4e3f>, line 1)

In [46]:
for number in [1, 2, 4, 'like', 8]:
    sentence = str(number) + ' is a number.'
    print(sentence)

1 is a number.
2 is a number.
4 is a number.
like is a number.
8 is a number.


In [47]:
string = "This article is written in {}"
print (string.format("Python")) 
print("Python")

This article is written in Python
Python


In [49]:
age = 21
print("Hello, I am {} years old!".format(age))

Hello, I am 21 years old!


In [50]:
# multiple placeholders
major = 'business analytics'
print("I'm {} years old, and a {} major.".format(age, major))

I'm 21 years old, and a business analytics major.


In [51]:
# positional arguments
print("I'm {1} years old, and a {0} major.".format(major, '21'))

I'm 21 years old, and a business analytics major.


In [52]:
''' Specifying types: 
s – strings
d – decimal integers
f – floating point numbers
c – character
b – binary
o – octal
x – hexadecimal with lowercase letters after 9
X – hexadecimal with uppercase letters after 9
e – exponent notation
'''

print("The temperature today is {0:f} degrees outside!".format(35.567))

The temperature today is 35.567000 degrees outside!


In [53]:
print("The temperature today is {0:.0f} degrees outside !".format(35.567))

The temperature today is 36 degrees outside !


In [54]:
print ("The average grade in the {0} class was {1:.2f}%".format("text analytics", 78.234876)) 

The average grade in the text analytics class was 78.23%


In [55]:
print("The {0} of 100 is {1:b}".format("binary", 100))

The binary of 100 is 1100100


In [56]:
# Adjusting for spacing
print("It is {0:5} degrees outside !".format(40))

It is    40 degrees outside !


In [57]:
for i in range (3, 9): 
    print("{:6d} {:6d} {:6d} {:6d}".format(i, i**2, i**3, i**4))

     3      9     27     81
     4     16     64    256
     5     25    125    625
     6     36    216   1296
     7     49    343   2401
     8     64    512   4096


### Text Cleaning:

In [58]:
tweet1 = "We have some delightful new food in the cafeteria. Awesome!!!"
print (tweet1.lower())

we have some delightful new food in the cafeteria. awesome!!!


In [59]:
words1 = tweet1.lower().split()
print (words1)

['we', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'awesome!!!']


lower() works with strings; not with lists

In [60]:
print('Awesome!!!'.strip('!'))

Awesome


Removing all punctuations

In [61]:
from string import punctuation

In [62]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [63]:
punctuation = punctuation + '\n'

In [64]:
"Awesome!!!".strip(punctuation)

'Awesome'

In [65]:
"Awesome!!! party".replace("!", "")

'Awesome party'

In [66]:
print(tweet1)
tweet1 = tweet1.lower().strip(punctuation)
print(tweet1)

We have some delightful new food in the cafeteria. Awesome!!!
we have some delightful new food in the cafeteria. awesome


It did not remove the period! We have to remove it manually.

In [67]:
tweet1 = tweet1.replace(".", "")
print(tweet1)

we have some delightful new food in the cafeteria awesome


In [68]:
words1 = tweet1.split()
print(words1)

['we', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria', 'awesome']


### Sentiment Analysis: 

In [69]:
tweet1 = 'We have some delightful new food in the cafeteria. Awesome!!!'
words1 = tweet1.lower().strip(punctuation).replace(".", "").split()
words1

['we',
 'have',
 'some',
 'delightful',
 'new',
 'food',
 'in',
 'the',
 'cafeteria',
 'awesome']

In [70]:
positive_words

['awesome', 'good', 'nice', 'super', 'like', 'delightful']

In [71]:
negative_words

['awful', 'lame', 'horrible', 'bad']

In [72]:
positive = 0
negative = 0
for word in words1:
    if word in positive_words:
        positive += 1
    elif word in negative_words:
        negative += 1

print (tweet1 + "\nPositive word count: " + str(positive) 
       + "\nNegative word count: " + str(negative))

We have some delightful new food in the cafeteria. Awesome!!!
Positive word count: 2
Negative word count: 0


In [73]:
sentiment = (positive - negative)/(positive + negative)
print("Sentiment score: " + str(sentiment))

Sentiment score: 1.0


In [74]:
tweet2 = "Food is lame today. I don't like it at all"
words2 = tweet2.lower().strip(punctuation).replace(".", "").split()
print(tweet2)

positive = 0
negative = 0
for word in words2:
    if word in positive_words:
        positive += 1
    elif word in negative_words:
        negative += 1
sentiment = (positive - negative)/(positive + negative)

print ("Positive word count: " + str(positive))
print ("Negative word count: " + str(negative))
print("Sentiment score: " + str(sentiment))

Food is lame today. I don't like it at all
Positive word count: 1
Negative word count: 1
Sentiment score: 0.0


Note: The above code ignored "don't" because it is not in the "bag of words". It is very important to have a comprehensive bag of words if you want to do a word-based sentiment analysis. It is also important to consider bigrams (e.g., "don't like") to get a proper semantic interpretation.

This was a very crude and incomplete sentiment analysis. The adequacy of the analysis depends on the completeness of the corpus of positive and negative words and also the algorithm's ability to detect negation. LIWC is the industry standard corpus for sentiment analysis. But LIWC is not free. A free and somewhat extensive corpus for sentiment analysis is Hu & Liu's (2004) Sentiment Lexicon that you can download from here: https://github.com/woodrad/Twitter-Sentiment-Mining/tree/master/Hu%20and%20Liu%20Sentiment%20Lexicon

### Using Regular Expressions
Allows wildcard matching and many complex string operations (. is single character, + is many characters)<br>
Reference: E.F. Friedl’s Mastering Regular Expressions (3rd Edition)

In [75]:
import re
regex = re.compile('\s+')

\s is a special character that matches any whitespace (space, tab, newline, etc.), and the "+" is a character that indicates one or more of the entity preceding it. Thus, the regular expression matches any substring consisting of one or more spaces.

In [76]:
fox = "The quick    brown   fox jumped over a lazy dog."
regex.split(fox)

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog.']

In [77]:
# regex.search() method is similar to str.index() or str.find(); 
# locates substring position. re.compile() specifies search string.
regex = re.compile("dog")
match = regex.search(fox)
match.start()

44

In [78]:
# regex.sub() method operates much like str.replace()
fox.replace('dog', 'bear')

'The quick    brown   fox jumped over a lazy bear.'

In [79]:
# regex.findall() is similar to str.find()
regex = re.compile('ion')
regex.findall('Great Expectations')

['ion']

Certain characters have special meanings in regular expressions:
. ^ $ * + ? { } [ ] \ | ( )

In [80]:
# r in r'\$' indicates a raw string (always use it before a \)
# \ indicate special characters, e.g. "\t"
print('a\tb\tc')
regex = re.compile(r'\$')
regex.findall("the cost is $20")

a	b	c


['$']

In [81]:
# \w is alphanumeric character string; \s is space
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

In [82]:
# Square brackets denote any set of characters
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

In [83]:
# A range can be specified with a dash: 
# [a-z]: any lower-case letter' [1-3]: Any number between 1 and 3
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')

['G2', 'H6']

In [84]:
# To match 3 alphanumeric characters in sequence, you can use "\w\w\w" or 
# indicate repetitions using curly braces with a number
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

In [85]:
# The + character will match one or more repetitions of what precedes it
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

In [86]:
# Matching e-mail addresses
text = "To contact Guido, e-mail guido@python.org or his older address guido@google.com."
email = re.compile('\w+@\w+\.[a-z]{3}')
email.findall(text)

['guido@python.org', 'guido@google.com']

In [87]:
# The above technique may not work for more complex e-mail addresses

email = re.compile(r'[\w\.-]+@[\w\.-]+')
email.findall('barack.obama@whitehouse.gov')
# Extracts strings of the pattern: [alphanumeric character (\w), period (\.), or hyphen(-)] 
# multiple times (+), followed by @, followed by [ ]

['barack.obama@whitehouse.gov']

In [88]:
# Parentheses indicate groups to extract
email = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')
text = "To contact Guido, e-mail guido@python.org or his older address guido@google.com."
email.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

### Comparing Different Types of Tokenization

In [89]:
string = "If it LOOKS like a duck, SWIMS like a duck, and QUACKS like a duck, then it is probably a duck."
print(string.lower().split())                      # Retains adjacent punctuations

['if', 'it', 'looks', 'like', 'a', 'duck,', 'swims', 'like', 'a', 'duck,', 'and', 'quacks', 'like', 'a', 'duck,', 'then', 'it', 'is', 'probably', 'a', 'duck.']


In [90]:
import re
print(re.split('\s', string))                      # Can use any string as separator

['If', 'it', 'LOOKS', 'like', 'a', 'duck,', 'SWIMS', 'like', 'a', 'duck,', 'and', 'QUACKS', 'like', 'a', 'duck,', 'then', 'it', 'is', 'probably', 'a', 'duck.']


In [93]:
# import nltk
# nltk.download('punkt')

from nltk.tokenize import word_tokenize
print(word_tokenize(string))                       # Separates punctuations into independent strings

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abhatt\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['If', 'it', 'LOOKS', 'like', 'a', 'duck', ',', 'SWIMS', 'like', 'a', 'duck', ',', 'and', 'QUACKS', 'like', 'a', 'duck', ',', 'then', 'it', 'is', 'probably', 'a', 'duck', '.']


In [94]:
from nltk.tokenize import sent_tokenize
string = 'Hi, Sam. How are you today? I like your shoes.'
print(sent_tokenize(string))  

['Hi, Sam.', 'How are you today?', 'I like your shoes.']


In [95]:
import re 
print(re.split(r"[.!?]", string))

['Hi, Sam', ' How are you today', ' I like your shoes', '']


In [96]:
tweets=["This is the best #nlp exercise I've found online! #python",
 "#NLP is super fun! <3 #learning",
 "Thanks @pythonworkshop :) #nlp #python"]

In [97]:
from nltk.tokenize import regexp_tokenize
regexp_tokenize(tweets[0], r"#\w+")

['#nlp', '#python']

In [98]:
regexp_tokenize(tweets[-1], r"([#|@]\w+)")

['@pythonworkshop', '#nlp', '#python']

In [99]:
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
all_tokens = [tokenizer.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', "I've", 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@pythonworkshop', ':)', '#nlp', '#python']]


In [100]:
tweet = "Best Pizza in town! 🍕 Want to take a Uber? 🚕"

In [101]:
print(regexp_tokenize(tweet, r"[A-Z]\w+"))

['Best', 'Pizza', 'Want', 'Uber']


In [102]:
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(tweet, emoji))

['🍕', '🚕']
