## Text Processing in Python 
#### A Series of 5 Tutorials

Organized by Facts Lab (http://factslab.io/)

Instructor - Siddharth (Sid) Vashishtha (svashis3@cs.rochester.edu)


## To-do before class:

 - You can access Today's notebook here - https://bit.ly/2kDLgv9

Please fill your details here if you haven't already:

 - RSVP page: https://bit.ly/2kDL6Uz

 - Today's Attendance: https://forms.gle/JM2FHsVhw8zrYmh8A


### Tuorial 1 (Date: 9 Sep, 2019)
Topics
- Basic datatypes and operations in Python
- If-else statements
- loops 


### Tutorial 2 (Date: 11 Sep, 2019)
Topics

- List comprehensions
- Defining Functions
- Regular expressions (regex) 

Optional reading for next tutorial:
Tokenization: https://web.stanford.edu/~jurafsky/slp3/2.pdf

### Tutorial 3 (Date: 13 Sep, 2019)
Topics
- Review tokenization issues
- Tokenization, Sentence separation (corenlp, spacy)
- Lemmatization (corenlp, spacy)
- Reading corpuses
- Plotting unigram counts 
- Zipfian distribution   

### Tutorial 4 (Date 16 Sep, 2019)
Topics
- Google ngram viewer
- Constituency Parsers
- Dependency Parsers
- Tree searching

### Tutorial 5 (Semantic Resources)



# Tutorial 1

We will cover these tutorials in Google Colab : https://colab.research.google.com

Google Colab allows a simple way to access Python Jupyter Notebooks with a lot of packages already installed.

To run Google Colab, all you need is a google account. It's a free service. 

## 1. Basic datatypes and operations

### Numbers

Python supports three types of numbers - int, float, and complex

In [0]:
2

In [0]:
type(2)  #type function returns the datatype of the input argument

int

In [0]:
type(2.0)

float

In [0]:
2*4

8

In [0]:
2 + 5 - 6 * 4     # follows BODMAS/PEDMAS

-17

In [0]:
2**4   #exponents

16

In [0]:
9%5  #remainder 

4

In [0]:
3//2

1

In [0]:
abs(-42)  #absolute function

42

### Strings

In [0]:
type('abc')

str

#### Printing stuff

In [0]:
print("Hello, world")
print('Hello, world')

Hello, world
Hello, world


How do we handle quotations inside text?

In [0]:
print("This is Sid's tutorial")

This is Sid's tutorial


In [0]:
print('Sid said, "Text processing is cool." ')

Sid said, "Text processing is cool." 


In [0]:
print('''  Sid said, "It doesn't matter which major you have, anyone can master text processing."  ''' )

Concatenating strings

In [0]:
print('abc' + 'def')   ## concatenation

abcdef


In [0]:
print("abc"*7)

abcabcabcabcabcabcabc


In [0]:
height = 179
name = 'Ram'

print("The instructor's {cab} " + "name is {} and he is {} cms tall.".format(name, height) + "{red}")

The instructor's {cab} name is Ram and he is 179 cms tall.{red}


#### Booleans

You can evaluate any expression in Python, and get one of two answers, `True` or `False`.

In [0]:
2 < 3

True

In [0]:
"True" == True

False

In [0]:
34 > 42

False

In [0]:
59 <= 59

True

In [0]:
99 != 100  #not equals

True

In [0]:
2 == 2

True

In [0]:
type(True)

bool

In [0]:
print(False)

False


In [0]:
"string1" == "string1"

True

In [0]:
"string1" == "string1s"

False

### Collection data types

- Lists
- Tuples
- Sets
- Dictionaries

Each collection data type can take multiple types of objects. 

A generic note for Python: Indexing starts at 0 for any data type that is indexable

### Lists
A collection which is:
 - ordered and changeable
 - written with square brackets


In [0]:
alst = [1,2,3]
blst = ['abc', 'xyz', 'string']
clst = ['foo', 456, 'bar', ['pqr', 2**4]]

In [0]:
blst[2]

'string'

In [0]:
print(alst[0])
print(blst[2])
print(clst[3])

1
string
['pqr', 16]


#### Lists are changeable i.e. mutable. You can edit/delete elements of a list and add elements to it as well.

In [0]:
blst[2] = "newString"
print(blst)

['abc', 'xyz', 'newString']


In [0]:
blst.append("anotherString")
print(blst)

['abc', 'xyz', 'newString', 'anotherString']


Removing elements from a list

In [0]:
popped = blst.pop()     ## popping the last element: O(1) time complexity
print(popped)
print(blst)

anotherString
['abc', 'xyz', 'newString']


In [0]:
blst.pop()

'newString'

In [0]:
blst

['abc', 'xyz']

In [0]:
popped = blst.pop(0)     ## popping by index: O(n) time complexity
print(blst) 

['xyz']


Inserting an element in the middle of a list

In [0]:
lstd = [1,2, 3, 4, 5]

In [0]:
lstf = lmstd[:2] + [42] + lstd[2:]

In [0]:
lstf

[1, 2, 42, 3, 4, 5]

Accessing elements upto last n elements

In [0]:
lstf[0:-2]

[1, 2, 42, 3]

### Tuples

A tuple is a collection which is:
 - ordered and unchangeable
 - written with round brackets

In [0]:
atupl = (1,2,3)
btupl =  ('abc', 'xyz', 'string')
ctupl = ('foo', 456, 'bar', ['pqr', 2**4])

In [0]:
print(atupl[0])
print(btupl[2])
print(ctupl[3])

1
string
['pqr', 16]


In [0]:
atupl[2] = "jsdhfkjs"

TypeError: ignored

#### Tuples are unchangeable i.e immutable. You cannot edit/delete the elements of a tuple, and neither can add any element to it.

In [0]:
# btupl[2] = "newString"
# print(blst)

In [0]:
#btupl.append("anotherString")

#### Exercise: (Importance of mutable and immutable objects)

In [0]:
dlst = list([7,5,4])
dlst

[7, 5, 4]

In [0]:
dlst[:]    #the colon operator creates a new list object

[7, 5, 4]

In [0]:
dlst = [42,12,79]
elst = dlst
elst[1] = 7

# # What do you think would be the output of the below code?
print(dlst == elst)
print(dlst)
print(elst)

True
[42, 7, 79]
[42, 7, 79]


In [0]:
anum = 89
bnum = anum
bnum = 3

# # What do you think would be the output of the below code?
print(bnum == anum)
print(anum)
print(bnum)

## Why? Because integer object is immutable

False
89
3


### Sets

A set is a collection which is:
- unordered and unindexed
- represented with curly brackets

In [0]:
aset = {1,2,3}
bset = {7,7,6,6,3,3,3,4,9,1,1}

print(aset)
print(bset)

{1, 2, 3}
{1, 3, 4, 6, 7, 9}


In [0]:
len(aset)

3

In [0]:
print(len(aset))

In [0]:
345 in aset   #set membership

False

In [0]:
print(4 in aset)

False


In [0]:
aset.intersection(bset)

{1, 3}

In [0]:
aset.union(bset)

{1, 2, 3, 4, 6, 7, 9}

### Dictionaries

 A dict is collection that has the following properties:
 - unordered, indexed and editable
 - represented with a curly bracket
 - has (Key, Value) pairs \\
 
Other notes:
 - has Constant time lookup
 - has to have Keys as objects that are immutable

In [0]:
#Storing heights of different living species (mostly Pokemon)

adict = {'sid': 179, 
         'pikachu': 40.6, 
         'bulbasaur': 71.1,
         'dragonite':221}

In [0]:
adict

{'bulbasaur': 71.1, 'dragonite': 221, 'pikachu': 40.6, 'sid': 179}

In [0]:
adict['sid'] = 2736483276

In [0]:
adict['sid']

2736483276

Adding to a dict is very straight-forward

In [0]:
adict['charmender'] = 61


In [0]:
print(adict)

{'sid': 2736483276, 23: 40.6, 'bulbasaur': 71.1, 'dragonite': 221, 'charmender': 61}


Different items of a dictionary can be accessed by using these attributes:
- `.keys()`
- `.values()`
- `.items()`

In [0]:
print(adict.keys())

dict_keys(['sid', 23, 'bulbasaur', 'dragonite', 'charmender'])


In [0]:
print(adict.values())

dict_values([2736483276, 40.6, 71.1, 221, 61])


In [0]:
adict.items()

dict_items([('sid', 2736483276), (23, 40.6), ('bulbasaur', 71.1), ('dragonite', 221), ('charmender', 61)])

## 2. If, else and Loops

Python requires indentation for scoping in your code.
In some other languages, curly brackets are used for this purpose

In [0]:
if 21 < 3:
  print("You are correct! :) ")
  print("khfskdjhf")
else:
  print("You are wrong! :( ")
  print("abcs")

You are wrong! :( 
abcs


In [0]:
if 21 < 3:
print("You are correct! :) ")
else:
print("You are wrong! :( ")

In [0]:
a = 3487683674
b = 200

if a < b:
  print("a is less than b")
if a > b:
  print("a is greater than b")

a is greater than b


When python encounters a series of `if` statements, by default it runs each of them sequentially

In [0]:
a = 100
b = 200

if a < b:
  print("a is less than b")
if a > b:
  print("a is greater than b")
if a==b:
  print("a is equal to b")

`elif` is a way of saying - if the previous condition weren't true, try this case

In [0]:
a = 100
b = 200

if a < b:
  print("a is less than b")
elif a > b:
  print("a is greater than b")
elif a==b:
  print("a is equal to b")


All objects in Python have a truth value.

The following objects have a `False` value
- 0
- 0.000, 0.0x
- an empty sequence: '', (), []
- an empty mapping: {}
- set()
- `None`

Rest other objects in Python, have value `True`

In [0]:
a = 5
if a:
  print("if condition is true")
else:
  print("else condition is true")

if condition is true


In [0]:
not True

False

In [0]:
b = []
if b:
  print("if condition is true")
else:
  print("else condition is true")

else condition is true


In [0]:
if {4}:
  print("if condition is true")
else:
  print("else condition is true")

if condition is true


#### Loops

A `for` loop is used for iterating over a sequence.

`for` doesn't require an indexing variable to be set beforehand

In [0]:
dlst = [42,12,79]

for item in dlst:
  
  print(abc)
  print(abc*2)

42
84
12
24
79
158


In [0]:
string = "apple"
for ch in string:
  print(ch)

a
p
p
l
e


In [0]:
range(0, 10, 1)   #start, stop, increment

In [0]:
for i in range(1,10,2):
  print(i)

1
3
5
7
9


In [0]:
## Default start = 0 and default incrememnt = 1
for i in range(0,10,1):
  print(i)

0
1
2
3
4
5
6
7
8
9


In [0]:
adict = {'sid': 179, 
         'pikachu': 40.6, 
         'bulbasaur': 71.1,
         'dragonite':221}

for key in adict:
  print(" {}'s height is:  {}".format(key, adict[key]))

 sid's height is:  179
 pikachu's height is:  40.6
 bulbasaur's height is:  71.1
 dragonite's height is:  221


In [0]:
lst = [1, 2.3, 4, 5.5, 6.6, 7, 8]


for var in lst:
  if isinstance(var, int):
    print(var)

1
4
7
8


A break statement can be used to break the for loop before it has looped through all the items

In [0]:
for var_name in "harry_potter":
  if var_name=='p':
    break
  else:
    print(var_name)
    
#  ########## this is a comment


'''
This line is commented
This line is also commented
This line is also commented
'''

print("Final value of var_name: {}".format(var_name))

h
a
r
r
y
_
Final value of var_name: p


`while` loop runs as long as a certain command is true

In [0]:
i = 10
while i > 0:
  print(i)
  i = i-1
  
print("Final value of i: {}".format(i))

10
9
8
7
6
5
4
3
2
1
Final value of i: 0


In [0]:
i = 11
while i > 0:
  print(i)
  
  if i%2==0:## even number (remainder when divided by 2 == 0)
    break
  
  i = i-1

11
10


# Tutorial 2 (Date: 11 Sep, 2019)

Today's Attendance: https://forms.gle/qCd7Muju8oFinkBj7

Topics today:

- List comprehensions
- Defining Functions
- Regular expressions (regex) 

## 3. List comprehensions

List comprehensions are a concise way of creating lists

Consider the below for loop for creating a list

In [0]:
alst = []
for i in range(15):
  alst.append(i)
  
print(alst)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


We can create the same list alst as follows using a list comprehension

In [0]:
blst = [i for i in range(15)]
print(blst)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


In [0]:
alst = []

for i in range(10):
  for j in range(5):
    alst.append(i*j)
    
print(alst)

[0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 2, 4, 6, 8, 0, 3, 6, 9, 12, 0, 4, 8, 12, 16, 0, 5, 10, 15, 20, 0, 6, 12, 18, 24, 0, 7, 14, 21, 28, 0, 8, 16, 24, 32, 0, 9, 18, 27, 36]


In [0]:
blst = [i*j for i in range(15) for j in range(5)]
blst == alst

True

In [0]:
## creating tuples inside a list

## 4. Functions

- A function is a block of code which runs only when it is called.
- You can pass parameters (some data) into a function
- A function can return any data as its result
- The return command exits the function whenever it is encountered inside a function

In [0]:
def multiplication_table(n):
  '''
  Input: takes an integer input
  
  Output: returns the multiplication table of the input integer upto 10 (as a list)
  '''
  ans_lst = []
  
  for i in range(0,11,2):
    ans_lst.append(i*n)
    
  return ans_lst #this will be returned as the output of the function
    

Once a function is defined, you can call the function as follows:

In [0]:
multiplication_table(5.345)

[5.345,
 10.69,
 16.035,
 21.38,
 26.724999999999998,
 32.07,
 37.415,
 42.76,
 48.105,
 53.449999999999996]

In [0]:
# Exercise:
# create a function 'is_prime' that takes a number as its input and prints if it is a prime number

In [0]:
is_prime(439)

## 5. Regular Expressions

A sequence of characters that define a search pattern in strings.

A resource for practising or checking regular expressions - https://www.regexpal.com/

**Cheat Sheet from regexpal.com**:

**Character classes** \\
.	any character except newline \\
\w \d \s word, digit, whitespace \\
\W \D \S not word, digit, whitespace \\
[abc]	any of a, b, or c \\
[^abc]	not a, b, or c \\
[a-g]	character between a & g \\

**Anchors**

^abc$	start / end of the string \\
\b	word boundary \\

**Escaped characters**  \\
\. \* \\	escaped special characters \\
\t \n \r	tab, linefeed, carriage return \\
\u00A9	unicode escaped © \\

**Groups & Lookaround**

(abc)	capture group \\
\1	backreference to group #1 \\
(?:abc)	non-capturing group \\
(?=abc)	positive lookahead \\
(?!abc)	negative lookahead  \\

**Quantifiers & Alternation** \\
a* a+ a?	0 or more, 1 or more, 0 or 1 \\
a{5} a{2,}	exactly five, two or more \\
a{1,3}	between one & three \\
a+? a{2,}?	match as few as possible \\
ab|cd	match ab or cd \\

In [0]:
import re #importing the Python regex library

In [0]:
example_string = '''
Harry Potter's Hogwarts student ID was GRYF10011. 
Ron's ID was just next to Harry's -- GRYF10012.
Draco Malfoy's ID was SLY7777. 
Fun fact - Harry Potter was 17 when he killed Voldemort. Voldemort was 72 years old when he died. 
Dumbledore was probably 150 when he died, who knows, doesn't matter anyway.
What is this sentence doing here? Nobody knows!!!
Random math equation just for the purpose of a regex example: 72=9*8 and 9+8=17
Another random math eq: x2-5x+6=0
Last equation: 2=2
Bye!
'''

In [0]:
print(example_string)



Harry Potter's Hogwarts student ID was GRYF10011. 
Ron's ID was just next to Harry's -- GRYF10012.
Draco Malfoy's ID was SLY7777. 
Fun fact - Harry Potter was 17 when he killed Voldemort. Voldemort was 72 years old when he died. 
Dumbledore was probably 150 when he died, who knows, doesn't matter anyway.
What is this sentence doing here? Nobody knows!!!
Random math equation just for the purpose of a regex example: 72=9*8 and 9+8=17
Another random math eq: x2-5x+6=0
Last equation: 2=2
Bye!



[1, 2, 3]

In [0]:
ages = re.findall('\d', example_string)
print(ages)

['1', '0', '0', '1', '1', '1', '0', '0', '1', '2', '7', '7', '7', '7', '1', '7', '7', '2', '1', '5', '0', '7', '2', '9', '8', '9', '8', '1', '7', '2', '5', '6', '0', '2', '2']


In [0]:
ages = re.findall(r' \d+ ', example_string)
print(ages)

[' 17 ', ' 72 ', ' 150 ']


In [0]:
ages = re.findall(r' (\d{1,3}) ', example_string)
print(ages)

['17', '72', '150']


Words, alphanumerics

#### Tokenization based on whitespaces

In [0]:
words = re.findall(r'\S+', example_string)  
print(words)

['Harry', "Potter's", 'Hogwarts', 'student', 'ID', 'was', 'GRYF10011.', "Ron's", 'ID', 'was', 'just', 'next', 'to', "Harry's", '--', 'GRYF10012.', 'Draco', "Malfoy's", 'ID', 'was', 'SLY7777.', 'Fun', 'fact', '-', 'Harry', 'Potter', 'was', '17', 'when', 'he', 'killed', 'Voldemort.', 'Voldemort', 'was', '72', 'years', 'old', 'when', 'he', 'died.', 'Dumbledore', 'was', 'probably', '150', 'when', 'he', 'died,', 'who', 'knows,', "doesn't", 'matter', 'anyway.', 'What', 'is', 'this', 'sentence', 'doing', 'here?', 'Nobody', 'knows!!!', 'Random', 'math', 'equation', 'just', 'for', 'the', 'purpose', 'of', 'a', 'regex', 'example:', '72=9*8', 'and', '9+8=17', 'Another', 'random', 'math', 'eq:', 'x2-5x+6=0', 'Last', 'equation:', '2=2', 'Bye!']


In some languages, tokenization is not as simple as separation by a whitespace. 

Consider this string from Chinese.

In [0]:
mandarin_string1 = "如果你吃面条的话，那我要回家"  # “If you eat noodles then I will go home.”
mandarin_string2 = "苹果是我最喜欢的水果"         # “Apples are my favourite fruit”.

In [0]:
words = re.findall(r'\w+', mandarin_string1)
print(words)

['如果你吃面条的话', '那我要回家']


In [0]:
words = re.findall(r'\w+', mandarin_string2)
print(words)

['苹果是我最喜欢的水果']


#### Exercise:

Using regex:
1. Find all the student IDs in the example_string
2. Find all the math equations in the example_string
3. Find all names in the example_string


In [0]:
## Finding all the IDS in the string
re.findall(r'[A-Z]+[0-9]+', example_string)

['GRYF10011', 'GRYF10012', 'SLY7777']

In [0]:
## Finding all the equations in the string
re.findall(r'\S+=\S+', example_string)

['72=9*8', '9+8=17', 'x2-5x+6=0', '2=2']

In [0]:
## Finding all the names
re.findall(r'([A-Z][A-Za-z]+)[\s\']', example_string)

['Harry',
 'Potter',
 'Hogwarts',
 'ID',
 'Ron',
 'ID',
 'Harry',
 'Draco',
 'Malfoy',
 'ID',
 'Fun',
 'Harry',
 'Potter',
 'Voldemort',
 'Dumbledore',
 'What',
 'Nobody',
 'Random',
 'Another',
 'Last']

Lookarounds:

These do not consume our matching regular expression. They just look around for matching the expression. 

 - Positive lookahead ?=

 - Negative look ahead ?!

 - Positive look behind ?<=

 - Negative look behind ?<!


Note: ?: non-capturing is different. it consumes characters

Examples:



In [0]:
re.findall(r'b(a)b', string)

['a']

#### Exercise: Find all the 'a's in the string which are both preceded and followed by a 'b'

In [0]:
re.findall(r'(?<=b)a(?=b)', string)


['a', 'a']

#### Capture the entire look around - (ex: capture all 'bab's)

In [0]:
string = "abababcdeab"

In [0]:
re.findall(r'(?=(bab))', string)

['bab', 'bab']

## Search and Replace

In [0]:
def replace(s):
  '''
  Input: a string
  
  Output: input string with some replacements 
  
  '''
  s = re.sub(r'', s)
  s = re.sub(r'', s)
  return s


### group matching example
### ? matchina as few as example  (greedy or nongreedy)

In [0]:
string = "abababcdeab"

In [0]:
re.sub(r'a', r'E', string)  ## replace the matching regex patttern with another pattern

'EbEbEbcdeEb'

#### Exercise:
1. Replace all alphabets of a student ID by "XXXX" in the example_string.

2. Replace all student IDs in the example_string by "YYYY"
 

3. Create a function that replaces the negative/positive contractions in a text with their full forms.

Example Input for Q3:  
"I don't want to go there. idk, i'm afraid man."
     
Example Output for Q3:\
"I do not want to go there. I do not know, I am afraid man."

## P.S. - Crossword lovers! Just found out yesterday that there is regex crossword:
https://regexcrossword.com