## Text Processing in Python - A Series of 5 Tutorials

Organized by Facts Lab (http://factslab.io/)

Instructor - Siddharth (Sid) Vashishtha (svashis3@ur.rochester.edu)

### Tuorial 1 (Date: 9 Sep, 2019)
Topics
- Basic datatypes and operations in Python
- If-else statements, loops 
- List comprehensions
- Defining Functions
- String manipulation
- Regular expressions (regex)

### Tutorial 2 (Date: 11 Sep, 2019)
Topics
- Tokenization, Lemmatization
- Text Corpuses
- N-gram Language Models

### Tutorial 3 (Date 13 Sep, 2019)
Topics
- 

# Tutorial 1

We will cover these tutorials in Google Colab : https://colab.research.google.com

Google Colab allows a simple way to access Python Jupyter Notebooks with a lot of packages already installed.

To run Google Colab, all you need is a google account. It's a free service. 

## 1. Basic datatypes and operations

### Numbers

In [0]:
2

2

In [0]:
type(2)  #type function returns the datatype of the input argument

int

In [0]:
type(2.0)

float

In [0]:
2*4

8

In [0]:
2 + 5 - 6 * 4     # follows BODMAS/PEDMAS

-17

In [0]:
2**4   #exponents

16

In [0]:
9%5  #remainder 

4

In [0]:
abs(-42)  #absolute function

42

### Strings

In [0]:
'abc'

'abc'

#### Printing stuff

In [0]:
print("Hello, world")
print('Hello, world')

Hello, world
Hello, world


How do we handle quotations inside text?

In [0]:
print("This is Sid's tutorial")

This is Sid's tutorial


In [0]:
print('Sid said, "Text processing is cool." ')

Sid said, "Text processing is cool." 


In [0]:
print('''  Sid said, "It doesn't matter which major you have, anyone can master text processing."  ''' )

  Sid said, "It doesn't matter which major you have, anyone can master text processing."  


Concatenating strings

In [106]:
print('abc' + 'def')   ## concatenation

abcdef


In [107]:
print("abc"*3)

abcabcabc


In [110]:
height = 179
name = 'Sid'

print("The instructor's name is {} and he is {} cms tall.".format(name, height))

The instructor's name is Sid and he is 179 cms tall.


#### Booleans

In [91]:
True

True

In [85]:
False

False

In [86]:
2 == 2

True

In [89]:
"string1" == "string1"

True

In [90]:
"string1" == "string2"

False

### Lists

 - an array which can take multiple objects
 - objects need not be homogenous
 - indexing starts at 0

In [0]:
alst = [1,2,3]
blst = ['abc', 'xyz', 'string']
clst = ['foo', 456, 'bar', ['pqr', 2**4]]

In [73]:
print(alst[0])
print(blst[2])
print(clst[3])

1
string
['pqr', 16]


#### Lists are mutable. You can edit/delete elements of a list and add elements to it as well.

In [74]:
blst[2] = "newString"
print(blst)

['abc', 'xyz', 'newString']


In [75]:
blst.append("anotherString")
print(blst)

['abc', 'xyz', 'newString', 'anotherString']


Removing elements from a list

In [78]:
popped = blst.pop()     ## popping the last element: O(1) time complexity
print(popped)
print(blst)

newString
['xyz']


In [77]:
popped = blst.pop(0)     ## popping by index: O(n) time complexity
print(blst) 

['xyz', 'newString']


### Tuples

In [0]:
atupl = (1,2,3)
btupl =  ('abc', 'xyz', 'string')
ctupl = ('foo', 456, 'bar', ['pqr', 2**4])

In [48]:
print(atupl[0])
print(btupl[2])
print(ctupl[3])

1
string
['pqr', 16]


Tuples are immutable. You cannot edit/delete the elements of a tuple, and neither can add any element to it.

In [0]:
# btupl[2] = "newString"
# print(blst)

In [0]:
#btupl.append("anotherString")

#### Exercise: (Importance of mutable and immutable objects)

In [0]:
dlst = [42,12,79]
elst = dlst
elst[1] = 7

# # What do you think would be the output of the below code?
# print(dlst == elst)
# print(dlst)
# print(elst)

In [0]:
anum = 89
bnum = anum
bnum = 3

# # What do you think would be the output of the below code?
# print(bnum == anum)
# print(anum)
# print(bnum)

### Sets

A set is a collection which is unordered and unindexed

In [114]:
aset = {1,2,3}
bset = {7,7,6,6,3,3,3,4,9,1,1}

print(aset)
print(bset)

{1, 2, 3}
{1, 3, 4, 6, 7, 9}


In [108]:
print(len(aset))

3


In [118]:
print(3 in aset)  #set membership

True


In [117]:
print(4 in aset)

False


In [119]:
aset.intersection(bset)

{1, 3}

In [120]:
aset.union(bset)

{1, 2, 3, 4, 6, 7, 9}

### Dictionaries

 - A dict is collection that is unordered, indexed and editable
 - Constant time lookup
 - (Key, Value) pairs
 - Keys have be to immutable

In [0]:
adict = {'sid': 179, 
         'pikachu': 40.6, 
         'bulbasaur': 71.1,
         'dragonite':221}

In [124]:
adict['sid']

179

In [0]:
adict['sid'] = 2736483276

In [126]:
adict['sid']

2736483276

Adding to a dict is very straight-forward

In [0]:
adict['charmender'] = 61


In [129]:
print(adict)

{'sid': 2736483276, 'pikachu': 40.6, 'bulbasaur': 71.1, 'dragonite': 221, 'charmender': 61}


## 2. If, else and Loops

## 3. List comprehensions

## 4. Functions

## 5. Regular Expressions

A sequence of characters that define a search pattern in strings.

A resource for practising or checking regular expressions - https://www.regexpal.com/

**Cheat Sheet from regexpal.com**:

**Character classes** \\
.	any character except newline \\
\w \d \s word, digit, whitespace \\
\W \D \S not word, digit, whitespace \\
[abc]	any of a, b, or c \\
[^abc]	not a, b, or c \\
[a-g]	character between a & g \\

**Anchors**

^abc$	start / end of the string \\
\b	word boundary \\

**Escaped characters**  \\
\. \* \\	escaped special characters \\
\t \n \r	tab, linefeed, carriage return \\
\u00A9	unicode escaped Â© \\

**Groups & Lookaround**

(abc)	capture group \\
\1	backreference to group #1 \\
(?:abc)	non-capturing group \\
(?=abc)	positive lookahead \\
(?!abc)	negative lookahead  \\

**Quantifiers & Alternation** \\
a* a+ a?	0 or more, 1 or more, 0 or 1 \\
a{5} a{2,}	exactly five, two or more \\
a{1,3}	between one & three \\
a+? a{2,}?	match as few as possible \\
ab|cd	match ab or cd \\

In [0]:
import re #importing the Python regex library

In [0]:
example_string = '''
Harry Potter's Hogwarts student ID was GRYF10011. Ron's ID was just next to Harry's -- GRYF10012.
Draco Malfoy's ID was SLY7777. 
Fun fact - Harry Potter was 17 when he killed Voldemort. Voldemort was 72 years old when he died. 
Dumbledore was probably 150 when he died, who knows.
Random math equation just for the purpose of a regex example: 72=9*8 and 9+8=17
Another random math eq: x2-5x+6=0
Bye!
'''

In [0]:
print(example_string)


Harry Potter's Hogwarts student ID was GRYF10011. Ron's ID was just next to Harry's -- GRYF10012.
Draco Malfoy's ID was SLY7777. 
Fun fact - Harry Potter was 17 when he killed Voldemort. Voldemort was 72 years old when he died. 
Dumbledore was probably 150 when he died, who knows.
Random math equation just for the purpose of a regex example: 72=9*8 and 9+8=17
Bye!



In [0]:
ages = re.findall(r'\d', example_string)
print(ages)

['1', '0', '0', '1', '1', '1', '0', '0', '1', '2', '7', '7', '7', '7', '1', '7', '7', '2', '1', '5', '0', '7', '2', '9', '8', '9', '8', '1', '7']


In [0]:
ages = re.findall(r'\d{1,3}', example_string)
print(ages)

['100', '11', '100', '12', '777', '7', '17', '72', '150', '72', '9', '8', '9', '8', '17']


In [0]:
ages = re.findall(r' \d{1,3} ', example_string)
print(ages)

[' 17 ', ' 72 ', ' 150 ']


Words, alphanumerics

In [0]:
words = re.findall(r'\w+', example_string)
print(words)

['Voldemort', 'was', '72', 'years', 'old', 'when', 'he', 'died', 'Harry', 'Potter', 'was', '17', 'when', 'he', 'killed', 'Voldemort', 'Harry', 's', 'Hogwarts', 'student', 'ID', 'was', 'GRYF10011', 'Ron', 's', 'ID', 'was', 'just', 'next', 'to', 'Harry', 'GRYF10012', 'Dumbledore', 'was', 'probably', '150', 'when', 'he', 'died', 'who', 'knows']


#### Exercise:

Using regex:
1. Find all the student IDs in the example_string
2. Find all the math equations in the example_string
3. Find all names in the example_string


## Search and Replace

In [0]:
def replace(s):
  '''
  Input: a string
  
  Output: input string with some replacements 
  
  '''
  
  s = re.sub(r'', s)
  s = re.sub(r'', s)
  
  
  return s

#### Exercise:
1. Replace all student IDs in the example_string by "XXXXXXX"

2. Create a function that replaces the negative/positive contractions in a text with their full forms.

Eg:  "I don't want to go there. idk, i'm afraid man."
     
    "I do not want to go there. I do not know, I am afraid man."