# Why is Text so important

Data continues to grow exponentially
- Estimated to be 2.5 Exabytes (2.5 million TB) a day
- Grow to 40 Zettabytes (40 billion TB) by 2020 (50-times that of 2010)

Approximately 80% of all data is estimated to be unstructured, text-rich data

SO, what can be done with text?
- Parse Text
- Find / Identity / Extract relevant information from text
- Classify text documents
- Search for relevant text documents
- Sentiment analysis
- Topic modeling
- ...



# Handling Text in Python

### 1. Primitive constructs in Text
- Sentences / input strings
- Words or Tokens
- Characters
- Document, larger files


In [5]:
# what if I want to know how many chars
text1 = 'Ethics are built right into the ideals and objectives of the United Nations '
len(text1)

75

In [6]:
# what if I want to know how many words
text2 = text1.split(' ')
len(text2)

13

### 2. Finding Specific words

1) Long words : Words that are more than 3 letters long

2) Capitalized words

3) Words that end with s

In [9]:
# Long words
print([w for w in text2 if len(w) > 3])

# Cap words
print([w for w in text2 if w.istitle()])

# ends with 's'
print([w for w in text2 if w.endswith('s')])

['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']
['Ethics', 'United', 'Nations']
['Ethics', 'ideals', 'objectives', 'Nations']


### 3. Finding Unique words : using set()

In [10]:
text3 = "To be or not to be"
text4 = text3.split(' ')
len(text4)

6

In [14]:
# returns unique words, but "To" and "to" still overlap
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [16]:
# so we change it
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### 4. Word Comparison functions

- s.startswith(t)
- s.endswith(t)

- t in s

- s.isupper(); s.islower(); s.istitle()
    - isupper : 전부 대문자
    - islower : 전부 소문자
    - istitle : 첫글자만 대문자
    
- s.isalpha(); s.isdigit(); s.isalnum()
    - isalpha : 전부 알파벳
    - isdigit : 전부 숫자
    - isalnum : 알파벳과 숫자의 결합

### 5. String Operations

- s.lower(); s.upper(); s.titlecase()
- s.split(t)
- s.splitlines()  : split sentence on newline, endofline character
- s.join(t) : split과 반대로 붙임
- s.strip(); s.rstrip()  : strip은 **단어 앞의** 모든 공백을 제거, rstrip은 **단어 뒤의** 모든 공백 제거 
- s.find(t); s.rfind(t)  : 각각 앞, 뒤부터 string t를 찾음
- s.replace(u,v) : 모든 u를 v로 대체

In [19]:
# split
text5 = 'ouagadougou'
text6 = text5.split('ou')
text6

['', 'agad', 'g', '']

In [20]:
# join
'ou'.join(text6)

'ouagadougou'

In [22]:
# all the chars in string
print(list(text5))
print( [c for c in text5 ])  # same as above

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']
['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']


In [34]:
# Cleaning Example
text8 = '   A quick brown fox jumped over the lazy dog. '
print(text8.split(' '))
text9 = text8.strip()   # 앞과 뒤 공백제거
print(text9)

['', '', '', 'A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', '']
A quick brown fox jumped over the lazy dog.


In [39]:
# Changing Text
text9
print(text9.find('o'))  # where it finds the first 'o'
print(text9.rfind('o')) # 뒤에서부터 시작하여 첫번째 'o'를 찾음
text9.replace('o', 'O') # replacement

10
40


'A quick brOwn fOx jumped Over the lazy dOg.'

### 6. File Operations
1) Reading files line by line
    
    f = open('UNDHR.txt', 'r')
    
2) Reading the full file

    f.seek(0) #resets the reading
    text12 = f.read()
    len(text12)
    text13 = text12.splitlines()   #split into \n
    
3) File Operations
    - f = open(filename, mode)     
    - f.readline(); f.read(); f.read(n) : 한 줄을 읽거나, 전체를 읽거나, n글자를 읽음
    - for line in f : doSomething(line)
    - f.seek(n)  : reset the reading position
    - f.write(message) : write into file (write mode)
    - f.close() : close the fild handle
    - f.closed  : check whether file's closed
    

In [43]:
# Some issues here
f = open('UNDHR.txt', 'r')
text14 = f.readline()
text14 # but we don't want \n at the end of the sentence

'Universal Declaration of Human Rights \n'

In [45]:
# How do you remove the last newline character?
text14.rstrip()  #it also works for \r \r\n ...

'Universal Declaration of Human Rights'