In [1]:
#Module Needed : string, collections
import string
import collections

# Strings

In this brief notebook we cover a few additional features of Python strings that you may find useful and conclude with an example of preparing data for text analytics.

### Library Dependancies
Need collections and string, which both come pre-installed with the Python Standard Library.

## Splitting string, joining strings and replacing characters

There are multiple ways to split a string into characters, but the easiest by far is to just pass the string as an argument to the list function.

In [2]:
str1 = 'abcdefgh'
list1 = list(str1)
list1

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

The elements of a list can be combined into a string using join. Note that we apply the join method to the string that serves as the "glue" and pass the list as an argument.

In [3]:
join_str = ' + '
str2 = join_str.join(list1)
str2

'a + b + c + d + e + f + g + h'

We can split a string on a specified substring, but note that the substring itself will not be included in the results. By default, splitting will be done on whitespace.

In [4]:
split_str = ' + '
list2 = str2.split(split_str)
list2

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

The replace method replaces the occurences of a substring within a string

In [5]:
str2.replace('+', 'XXX')

'a XXX b XXX c XXX d XXX e XXX f XXX g XXX h'

## String constants

The Python string module contains some very useful string constants.

In [6]:
# import string

In [7]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [8]:
string.ascii_lowercase

'abcdefghijklmnopqrstuvwxyz'

In [9]:
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [10]:
string.digits

'0123456789'

## Example - breaking text into words

While these string constants are mostly a convenience to save you from having to type out lists of letters or digits, the punctuation constant is particularly useful when processing data for text analytics.

In [11]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [12]:
with open('moby.txt', 'rt', encoding='latin1') as file:  
    text = file.read()

In [13]:
print(text)

Hello, world! This is an example of text with punctuation.


In [14]:
for c in string.punctuation:
    text = text.replace(c, ' ')

text = text.lower()

In [15]:
print(text)

hello  world  this is an example of text with punctuation 


In [16]:
word_list = text.split()

In [17]:
# import collections
collections.Counter(word_list)

Counter({'hello': 1,
         'world': 1,
         'this': 1,
         'is': 1,
         'an': 1,
         'example': 1,
         'of': 1,
         'text': 1,
         'with': 1,
         'punctuation': 1})