# Playing with Text in Python

Python has many built-in functions in the standard library that allow you to manipulate text. In this notebook, we will cover some of the most popular ones, and we will also introduce regular expressions. 

## Join() 
The join() method takes all items in an iterable and joins them into one string. A string must be specified as the separator. It can be used to concatenate two strings, but in a way that passes one string through another.

In [1]:
myList1 = ("John", "Peter", "Vicky")
x = "#".join(myList1)
x

'John#Peter#Vicky'

In [2]:
x = " ".join(myList1)
x

'John Peter Vicky'

In [3]:
balloon = "Sammy has a balloon."
" ".join(balloon)

'S a m m y   h a s   a   b a l l o o n .'

## Split()

The split() method splits a string into a list. You can specify the separator, default separator is any whitespace.

In [5]:
txt = "welcome to the jungle"
x = txt.split()
x

['welcome', 'to', 'the', 'jungle']

In [7]:
txt = "apple#banana#cherry#orange"
x = txt.split("#")
x

['apple', 'banana', 'cherry', 'orange']

Using the split and join functions together can be very powerful. 

In [9]:
# Removing the junk $ characters 
junk_str = 'this $ is a $ proof of concept'
result= ''.join(junk_str.split('$'))
result

'this  is a  proof of concept'

Other useful functions to consider: replace, count , upper. They can all be found here: https://docs.python.org/2.5/lib/string-methods.html .

### Challenges
1) From the string 'Sammy has a balloon.' remove all the a characters to return -->  'Smmy hs blloon.' Do this using the replace function, and then using split+join.


In [12]:
str1 = "Sammy has a balloon."
...

2) Using join+split, add a dash in between each word in 'this is a string' to return --> 'this-is-a-string'

In [None]:
str2 = 'this is a string'
...

3)

Return True if the given string contains an appearance of "xyz" where the xyz is not directly preceeded by a period (.). So "xxyz" counts but "x.xyz" does not.


xyz_there('abcxyz') → True

xyz_there('abc.xyz') → False

xyz_there('xyz.abc') → True


In [None]:
def xyz_there(input_txt):
    ...
    return True

4) Given a string, return a string where for every char in the original, there are two chars.

double_char('The') → 'TThhee'

double_char('AAbb') → 'AAAAbbbb'

double_char('Hi-There') → 'HHii--TThheerree'

In [None]:
def double_char(str):
    pass

# Regular Expressions

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.



In [14]:
# See https://docs.python.org/3/howto/regex.html for the full reference
import re

Regex rules:

\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

## re.match() and re.search()

Determine if the RE matches at the beginning of the string. Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string.

In [22]:
line = "Cats are smarter than dogs"
matchObj = re.match( r'dogs', line)
print(matchObj)
matchObj = re.match( r'Cats', line)
print(matchObj)

None
<re.Match object; span=(0, 4), match='Cats'>


In [24]:
line = "Cats are smarter than dogs"
matchObj= re.search( r'dogs', line)
matchObj

<re.Match object; span=(22, 26), match='dogs'>

In [25]:
line[22:26]

'dogs'

## re.findall()

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

In [26]:
string = """Hello my Number is 123456789 and  
             my friend's number is 987654321"""
    
# A sample regular expression to find digits.
regex = '\d+'             
    
match = re.findall(regex, string)
print(match)

['123456789', '987654321']


## Challenges

1) You are given a string S. It consists of alphanumeric characters, spaces and symbols(+,-). 
Your task is to find all the substrings of S that contains 2 or more vowels. 
Also, these substrings must lie in between  consonants and should contain vowels only. 

E.g. rabcdeefgyYhFjkIoomnpOeorteeeeet --> [ee, Ioo, Oeo, eeeee ] 

In [None]:
S = 'rabcdeefgyYhFjkIoomnpOeorteeeeet'
...

2) Write a Python program that matches a word containing 'z', not at the start or end of the word.

In [None]:
def text_match(text):
    patterns = ...
    if re.search(patterns,  text):
        return 'Found a match!'
    else:
        return('Not matched!')

In [None]:
print(text_match("The quick brown fox jumps over the lazy dog."))
print(text_match("Python Exercises."))

3) Write a Python program to remove leading zeros from an IP address.
"216.08.094.196" -->>  "216.8.94.196"

In [None]:
ip = "216.08.094.196"
string = re.sub(....)
print(string)