## String Manipulation

We'll define a basic piece of text and transform it into its mai components; then, we'll reconstruct it. As an example, a report needs to be transformed into a new format to be tramsfpr,ed omtp a mew format to be sent via email.

The input format we'll use in this example will be this:

    AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
    HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
    WITH THE OBJECTIVES FOR THE YEAR. THE MAIN DRIVER OF THE SALES HAS 

BEEN

    THE NEW PACKAGE DESIGNED UNDER THE SUPERVISION OF OUR MARKETING 

DEPARTMENT.

    OUR EXPENSES HAS BEEN CONTAINED, INCREASING ONLY BY 0.7%, THOUGH THE 

BOARD

    CONSIDERS IT NEEDS TO BE FURTHER REDUCED. THE EVALUATION IS 

SATISFACTORY

    AND THE FORECAST FOR THE NEXT QUARTER IS OPTIMISTIC. THE BOARD 
EXPECTS

    AN INCREASE IN PROFIT OF AT LEAST 2 MILLION DOLLARS.


 We need to redact the text to eliminate any references to numbers. It needs to 
be properly formatted by adding a new line after each period, justified with 80 
characters, and transformed into ASCII for compatibility reasons.
 The text will be stored in the INPUT_TEXT variable in the interpreter.


 #### How to do it

 

In [1]:
# After entering the text, split it into individual words:
INPUT_TEXT = """ 
 AFTER THE CLOSE OF THE SECOND QUARTER, OUR COMPANY, CASTAÑACORP
    HAS ACHIEVED A GROWTH IN THE REVENUE OF 7.47%. THIS IS IN LINE
    WITH THE OBJECTIVES FOR THE YEAR. THE MAIN DRIVER OF THE SALES HAS 
BEEN
    THE NEW PACKAGE DESIGNED UNDER THE SUPERVISION OF OUR MARKETING 
DEPARTMENT.
    OUR EXPENSES HAS BEEN CONTAINED, INCREASING ONLY BY 0.7%, THOUGH THE 
BOARD
    CONSIDERS IT NEEDS TO BE FURTHER REDUCED. THE EVALUATION IS 
SATISFACTORY
    AND THE FORECAST FOR THE NEXT QUARTER IS OPTIMISTIC. THE BOARD 
EXPECTS
    AN INCREASE IN PROFIT OF AT LEAST 2 MILLION DOLLARS.
"""

words = INPUT_TEXT.split()

# Replace any numerical digits with an 'X'character:
redacted = ["".join("X" if w.isdigit() else w for w in word) for word in words]

# Transform the text into pure ASCII (note that the name of the company contains the letter ñ, which is not ASCII):
ascii_text = [word.encode('ascii', errors='replace').decode('ascii') for word in redacted]

# Group the words into 80-character lines:
newlines = [word + '\n' if word.endswith('.') else word for word in ascii_text]
LINE_SIZE = 80
lines = []
line = ''
for word in newlines:
    if line.endswith('\n') or (len(line) + len(word) +1) > LINE_SIZE:
        lines.append(line)
        line = ''
    line = line + ' ' + word
    
# Format all of the lines as titles and join them as a single piece of text:
lines = [line.title() for line in lines]
result = '\n'.join(lines)

# Print the result
print(result)

 After The Close Of The Second Quarter, Our Company, Casta?Acorp Has Achieved A
 Growth In The Revenue Of X.Xx%.

 This Is In Line With The Objectives For The Year.

 The Main Driver Of The Sales Has Been The New Package Designed Under The
 Supervision Of Our Marketing Department.

 Our Expenses Has Been Contained, Increasing Only By X.X%, Though The Board
 Considers It Needs To Be Further Reduced.

 The Evaluation Is Satisfactory And The Forecast For The Next Quarter Is
 Optimistic.



### Getting ready

Imagine that we need to parse information sotres in sales logs. We'll use a sales log withthe followig structure:

```
[<Timestamp in iso format>] - SALE - PRODUCT: <product id> - PRICE: $<price of the sale>
```

Note that the price has a leading zero. All prices will have two digits for the dollars and two for the cents.


<div align='center' width ='250 px' background='gray'>
<p>The standard ISO 8601 defines standard ways of representing the time and date. It's widely used in the computing world and can be parsed and generate by vistually any computer language.</p>
<div>

In [2]:
# In the python interpreter, make the following imports. REmember to activate your vitualenv, as describen in the Creating a virtual enviroment recipe:
import delorean
from decimal import Decimal

# Enter the log to parse:
log = "[2018-05-05T11:07:12.267897] - SALE - PRODUCT: 1345 - PRICE: $09.99"

# Split the log into its parts, which are divided by - (note the space before and after the dash). We ignore the SALE part as it doesn't add any relevant information:
divide_it = log.split(" - ")
timestamp_string, _, product_string, price_string = divide_it

# Patse the timestamp into a datetime object
timestamp = delorean.parse(timestamp_string.strip("[]"))

# Parse the product_id into an integer:
product_id = int(product_string.split(":")[-1])

# Parse the price into a Decimal type:
price = Decimal(price_string.split("$")[-1])

# Now you have all of the values in native PYthon format:
print("hora:", timestamp, "\nId:", product_id, "\nPrecio:", price)

hora: Delorean(datetime=datetime.datetime(2018, 5, 5, 11, 7, 12, 267897), timezone='UTC') 
Id: 1345 
Precio: 9.99


### There's more...

These log elements can be combined together into a single object, helping to parse ad aggregate them. For example, we could define a class in Python code in the following way:

In [3]:
class PriceLog(object):
    def __init__(self, timestamp, product_id, price):
        self.timestamp = timestamp
        self.product_id = product_id
        self.price = price

    def __repr__(self):
        return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
                                                self.product_id,
                                                self.price)

    @classmethod
    def parse(cls, text_log):
        """Parse from a text log with the format
        [<timestamp>] - SALE - PRODUCT: <product_id> - PRICE: $<price>
        to a PriceLog object

        Args:
            text_log (_type_): _description_
        """

        divide_id = text_log.split(' - ')
        tmp_string, _, product_string, price_string = divide_it
        timestamp = delorean.parse(tmp_string.strip("[]"))
        product_id = int(product_string.split(':')[-1])
        price = Decimal(price_string.split('$')[-1])
        return cls(timestamp=timestamp, product_id= product_id, price = price)
    
    
# So, the parsing can be done as follows:
log = '[2018-05-05T12:58:59.998903] - SALE - PRODUCT: 897 - PRICE: $17.99'
print(PriceLog.parse(log))

<PriceLog (Delorean(datetime=datetime.datetime(2018, 5, 5, 11, 7, 12, 267897), timezone='UTC'), 1345, 9.99)>


### Usinga third-party tool-parse

A more advanced option is to use regular expressions,as we'll see in the next recipe. But there's a great module in python called *parse*. which allows us to reverse format string. It is a fantastic tool that's powerful, easy to use, and greatly imporves the readability of code.

##### How to do it...

In [5]:
# Import the parse function
from parse import parse

# Define the log to parse, in the same format as in the Extracting data from structured strings recipe:
LOG = '[2018-05-06T12:58:00.714611] - SALE - PRODUCT: 1345 - PRICE: $09.99'

# Analyze it and describe it as you would do when trying to print it, like this:
FORMAT = '[{date}] - SALE - PRODUCT: {product} - PRICE: ${price}'


# Run parse and check the results:
results = parse(FORMAT, LOG)
print(results)
print(results['date'])
print(results['product'])
print(results['price'])

# Nothe the results are all strings. Define the types to be parsed:
FORMAT = '[{date:ti}] - SALE - PRODUCT: {product:d} - PRICE: ${price:05.2f}'

# Parse once again:
result = parse(FORMAT, LOG)
print(result)
print(result["date"])
print(result["product"])
print(result["price"])

# Define a custom type for the price to avoid issues with the float type:

from decimal import Decimal


def price(string):
    return Decimal(string)

FORMAT = '[{date:ti}] - SALE - PRODUCT: {product:d} - PRICE: ${price:price}'

parse(FORMAT, LOG, {'price': price})

<Result () {'date': '2018-05-06T12:58:00.714611', 'product': '1345', 'price': '09.99'}>
2018-05-06T12:58:00.714611
1345
09.99
<Result () {'date': datetime.datetime(2018, 5, 6, 12, 58, 0, 714611), 'product': 1345, 'price': 9.99}>
2018-05-06 12:58:00.714611
1345
9.99


<Result () {'date': datetime.datetime(2018, 5, 6, 12, 58, 0, 714611), 'product': 1345, 'price': Decimal('9.99')}>

### There's more...

The timestamp can also be translated into a delorean object for consistency. Also, delorean objects carry over time zone information. Adding the same structure as in the previous recipe gives the following object, which is capable of parsing logs:

In [9]:
import parse
from decimal import Decimal
import delorean

class PriceLog(object):
    def __init__(self, timestamp, product_id, price):
        self.timestamp = timestamp
        self.product_id = product_id
        self.price = price

    def __repr__(self):
        return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
                                                self.product_id,
                                                self.price)

    @classmethod
    def parse(cls, text_log):
        '''
            Parse from a text log with the format
            [<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>     
        to a PriceLog object
        '''
        def price(string):
            return Decimal(string)
        def isodate(string):
            return delorean.parse(string)
        
        FORMAT = ('[{timestamp:isodate}] - SALE - PRODUCT: {product:d} - PRICE: ${price:price}')
        formats = {'price': price, 'isodate': isodate}
        result = parse.parse(FORMAT, text_log, formats)
        
        return cls(timestamp=result['timestamp'],
                    product_id=result['product'],
                    price=result['price']
                )

# So, parsing it returns similar results:
log = '[2018-05-06T14:58:59.051545] - SALE - PRODUCT: 827 - PRICE: $22.25'

PriceLog.parse(log)

<PriceLog (Delorean(datetime=datetime.datetime(2018, 6, 5, 14, 58, 59, 51545), timezone='UTC'), 827, 22.25)>

### REGEX

##### Getting ready

The python module to deal with regexes is called re. The main function we'll cover is re.search(), which return a match object with information about what matched the pattern.

Some characters are special and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.

The simplest form is just a literal string. For example, the regex pattern r'LOG' matches the string 'LOGS', but not the string 'NOT A MATCH'. If there's no match, re.search returns None. If there is, it returns a special Match Object:

In [11]:
import re

re.search(r'LOG', 'LOGS')

re.search(r'LOG', 'NOT A MATCH')

### How to do it


In [24]:
# Import the re module:
import re

# Then, match a pattern that is not at the start of the string:
print(re.search(r"LOG", "SOME LOGS"))

# Match a pattern that is only at the start of the string. Note the ^ character:
print(re.search(r"^LOG", "LOGS"))
print(re.search(r"^LOG", "SOME LOGS"))

# Match the word 'thing' (not excluding things), but not something or anything. Note the \b at the start of the second pattern:
STRING = "something in the things she shows me"
match = re.search(r"thing", STRING)
print(
    "First:",
    STRING[: match.start()],
    "\nSecond:",
    STRING[match.start() : match.end()],
    "\nFin:",
    STRING[match.end() :],
)

match = re.search(r"\bthing", STRING)
print(
    "First:",
    STRING[: match.start()],
    "\nSecond:",
    STRING[match.start() : match.end()],
    "\nFin:",
    STRING[match.end() :],
)

# Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matches string:
print(re.search(r'[0123456789-]+', 'the phone number is 1234-567-890'))

print(re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group())

# Match an email address naively:
re.search(r'\S+@\S+', 'my email is email.123@test.com').group()

<re.Match object; span=(5, 8), match='LOG'>
<re.Match object; span=(0, 3), match='LOG'>
None
First: some 
Second: thing 
Fin:  in the things she shows me
First: something in the  
Second: thing 
Fin: s she shows me
<re.Match object; span=(20, 32), match='1234-567-890'>
1234-567-890


'email.123@test.com'

Note that the result will always be a string. It can be further processed using any of the methods that we've previusly seen, such as by splitting the pohone number inot groups by dashes, for example:

In [25]:
match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
[int(n) for n in match.group().split('-')]

[1234, 567, 890]

### Going deeper into regular expressions

In this recipe, we'll learn more about how to deal with regular expressions. After introducing the basics, we will dig a little deeper into pattern elements, introduce groups as a better way to retrieve and parse strings, learn how to search for multiple occurrences of the same string, and deal with longer texts.

## How to do it...

In [35]:
import re

# Match a phone pattern as part of a group (in bracckets). Note the use of \d as a special character for any digit:
match = re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
print(match.group())
print(match.group(1))

# Compile a pattern and capture a case-insensitive pattern with a yes|no option:
pattern = re.compile(r'The answer to question (\w+) is (yes|no)', re.IGNORECASE)
print(pattern.search("Naturally, the answer to question 3b is YES"))

print(pattern.search('Naturally, the answer to question 3b is YES').groups())

# Match all the occurrences of cities and state abbreviations in the text. Note that they are separated by a single character,
# and the name of the city always start with an uppercase letter. Only four states are matched for simplicity:

PATTERN = re.compile(r'([A-Z] [\w\s]+?).(TX|OR|OH|MI)')
TEXT = 'the jackalopes are the team of Odessa, TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
list(PATTERN.finditer(TEXT))

_[0].groups()

the phone number is 1234-567-890
1234-567-890
<re.Match object; span=(11, 43), match='the answer to question 3b is YES'>
('3b', 'YES')


('X while the knights are native of Corvallis', 'OR')

Patterns can be compiled as well. This saves some time if the pattern needs to be matched over and over. To use it that way, compile the pattern and then use that object to perform searches, as shown in steps 3 and 4. Some extra flags can be added, such as making the pattern case inseensitive.

In [40]:
PATTERN = re.compile(r'([A-Z][\w\s]+).(TX|OR|OH|MI)')
TEXT = """the jackalopes are the team of Odessa,TX while the knights 
are native of Corvallis OR and the mud hens come from Toledo.OH; the 
whitecaps have their base in Grand Rapids,MI"""

print(list(PATTERN.finditer(TEXT))[1])

print(re.search(r'([A-Z] [\w\s]+?).(TX|OR|OH|MI)', 'This is a test, Escanaba MI'))

print(re.search(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)', 'This is a test with Escanaba MI'))

PATTERN.search(TEXT), PATTERN.findall(TEXT)

<re.Match object; span=(74, 123), match='Corvallis OR and the mud hens come from Toledo.OH>
None
<re.Match object; span=(0, 31), match='This is a test with Escanaba MI'>


(<re.Match object; span=(31, 40), match='Odessa,TX'>,
 [('Odessa', 'TX'),
  ('Corvallis OR and the mud hens come from Toledo', 'OH'),
  ('Grand Rapids', 'MI')])

## Adding command-line arguments

A lot of tasks can be best structured as a command-line interface that accepts different parameters to change the way it works, for example, scraping a web page from a provided URL or other URL. Python includes a powerful argparse module in the standard library to create rich command-line argument parsing with minimal effort.

### Getting ready

The basic use of argparse in a script can be shown in three steps:

1. Define the arguments that your script is going to acept, generating a new parser.
2. Call the defined parser, returning an object with all of the resulting arguments.
3. Use the argumens to call the entry point of your script, which will apply the definend behavior.

Try to use the following general structure for your scripts:

```python 3
IMPORTS
def main(main parameters):
    DO THINGS

if __name__ == '__main__':
    DEFINE ARGUMENT PARSER
    PARSE ARGS
    VALIDATE OR MANIPULATE ARGS, IF NEEDED
    main(arguments)
```

The main function makes it easy to know what the entry point for the code is. The section under the if statement is only exwcuted if the file is called directly, but not if it's imported. We'll follow this for all the steps.

### How to do it...

1. Create a script  tthat will acept a single integer as a positional argument, and will print a hash symbol that amount of times. The recipe_cli_step1.py script is as follows, but note that we are following the structure presented previously, an the `main` function is just printing the argument

In [42]:
import argparse

def main(number):
    print('#' * number)
    
if __name__ =='__main__':
    parser =  argparse.ArgumentParser()
    parser.add_argument('number', type=int, help= 'A number')
    args = parser.parse_args()
    
    main(args.number)

usage: ipykernel_launcher.py [-h] number
ipykernel_launcher.py: error: the following arguments are required: number


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [43]:
import argparse

def main(character, number):
    print(character* number)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('number', type=int, help='A number')
    parser.add_argument('-c', type=str, help='character to print', default="#")
    args = parser.parse_args()
    main(args.c, args.number)

usage: ipykernel_launcher.py [-h] [-c C] number
ipykernel_launcher.py: error: the following arguments are required: number


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [45]:
# Add a flag that changes the behavior when present. The `recipe_cli_step3.py` script is as follows

import argparse

def main(character, number):
    print(character * number)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('number', type=int, help='A number')
    parser.add_argument('-c', type=str, help="character to print", default="#")
    parser.add_argument('-U', action='store_true', default=False, dest='uppercase', help='Uppercase the character')
    args = parser.parse_args()
    
    if args.uppercase:
        args.c = args.c.cupper()
    main(args.c, args.number)

usage: ipykernel_launcher.py [-h] [-c C] [-U] number
ipykernel_launcher.py: error: the following arguments are required: number


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
