<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**`str` Objects**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## String Creation

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
"This is a string. With a number such as 1,234."

In [None]:
'This is another string. With a number such as 1,234.'

In [None]:
'word'

In [None]:
'a'

In [None]:
type('a')

In [None]:
type('a longer string')

In [None]:
"""This is a 
multiline string
object.
"""

In [None]:
"""
This is a 
multiline string
object.
"""

In [None]:
"""
This is a 
multiline string
object."""

In [None]:
"""This is a 
multiline string
object."""

In [None]:
'''
Another multiline
string object.
'''

In [None]:
s = 'NLP Basics'

In [None]:
s

In [None]:
type(s)

In [None]:
1000 * 2

In [None]:
type(1000)

In [None]:
str(1000)

In [None]:
str(2.345)

In [None]:
float('2.456')

## Printing Strings 

In [None]:
print(s)

In [None]:
print('Hello NLP enthusiasts.')

In [None]:
# print?

In [None]:
for i in range(5):
    print(s)

In [None]:
for i in range(5):
    print(s, end='\n')

In [None]:
for i in range(5):
    print(s, end=' | ')

In [None]:
for i in range(5):
    print(s, i, end=' | ')

## Basic String Operations

In [None]:
a = 10
b = 5

In [None]:
a + b

In [None]:
a * b

In [None]:
a / b

In [None]:
s = 'The first string.'

In [None]:
t = 'The second string.'

In [None]:
s + t

In [None]:
s + ' ' + t

In [None]:
' '

In [None]:
''

In [None]:
s + s

In [None]:
2 * s

In [None]:
# s / 2  # does not work

In [None]:
print(s + s)

In [None]:
type(s + s)

## String Indexing and Slicing

In [None]:
s

In [None]:
s[0]

In [None]:
s[1]

In [None]:
s[-1]

In [None]:
s[-2]

In [None]:
s[1:5]

In [None]:
s[5:1]

In [None]:
s[3:]

In [None]:
s[3:100]

In [None]:
s[3:-1]

In [None]:
s[:5]

In [None]:
s[0:5]

In [None]:
s[2:10:1]  # adding step size

In [None]:
s[2:10:2]

In [None]:
s[:]

In [None]:
s[::2]

In [None]:
s[::-1]

## Immutability

In [None]:
l = [1, 4, 7, 9]

In [None]:
l

In [None]:
l[0]

In [None]:
l[-1]

In [None]:
l[-1] = 100

In [None]:
l  # list objects are mutable

In [None]:
s[-1]

In [None]:
# s[-1] = '!'  # str objects are immutable

In [None]:
tu = tuple(l)

In [None]:
tu

In [None]:
tu[-1]

In [None]:
# tu[-1] = 25  # tuple objects are immutable

## String Methods

In Python, strings come with a variety of built-in methods that allow for powerful and efficient manipulation and analysis. Here's a concise overview of some commonly used string methods along with code examples:

1. **`.upper()` and `.lower()`**: Converts a string to uppercase or lowercase.
   ```python
   s = "Hello World"
   print(s.upper())  # Output: "HELLO WORLD"
   print(s.lower())  # Output: "hello world"
   ```

2. **`.strip()`**: Removes leading and trailing whitespace (or other specified characters).
   ```python
   s = "   Hello World   "
   print(s.strip())  # Output: "Hello World"
   ```

3. **`.split()`**: Splits the string into a list of substrings based on a separator (default is whitespace).
   ```python
   s = "Hello, World, Python"
   print(s.split(','))  # Output: ['Hello', ' World', ' Python']
   ```

4. **`.join()`**: Joins elements of a sequence into a string, separated by the string used to call this method.
   ```python
   words = ['Hello', 'World']
   print(' '.join(words))  # Output: "Hello World"
   ```

5. **`.replace()`**: Replaces occurrences of a specified substring with another substring.
   ```python
   s = "Hello World"
   print(s.replace("World", "Python"))  # Output: "Hello Python"
   ```

6. **`.find()` and `.rfind()`**: Returns the lowest (or highest for `rfind`) index of the substring if found.
   ```python
   s = "Hello World"
   print(s.find("World"))  # Output: 6
   print(s.rfind("o"))     # Output: 7
   ```

7. **`.startswith()` and `.endswith()`**: Checks if the string starts or ends with the specified substring.
   ```python
   s = "Hello World"
   print(s.startswith("Hello"))  # Output: True
   print(s.endswith("Python"))   # Output: False
   ```

8. **`.isalpha()`, `.isdigit()`, `.isspace()`**: Checks if the string contains only alphabetic characters, digits, or whitespace.
   ```python
   print("Hello".isalpha())  # Output: True
   print("123".isdigit())    # Output: True
   print("   ".isspace())    # Output: True
   ```

9. **`.format()`**: Inserts values into a string with placeholders.
   ```python
   print("Name: {}, Age: {}".format("Alice", 30))  # Output: "Name: Alice, Age: 30"
   ```

10. **`.count()`**: Counts occurrences of a substring in the string.
    ```python
    s = "Hello World"
    print(s.count("o"))  # Output: 2
    ```

These methods are integral to string handling in Python, providing a rich set of tools for string manipulation without the need for external libraries.

In [None]:
t

In [None]:
t.split()

In [None]:
v = 'Word  another one   !'

In [None]:
v.split()

In [None]:
csv = 'AAPL, 123.4, 20000'

In [None]:
csv.split()

In [None]:
csv = 'AAPL,123.4,20000'

In [None]:
csv.split(',')

## String Formatting

String formatting in Python allows you to interpolate variables or expressions into strings. Here are several methods for formatting strings, each with a concise explanation and example:

1. **Old-Style String Formatting (% Operator)**
   - Explanation: Uses the `%` operator with format specifiers like `%s`, `%d`, etc.
   - Example:
     ```python
     name = "Alice"
     age = 30
     formatted_string = "Name: %s, Age: %d" % (name, age)
     # Result: "Name: Alice, Age: 30"
     ```

2. **`str.format()` Method**
   - Explanation: Utilizes curly braces `{}` as placeholders for variables, which are passed to the `format` method.
   - Example:
     ```python
     name = "Bob"
     age = 25
     formatted_string = "Name: {}, Age: {}".format(name, age)
     # Result: "Name: Bob, Age: 25"
     ```

3. **Formatted String Literals (f-Strings)**
   - Explanation: Introduced in Python 3.6, f-strings use an `f` prefix and curly braces containing expressions that are replaced directly.
   - Example:
     ```python
     name = "Charlie"
     age = 40
     formatted_string = f"Name: {name}, Age: {age}"
     # Result: "Name: Charlie, Age: 40"
     ```

Each of these methods has its use cases. `%` operator and `str.format()` are widely used and compatible with older Python versions. F-strings offer a more readable and concise syntax, preferred in Python 3.6 and later.

In [None]:
s = 'Python'

In [None]:
'The best programming language is %s.' % s

In [None]:
'The best programming language is {}.'.format(s)

In [None]:
f'The best programming language is {s}.'

## Escape Characters

In Python, escape characters are used in strings to represent characters that are otherwise difficult or impossible to express directly. Here are some common escape characters:

1. **`\n`**: Newline. Starts a new line.
   - Example: `"Hello\nWorld"` results in two lines: "Hello" and "World".

2. **`\t`**: Horizontal Tab. Adds a tab space.
   - Example: `"Hello\tWorld"` places a tab space between "Hello" and "World".

3. **`\r`**: Carriage Return. Moves the cursor to the beginning of the line without advancing to the next line.
   - Example: `"Hello\rWorld"` results in "Worldo" because "World" overwrites "Hello".

4. **`\b`**: Backspace. Erases one character (backwards).
   - Example: `"Hello\bWorld"` results in "HellWorld".

5. **`\f`**: Form Feed. Advances the paper feed in a printer.
   - Example: `"Hello\fWorld"` is rarely used in modern contexts.

6. **`\\`**: Backslash. To represent a literal backslash.
   - Example: `"C:\\Users"` represents the string "C:\Users".

7. **`\'`**: Single Quote. Used inside single-quoted strings.
   - Example: `'It\'s a nice day'` is interpreted as "It's a nice day".

8. **`\"`**: Double Quote. Used inside double-quoted strings.
   - Example: `"He said, \"Hello\""` is interpreted as `He said, "Hello"`.

9. **`\a`**: Bell/Alert. Causes the computer to make a sound.
   - Example: `"Hello\a"` might cause the computer to beep.

10. **`\v`**: Vertical Tab. Moves the cursor down a line without returning to the beginning of the line.
    - Example: `"Hello\vWorld"` is similar to `\n` but is less commonly used.

11. **`\xhh`**: Character with hex value hh. Represents a character with the specified hexadecimal value.
    - Example: `"\x48\x65\x6c\x6c\x6f"` is equivalent to `"Hello"`.

12. **`\uhhhh`** and **`\Uhhhhhhhh`**: Unicode characters. Represent a character based on its Unicode code point.
    - Example: `"\u0048ello"` for "Hello" using the Unicode code point for 'H'.

These escape sequences allow for more control over the text representation within a string, especially in scenarios where direct representation is not possible or convenient.

In [None]:
for i in range(5):
    print(s, i, end='\n')

In [None]:
import time

In [None]:
for i in range(50):
    print(s, i, end='\r')
    time.sleep(0.1)

In [None]:
'Here a few words.'

In [None]:
'Here\ta\tfew\twords.'

In [None]:
print('Here\ta\tfew\twords.')

## Raw Strings

Raw strings in Python are a special kind of string literals that treat backslashes (`\`) as literal characters and do not interpret them as escape characters. This is particularly useful in situations where you have strings that contain a lot of backslashes and you want to avoid the hassle of escaping them.

### Characteristics of Raw Strings:

1. **Prefix with 'r' or 'R'**: Raw strings are created by prefixing the string literal with `r` or `R`.

2. **No Escape Character Processing**: Backslashes in a raw string are treated as literal characters, so escape codes like `\n` (newline), `\t` (tab), etc., are not processed.

3. **Common Use Cases**: They are often used in regular expressions, file paths, and network expressions where backslashes frequently occur.

4. **Not Completely Raw**: The only exception to the no-escape rule is the quote character used to delimit the string. For instance, you can't have a single backslash at the end of a raw string because it would escape the closing quote.

### Usage Example:

1. **Escape Characters**
   Strings with escape characters are printed literally, i.e. without interpreting the escape characters:

   ```python
   s = r"Hello\nWorld!\tWow!"
   print(s)  # Output: Hello\nWorld!
   ```

3. **File Paths**:
   File paths on Windows often contain backslashes. Using raw strings prevent the need for double backslashes.
  
   ```python
   path = r"C:\Users\Name\Folder\File.txt"
   print(path)  # Output: C:\Users\Name\Folder\File.txt
   ```

In summary, raw strings are a convenient feature in Python for scenarios where escape sequences are common and you want to preserve the literal value of the string without the Python interpreter interpreting backslashes as escape characters.

In [None]:
s = 'Here\ta\tfew\twords.'
s

In [None]:
print(s)

In [None]:
r = r'Here\ta\tfew\twords.'

In [None]:
type(r)

In [None]:
print(r)

In [None]:
path = r"C:\Users\Name\Folder\File.txt"
path

In [None]:
# path = "C:\Users\Name\Folder\File.txt"
# path  # raises error

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>