# Introduction to Textual Analysis - Basic String Functions

A vast amount of data is contained in textual format (e.g., 10-Ks, news articles, press releases, conference calls, analyst reports, etc.), and there are many research questions that we can ask relating to this type of data.

Working with textual data is no easy task because language is multi-faceted. There are many ways to convey similar information. For example, to communicate that the firm had a good quarter, the manager might say any of the following:

    "We had a great quarter."

    "Our quarter was fantastic."

    "We are proud of our results this quarter."

    "Our results were excellent this quarter."

In this module, we'll introduce some basic string functions to begin to develop some core natural language processing skills that we can ultimately use to analyze textual data.

#### Basic String Functions

We'll illustrate using the following text string.

In [1]:
text = 'Hello World!! '

***Replace Text***

Replace text in a string using the **replace** function.

In [2]:
replaced_text = text.replace('Hello','Hi')
replaced_text

'Hi World!! '

***Remove Leading and Trailing Spaces***

Remove leading and trailing spaces in a string using the **strip** function.

In [3]:
stripped_text1 = text.strip()
stripped_text1

'Hello World!!'

***Remove Specific Leading and Trailing Characters***

Remove specific leading and trailing characters from a string using the **strip** function.

In [4]:
stripped_text2 = stripped_text1.strip('!')
stripped_text2

'Hello World'

***Split String***

Split string based on a value (e.g., ' ') using the **split** function.

In [5]:
words = text.split(' ')
words

['Hello', 'World!!', '']

***Convert to Upper Case***

Convert string to all upper case using the **upper** function.

In [6]:
text_upper = text.upper()
text_upper

'HELLO WORLD!! '

***Convert to Lower Case***

Convert string to all lower case using the **lower** function.

In [7]:
text_lower = text.lower()
text_lower

'hello world!! '

***Count Occurrences***

Count occurences in a string using the **count** function.

In [8]:
text = 'Hello World Hello!'
count_hello = text.count('Hello')
count_hello

2

***Join Strings***

Join strings together using the **join** function.

In [9]:
words = ['Hello','World','!']
#joined_text = ' '.join(words)
joined_text = '~'.join(words)
joined_text

'Hello~World~!'

***Search Strings***

Search a string for a specified value and return the first position of where it was found using the **find** function.

In [10]:
text = 'Here is a string. This string is an example string.'
loc_string = text.find('string') 
loc_string

10

#### Boolean String Functions

We can test if a string contains certain properties using the following functions:

    islower() - returns True if all characters in a string are lower case
    isupper() - returns True if all characters in a string are upper case
    isnumeric() - returns True if all characters in a string are numeric
    isalpha() - returns True if all characters in a string are in the alphabet
    isalnum() - returns True if all characters in a string are alphanumeric
    startswith() - returns True if a string starts with a specified value
    endswith() - returns True if a string ends with a specified value

In [11]:
text = 'hello world'
print('The text is all lower case    : '+str(text.islower()))
print('The text is all upper case    : '+str(text.isupper()))
print('The text is all numeric       : '+str(text.isnumeric()))
print('The text is all alphabetic    : '+str(text.isalpha()))
print('The text is all alpha-numeric : '+str(text.isalnum()))
print('The text starts with "Hello"  : '+str(text.startswith('Hello')))
print('The text ends with "!"        : '+str(text.endswith('!')))

The text is all lower case    : True
The text is all upper case    : False
The text is all numeric       : False
The text is all alphabetic    : False
The text is all alpha-numeric : False
The text starts with "Hello"  : False
The text ends with "!"        : False


#### Exercise

You are given the following text string and string list:

In [12]:
text = 'Hello MFIN290 students. I hope you are enjoying Python! Python is #1!'
string_list = ['Hello','Python','Coder']

Do the following:

1. Obtain a list of all words in the `text` string.
2. Replace '#1' with 'number one' in the `text` string.
3. Convert all characters in the `text` string to lower case.
4. Test whether the `text` string is all lower case.
5. Count the number of 'p' characters in the `text` string.
6. Test whether the `text` string ends with a period (i.e., '.').
7. Identify the first index location of the word 'hope' within the `text` string.
8. Join the `string_list` using a space (i.e., ' ').

#### Solution for # 1

In [13]:
words = text.split(' ')
words

['Hello',
 'MFIN290',
 'students.',
 'I',
 'hope',
 'you',
 'are',
 'enjoying',
 'Python!',
 'Python',
 'is',
 '#1!']

#### Solution for # 2

In [14]:
text = text.replace('#1','number one')
text

'Hello MFIN290 students. I hope you are enjoying Python! Python is number one!'

#### Solution for # 3

In [15]:
text = text.lower()
text

'hello mfin290 students. i hope you are enjoying python! python is number one!'

#### Solution for # 4

In [16]:
text.islower()

True

#### Solution for # 5

In [17]:
pcount = text.count('p')
pcount

3

#### Solution for # 6

In [18]:
text.endswith('.')

False

#### Solution for # 7

In [19]:
hope_loc = text.find('hope')
hope_loc

26

#### Solution for # 8

In [20]:
' '.join(string_list)

'Hello Python Coder'