# 1. String Manipulation Methods (READ-AND-PLAY)

In [0]:
# Run this code
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image

In [0]:
# Run this code
string = '    pen,pineapple,apple, pen   '
print(string)

[`split()`](https://docs.python.org/3/library/stdtypes.html#str.split) method 

- This method splits a string into a list where each word is a list item.
- We need to specify the `separator` to use when splitting the string.
- We can specifiy how many splits to do by setting the `maxsplit` parameter.

In [0]:
# Split the string using separator ','
string.split(',')

In [0]:
# Run this code
string_2 = 'summer#autumn   #spring  # winter'
print(string_2)

In [0]:
# Split the string using separator '#'
string_2.split('#')

In [0]:
# Split string_2 and set the maxsplit parameter to 2 (this should return a list with 3 elements)
x = string_2.split('#', 2)
print(x)

[`strip()`](https://docs.python.org/3/library/stdtypes.html#str.strip) method

- It removes whitespaces at the beginning and at the end of the string.

In [0]:
# Remove whitespaces in the variable our_string
our_string = '     There is a lot of space at the beginning and at the end of this sentence, let`s remove it.       '
our_result = our_string.strip()
print(our_result)

[`join()`](https://docs.python.org/3/library/stdtypes.html#str.join) method

- This method takes all items in an iterable and joins them into one string.

In [0]:
# Run this code
my_list = ['Please', 'join', 'these', 'items.']
'_'.join(my_list)

In [0]:
# Run this code
my_tuple = ('We','are', 'joining', 'again.')
'-'.join(my_tuple)

In the case of a dictionary, [`join()`](https://docs.python.org/3/library/stdtypes.html#str.join) tries to join keys of the dictionary, not values.

In [0]:
# Run this code
my_dictionary = {'Key_1':'1',
                 'Key_2':'2'}
'#'.join(my_dictionary)

[`index()`](https://docs.python.org/3/library/stdtypes.html#str.index) method

- It returns the position of the first character in a substring if the substring is found in the string.
- It raises a `ValueError` if nothing is found.

In [0]:
# Run this code
string_3 = 'That is my string'

In [0]:
# Find the position of 'm' using `index()`
string_3.index('m')

[`replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) method

- This method replaces occurences of a substring with another string.
- It is commonly used to remove characters by passing an empty string.

In [0]:
# Replacing string in string_3
string_3.replace('is','was')

In [0]:
# Run this code
string_4 = 'Why is here a semicolon; ?'

In [0]:
# Replacing character
string_4.replace(';','')

In [0]:
# Run this code
string_5 = 'Banana, avocado, pineapple, artichoke'

In [0]:
# TASK 1 >>>> Use .replace() method to replace 'a' with 'A' in string_5 and store it in variable result_1

[`upper()`](https://docs.python.org/3/library/stdtypes.html#str.upper) method

- This method converts all lowercase characters in a string into uppercase characters and returns it.

[`lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower) method
- This method converts all upercase characters in a string into lowercase characters and returns it.

In [0]:
# Run this code
string_to_upper = "Make this uppercase"
print(string_to_upper.upper())

In [0]:
# Run this code
string_to_lower = 'THIS SHOULD BE ALL LOWERCASE'
print(string_to_lower.lower())

[`find()`](https://docs.python.org/3/library/stdtypes.html#str.lower) method
- This method is similar to [`index()`](https://docs.python.org/3/library/stdtypes.html#str.index).
- If the substring is found, this method returns the index of first occurrence of the substring.
- If the substring is not found, -1 is returned.
- This function is **case sensitive**.

In [0]:
# Run this code
quote = "Data Science is cool"

print("The quote is: " + quote)

# first occurance of 'Data Science'
result = quote.find('Data Science')
print("Substring 'Data Science':", result)

# what happens when we neglect the case sensitivity 
result = quote.find('data science')
print("Substring 'data science':", result)

In [0]:
# find returns -1 if substring not found
result = quote.find('RBI')
print("Substring 'RBI':", result)

# How to use find()
if (quote.find('is') != -1):
    print("Substring is found")
else:
    print("Substring is not found")

If you go to the [Documentation of `find()`](https://python-reference.readthedocs.io/en/latest/docs/str/find.html), you will see that [`find()`](https://docs.python.org/3/library/stdtypes.html#str.lower) can accept three parameters. One is compulsory, and the others are optional. 

The general syntax looks like this:
````
string.find(value, start, end)
````

|Parameter|Characteristics|Description|Default|
|---------|-----|------------- |-----|
|sub| Required|The string that you are searching for| (no default)|
|start|Optional|Specify the start position|Default is 0, corresponds to beginning of the string|
|end|Optional|Specify the end position|Default is the end of the string|

In [0]:
# Run this code
quote = "Data Science is so cool, I love Data Science!"

print("The new quote is:" + quote)

# Where in the text is the first occurrence of the substring "Data" when you only want to search between position 10 and 40?
result = quote.find("Data",10,40)

print("Substring 'Data' from position 10 to 40: ", result)

# 2. Project: Cleaning Column Names

In [0]:
# Import Pandas library
import pandas as pd
data = pd.read_csv('../../Data/avocado.csv')

If we take a look at the column names, we notice that they need some cleaning, such as removing the whitespaces. Some systems and data pipelines can have issues with these.

In [0]:
# Run this code
data_2015 = data[data['year'] == 2015]
data_2015.columns

Let's use a lambda function and three of the methods which we have just learned - strip, lower and replace.

In [0]:
# Run this code
data_2015.rename(columns = lambda x: x.strip().lower().replace(' ','_'), inplace = True)

In [0]:
# Run this code
data_2015.head()

One column is still ugly. It would not be worth it to attempt and write a specific function for it. We address it manually via a dictionary.

In [0]:
# BONUS TASK - Hints: use .rename() method and specify columns through dictionary, i.e. 'column_name_to_clean':'new_column_name'
#                   specify inplace = True

# 3. Cleaning Text Column (READ-ONLY)

Imagine we have 2 possible categories of avocado (A and B) in the same row for the same day separated with '/'. 
It would be an issue for us if we'd like to explore and visualize data based on the avocado's category. 

We can use [`str.split()`](https://docs.python.org/3/library/stdtypes.html#str.split) method to resolve this issue in few steps.

In [0]:
# Run this code - don't bother what it does for now

data_avo = {'day':'Monday', 'category':'A/B', 'type':'organic'}
monday_data = pd.DataFrame(data_avo, range(10))

Let's now examine the special altered dataset which we created. You will notice that in the 'category' column. we have A and B symbols. These represent avocado types, which means that in **every row we have stored 2 observations**. That is not good and we need to split each row into 2 separate rows.

In [0]:
# Run this code
monday_data

At first, we use [`split()`](https://docs.python.org/3/library/stdtypes.html#str.split) the method to create a list of two objects from the original element in the column.

In [0]:
# Firstly, split the 'category' column with separator '/'

monday_data['category'] = monday_data['category'].str.split('/')
monday_data

As the next steps:

- next we use [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) function on `monday_data` that return Series: use lambda function `lambda x:` to create new Series - we also need to specify `axis = 1` which returns a new column for avocado's type
- after the [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) part add [`stack()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html) - to stack avocado's category

In [0]:
# Run this code

series_2 = monday_data.apply(lambda x: pd.Series(x['category']), axis = 1).stack()

As you can see below, **categories are now separated into new rows**: 10 observation for Monday. However there is also new level (another index) for A and B that we don't need anymore.

In [0]:
# Run this code
series_2

We can remove this index using [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html): 
- use `drop = True`
- set `level = 1`

In [0]:
# Run this code
series_2 = monday_data.apply(lambda x: pd.Series(x['category']), axis = 1).stack().reset_index(level = 1, drop = True)

- give the Series (it will be a new column) a name 'avocado_category'

In [0]:
# Run this code
series_2.name = 'avocado_category' 

- drop the column 'category' from `new_data` (this is the column that contain A/B), set axis = 1
- join `series_2` where we have separated categories

In [0]:
# Run this code
new_data = monday_data.drop('category', axis = 1).join(series_2)

In [0]:
# Run this code
new_data

If the procedure above seem extensive to you, you are right.
The [``.explode()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) 
method transforms each element of a list to a row.

In [0]:
# run this code
data_avo = {"day": "Monday", "category": "A/B", "type": "organic"}
monday_data = pd.DataFrame(data_avo, range(10))

monday_data

In [0]:
monday_data["category"] = monday_data["category"].str.split("/")
monday_data.explode("category")

# 4. Project: Cleaning Text Column

In [0]:
# Run the code
import numpy as np
data_1 = pd.read_csv('../../Data/movie_metadata.csv')
movie_data = data_1.iloc[:,np.r_[1:3, 8:13]]

In [0]:
# Display first 5 rows of movie_data and look at the genres column
movie_data.head()

Now we use the same way to split genres of movies, the only difference is the separator '|'.

In [0]:
# TASK 2.1 >>>> Split the 'genres' column with seperator '|'
#               Assign the result to the 'genres' column


In [0]:
# Create a new Series for genres using a lambda function and apply it to movie_data

series_genres = movie_data.apply(lambda x: pd.Series(x['genres']), axis = 1).stack().reset_index(level = 1,drop = True)

In [0]:
# Print the new Series

print(series_genres)

In [0]:
# Give the Series (new column) the name 'genres'
series_genres.name = 'genres'

Now let's practice both ways of expanding the list items to seperate rows.

In [0]:
# create copies to practice both ways
movie_join = movie_data.copy()
movie_explode = movie_data.copy()

In [0]:
# TASK 2.2 >>>> Drop the old column 'genres' from movie_join on axis = 1
#               Join to the new Series 'series_genres'. 
#               Assign it to movie_join.



In [0]:
# Run this code
movie_join.head()

In [0]:
# TASK 2.3 >>>> Use the explode method on movie_explode for the 'genres' column.
#               Assign it to movie_explode.


In [0]:
# Run this code
movie_explode.head()

# 5. Regular expressions

- provide a flexible way to serach or match string patterns in text
- a single expression, commonly called a **regex**, is a string formed according to the regular expression language
- using built-in module `re` we can apply regular expressions to strings

Run the following cell showing example of regular expression for validating an email \\(^{1}\\).

In [0]:
# Run this code
Image('../../Images/regex.PNG')

In [0]:
# Import re module
import re

Regex Methods

There is a set of methods that allows us to search a string for a match such as:

- [`findall`](https://docs.python.org/3/library/re.html#re.findall): returns a list that contain all matches
- [`match`](https://docs.python.org/3/library/re.html#re.match): if zero or more characters at the beginning of string match this regular expression, return a corresponding match object
- [`search`](https://docs.python.org/3/library/re.html#re.search): scan through string looking for the first location where regular expression produces a match and return a corresponding match object
- [`split`](https://docs.python.org/3/library/re.html#re.split): breaks string into pieces at each occurence of pattern

In [0]:
# Split string called 'sentence' by whitespaces 
sentence = 'This  sentence contains     whitespace'

To split this string we need to call [`re.split()`](https://docs.python.org/3/library/re.html#re.split). 

Within this method we specify regex `'\s+'` describing one or more whitespace character and string to split (in our case 'sentence').

Firstly, the regex is compiled and then the [`split`](https://docs.python.org/3/library/re.html#re.split) function is called on the passed string.

In [0]:
# Run this code
re.split('\s+', sentence)

With [`re.compile()`](https://docs.python.org/3/library/re.html#re.compile) we can combine a regular expression pattern into pattern objects which can be used for pattern matching
- this approach is recommended if you intend to apply the same expression to many strings

In [0]:
# Run this code
our_regex = re.compile('\s+')

In [0]:
# Split string 'sentence' using regex object 'our_regex'
our_regex.split(sentence)

In [0]:
# Get the list of all patterns that match regex using findall() method
our_regex.findall(sentence)

In [0]:
# Create regex object that match pattern contain 'e'
another_regex = re.compile('e')

In [0]:
# Run the code
sentence_2 = 'Learning RegEx is fun'

In [0]:
# Return the list that contain all matches in string 'sentence_2'
another_regex.findall(sentence_2)

As you can see, the regex object performed case-sensitive matching and matched lowercase letters only. 

We can also define a case insensitive regex object during the pattern compile using `flags = re.IGNORECASE`

In [0]:
# Create regex object that is not case sensitive using re.IGNORECASE
regex_sensitive = re.compile('e', flags = re.IGNORECASE)

In [0]:
# Run this code
regex_sensitive.findall(sentence_2)

In [0]:
text = 'Regex, Regex pattern, Expressions'

# Create a regex object with the matche pattern 's'
pattern = re.compile('s')

In [0]:
# Check for a match anywhere in the string using .search()

pattern.search(text)

As you can see [`search`](https://docs.python.org/3/library/re.html#re.search) returns only the start and end position of the pattern.

In [0]:
# Check for a match only at the beginning of the string using .match()

pattern.match(text)

In [0]:
# Run this line of code

email = 'Email addresses of our two new employees are first.example@gmail.com and second_example@gmail.com'

In [0]:
# Write a regex to match email addresses

email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

In [0]:
# Create a regex object that matches email addresses and make it case-insensitive

rege = re.compile(email_pattern, flags = re.IGNORECASE)

In [0]:
# Get list of email addresses from 'email' string

rege.findall(email)

In [0]:
# Search for the position of the first email address in the string 'email'

rege.search(email)

In [0]:
text = "The average price of the avocados was $1.35 last year, hopefully, this year the price don't exceed $1.50 for a piece!"

In [0]:
# TASK 3 >>>> Google for Regex patterns to match decimal numbers and assign it to the variable decimal_number

In [0]:
# Regex object that match decimal number - won't work if TASK 3 is not completed

pattern_dec = re.compile(decimal_number)

In [0]:
# Run this code - won't work if TASK 3 is not completed

pattern_dec.findall(text)

You can find many Regular Expressions Cheat Sheets on the web, like [this one](https://cheatography.com/mutanclan/cheat-sheets/python-regular-expression-regex/).

**Hint**

If we want to find some pattern (decimal numbers for example) within the string of a Series, we can also use the pandas function [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html). For more information check the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html).

# Appendix

Data Source 1: https://www.kaggle.com/neuromusic/avocado-prices

License: Database: Open Database, Contents: © Original Authors


Data source 2: https://www.kaggle.com/orgesleka/imdbmovies

License: CC0: Public Domain

# References

\\(^{1}\\) BreatheCode. 2017. Regex Tutorial. [ONLINE] Available at: https://content.breatheco.de/en/lesson/regex-tutorial-regular-expression-examples. [Accessed 14 September 2020].

pandas. pandas.Series.str.contains. [ONLINE] Available at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html. [Accessed 14 September 2020].

Material adapted for RBI internal purposes with full permissions from original authors.