# Lab 3. Strings and Text Data

### The ability to process textual information is a core competence of any data scientist or business analyst. Humans communicate mostly using spoken and written lanuage (e.g. posting messages on social media, reviews on ecommerce platforms)

### In this lab we will learn the following:

1. Define string variables
2. String methods
3. String formatting
4. Regular expressions

### 0. Lets import required libraries

In [1]:
import re

In [2]:
import pandas as pd

In [3]:
import warnings
warnings.filterwarnings("ignore")

## 1. Basic text manipulation with Python

#### In Python texts are defined as string variables.
#### A string is simply a series of characters. They are created by a set of opening and matching single or double quotes. For example,

In [4]:
word = 'banana'
sentence = 'We are in the year 2021'

In [5]:
type(word)

str

In [6]:
type(sentence)

str

### 1.1. Subsetting and Slicing Strings

In [7]:
word[0]

'b'

In [8]:
sentence[6]

' '

In [9]:
word[0:3]

'ban'

In [10]:
word[-1]

'a'

In [11]:
word[-4:]

'nana'

In [12]:
sentence[0:5]

'We ar'

In [13]:
sentence[10:]

'the year 2021'

In [14]:
sentence[14:18]

'year'

### 1.2. String methods with Python

In [15]:
sentence='this is a really long string of words'

In [16]:
type(sentence)

str

In [17]:
# We capitalize the sentence
sentence.capitalize()

'This is a really long string of words'

In [18]:
# count the number of "a" in the sentence
sentence.count('a')

2

In [19]:
# count the number of "i" in the sentence
sentence.count('i')

3

In [20]:
sentence.upper()

'THIS IS A REALLY LONG STRING OF WORDS'

In [21]:
sentence.startswith('this')

True

In [22]:
sentence.replace('long', 'short')

'this is a really short string of words'

In [23]:
anotherSentence='This is yet another sentence'

In [24]:
sentence+anotherSentence

'this is a really long string of wordsThis is yet another sentence'

In [25]:
sentence+'. '+anotherSentence

'this is a really long string of words. This is yet another sentence'

## 2. Advanced text manipulation with RegEx

#### When the base Python string methods that search for patterns aren’t enough, you can resort to regular expressions. The extremely powerful regular expressions provide a (nontrivial) way to find and match patterns in strings. The downside is that after you finish writing a complex regular expression, it becomes difficult to figure out what the pattern does by looking at it. That is, the syntax is difficult to read.

#### For many data tasks, such as matching a telephone number or address field validation, it’s almost easier to Google which type of patten you are trying to match, and paste what someone has already written into your own code. You might can use https://regex101.com/ to test and debug RegEx expressions it’s a great place and reference for regular expressions and testing patterns on test strings. It even has a Python mode, so you can directly copy/paste a pattern from the site into your own Python code.



In [26]:
## This is a regex pattern that matches digits
pattern_digits = '\d+'

In [27]:
names="13 Jodie Whittaker, John Hurn 12, Peter Capaldi 11, Matt Smith 10, David Tennant 9"

In [28]:
names

'13 Jodie Whittaker, John Hurn 12, Peter Capaldi 11, Matt Smith 10, David Tennant 9'

In [29]:
numbers = re.findall(pattern=pattern_digits, string=names)

In [30]:
numbers

['13', '12', '11', '10', '9']

In [31]:
## This is a regex pattern that matches words except digits
pattern_words = '\D+'

In [32]:
words = re.findall(pattern=pattern_words, string=names)

In [33]:
words

[' Jodie Whittaker, John Hurn ',
 ', Peter Capaldi ',
 ', Matt Smith ',
 ', David Tennant ']

In [34]:
pattern_u='\S*u+\S*'

In [35]:
words_containing_u = re.findall(pattern=pattern_u, string=names)

In [36]:
words_containing_u

['Hurn']

In [37]:
pattern_Jo='\S*Jo+\S*'

In [38]:
words_starting_with_Jo = re.findall(pattern=pattern_Jo, string=names)

In [39]:
words_starting_with_Jo

['Jodie', 'John']

## 3. Basic text manipulation with Pandas
### Pandas offers basic string manipulation capabilities

In [40]:
hello_words= pd.Series(['hola', 'hello', 'nǐ hǎo','hallo','bonjour'], dtype="string")

In [41]:
hello_words

0       hola
1      hello
2     nǐ hǎo
3      hallo
4    bonjour
dtype: string

In [42]:
contains_a=hello_words.str.count("a")

In [43]:
contains_a

0    1
1    0
2    0
3    1
4    0
dtype: Int64

In [44]:
contains_l=hello_words.str.count("l")

In [45]:
contains_l

0    1
1    2
2    0
3    2
4    0
dtype: Int64

In [46]:
boolean_hallo=hello_words.str.match("hallo")

In [47]:
boolean_hallo

0    False
1    False
2    False
3     True
4    False
dtype: boolean

In [48]:
hello_words.str.upper()

0       HOLA
1      HELLO
2     NǏ HǍO
3      HALLO
4    BONJOUR
dtype: string

In [49]:
hello_words.str.replace('h', 'j')

0       jola
1      jello
2     nǐ jǎo
3      jallo
4    bonjour
dtype: string

In [50]:
hello_words.str[0:3]

0    hol
1    hel
2    nǐ 
3    hal
4    bon
dtype: string

In [51]:
hello_words.str[0:4]

0    hola
1    hell
2    nǐ h
3    hall
4    bonj
dtype: string

## 4. Advanced text manipulation with Pandas
### More involved textual analysis oftentimes requires RegEx

### 4.1. Finding text in Pandas DataFrames using regex

In [52]:
numbers=pd.Series(['1', '2', '3a', '3b', '03c', '4dx','text'])

In [53]:
numbers

0       1
1       2
2      3a
3      3b
4     03c
5     4dx
6    text
dtype: object

In [54]:
## numbers is a pandas series object
type(numbers)

pandas.core.series.Series

In [55]:
## This is a regex pattern that looks for strings consisted of digits and letters
pattern = r'[0-9][a-z]'

In [56]:
numbers.str.contains(pattern)

0    False
1    False
2     True
3     True
4     True
5     True
6    False
dtype: bool

In [57]:
numbers.str.contains('\d', regex=True)

0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

In [58]:
numbers.str.contains('\D', regex=True)

0    False
1    False
2     True
3     True
4     True
5     True
6     True
dtype: bool

In [59]:
long_strings=pd.Series(['This is a really long string','13 Jodie Whittaker', 'John Hurn 12','David Lopez 34','this is a sentence'])

In [60]:
long_strings

0    This is a really long string
1              13 Jodie Whittaker
2                    John Hurn 12
3                  David Lopez 34
4              this is a sentence
dtype: object

In [61]:
long_strings.str.contains('\d', regex=True)

0    False
1     True
2     True
3     True
4    False
dtype: bool

### 4.2 Filtering text in Pandas DataFrames using RegEx

#### it builds on the previous step, this time we define a filter

In [62]:
long_strings=pd.Series(['This is a really long string','13 Jodie Whittaker', 'John Hurn 12','David Lopez 34','this is a sentence with no digits in it'])

In [63]:
long_strings

0               This is a really long string
1                         13 Jodie Whittaker
2                               John Hurn 12
3                             David Lopez 34
4    this is a sentence with no digits in it
dtype: object

In [64]:
digit_filter=long_strings.str.contains('\d', regex=True)

In [65]:
long_strings[digit_filter]

1    13 Jodie Whittaker
2          John Hurn 12
3        David Lopez 34
dtype: object

In [87]:
twitter_users=pd.Series(['this is a user called @dlpopez','@foouser is from London','@obama is the former president of the USA','@merkel is stepping down','regular user','anotheruser is not interested','@trump is considering his options'])

In [88]:
twitter_users

0               this is a user called @dlpopez
1                      @foouser is from London
2    @obama is the former president of the USA
3                     @merkel is stepping down
4                                 regular user
5                anotheruser is not interested
6            @trump is considering his options
dtype: object

#### 4.2.1 Filter definition

In [89]:

twitter_filter=twitter_users.str.contains('@([A-Za-z0-9_]+)', regex=True)

#### 4.2.2 Filter application

In [90]:
twitter_users[twitter_filter]

0               this is a user called @dlpopez
1                      @foouser is from London
2    @obama is the former president of the USA
3                     @merkel is stepping down
6            @trump is considering his options
dtype: object

### 4.3 Extracting text in Pandas DataFrames using RegEx

In [97]:
pd.options.display.max_colwidth = 100
twitterConversations=pd.Series(['Hi there, this is a twitter conversation about #climatechange','I totally agree, climate change is an important issue','the President @obama is in #nigeria','We must praise what @merkel has done for #Germany','the stockmarket is down due to #covid19','this is a twitter conversation involving #hashtag1 and #hashtag2'])

In [98]:
twitterConversations

0       Hi there, this is a twitter conversation about #climatechange
1               I totally agree, climate change is an important issue
2                                 the President @obama is in #nigeria
3                   We must praise what @merkel has done for #Germany
4                             the stockmarket is down due to #covid19
5    this is a twitter conversation involving #hashtag1 and #hashtag2
dtype: object

In [99]:
twitterConversations.str.extract(r'#([A-Za-z0-9_]+)', expand=False)

0    climatechange
1              NaN
2          nigeria
3          Germany
4          covid19
5         hashtag1
dtype: object

In [100]:
twitterConversations.str.findall(r'#.*?(?=\s|$)')

0          [#climatechange]
1                        []
2                [#nigeria]
3                [#Germany]
4                [#covid19]
5    [#hashtag1, #hashtag2]
dtype: object