### Strings 

Strings in Python are immutable.

### Indentation
- "  " - space
- "\t" - tabulation 
- "\n" - new line

### Strings Methods

- **str.upper( )** - uppercases all symbols
- **str.lower( )** - lowercases all symbols
- **str.capitalize( )** - capitalizes all sybmols
- **str.title( )** - titles all symbols


- **str.format( )**  - string formating
- **%s%d%f** - one more option for string formatting
- **f'{ }{ }'** - f-strings
 

- **str.index( )** - returns a symbol/word index (first appearance, from the left) 
- **str.rindex( )**  - similar to above example,except that it looks from the right
- **str.find( )**  - searches first appearance from the left
- **str.rfind( )**  - similar to above example but from the right


- **len( str )** - returns string length 
- **str.split( symbol )**  - splits a string according to provided symbol
- **join( list_of_strings )**  - creates a string according to provided delimeter
- **replace( what_replace, new_symbol )**  - replace according to provided template
- **strip( delimeter )** - deletes symbols according to a provided delimeter from the left and from the right


- **isdigit( )** - chechs if a string consists only of numbers
- **isalpha( )** - chechs if a string consists only of letters
- **isalnum( )** - chechs if a string consists of letters and numbers
- **islower( )** - chechs if a string is in lowercase
- **isupper( )** - chechs if a string is in uppercase
- **istitle( )** - chechs if a string is titled
- **isspace( )** - chechs if a string consists only of spaces


- **endswith( )** - understandable
- **startswith( )** -understandable

### Tricks
- find only unique characters (''.join(set(str) )

### Regular Expressions
Regular Expressions make the life much easier when dealing with strings preprocessing(searching, ordering, extraction). 

They are being used for string searching and substitution. First of all we need to import a module which is called **re**

In [29]:
import re

### Main Methods
- **re.match( pattern, string )** looks for a provided pattern at the beginning of a string. If can find, returns None, otherwise **match object**


- **match_obj.group()** - returns group of found pattern
- **match_obj.start()** - returns start index of found pattern
- **match_obj.end()** - returns end index of found pattern

- **re.search( pattern, string )** - looks for a provided pattern through all string. However, returns only first found pattern. Returns **match object** 
- **re.findall( pattern, string )** - looks for a provided pattern through all string. Returns all found patterns in a string. **Returns list**
- **re.split( delimeter, string )** - splits a string according to a provided delimeter. **Maxsplit** determines num of splits
- **re.sub( old_word, new_word, string )** - substitute a word or symbol with a new word or symbol in the whole string.
- **re.compile(pattern, repl, string)** - creates a regular expression object which then can be used for searching.

re_res - stands for result after applying any re method 

### Special Symbols for Regular Expressions
Resources: https://regex101.com/ (playground); https://www.debuggex.com/#cheatsheet (debugger)
![reg_ex_cheet.png](attachment:reg_ex_cheet.png)
![reg_exp.png](attachment:reg_exp.png)
![reg_exp_2.png](attachment:reg_exp_2.png)
![look_around.png](attachment:look_around.png)

### re.match( )

In [322]:
res = re.match(r'CV','CV stands for Cross Validation') # r'CV' r indicates that the row is raw
print('Found Pattern: ', res.group())
print('Start index: ', res.start())
print('End index: ', res.end())

Found Pattern:  CV
Start index:  0
End index:  2


### re.search( )

In [49]:
res = re.search(r'like','I like doing what I like')
res.group() # found only first word

'like'

### re.findall( )

In [52]:
res = re.findall(r'like','I like doing what I like')
res

['like', 'like']

### re.split( )

In [55]:
res = re.split(r'o', 'Python is not tough',maxsplit=1)
res

['Pyth', 'n is not tough']

### re.sub( )

In [63]:
res = re.sub(r't','$','K2 is the highest mountain in Pakistan')
res

'K2 is $he highes$ moun$ain in Pakis$an'

### re.compile( )

In [64]:
pattern = re.compile('K2')
sentence = 'K2 is the highest mountain in Pakistan'
pattern.findall(sentence)

['K2']

### Regular Expressions Exercises

In [78]:
# Extract first word of a string
sentence = 'AV is largest Analytics community of India'
res = re.findall(r'^\w+',sentence)
print('First word of the sentence: ',res)
# Extract last word of a string
res = re.findall(r'\w+$',sentence)
print('Last word of a sentence: ',res)

First word of the sentence:  ['AV']
Last word of a sentence:  ['India']


In [118]:
# Return first two symbols of every word in the sentence
res = re.findall(r'\b\w\w',sentence)
print(res)

['AV', 'is', 'la', 'An', 'co', 'of', 'In']


In [137]:
# Return domen names from emails 
mails = 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz'

res_1 = re.findall(r'@\w+.(\w+)',mails)
res_2 = re.findall(r'@\w+.\w+',mails)
print('First Option: ',res_1)
print('Second Option: ',res_2)

First Option:  ['com', 'in', 'com', 'biz']
Second Option:  ['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


In [176]:
# Extract Date
dates = 'Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009'
res_1 = re.findall(r'\d{2}-\d{2}-\d{4}', dates)
res_2 = re.findall(r'\d{2}-(\d{2})-\d{4}', dates)
res_3 = re.findall(r'\d{2}-\d{2}-(\d{4})', dates)
print(f'Full Date: {res_1[0]}\nOnly Month: {res_2[0]}\nOnly Year: {res_3[0]}')

Full Date: 12-05-2007
Only Month: 05
Only Year: 2007


In [197]:
# Extract words that starts with vowel
text = 'AV is largest Analytics community of India'
res_1 = re.findall(r'\b[aeiouAEIOU]\w+', text)
res_2 = re.findall(r'\b[^aeiouAEIOU ]\w+', text)
print(f'Starts with vowels: {res_1}\nStarts with consonant: {res_2}')

Starts with vowels: ['AV', 'is', 'Analytics', 'of', 'India']
Starts with consonant: ['largest', 'community']


In [285]:
#Check tel nuber
phone_numbers = ['+7-(982)-444-12-23', '+7-(902)-837-10-21', '+49-178-2954524','+39-339-5696712','2342344534534']

def get_phone_number_info(phone):
    if re.findall(r'[+7]{2}-[(9]{2}[0-9)]{3}-[0-9]{3}-[0-9]{2}-[0-9]{2}',phone):
        print('From Russia')
        print('Phone number is valid')
    elif re.findall(r'[+4]{2}[9]{1}-[0-9]{3}-[0-9]{7}',phone):
        print('From Germany')
        print('Phone number is valid')
    elif re.findall(r'[+3]{2}[9]{1}-[0-9]{3}-[0-9]{7}',phone):
        print('From Italy')
        print('Phone number is valid')
    else:
        print('Unknow phone')

for phone in phone_numbers:
    get_phone_number_info(phone)

From Russia
Phone number is valid
From Russia
Phone number is valid
From Germany
Phone number is valid
From Italy
Phone number is valid
Unknow phone


In [295]:
# Split a string with several delimeters
line = 'asdf fjdk;afed,fjek,asdf,foo' # It is impossible to use split() as it allows to split only once

clean_line = re.split(r'[\s;,]',line)
print(clean_line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


In [315]:
# Extract info from HTML file
html_text = '1NoahEmma2LiamOlivia3MasonSophia4JacobIsabella5WilliamAva6EthanMia7MichaelEmily'
res_1 = re.findall(r'([A-Z][A-Za-z]+)([A-Z][A-Za-z]+)',html_text)
res_2 = re.findall(r'\d',html_text)
print('Only names: ',res_1[:5])
print('Only numbers: ', res_2)

Only names:  [('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava')]
Only numbers:  ['1', '2', '3', '4', '5', '6', '7']


### String Cleaning 

In [412]:
text = 'Mary45456456said****Vlad !!!was @ coming back home. Vlad knows the way home`````'
res = re.split(r'[\d*\s!`@]',text)
clean_txt = ' '.join(res)
clean_txt = clean_txt.split()
clean_txt = ' '.join(clean_txt)
print(f'Original Dirty Message: {text}\nClean Message: {clean_txt}')

Original Dirty Message: Mary45456456said****Vlad !!!was @ coming back home. Vlad knows the way home`````
Clean Message: Mary said Vlad was coming back home. Vlad knows the way home


### Upper(), Lower(), Title() , Capitalize()

In [1]:
name = 'machine learning course'
print(name.upper())
print(name.lower())
print(name.title())
print(name.capitalize())

MACHINE LEARNING COURSE
machine learning course
Machine Learning Course
Machine learning course


### Strings Concatenation

In [2]:
first_name = 'vladislav'
last_name = 'raskoshinskii'
full_name = first_name + ' ' + last_name
message = full_name.title() + ' is learning Python'
print(message)

Vladislav Raskoshinskii is learning Python


### Indentation

In [3]:
print("Hello world")
print("\tHello world")
print("Hello\nworld")

Hello world
	Hello world
Hello
world


### Strip

In [4]:
name = ' vladislav '

print('-' + name.lstrip() + '-')
print('-' + name.rstrip() + '-')
print('-' + name.strip() + '-')

-vladislav -
- vladislav-
-vladislav-


### Split

In [7]:
message = '*I would like to climb on mountain Everest as I like climbing&'
print(message.split(' '))

['I', 'would', 'like', 'to', 'climb', 'on', 'mountain', 'Everest', 'as', 'I', 'like', 'climbing']


### Index and Find

In [10]:
print('Index: ', message.index('like'))
print('Rindex: ', message.rindex('like'))
print('Find: ', message.find('like'))
print('RFind: ', message.rfind('like'))

Index:  8
Rindex:  47
Find:  8
RFind:  47


### Join

In [12]:
full_sentence = ''.join(['Everything what I want will happen'])
print(full_sentence)

Everything what I want will happen


### Replace and Strip

In [14]:
print('Replace: ', message.replace('I','We'))
print('Strip: ', message.strip('*&'))

Replace:  We would like to climb on mountain Everest as We like climbing
Strip:  I would like to climb on mountain Everest as I like climbing


### Format, % and F - strings

In [21]:
data = {1:('Vald',23,'Programmer'),
        2:('Max',23,'Sportsman'),
        3:('Dasha',22,'Mathematic')}
for (name,age,job) in data.values():
    print("{0}\nAge{1}\nJob:{2}".format(name,age,job))

name = 'Vlad'
age = 24
grade = 1.0
print('Name: %s\nAge: %f\nGrade:%f' %(name,age,grade))
print('\n')
print(f'Name: {name}\nAge: {age}\nGrade:{grade}')

Vald
Age23
Job:Programmer
Max
Age23
Job:Sportsman
Dasha
Age22
Job:Mathematic
Name: Vlad
Age: 24.000000
Grade:1.000000


Name: Vlad
Age: 24
Grade:1.0


### isdigit( ), isalpha( ), isalnum( ) and so on....

In [22]:
print(message +"\n"+str(message.isdigit()))
print(message.isalpha())
print(message.isalnum())
print(message.islower())
print(message.isupper())
print(message.istitle())
print(message.isspace())

I would like to climb on mountain Everest as I like climbing
False
False
False
False
False
False
False


### Startswith( ) and Endswith( )

In [27]:
print('Starts with!' if message.startswith('I like') else "It doesn't starts with!!")
print('Ends with!' if message.endswith('climbing') else 'nope')

It doesn't starts with!!
Ends with!
