# _Python for Scientific Data Analysis_


#  Basic Python

## Section 6: Attributes; Finding, Replacing, Sorting, and Searching

Objects in Python typically have both attributes (other Python objects stored “inside”
the object) and methods (functions associated with an object that can have access to
the object’s internal data). Both of them are accessed via the syntax
obj.attribute_name.  

Say for example that you have a string variable called a which has a value of `foo':

In [2]:
a='foo'

To get all the attribute names type ``a.`` and then press ``tab`` twice on your keyboard.

Here's a list of some of them that may emerge if you do this from the Python prompt:

```
>>>a. #press tab twice
a.capitalize a.format a.isupper    a.rindex     a.strip
a.center     a.index   a.join      a.rjust      a.swapcase
a.count      a.isalnum a.ljust     a.rpartition a.title
a.decode     a.isalpha a.lower     a.rsplit     a.translate
a.encode     a.isdigit a.lstrip    a.rstrip     a.upper
a.endswith   a.islower a.partition a.split      a.zfill
a.expandtabs a.isspace a.replace   a.splitlines
a.find       a.istitle a.rfind     a.startswith

```

If you do this from this notebook you get a pulldown menu showing the same attributes

In [4]:
#type a. and then tab twice

#uncomment below
#a.
#a.
a.capitalize()

'Foo'

In [None]:
#another example ...

a=3.141569

In [None]:
#a. #pressing tab twice in a Jupyter notebook pulls up a list of possible attributes

Most of these are self-explanatory; others you can just look up.  But below we detail a few of them.

### Finding and Replacing

Python lists several attributes that will allow you to find/replace elements of a string.   First, ``find`` finds the first match between a substring you give it and a string.   A simple example:

In [7]:
txt="All good things can wait a bit longer"

In [None]:
txt.

In [8]:
x=txt.find("good")

y=txt.find("things")

print(x,y)

4 9


```
>>> txt="All good things can wait a bit longer"
>>> x=txt.find("good")
>>> x #4
>>> y=txt.find("things")
>>> y #9
```

There, the return value ``4`` corresponds to the element of the string.  The full call for ``find`` is ``string.find(value, start, end) ``.   Here are some examples:

In [9]:

txt="All good things can wait a bit longer, good"
x=txt.find("good",0,len(txt))
y=txt.find("good",13,len(txt))
print(x)
print(y)


4
39


In [10]:
txt="All good things can be good if you wait a bit longer, good"
x=txt.find("good",11,len(txt))
print(x)
#23
x=txt.find("good",41,len(txt))
print(x)
#54


23
54


The property ``replace`` replaces a substring with another substring.  E.g. 

In [11]:

c='one duck walks into a bar'
print(c.replace('duck','cat'))
#'one cat walks into a bar'



one cat walks into a bar


### Sorting

Python also allows sorting of lists: ``reverse``, ``sort`` and other functions.

  The ``reverse`` attribute reverses the order of a list.  E.g. if ``c=[4,5,6]`` then ``c.reverse()`` is ``[6,5,4]``.   If you want to sort the list by an increasing value then do ``c.sort()``; for decreasing, ``c.sort(reverse=True)``.  
  

In [22]:
c=[4,5,6]
print(c)
c.reverse()
print(c)

#sorting
c.sort()
print(c)
c.sort(reverse=True)
print(c)

[4, 5, 6]
[6, 5, 4]
[4, 5, 6]
[6, 5, 4]


  
  Python also has another useful module called ``bisect`` that can be used after it is imported.   It implements a binary search and insertion into a sorted list.  E.g. ``bisect.bisect`` finds the location where an element should be inserted to keep it sorted, while ``bisect.insort`` actually inserts the element into that location.   E.g. 
  

In [27]:
import bisect
c=[1, 2, 2, 2, 3, 4, 7]
print('two',bisect.bisect(c, 2))
print('three',bisect.bisect(c, 3))
print('five',bisect.bisect(c, 5))
bisect.insort(c,6)
print(c)

print('')
c2=[1, 2, 2.25, 2.5, 3, 4, 7]
print('two',bisect.bisect(c2, 2)) #compare with the bisect command on c: 
#there, the new list element is inserted after the first instance of 2
print('three',bisect.bisect(c2, 3))
print('five',bisect.bisect(c2, 5))
bisect.insort(c2,6)
print(c2)

two 4
three 5
five 6
[1, 2, 2, 2, 3, 4, 6, 7]

two 2
three 5
five 6
[1, 2, 2.25, 2.5, 3, 4, 6, 7]


Python includes the ability to do string expression matching with the module ``re``.   You have to ``import re`` first to use it.  

The formal documentation includes an exhaustive list of attributes for ``re``: https://docs.python.org/3/library/re.html . Below are a few that are particularly useful:

``re.search`` looks for the first instance of a string:

In [30]:
import re
startrek3string='Help,Jim.  Why did you leave me on Genesis? Help me, Spock.'
key_word='Genesis'
a=re.search(key_word,startrek3string)
print(a)
print(a.start()) #gives the starting index position of the match
 #   35
print(a.end()) #gives the end index position
 #   42


<re.Match object; span=(35, 42), match='Genesis'>
35
42


``re.split`` will split a string at every point there is a certain character:

In [46]:
startrek3string='Help,Jim. Why did you leave me on Genesis? Help me, Spock.'
a= re.split(r'\s',startrek3string) 
#here, the string is split at every empty space; note: in latest Python version we had to add a r before the ' ',otherwise Python yells
print(a)
#['Help,Jim.', 'Why', 'did', 'you', 'leave', 'me', 'on', 'Genesis?', 'Help', 'me,', 'Spock.']


#b=re.split('',startrek3string) #splits on every letter/space element
b=re.split(' ',startrek3string) #splits on every empty space .. i.. just like a except syntaxually simpler
print(b)

c=re.split('leave',startrek3string) #splits on 'leave' ... returns list elements before 'leave' and after 'leave'
print(c)

d=re.split('o',startrek3string)
print(d)

#we can control the number of splits
e=re.split('o',startrek3string,1)
print(e)
e2=re.split('o',startrek3string,2)
print(e2)

['Help,Jim.', 'Why', 'did', 'you', 'leave', 'me', 'on', 'Genesis?', 'Help', 'me,', 'Spock.']
['Help,Jim.', 'Why', 'did', 'you', 'leave', 'me', 'on', 'Genesis?', 'Help', 'me,', 'Spock.']
['Help,Jim. Why did you ', ' me on Genesis? Help me, Spock.']
['Help,Jim. Why did y', 'u leave me ', 'n Genesis? Help me, Sp', 'ck.']
['Help,Jim. Why did y', 'u leave me on Genesis? Help me, Spock.']
['Help,Jim. Why did y', 'u leave me ', 'n Genesis? Help me, Spock.']


``re.sub`` will replace every instance of one string by another string:

Here's a simple example:

In [21]:
mst3kstring='watch out for snakes'
a=re.sub('snakes','Torgo',mst3kstring)
b=re.sub('snakes','McLeod!!!',mst3kstring)
print(a)
print(b)

mst3kstring2="David Ryder's Name is Slab Bulkhead"
c=re.sub('Slab Bulkhead','Big McLarge Huge',mst3kstring2)
print(c)

watch out for Torgo
watch out for McLeod!!!
David Ryder's Name is Big McLarge Huge


or this example where we replace spaces by 999:

In [49]:
startrek3string='help,Jim. Why did you leave me on Genesis? help me, Spock.'
#a=re.sub(r'\s','999',startrek3string) this works too
a=re.sub(' ','999',startrek3string)
print(a)

#'help,Jim.999Why999did999you999leave999me999on999Genesis?999help999me,999Spock.'

a=re.sub('o','999',startrek3string)
print(a)

help,Jim.999Why999did999you999leave999me999on999Genesis?999help999me,999Spock.
help,Jim. Why did y999u leave me 999n Genesis? help me, Sp999ck.


In [50]:
b=re.sub('Genesis','Pluto',startrek3string)
print(b)
#'Help,Jim. Why did you leave me on Pluto? Help me, Spock.

help,Jim. Why did you leave me on Pluto? help me, Spock.


You can control the number of replacements with the `count' keyword:

In [51]:

startrek3string='help,Jim. Why did you leave me on Genesis? help me, Spock.  Will you help?'
b=re.sub('help','Waffles',startrek3string,1)
b
#'Waffles,Jim. Why did you leave me on Genesis? Waffles me, Spock.  Will you help?'



'Waffles,Jim. Why did you leave me on Genesis? help me, Spock.  Will you help?'

Another attribute of ``re`` -- ``.search``-- combined with ``span`` will give you the indices of the first string match while combining with ``group`` will give you the first string where there is a match.

```
>>>txt= "The needs of the many outweigh the needs of the few"
>>>x=re.search("outweigh",txt)
>>>x.span()
   (22,30)
```   

In [66]:
txt= "The Needs of the many outweigh the Nags of the few"
y=re.search(r"\bN\w+",txt) #searches at the beginning \b for containing any word character N
print(y.group())
#'Needs'
y=re.search(r"\bNa\w+",txt) #searches at the beginning \b for containing any word characters Na
print(y.group())
#'Nags'

y=re.findall(r"\bN\w+",txt)
print(y)

Needs
Nags
['Needs', 'Nags']


Instead of ``.search``, you can also use ``.findall``.  ``.findall`` will actually return the instances themselves.  E.g. 

In [72]:
txt= "The needs of the many outweigh the needs of the few"
x=re.search(r"outweigh",txt)
print(x)
#<re.Match object; span=(22, 30), match='outweigh'>
x=re.findall(r"outweigh",txt)
print(x)
#['outweigh']


#y=re.findall(r"\bN\w+",txt)
y=re.findall(r"\bN\w+",txt)
print(y) #nothing

y=re.findall(r"\bn\w+",txt)
print(y) #two instances

y=re.findall(r"n\w",txt)
print(y) #three instances

y=re.findall(r"n\w+",txt)
print(y) #three instances

<re.Match object; span=(22, 30), match='outweigh'>
['outweigh']
[]
['needs', 'needs']
['ne', 'ny', 'ne']
['needs', 'ny', 'needs']


You can use brackets -- ``[ ]`` -- and backslashes -- ``\`` -- and other metacharacters, special sequences, and sets to do complex string searches.  Here are some of them 


_Metacharacters_

Examples of metacharacters can be found here [https://www.w3schools.com/python/gloss\_python\_regex\_metacharacters.asp]()

```
e.g. given a string called "hello world" ...

[] 	A set of characters 	"[a-m]" 	
\ 	Signals a special sequence (can also be used to escape special characters) 	"\d" 	
. 	Any character (except newline character) 	"he..o" 	
^ 	Starts with 	"^hello" 	
$ 	Ends with 	"planet$" 	
* 	Zero or more occurrences 	"he.*o" 	#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":
+ 	One or more occurrences 	"he.+o" 	#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":
? 	Zero or one occurrences 	"he.?o" 	#Search for a sequence that starts with "he", followed by 0 or 1 characters, and an "o":
{} 	Exactly the specified number of occurrences #Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":	"he.{2}o" 	
| 	Either or 	"falls|stays" 	
() 	Capture and group

```

In [116]:
txt="hello world"

print('1')
x=re.findall("[a-m]",txt)
print(x)

print('2')
x=re.findall("l",txt)
print(x)
#x=re.findall("[l]",txt) #same thing
#print(x) 

#x=re.findall("l,w",txt) #returns an empty list
#print(x)

print('3')
x=re.findall("[l,w]",txt)
print(x)

print('4')
x=re.findall("he.*o",txt)
print(x)

print('5')
x=re.findall("he.+o",txt)
print(x)

print('6')
x=re.findall("he.?o",txt)
print(x)

print('7')
x=re.findall("hell.?o",txt)
print(x)

print('8')
x=re.findall("hell.{3}o",txt)
print(x)

#x=re.findall("hell.{4}o",txt) #empty return
#print(x) 

print('9')
txt2="hello planet"
x=re.findall("he.*o",txt2)
print(x)

1
['h', 'e', 'l', 'l', 'l', 'd']
2
['l', 'l', 'l']
3
['l', 'l', 'w', 'l']
4
['hello wo']
5
['hello wo']
6
[]
7
['hello']
8
['hello wo']
9
['hello']


_Special Sequences_

Examples of special sequences can be found here ``https://www.w3schools.com/python/gloss_python_regex_sequences.asp``

```

\A 	Returns a match if the specified characters are at the beginning of the string
\b 	Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string") 	

\B 	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word


\d 	Returns a match where the string contains digits (numbers from 0-9) 	
\D 	Returns a match where the string DOES NOT contain digits 	"\D" 	
\s 	Returns a match where the string contains a white space character 	
\S 	Returns a match where the string DOES NOT contain a white space character 	
\w 	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) 	"\w" 	
\W 	Returns a match where the string DOES NOT contain any word characters 	
\Z 	Returns a match if the specified characters are at the end of the string 

```

_Sets_

```

[arn] 	Returns a match where one of the specified characters (a, r, or n) is present 	
[a-n] 	Returns a match for any lower case character, alphabetically between a and n 	
[^arn] 	Returns a match for any character EXCEPT a, r, and n 	
[0123] 	Returns a match where any of the specified digits (0, 1, 2, or 3) are present 	
[0-9] 	Returns a match for any digit between 0 and 9 	
[0-5][0-9] 	Returns a match for any two-digit numbers from 00 and 59 	
[a-zA-Z] 	Returns a match for any character alphabetically between a and z, lower case OR upper case 	
[+] 	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

```

In [84]:
txt = "The rain in Spain"

#Check if the string starts with "The":
x = re.findall(r"\AThe", txt) #the r avoids Python yelling
print(x)

#\b checks of string present at BEGINNING of a WORD
#e.g. Check if "rain" is present at the beginning of a WORD:
x=re.findall(r"\brain",txt)
print(x)

#\B 	Returns a match where the specified characters are present, but NOT at the beginning  of a word
x=re.findall(r"\Bain",txt)
print(x) #rain and Spain

#or at the end
x=re.findall(r"ain\B",txt)
print(x) #no instance

txt2="aint no mountain high enough"
x2=re.findall(r"ain\B",txt2)
print(x2) #aint

['The']
['rain']
['ain', 'ain']
[]
['ain']


In [96]:
#another set of examples
eightiessong = "Jenny's number is 8675309"
x = re.findall(r"\d", eightiessong) #Returns a match where the string contains digits
print(x)

x = re.findall(r"\D", eightiessong) #Returns a match where the string does not contain digits
print(x)

x=re.findall(r"\s",eightiessong) #Returns a match where the string contains blank space 
print(x)

x=re.findall(r"\S",eightiessong) #... no blank space
print(x)

x=re.findall(r"\w",eightiessong) #... contains any word characters
print(x)

x=re.findall(r"\W",eightiessong) #... does NOT contain word characters
print(x)

x=re.findall(r"number\Z",eightiessong) #... if specified character is at end of string
print(x) #it's not at the END of the string so Python returns an empty list


['8', '6', '7', '5', '3', '0', '9']
['J', 'e', 'n', 'n', 'y', "'", 's', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ']
[' ', ' ', ' ']
['J', 'e', 'n', 'n', 'y', "'", 's', 'n', 'u', 'm', 'b', 'e', 'r', 'i', 's', '8', '6', '7', '5', '3', '0', '9']
['J', 'e', 'n', 'n', 'y', 's', 'n', 'u', 'm', 'b', 'e', 'r', 'i', 's', '8', '6', '7', '5', '3', '0', '9']
["'", ' ', ' ', ' ']
[]
