<h1><center> PPOLS564: Foundations of Data Science </center><h1>
<h3><center> Lecture 4 <br><br><font color='grey'> Manipulating Data Structures </font></center></h3>

# Manipulating Mutable Objects

## Local (In-Scope) Functionality

In [1]:
country_list = ["Russia","Latvia","United States","Nigeria","Mexico","India","Costa Rica"]
country_list

['Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica']

### `len()`

`len()` provides use with the length of the function.

In [2]:
print(len(country_list))
print(len(country_list[1]))

7
6


### `.index()`

Isolating the **location of a specific value**.

In [3]:
country_list.index('Nigeria')

3

In [4]:
country_list[country_list.index('Nigeria')]

'Nigeria'

Membership in a list

In [5]:
'Russia' in country_list

True

### Updating and altering values

Adding values to a collection, we have seen methods such as `__add__`, `.append()`, `.extend()`, and `.update()` given the collection type. 

Recall that not all methods actually update the object.

In [6]:
print(id(country_list)) # print object id

print(country_list + ['Canada']) # add canada to the list

print(id(country_list)) # object id remains consistent
print(country_list) # list wasn't updated

4400613064
['Russia', 'Latvia', 'United States', 'Nigeria', 'Mexico', 'India', 'Costa Rica', 'Canada']
4400613064
['Russia', 'Latvia', 'United States', 'Nigeria', 'Mexico', 'India', 'Costa Rica']


We need an in-place addition offered by the `__iadd__` method,

In [7]:
country_list += ['Canada']
country_list

['Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada']

There is also an in-place repetition operation (`__imul__`)

In [8]:
country_list *= 3
country_list

['Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada',
 'Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada',
 'Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada']

The point is that it makes for more efficient code. Also, when we append we are making a **new object reference**; An in-place extension retains the original object id. 

In [9]:
x = [1,2,3]
print(id(x))

x1 = x + [4]
print(id(x1))

x += [4]
print(id(x))

print(x1)
print(x)

4401331464
4401464520
4401331464
[1, 2, 3, 4]
[1, 2, 3, 4]


### Slicing 

Often we want values ranges of values in a container. We can accomplish this by slicing.

Rule of thumb:
- `:`
- `<start here>:<to the value before here>`


```python
x = [1, 2, 3, 4, 5, 6]
x[1:4]
```

is

```python
 0  1  2  3  4  5
[1, 2, 3, 4, 5, 6]
    ^  ^  ^ 
```

In [10]:
country_list = ["Russia","Latvia","United States","Nigeria","Mexico","India","Costa Rica"]
country_list[1:5]

['Latvia', 'United States', 'Nigeria', 'Mexico']

When we leave a value open, we are saying take me all the way to the end or the beginning,

In [11]:
country_list[:4]

['Russia', 'Latvia', 'United States', 'Nigeria']

In [12]:
country_list[5:]

['India', 'Costa Rica']

The slicing operator by itself copies the object

In [13]:
cc = country_list[:]
cc is country_list

False

And every slice creates a new object id

In [14]:
print(id(country_list))
print(id(country_list[:3]))
print(id(country_list[3:]))

4401464904
4401329480
4401465608


### Deleting Values

- `del` keyword
- `.remove()` method

In [15]:
del country_list[1]
country_list

['Russia', 'United States', 'Nigeria', 'Mexico', 'India', 'Costa Rica']

In [16]:
country_list.remove("Nigeria")
country_list

['Russia', 'United States', 'Mexico', 'India', 'Costa Rica']

### Popping elements out of a container

Elements can be used and removed simultaneously from a collection with `.pop()`. Useful when you have a set list that you want to perform similar features on.

In [17]:
country_list.pop()

'Costa Rica'

In [18]:
country_list

['Russia', 'United States', 'Mexico', 'India']

We can pop items out given index location

In [19]:
country_list.pop(2)

'Mexico'

In [20]:
country_list

['Russia', 'United States', 'India']

### Counting Values

In [21]:
country_list = ["Russia","Latvia","United States","Russia","Mexico",
                "India","Papua New Guinea","Latvia","Russia"]
print(country_list.count("Russia"))
print(country_list.count("Latvia"))

3
2


### Sorting Values

In [22]:
country_list.sort()
country_list

['India',
 'Latvia',
 'Latvia',
 'Mexico',
 'Papua New Guinea',
 'Russia',
 'Russia',
 'Russia',
 'United States']

In [23]:
country_list.reverse()
country_list

['United States',
 'Russia',
 'Russia',
 'Russia',
 'Papua New Guinea',
 'Mexico',
 'Latvia',
 'Latvia',
 'India']

There are some built-in sorting methods also that we'll explore in greater detail when discussing iteration.

In [24]:
sorted(country_list)

['India',
 'Latvia',
 'Latvia',
 'Mexico',
 'Papua New Guinea',
 'Russia',
 'Russia',
 'Russia',
 'United States']

In [25]:
# Can sort by some defined function
sorted(country_list,key=len,reverse=True)

['Papua New Guinea',
 'United States',
 'Russia',
 'Russia',
 'Russia',
 'Mexico',
 'Latvia',
 'Latvia',
 'India']

### Getting help()

Recall that we can consult an objects functionality using the built-in `help()` function

In [26]:
help([].sort)

Help on built-in function sort:

sort(...) method of builtins.list instance
    L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*



In [27]:
# Jupyter notebooks hold their own magic internal
?list()

# Methods to Keep in Mind

Above we looked as some specific local methods that occur within an objects scope. Let's list off the main methods for each collection types covered thus far and probe the differences.

**<center>Methods in object type `list`</center>**

| Method  | Description |
|:---------:|---------|
|**`.append()`**| L.append(object) -> None -- append object to end|
|**`.clear()`**| L.clear() -> None -- remove all items from L|
|**`.copy()`**| L.copy() -> list -- a shallow copy of L|
|**`.count()`**| L.count(value) -> integer -- return number of occurrences of value|
|**`.extend()`**| L.extend(iterable) -> None -- extend list by appending elements from the iterable|
|**`.index()`**| L.index(value, [start, [stop]]) -> integer -- return first index of value. Raises ValueError if the value is not present.|
|**`.insert()`**| L.insert(index, object) -- insert object before index|
|**`.pop()`**| L.pop([index]) -> item -- remove and return item at index (default last). Raises IndexError if list is empty or index is out of range.|
|**`.remove()`**| L.remove(value) -> None -- remove first occurrence of value. Raises ValueError if the value is not present.|
|**`.reverse()`**| L.reverse() -- reverse *IN PLACE*|
|**`.sort()`**| L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*|

**<center>Methods in object type `set`</center>**

| Method  | Description |
|:---------:|:---------:|
|**`.add()`**| Add an element to a set.|
|**`.clear()`**| Remove all elements from this set.|
|**`.copy()`**| Return a shallow copy of a set.|
|**`.difference()`**| Return the difference of two or more sets as a new set.|
|**`.difference_update()`**| Remove all elements of another set from this set.|
|**`.discard()`**| Remove an element from a set if it is a member.|
|**`.intersection()`**| Return the intersection of two sets as a new set.|
|**`.intersection_update()`**| Update a set with the intersection of itself and another.|
|**`.isdisjoint()`**| Return True if two sets have a null intersection.|
|**`.issubset()`**| Report whether another set contains this set.|
|**`.issuperset()`**| Report whether this set contains another set.|
|**`.pop()`**| Remove and return an arbitrary set element. Raises KeyError if the set is empty.|
|**`.remove()`**| Remove an element from a set; it must be a member.|
|**`.symmetric_difference()`**| Return the symmetric difference of two sets as a new set.|
|**`.symmetric_difference_update()`**| Update a set with the symmetric difference of itself and another.|
|**`.union()`**| Return the union of sets as a new set.|
|**`.update()`**| Update a set with the union of itself and others.|



**<center>Methods in object type `dict`</center>**

| Method  | Description |
|:---------:|:---------:|
|**`.clear()`**| D.clear() -> None.  Remove all items from D.|
|**`.copy()`**| D.copy() -> a shallow copy of D|
|**`.fromkeys()`**| Returns a new dict with keys from iterable and values equal to value.|
|**`.get()`**| D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.|
|**`.items()`**| D.items() -> a set-like object providing a view on D's items|
|**`.keys()`**| D.keys() -> a set-like object providing a view on D's keys|
|**`.pop()`**| D.pop(k[,d]) -> v, remove specified key and return the corresponding value. If key is not found, d is returned if given, otherwise KeyError is raised|
|**`.popitem()`**| D.popitem() -> (k, v), remove and return some (key, value) pair as a 2-tuple; but raise KeyError if D is empty.|
|**`.setdefault()`**| D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D|
|**`.update()`**| D.update([E, ]**F) -> None.  Update D from dict/iterable E and F. If E is present and has a .keys() method, then does:  for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does:  for k, v in E: D[k] = v In either case, this is followed by: for k in F:  D[k] = F[k]|
|**`.values()`**| D.values() -> an object providing a view on D's values|

**<center>Methods in object type `tuple`</center>**

| Method  | Description |
|:---------:|:---------:|
|**`.count()`**| T.count(value) -> integer -- return number of occurrences of value|
|**`.index()`**| T.index(value, [start, [stop]]) -> integer -- return first index of value. Raises ValueError if the value is not present.|

# String methods

`str` types are really useful because so many common regular expression (regex) methods are baked into the string object when created.

Note that some methods, such as addition and multiplication, take on new functionality here. 

**<center>Methods in object type `str`</center>**

| Method  | Description |
|:---------:|:---------:|
|**`.capitalize()`**| S.capitalize() -> str|
|**`.casefold()`**| S.casefold() -> str|
|**`.center()`**| S.center(width[, fillchar]) -> str|
|**`.count()`**| S.count(sub[, start[, end]]) -> int|
|**`.encode()`**| S.encode(encoding='utf-8', errors='strict') -> bytes|
|**`.endswith()`**| S.endswith(suffix[, start[, end]]) -> bool|
|**`.expandtabs()`**| S.expandtabs(tabsize=8) -> str|
|**`.find()`**| S.find(sub[, start[, end]]) -> int|
|**`.format()`**| S.format(*args, **kwargs) -> str|
|**`.format_map()`**| S.format_map(mapping) -> str|
|**`.index()`**| S.index(sub[, start[, end]]) -> int|
|**`.isalnum()`**| S.isalnum() -> bool|
|**`.isalpha()`**| S.isalpha() -> bool|
|**`.isdecimal()`**| S.isdecimal() -> bool|
|**`.isdigit()`**| S.isdigit() -> bool|
|**`.isidentifier()`**| S.isidentifier() -> bool|
|**`.islower()`**| S.islower() -> bool|
|**`.isnumeric()`**| S.isnumeric() -> bool|
|**`.isprintable()`**| S.isprintable() -> bool|
|**`.isspace()`**| S.isspace() -> bool|
|**`.istitle()`**| S.istitle() -> bool|
|**`.isupper()`**| S.isupper() -> bool|
|**`.join()`**| S.join(iterable) -> str|
|**`.ljust()`**| S.ljust(width[, fillchar]) -> str|
|**`.lower()`**| S.lower() -> str|
|**`.lstrip()`**| S.lstrip([chars]) -> str|
|**`.maketrans()`**| Return a translation table usable for str.translate().|
|**`.partition()`**| S.partition(sep) -> (head, sep, tail)|
|**`.replace()`**| S.replace(old, new[, count]) -> str|
|**`.rfind()`**| S.rfind(sub[, start[, end]]) -> int|
|**`.rindex()`**| S.rindex(sub[, start[, end]]) -> int|
|**`.rjust()`**| S.rjust(width[, fillchar]) -> str|
|**`.rpartition()`**| S.rpartition(sep) -> (head, sep, tail)|
|**`.rsplit()`**| S.rsplit(sep=None, maxsplit=-1) -> list of strings|
|**`.rstrip()`**| S.rstrip([chars]) -> str|
|**`.split()`**| S.split(sep=None, maxsplit=-1) -> list of strings|
|**`.splitlines()`**| S.splitlines([keepends]) -> list of strings|
|**`.startswith()`**| S.startswith(prefix[, start[, end]]) -> bool|
|**`.strip()`**| S.strip([chars]) -> str|
|**`.swapcase()`**| S.swapcase() -> str|
|**`.title()`**| S.title() -> str|
|**`.translate()`**| S.translate(table) -> str|
|**`.upper()`**| S.upper() -> str|
|**`.zfill()`**| S.zfill(width) -> str|

In [28]:
print(my_str.lower()) # convert to lower

print(my_str.upper()) # convert to upper

print(my_str.isupper()) # boolean determination

NameError: name 'my_str' is not defined

In [None]:
my_str.replace("George","My ")

In [None]:
sent = "This is important to remember."
sent.split() # break a string into a list`

In [None]:
seq_str = "A A A A B B B B C C C C A A C A B A C"
seq_str.count("A") # count the number of times a certain pattern occurs

In [None]:
ind = sent.find("i")
print(ind)
print(sent[ind])

In [None]:
# In concert, we can do some useful manipulations of text

sent_ws = "      THIS is a Sentence &95#with problems"

sent_ws = sent_ws.strip() # Strip white space
print(sent_ws)

sent_ws = sent_ws.replace("&95#","") # strip problem values by leveraging the pattern
print(sent_ws)

sent_ws = sent_ws.lower() # convert to lower case
print(sent_ws)

sent_ws = sent_ws.capitalize() # capitalize the first letter
print(sent_ws)

sent_ws = sent_ws + "."
print(sent_ws)

## Formating Data into Strings

Often, we need to combine data and strings, either to report results or progress, or compose more versatile text objects. Python makes it easy to integrate data with strings. 

### `"" % ()`

In [29]:
x = 4
y = "dog"
"This is a string with a number (%s) and a word (%s)" %(x,y)

'This is a string with a number (4) and a word (dog)'

In [30]:
"This is a string with a number (%d) and a word (%d)" %(x,y)

TypeError: %d format: a number is required, not str

In [None]:
"This is a string with a number (%d) and a word (%s)" %(x,y)

In [None]:
"This is a string with a number (%.3f) and a word (%s)" %(x,y)

### `.format()`

In [31]:
# Integer positions
"This is a string with a number ({0}) and a word ({1})".format('4','dog')

'This is a string with a number (4) and a word (dog)'

In [32]:
# Named Fields
'This is a {a} in a {b}'.format(a='dog',b='house')

'This is a dog in a house'

In [33]:
ps = [1.0,2.2,3]
'This is a field: {ps[2]} and {ps[1]}. '.format(ps=ps)

'This is a field: 3 and 2.2. '

### fstrings

fstrings emerge from a desire to make string formatting more readable. The above two methods are fine, but these can be difficult to read when these statements become involved. To this end, fstrings provide an easy syntax in which objects can be evaluated directly in the string statement usin `{}`. This increases readability.

In [34]:
f'This is a field: {ps[2]} and {ps[1]}'

'This is a field: 3 and 2.2'

In [35]:
f"Progress: { round((44/76)*100,2) }%"

'Progress: 57.89%'

### String Encoding

Note the default string code character is UTF-8

In [36]:
word = "éôü"
word

'éôü'

In [37]:
en_word = word.encode('UTF-8')
en_word

b'\xc3\xa9\xc3\xb4\xc3\xbc'

In [38]:
en_word.decode('UTF-8')

'éôü'

## Dates

Finally, let's briefly explore working with dates in Python.

In [39]:
from datetime import datetime
now = datetime.now()
now

datetime.datetime(2018, 9, 19, 7, 41, 2, 396927)

**<center>Methods in object type `datetime`</center>**

| Method  | Description |
|:---------:|:---------:|
|**`.astimezone()`**| tz -> convert to local time in new timezone tz|
|**`.combine()`**| date, time -> datetime with same date and time fields|
|**`.ctime()`**| Return ctime() style string.|
|**`.date()`**| Return date object with same year, month and day.|
|**`.day()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.dst()`**| Return self.tzinfo.dst(self).|
|**`.fold()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.fromisoformat()`**| string -> datetime from datetime.isoformat() output|
|**`.fromordinal()`**| int -> date corresponding to a proleptic Gregorian ordinal.|
|**`.fromtimestamp()`**| timestamp[, tz] -> tz's local time from POSIX timestamp.|
|**`.hour()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.isocalendar()`**| Return a 3-tuple containing ISO year, week number, and weekday.|
|**`.isoformat()`**| [sep] -> string in ISO 8601 format, YYYY-MM-DDT[HH[:MM[:SS[.mmm[uuu]]]]][+HH:MM]. sep is used to separate the year from the time, and defaults to 'T'. timespec specifies what components of the time to include (allowed values are 'auto', 'hours', 'minutes', 'seconds', 'milliseconds', and 'microseconds').|
|**`.isoweekday()`**| Return the day of the week represented by the date. Monday == 1 ... Sunday == 7|
|**`.max()`**| datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])|
|**`.microsecond()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.min()`**| datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])|
|**`.minute()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.month()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.now()`**| Returns new datetime object representing current time local to tz.|
|**`.replace()`**| Return datetime with new specified fields.|
|**`.resolution()`**| Difference between two datetime values.|
|**`.second()`**| int([x]) -> integer int(x, base=10) -> integer|
|**`.strftime()`**| format -> strftime() style string.|
|**`.strptime()`**| string, format -> new datetime parsed from a string (like time.strptime()).|
|**`.time()`**| Return time object with same time but with tzinfo=None.|
|**`.timestamp()`**| Return POSIX timestamp as float.|
|**`.timetuple()`**| Return time tuple, compatible with time.localtime().|
|**`.timetz()`**| Return time object with same time and tzinfo.|
|**`.today()`**| Current date or datetime:  same as self.__class__.fromtimestamp(time.time()).|
|**`.toordinal()`**| Return proleptic Gregorian ordinal.  January 1 of year 1 is day 1.|
|**`.tzname()`**| Return self.tzinfo.tzname(self).|
|**`.utcfromtimestamp()`**| Construct a naive UTC datetime from a POSIX timestamp.|
|**`.utcnow()`**| Return a new datetime representing UTC day and time.|
|**`.utcoffset()`**| Return self.tzinfo.utcoffset(self).|
|**`.utctimetuple()`**| Return UTC time tuple, compatible with time.localtime().|
|**`.weekday()`**| Return the day of the week represented by the date. Monday == 0 ... Sunday == 6|
|**`.year()`**| int([x]) -> integer int(x, base=10) -> integer|

In [40]:
now.month 

9

In [41]:
now.year

2018

Can format Date Time

In [42]:
now.strftime('%Y %m %d')

'2018 09 19'

In [43]:
now.strftime('%Y-%m-%d %H:%M:%S')   

'2018-09-19 07:41:02'

Generating dates

In [44]:
past = datetime(year=2008,month=4,day=20)
past

datetime.datetime(2008, 4, 20, 0, 0)

Comparing dates

In [45]:
diff = now - past
diff.days

3804

In [46]:
now + datetime.timedelta(days=1)

AttributeError: type object 'datetime.datetime' has no attribute 'timedelta'

In [None]:
from datetime import timedelta
now + timedelta(hours=5)

Converting raw date strings into datetime objects. The secret is that we need to identify the structure of that the date is formatted in. 

In [47]:
new_date = datetime.strptime('Jun 1 2005', '%b %d %Y')
new_date

datetime.datetime(2005, 6, 1, 0, 0)