# Secuencias y ficheros en python

Python proporciona una serie de estructuras de datos que nos permiten ordenar información. Las principales son las listas y los diccionarios que utilizaremos de manera frecuente. 

En este tema nos vamos a centrar en entender el funcionamiento de las listas, y de otro tipo de datos que tienen similitudes con las listas como son los strings.

Además introduciremos el trabajo con ficheros, que nos permitirá leer y escribir datos para conservarlos en nuestros sistemas de almacenamiento.


# Built-In Data Structures

We have seen Python's simple types: ``int``, ``float``, ``complex``, ``bool`` and so on.
Python also has several built-in compound types, which act as containers for other types.
These compound types are:

| Type Name | Example                   |Description                            |
|-----------|---------------------------|---------------------------------------|
| ``list``  | ``[1, 2, 3]``             | Ordered collection                    |
| ``tuple`` | ``(1, 2, 3)``             | Immutable ordered collection          |
| ``dict``  | ``{'a':1, 'b':2, 'c':3}`` | Unordered (key,value) mapping         |
| ``set``   | ``{1, 2, 3}``             | Unordered collection of unique values |

As you can see, round, square, and curly brackets have distinct meanings when it comes to the type of collection produced.
We'll take a quick tour of these data structures here.

## Lists
Lists are the basic *ordered* and *mutable* data collection type in Python.
They can be defined with comma-separated values between square brackets; for example, here is a list of the first several prime numbers:

In [2]:
L = [2, 3, 5, 7]

Lists have a number of useful properties and methods available to them.
Here we'll take a quick look at some of the more common and useful ones:

In [3]:
# Length of a list
len(L)

4

In [4]:
# Append a value to the end
L.append(11)
L

[2, 3, 5, 7, 11]

In [5]:
# Addition concatenates lists
L + [13, 17, 19]

[2, 3, 5, 7, 11, 13, 17, 19]

In [6]:
# sort() method sorts in-place
L = [2, 5, 1, 6, 3, 4]
L.sort()
L

[1, 2, 3, 4, 5, 6]

In addition, there are many more built-in list methods; they are well-covered in Python's [online documentation](https://docs.python.org/3/tutorial/datastructures.html).

While we've been demonstrating lists containing values of a single type, one of the powerful features of Python's compound objects is that they can contain objects of *any* type, or even a mix of types. For example:

In [7]:
L = [1, 'two', 3.14, [0, 3, 5]]

This flexibility is a consequence of Python's dynamic type system.
Creating such a mixed sequence in a statically-typed language like C can be much more of a headache!
We see that lists can even contain other lists as elements.
Such type flexibility is an essential piece of what makes Python code relatively quick and easy to write.

So far we've been considering manipulations of lists as a whole; another essential piece is the accessing of individual elements.
This is done in Python via *indexing* and *slicing*, which we'll explore next.

### List indexing and slicing
Python provides access to elements in compound types through *indexing* for single elements, and *slicing* for multiple elements.
As we'll see, both are indicated by a square-bracket syntax.
Suppose we return to our list of the first several primes:

In [8]:
L = [2, 3, 5, 7, 11]

Python uses *zero-based* indexing, so we can access the first and second element in using the following syntax:

In [9]:
L[0]

2

In [10]:
L[1]

3

Elements at the end of the list can be accessed with negative numbers, starting from -1:

In [11]:
L[-1]

11

In [12]:
L[-2]

7

You can visualize this indexing scheme this way:

![List Indexing Figure](./fig/list-indexing.png)

Here values in the list are represented by large numbers in the squares; list indices are represented by small numbers above and below.
In this case, ``L[2]`` returns ``5``, because that is the next value at index ``2``.

Where *indexing* is a means of fetching a single value from the list, *slicing* is a means of accessing multiple values in sub-lists.
It uses a colon to indicate the start point (inclusive) and end point (non-inclusive) of the sub-array.
For example, to get the first three elements of the list, we can write:

In [13]:
L[0:3]

[2, 3, 5]

Notice where ``0`` and ``3`` lie in the preceding diagram, and how the slice takes just the values between the indices.
If we leave out the first index, ``0`` is assumed, so we can equivalently write:

In [14]:
L[:3]

[2, 3, 5]

Similarly, if we leave out the last index, it defaults to the length of the list.
Thus, the last three elements can be accessed as follows:

In [15]:
L[-3:]

[5, 7, 11]

Finally, it is possible to specify a third integer that represents the step size; for example, to select every second element of the list, we can write:

In [16]:
L[::2]  # equivalent to L[0:len(L):2]

[2, 5, 11]

A particularly useful version of this is to specify a negative step, which will reverse the array:

In [17]:
L[::-1]

[11, 7, 5, 3, 2]

Both indexing and slicing can be used to set elements as well as access them.
The syntax is as you would expect:

In [18]:
L[0] = 100
print(L)

[100, 3, 5, 7, 11]


In [19]:
L[1:3] = [55, 56]
print(L)

[100, 55, 56, 7, 11]


A very similar slicing syntax is also used in many data science-oriented packages, including NumPy and Pandas (mentioned in the introduction).

Now that we have seen Python lists and how to access elements in ordered compound types, let's take a look at the other three standard compound data types mentioned earlier.

## Checking if certain value is present in list

In [1]:
languages = ['Java', 'C++', 'Go', 'Python', 'JavaScript']

#if 'Python' in languages:
#     print('Python is there!')

bool = 'Python' in languages
bool    


True

In [2]:
if 6 not in [1, 2, 3, 7]:
    print('number 6 is not present')

number 6 is not present


## List are mutable

In [24]:
original = [1, 2, 3]
modified = original
modified[0] = 99
print('original: {}, modified: {}'.format(original, modified))

original: [99, 2, 3], modified: [99, 2, 3]


You can get around this by creating new `list`:

In [25]:
original = [1, 2, 3]
modified = list(original)  # Note list() 
# Alternatively, you can use copy method
# modified = original.copy()
modified[0] = 99
print('original: {}, modified: {}'.format(original, modified))

original: [1, 2, 3], modified: [99, 2, 3]


## `list.append()`

In [26]:
my_list = [1]
my_list.append('ham')
print(my_list)

[1, 'ham']


## `list.remove()`

In [27]:
my_list = ['Python', 'is', 'sometimes', 'fun']
my_list.remove('sometimes')
print(my_list)

# If you are not sure that the value is in list, better to check first:
if 'Java' in my_list:
    my_list.remove('Java')
else:
    print('Java is not part of this story.')

['Python', 'is', 'fun']
Java is not part of this story.


## `list.sort()`

In [5]:
numbers = [8, 1, 6, 5, 10]
numbers.sort()
print('numbers: {}'.format(numbers))

numbers.sort(reverse=True)
print('numbers reversed: {}'.format(numbers))

words = ['this', 'is', 'a', 'list', 'of', 'words']
words.sort()
print('words:',words)

numbers: [1, 5, 6, 8, 10]
numbers reversed: [10, 8, 6, 5, 1]
words: ['a', 'is', 'list', 'of', 'this', 'words']


## `sorted(list)`
While `list.sort()` sorts the list in-place, `sorted(list)` returns a new list and leaves the original untouched:

In [29]:
numbers = [8, 1, 6, 5, 10]
sorted_numbers = sorted(numbers)
print('numbers: {}, sorted: {}'.format(numbers, sorted_numbers))

numbers: [8, 1, 6, 5, 10], sorted: [1, 5, 6, 8, 10]


## `list.extend()`

In [30]:
first_list = ['beef', 'ham']
second_list = ['potatoes',1 ,3]
first_list.extend(second_list)
print('first: {}, second: {}'.format(first_list, second_list))

first: ['beef', 'ham', 'potatoes', 1, 3], second: ['potatoes', 1, 3]


Alternatively you can also extend lists by summing them:

In [31]:
first = [1, 2, 3]
second = [4, 5]
first += second  # same as: first = first + second
print('first: {}'.format(first))

# If you need a new list
summed = first + second
print('summed: {}'.format(summed))

first: [1, 2, 3, 4, 5]
summed: [1, 2, 3, 4, 5, 4, 5]


## `list.reverse()`

In [32]:
my_list = ['a', 'b', 'ham']
my_list.reverse()
print(my_list)

['ham', 'b', 'a']


## Tuples
Tuples are in many ways similar to lists, but they are defined with parentheses rather than square brackets:

In [33]:
t = (1, 2, 3)

They can also be defined without any brackets at all:

In [34]:
t = 1, 2, 3
print(t)

(1, 2, 3)


Like the lists discussed before, tuples have a length, and individual elements can be extracted using square-bracket indexing:

In [35]:
len(t)

3

In [36]:
t[0]

1

The main distinguishing feature of tuples is that they are *immutable*: this means that once they are created, their size and contents cannot be changed:

In [37]:
t[1] = 4

TypeError: 'tuple' object does not support item assignment

In [38]:
t.append(4)

AttributeError: 'tuple' object has no attribute 'append'

Tuples are often used in a Python program; a particularly common case is in functions that have multiple return values.
For example, the ``as_integer_ratio()`` method of floating-point objects returns a numerator and a denominator; this dual return value comes in the form of a tuple:

In [39]:
x = 0.125
x.as_integer_ratio()

(1, 8)

These multiple return values can be individually assigned as follows:

In [40]:
numerator, denominator = x.as_integer_ratio()
print(numerator / denominator)

0.125


The indexing and slicing logic covered earlier for lists works for tuples as well, along with a host of other methods.
Refer to the online [Python documentation](https://docs.python.org/3/tutorial/datastructures.html) for a more complete list of these.

# String Type
Strings in Python are created with single or double quotes:

In [41]:
message = "what do you like?"
response = 'spam'

Python has many extremely useful string functions and methods; here are a few of them:

In [42]:
# length of string
len(response)

4

In [43]:
# Make upper-case. See also str.lower()
response.upper()

'SPAM'

In [44]:
# Capitalize. See also str.title()
message.capitalize()

'What do you like?'

In [45]:
# concatenation with +
message + response

'what do you like?spam'

In [46]:
# multiplication is multiple concatenation
5 * response

'spamspamspamspamspam'

In [47]:
# Access individual characters (zero-based indexing)
message[0]

'w'

# String Manipulation and Regular Expressions

One place where the Python language really shines is in the manipulation of strings.
This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of *regular expressions*.
Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context.

Strings in Python can be defined using either single or double quotations (they are functionally equivalent):

In [48]:
x = 'a string'
y = "a string"
x == y

True

In addition, it is possible to define multi-line strings using a triple-quote syntax:

In [49]:
multiline = """
one
two
three
"""

With this, let's take a quick tour of some of Python's string manipulation tools.

## Simple String Manipulation in Python

For basic manipulation of strings, Python's built-in string methods can be extremely convenient.
If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing.
We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper

### Formatting strings: Adjusting case

Python makes it quite easy to adjust the case of a string.
Here we'll look at the ``upper()``, ``lower()``, ``capitalize()``, ``title()``, and ``swapcase()`` methods, using the following messy string as an example:

In [50]:
fox = "tHe qUICk bROWn fOx."

To convert the entire string into upper-case or lower-case, you can use the ``upper()`` or ``lower()`` methods respectively:

In [51]:
fox.upper()

'THE QUICK BROWN FOX.'

In [52]:
fox.lower()

'the quick brown fox.'

A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence.
This can be done with the ``title()`` and ``capitalize()`` methods:

In [53]:
fox.title()

'The Quick Brown Fox.'

In [54]:
fox.capitalize()

'The quick brown fox.'

The cases can be swapped using the ``swapcase()`` method:

In [55]:
fox.swapcase()

'ThE QuicK BrowN FoX.'

### Formatting strings: Adding and removing spaces

Another common need is to remove spaces (or other characters) from the beginning or end of the string.
The basic method of removing characters is the ``strip()`` method, which strips whitespace from the beginning and end of the line:

In [56]:
line = '         this is the content         '
line.strip()

'this is the content'

To remove just space to the right or left, use ``rstrip()`` or ``lstrip()`` respectively:

In [57]:
line.rstrip()

'         this is the content'

In [58]:
line.lstrip()

'this is the content         '

To remove characters other than spaces, you can pass the desired character to the ``strip()`` method:

In [59]:
num = "000000000000435"
num.strip('0')

'435'

The opposite of this operation, adding spaces or other characters, can be accomplished using the ``center()``, ``ljust()``, and ``rjust()`` methods.

For example, we can use the ``center()`` method to center a given string within a given number of spaces:

In [60]:
line = "this is the content"
line.center(30)

'     this is the content      '

Similarly, ``ljust()`` and ``rjust()`` will left-justify or right-justify the string within spaces of a given length:

In [61]:
line.ljust(30)

'this is the content           '

In [62]:
line.rjust(30)

'           this is the content'

All these methods additionally accept any character which will be used to fill the space.
For example:

In [63]:
'435'.rjust(10, '0')

'0000000435'

Because zero-filling is such a common need, Python also provides ``zfill()``, which is a special method to right-pad a string with zeros:

In [64]:
'435'.zfill(10)

'0000000435'

### Finding and replacing substrings

If you want to find occurrences of a certain character in a string, the ``find()``/``rfind()``, ``index()``/``rindex()``, and ``replace()`` methods are the best built-in methods.

``find()`` and ``index()`` are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:

In [65]:
line = 'the quick brown fox jumped over a lazy dog'
line.find('fox')

16

In [66]:
line.index('fox')

16

The only difference between ``find()`` and ``index()`` is their behavior when the search string is not found; ``find()`` returns ``-1``, while ``index()`` raises a ``ValueError``:

In [67]:
line.find('bear')

-1

In [68]:
line.index('bear')

ValueError: substring not found

The related ``rfind()`` and ``rindex()`` work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

In [69]:
line.rfind('a')

35

For the special case of checking for a substring at the beginning or end of a string, Python provides the ``startswith()`` and ``endswith()`` methods:

In [70]:
line.endswith('dog')

True

In [71]:
line.startswith('fox')

False

To go one step further and replace a given substring with a new string, you can use the ``replace()`` method.
Here, let's replace ``'brown'`` with ``'red'``:

In [72]:
line.replace('brown', 'red')

'the quick red fox jumped over a lazy dog'

The ``replace()`` function returns a new string, and will replace all occurrences of the input:

In [73]:
line.replace('o', '--')

'the quick br--wn f--x jumped --ver a lazy d--g'

For a more flexible approach to this ``replace()`` functionality, see the discussion of regular expressions in [Flexible Pattern Matching with Regular Expressions](#Flexible-Pattern-Matching-with-Regular-Expressions).

### Splitting and partitioning strings

If you would like to find a substring *and then* split the string based on its location, the ``partition()`` and/or ``split()`` methods are what you're looking for.
Both will return a sequence of substrings.

The ``partition()`` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [74]:
line.partition('fox')

('the quick brown ', 'fox', ' jumped over a lazy dog')

The ``rpartition()`` method is similar, but searches from the right of the string.

The ``split()`` method is perhaps more useful; it finds *all* instances of the split-point and returns the substrings in between.
The default is to split on any whitespace, returning a list of the individual words in a string:

In [75]:
line.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

A related method is ``splitlines()``, which splits on newline characters.
Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:

In [76]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

Note that if you would like to undo a ``split()``, you can use the ``join()`` method, which returns a string built from a splitpoint and an iterable:

In [77]:
'--'.join(['1', '2', '3'])

'1--2--3'

A common pattern is to use the special character ``"\n"`` (newline) to join together lines that have been previously split, and recover the input:

In [78]:
print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']))

matsushima-ya
aah matsushima-ya
matsushima-ya


## Format Strings

In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats.
Another use of string methods is to manipulate string *representations* of values of other types.
Of course, string representations can always be found using the ``str()`` function; for example:

In [79]:
pi = 3.14159
str(pi)

'3.14159'

For more complicated formats, you might be tempted to use string arithmetic as outlined in [Basic Python Semantics: Operators](04-Semantics-Operators.ipynb):

In [80]:
"The value of pi is " + str(pi)

'The value of pi is 3.14159'

A more flexible way to do this is to use *format strings*, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted.
Here is a basic example:

In [81]:
"The value of pi is {}".format(pi)

'The value of pi is 3.14159'

Inside the ``{}`` marker you can also include information on exactly *what* you would like to appear there.
If you include a number, it will refer to the index of the argument to insert:

In [82]:
"""First letter: {0}. Last letter: {1}.""".format('A', 'Z')

'First letter: A. Last letter: Z.'

If you include a string, it will refer to the key of any keyword argument:

In [83]:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')

'First letter: A. Last letter: Z.'

Finally, for numerical inputs, you can include format codes which control how the value is converted to a string.
For example, to print a number as a floating point with three digits after the decimal point, you can use the following:

In [84]:
"pi = {0:.3f}".format(pi)

'pi = 3.142'

As before, here the "``0``" refers to the index of the value to be inserted.
The "``:``" marks that format codes will follow.
The "``.3f``" encodes the desired precision: three digits beyond the decimal point, floating-point format.

This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available.
For more information on the syntax of these format strings, see the [Format Specification](https://docs.python.org/3/library/string.html#formatspec) section of Python's online documentation.

# [File I/O](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)
Reading and writing files.

## Working with paths

In [85]:
import os

current_file = os.path.realpath('file_io.ipynb')  
print('current file: {}'.format(current_file))
# Note: in .py files you can get the path of current file by __file__

current_dir = os.path.dirname(current_file)  
print('current directory: {}'.format(current_dir))
# Note: in .py files you can get the dir of current file by os.path.dirname(__file__)

data_dir = os.path.join(os.path.dirname(current_dir), 'data')
print('data directory: {}'.format(data_dir))

current file: /Users/ddl/DS-Bootcamp/Pre-Semana1/NOTEBOOKS/file_io.ipynb
current directory: /Users/ddl/DS-Bootcamp/Pre-Semana1/NOTEBOOKS
data directory: /Users/ddl/DS-Bootcamp/Pre-Semana1/data


### Checking if path exists

In [86]:
print('exists: {}'.format(os.path.exists(data_dir)))
print('is file: {}'.format(os.path.isfile(data_dir)))
print('is directory: {}'.format(os.path.isdir(data_dir)))

exists: False
is file: False
is directory: False


## Reading files

In [87]:
file_path = os.path.join(data_dir, 'simple_file.txt')

with open(file_path, 'r') as simple_file:
    for line in simple_file:
        print(line.strip())

FileNotFoundError: [Errno 2] No such file or directory: '/Users/ddl/DS-Bootcamp/Pre-Semana1/data/simple_file.txt'

The [`with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement) statement is for obtaining a [context manager](https://docs.python.org/3/reference/datamodel.html#with-statement-context-managers) that will be used as an execution context for the commands inside the `with`. Context managers guarantee that certain operations are done when exiting the context. 

In this case, the context manager guarantees that `simple_file.close()` is implicitly called when exiting the context. This is a way to make developers life easier: you don't have to remember to explicitly close the file you openened nor be worried about an exception occuring while the file is open. Unclosed file maybe a source of a resource leak. Thus, prefer using `with open()` structure always with file I/O.

To have an example, the same as above without the `with`.

In [88]:
file_path = os.path.join(data_dir, 'simple_file.txt')

# THIS IS NOT THE PREFERRED WAY
simple_file = open(file_path, 'r')
for line in simple_file:
    print(line.strip())
simple_file.close()  # This has to be called explicitly 

FileNotFoundError: [Errno 2] No such file or directory: '/Users/ddl/DS-Bootcamp/Pre-Semana1/data/simple_file.txt'

## Writing files

In [89]:
new_file_path = os.path.join(data_dir, 'new_file.txt')

with open(new_file_path, 'w') as my_file:
    my_file.write('This is my first file that I wrote with Python.')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/ddl/DS-Bootcamp/Pre-Semana1/data/new_file.txt'

Now go and check that there is a new_file.txt in the data directory. After that you can delete the file by:

In [90]:
if os.path.exists(new_file_path):  # make sure it's there
    os.remove(new_file_path)