# Introduction to Python - Lecture 07 (29Oct 2018)

### Agenda for today:
+ Working with files and filesystem:
    - basics of file handling in python
+ Introduction to Numpy
+ Introduction to Pandas
+ Introduction to Seaborn

# Data Persistence

+ Files
    + **<font color='blue'>\*.txt**</font>, \*.xml, *.json
    + \*.csv, \*.tab, *.xlsx (covered later, with pandas)
+ Databases (covered later in thhe course), when you want to capture relationships between data / entities

# Built-in *<font color='blue'>file</font>* object

+ Basic format:
```python
fh = open('<filename>', '<mode>')   # Creates a file object fh
```
+ *filename* can be _**absolute**_ or _**relative**_
+ *mode*: {'r', 'w', 'a'}; Default='r'
+ 'r': open file for reading if exists, else **<font color='blue'>FileNotFoundError</font>**
+ 'w': open new file for writing; overwrite if exists; use 'a' to avoid overwriting

```python
fh = open('data/data.txt')
type(fh)
dir(fh)
```

+ If the file does not exist, open will raise a **FileNotFound** error with traceback

**Notes**:
+ *fh* is not the file itself, but a handle/reference to it. Use it to do desired operations (read/write).
<br />  
![alt text](filehandle.svg)
<br />
+ <font color='blue'>Some additional mode options: 'rb', 'wb' for reading and writing binary files; '+' to open the file for both reading and writing.

# Reading from Files in "text mode"
+ File content is always read in as strings  
<br />
+ Here are the **most common approaches**:

    - **Read all data at once as a string**

    ```python
    import pprint
    fh = open('data/data.txt', 'r')
    data = fh.read()              # to read in all data as one big string
    print(type(data), '\n\n')
    print(data)
    print('\n', data.split('\n'))   # split the big string on new line character (\n); '\r\n' on windows
    ```

+ **file pointers and *<font color='blue'>seek</font>* operation**

    ```python
data_read_again = fh.read()        # can't read more without resetting the read pointer
print("length of data_read_again is: ", len(data_read_again))
fh.seek(0, 0)          # reset the pointer to beginning of the file: fh.seek(offset, from_what)
                       # https://docs.python.org/3/tutorial/inputoutput.html
data_now = fh.read()
print(data_now)
```

+ **Read individual lines as strings**
```python
fh.seek(0, 0)
data = fh.readlines()       # returns list of strings
print(type(data), "\n\n")   # check out the \n newline character at the end of lines
print(data)         # (\r\n on windows machines)
```

+ **Iterate over large files**
```python
fh.seek(0, 0)
for line in fh:           # 'fh' is iterable; use in iteration context for efficient
    print(len(line), line)  # reading of large files
fh.close()                # close file; good practice (esp. when writing files)
```

+ **Context manager**
```python
with open('data/data.txt', 'r') as fh:  # Context-manager; automatically closes the file
    for line in fh:
        print(line)
print('\nFile closed? : ', fh.closed)
```

# Writing to Files in "text mode"
+ Like reading, writing is also done as strings
<br />
+ Here are the **most common approaches**:

```python
fh = open('data/fresh.txt', 'w')           # Open a new file in write mode
fh.write('This is the 1st line\n')    # Write a line; 
                                      # Note that newline chars must be explicitly added
fh.close()
```



+ **Flushing buffers**

```python
fh = open('data/fresh.txt', 'a')           # Open the earlier file in 'append' mode
                                      #    to avoid overwriting
fh.write('This is the 2nd line\n')    # Write a line; 
                                      # Note that newline chars must be explicitly added
fh.flush()                            # Clears the buffer
```

+ **Write multiple lines at once**

```python
fh.writelines(['This is the 3rd line\n', 'This is the 4th line\n'])   # Note the newline
fh.flush()
```

+ **Write iteratively**

```python
more_lines = ['5th line', '6th line', '7th line']

for line in more_lines:       # iteration context
    fh.write(line + '\n')
    fh.flush()                # Don't need to add flush after every write: inefficient; let python handle it
fh.close()
```



### There are specialized modules to work with specific file formats:
1. csv (comma-separated values)
2. xlrd (excel documents): **pip install** xlrd OR **conda install** xlrd
3. json (hierarchical text-based format like python dictionares)
4. yaml (another hierarchical text-based format like json): **pip install** pyyaml OR **conda install** pyyaml
5. pandas (works with csv, tsv, xls and many more formats): **pip install pandas** OR **conda install pandas**

## os.path module
```python
import os
dir(os.path)  # Note that path is another module that os module imports. 
              # When we import os, path becomes available as a module variable within os namespace.
              # --> module / namespace hierarchy
```
+ **path parsing:**
    - os.path.split(<path_str>):
    - os.path.splitext(<path_str>) 
```python
# Ex.
print(os.path.split('/Users/groveh01/Documents/my_data.txt'))
print(os.path.splitext('my_data.txt'))
```
+ **path building:**
    - os.path.join(<path_components>)
```python
# Ex.
print(os.path.join('/Users', 'groveh01', 'Documents', 'Teaching'))
```
+ **common tests:**
    - os.path.<test>, where test = {isdir(), isfile(), exists(), ...}
```python
# Ex.
print(os.path.isdir('data/data.txt'))
print(os.path.isfile('data/data.txt')
print(os.path.exists('data')
print(os.path.exists('data.txt')
print(os.path.exists('/Users/groveh01')
```
+ **listing contents of a dir:**
    - os.listdir
```python
# Ex.
import pprint
pprint.pprint(os.listdir('/Users/groveh01'))
pprint.pprint(os.listdir('.'))
pprint.pprint(os.listdir('..'))
```

## Numpy

+ A package for scienctific computing
+ A more powerful version of lists
+ All of the methods are optimized to run fast
+ Great for linear algebra, statistical analysis

Numpy is not part of Pythons standard libraries and needs to be installed.

This can be done using the conda command if you are using Anaconda:

```bash
conda install numpy
```

Alternatively this can be done using the built in Python package manager Pip

```bash
pip install numpy
```

Once numpy is installed it then needs to be imported to use its functionality

```python
import numpy as np
```

In [7]:
import numpy as np

### Lists recap

Lists are created using '[]'

```python
l1 = [1, 2, 3, 4]
```

Lists can contain mixed types

```python
l2 = [1, 'a', {'abs': abs}, (1, 2)]
```


Lists can be joined using the + operator which creates a new list

```python
l3 = l1 + l2
```

Lists can be extended using the .extend() method which happens in place

```python
l1.extend(l2)
```

The range() function can be used to initialize numeric lists

```python
base = list(range(0, 101, 2))
```

To perform calculations using a list, a for loop is required

```python
base_squared = []
for x in base:
    base_squared.append(x**2)
```

### Creating Arrays with Numpy

There are a number of ways of creating numpy arrays.

##### Converting a regular python list into a numpy array:

```python
var = np.array(< list >)
```

This is useful when the original list needs to be constructed from a file. Numpy does not have a simple method to append to lists. For this reason it is sometimes easier to build a normal python list before converting it to a numpy array.

```python
characters = []
for i in range(32, 100):
    characters.append(chr(i))
print(characters)
np_char = np.array(characters)
print(np_char)
```

##### Initializing arrays using numpy

1. Creating an array with *n* zeros

```python
lst = np.zeros(25)
lst_2d = np.zeros((5, 5))
```

2. Creating an array with *n* ones

```python
lst = np.ones(25)
lst_2d = np.ones((5, 5))
```

3. Using a range(start, end, incriment)

```python
lst = np.arange(10, 20, 2)
```

4. Linspace is similar to range - linspace(start, end, number_of_elements) 

```python
lst = np.linspace(10, 20, 2)
```

5. Filling a list with random numbers between [0, 1)

```python
lst = np.random.rand(5)
lst_2d = np.random.rand((5, 5))
```

##### Mathimatical operations

Numpy is a tool which simplifies performing linear algebra in Python, for this reason most operations will match linear algebra operations.

1. Addition
    1. Adding a scalar to a matrix does not follow normal mathematical rules as the scalar is 
    added to each object in the matrix
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} + C
        $
        $ =
            \begin{bmatrix} 
            a_{0,0} + C & a_{0,1} + C & \cdots & a_{0,n} + C \\
            a_{1,0} + C & a_{1,1} + C & \cdots & a_{1,n} + C \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} + C & a_{m,1} + C & \cdots & a_{m,n} + C \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((n, m)) + C
    ```

1. 
    2. Adding two numpy arrays requires them to have the same shape
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} + 
            \begin{bmatrix} 
            b_{0,0} & b_{0,1} & \cdots & b_{0,n} \\
            b_{1,0} & b_{1,1} & \cdots & b_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            b_{m,0} & b_{m,1} & \cdots & b_{m,n} \\
            \end{bmatrix} =
            \begin{bmatrix} 
            a_{0,0} + b_{0,0} & a_{0,1} + b_{0,1} & \cdots & a_{0,n} + b_{0,n} \\
            a_{1,0} + b_{1,0} & a_{1,1} + b_{1,1} & \cdots & a_{1,n} + b_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} + b_{m,0} & a_{m,1} + b_{m,1} & \cdots & a_{m,n} + b_{m,n} \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((n, m)) + np.ones((n, m))
    ```
    

2. Multiplication
    1. By a scalar - each element is multiplied by the scalar
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} * C
        $
        $ =
            \begin{bmatrix} 
            a_{0,0} * C & a_{0,1} * C & \cdots & a_{0,n} * C \\
            a_{1,0} * C & a_{1,1} * C & \cdots & a_{1,n} * C \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} * C & a_{m,1} * C & \cdots & a_{m,n} * C \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((n, m)) * C
    ```

2.    
    2. Multiplying two equally sized matricies results in element wise multiplication
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} * 
            \begin{bmatrix} 
            b_{0,0} & b_{0,1} & \cdots & b_{0,n} \\
            b_{1,0} & b_{1,1} & \cdots & b_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            b_{m,0} & b_{m,1} & \cdots & b_{m,n} \\
            \end{bmatrix} =
            \begin{bmatrix} 
            a_{0,0} * b_{0,0} & a_{0,1} * b_{0,1} & \cdots & a_{0,n} * b_{0,n} \\
            a_{1,0} * b_{1,0} & a_{1,1} * b_{1,1} & \cdots & a_{1,n} * b_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} * b_{m,0} & a_{m,1} * b_{m,1} & \cdots & a_{m,n} * b_{m,n} \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((n, m)) * np.ones((n, m))
    ```
    

2. 
    3. Multiplying a n x m matrix with a vector of size n results in each row being multiplied by the vector
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} * 
            \begin{bmatrix} 
            b_{0} & b_{1} & \cdots & b_{n}
            \end{bmatrix} =
            \begin{bmatrix} 
            a_{0,0} * b_{0} & a_{0,1} * b_{1} & \cdots & a_{0,n} * b_{n} \\
            a_{1,0} * b_{0} & a_{1,1} * b_{1} & \cdots & a_{1,n} * b_{n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} * b_{0} & a_{m,1} * b_{1} & \cdots & a_{m,n} * b_{n} \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((6, 5)) * np.array([1, 2, 3, 2, 1])
    ```

3. Division follows the same rules as multiplication

    ```python
    lst_2d = np.ones((6, 5)) / np.array([1, 2, 3, 2, 1])
    ```


#### Accessing elements/rows/columns


##### Constructing a 2d array for demonstration purposes

```python
lst_2d = np.vstack((np.arange(1, 100, 2), np.arange(100, 1, -2)))
```

This is a 2D array in which the first row comprises of all the odd numbers between [1 and 99] and the second row comprises the even numbers between [100 and 2].

##### Accessing individual values

1. Using standard list indexing
```python
# lst_2d[row_index][column_index]
lst_2d[1][2]
```

2. Numpy has more advanced indexing which allows both values to be specified together
```python
# lst_2d[row_index, column_index]
lst_2d[1, 2]
```

##### Retrieving a row

```python
# lst_2d[row_index, :]
lst_2d[0, :] # return an array of all the elements in the first row
lst_2d[1, :] # return an array of all the elements in the second row
```

##### Retrieving a column

```python
# lst_2d[:, column_index]
lst_2d[:, 5] # return an array of all the elements in the fifth column
lst_2d[:, 20] # return an array of all the elements in the twentieth column
```

In [117]:
lst_2d[:, 5]

array([11, 90])