## Interacting with the OS and filesystem

The `os` module in Python provides many functions for interacting with the OS and the filesystem. Let's import it and try out some examples.

In [1]:
import os

We can check the present working directory using the `os.getcwd` function.

In [2]:
os.getcwd()

'C:\\Users\\vrsha\\Data analysis'

To get the list of files in a directory, use `os.listdir`. You pass an absolute or relative path of a directory as the argument to the function.

In [4]:
help(os.listdir)

Help on built-in function listdir in module nt:

listdir(path=None)
    Return a list containing the names of the files in the directory.

    path can be specified as either str, bytes, or a path-like object.  If path is bytes,
      the filenames returned will also be bytes; in all other circumstances
      the filenames returned will be str.
    If path is None, uses the path='.'.
    On some platforms, path may also be specified as an open file descriptor;\
      the file descriptor must refer to a directory.
      If this functionality is unavailable, using it raises NotImplementedError.

    The list is in arbitrary order.  It does not include the special
    entries '.' and '..' even if they are present in the directory.



In [8]:
os.listdir('.')

['.git',
 '.ipynb_checkpoints',
 '1 First steps with python.ipynb',
 '2 Variables and datatypes.ipynb',
 '3 Branching loops and conditionals.ipynb',
 '4 Functions and scope.ipynb',
 '5 Working with OS & Files.ipynb',
 'data']

You can create a new directory using `os.makedirs`. Let's create a new directory called `data`, where we'll later download some files.

In [11]:
os.listdir('C:/')

['$AV_ASW',
 '$Recycle.Bin',
 '.GamingRoot',
 'avast! sandbox',
 'Config.Msi',
 'data',
 'Documents and Settings',
 'Drivers',
 'DumpStack.log.tmp',
 'inetpub',
 'logUploaderSettings.ini',
 'logUploaderSettings_temp.ini',
 'OneDriveTemp',
 'pagefile.sys',
 'PerfLogs',
 'Program Files',
 'Program Files (x86)',
 'ProgramData',
 'Python',
 'Recovery',
 'swapfile.sys',
 'System Volume Information',
 'Users',
 'Windows',
 'XboxGames']

You can create a new directory using `os.makedirs`. Let's create a new directory called `data`, where we'll later download some files.

In [12]:
os.makedirs('/data', exist_ok = True)

Can you figure out what the argument `exist_ok` does? Try using the `help` function or [read the documentation](https://docs.python.org/3/library/os.html#os.makedirs).

Let's verify that the directory was created and is currently empty.

In [14]:
'data' in os.listdir('.')

True

In [16]:
os.listdir('./data')

['.ipynb_checkpoints', 'links.csv', 'movies.csv', 'ratings.csv', 'tags.csv']

Let us download some files into the `data` directory using the `urllib` module.

In [17]:
url1 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans1.txt'
url2 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans2.txt'
url3 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans3.txt'

In [18]:
from urllib.request import urlretrieve

In [19]:
urlretrieve(url1, './data/loans1.txt')

('./data/loans1.txt', <http.client.HTTPMessage at 0x1657bf63290>)

In [20]:
urlretrieve(url2, './data/loans1.txt')

('./data/loans1.txt', <http.client.HTTPMessage at 0x1657bf61340>)

In [21]:
urlretrieve(url3, './data/loans3.txt')

('./data/loans3.txt', <http.client.HTTPMessage at 0x1657bf63950>)

In [22]:
os.listdir('./data')

['.ipynb_checkpoints',
 'links.csv',
 'loans1.txt',
 'loans3.txt',
 'movies.csv',
 'ratings.csv',
 'tags.csv']

You can also use the [`requests`](https://docs.python-requests.org/en/master/) library to dowload URLs, although you'll need to [write some additional code](https://stackoverflow.com/questions/44699682/how-to-save-a-file-downloaded-from-requests-to-another-directory) to save the contents of the page to a file.

## Reading from a file 

To read the contents of a file, we first need to open the file using the built-in `open` function. The `open` function returns a file object and provides several methods for interacting with the file's contents.

In [24]:
file1 = open('./data/loans1.txt', mode='r')

The `open` function also accepts a `mode` argument to specifies how we can interact with the file. The following options are supported:

```
    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'x'       create a new file and open it for writing
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode
    't'       text mode (default)
    '+'       open a disk file for updating (reading and writing)
    'U'       universal newline mode (deprecated)
    ========= ===============================================================
```

To view the contents of the file, we can use the `read` method of the file object.

In [25]:
file1_contents = file1.read()

In [27]:
print(file1_contents)

amount,duration,rate,down_payment
828400,120,0.11,100000
4633400,240,0.06,
42900,90,0.08,8900
983000,16,0.14,
15230,48,0.07,4300


The file contains information about loans. It is a set of comma-separated values (CSV). 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)

The first line of the file is the header, indicating what each of the numbers on the remaining lines represents. Each of the remaining lines provides information about a loan. Thus, the second line `10000,36,0.08,20000` represents a loan with:

* an *amount* of `$10000`, 
* *duration* of `36` months, 
* *rate of interest* of `8%` per annum, and 
* a down payment of `$20000`

The CSV is a standard file format used for sharing data for analysis and visualization. Over the course of this tutorial, we will read the data from these CSV files, process it, and write the results back to files. Before we continue, let's close the file using the `close` method (otherwise, Python will continue to hold the entire file in the RAM)

In [28]:
file1.close()

Once a file is closed, you can no longer read from it.

In [29]:
file1.read()

ValueError: I/O operation on closed file.

In [43]:
with open('./data/loans3.txt') as file3:
    file3_lines = file3.readlines()

In [44]:
file3_lines


['amount,duration,rate,down_payment\n',
 '45230,48,0.07,4300\n',
 '883000,16,0.14,\n',
 '100000,12,0.1,\n',
 '728400,120,0.12,100000\n',
 '3637400,240,0.06,\n',
 '82900,90,0.07,8900\n',
 '316000,16,0.13,\n',
 '15230,48,0.08,4300\n',
 '991360,99,0.08,\n',
 '323000,27,0.09,4720010000,36,0.08,20000\n',
 '528400,120,0.11,100000\n',
 '8633400,240,0.06,\n',
 '12900,90,0.08,8900']

Once the statements within the with block are executed, the .close method on file2 is automatically invoked. Let's verify this by trying to read from the file object again.

## Processing data from files

Before performing any operations on the data stored in a file, we need to convert the file's contents from one large string into Python data types. For the file `loans1.txt` containing information about loans in a CSV format, we can do the following:

* Read the file line by line
* Parse the first line to get a list of the column names or headers
* Split each remaining line and convert each value into a float
* Create a dictionary for each loan using the headers as keys
* Create a list of dictionaries to keep track of all the loans

Since we will perform the same operations for multiple files, it would be useful to define a function `read_csv`. We'll also define some helper functions to build up the functionality step by step. 

Let's start by defining a function `parse_header` that takes a line as input and returns a list of column headers.

In [45]:
def parse_headers (header_line):
    return header_line.strip().split(',')
    

The `strip` method removes any extra spaces and the newline character `\n`. The `split` method breaks a string into a list using the given separator (`,` in this case).

In [47]:
file3_lines[0]

'amount,duration,rate,down_payment\n'

In [48]:
headers = parse_headers(file3_lines[0])

In [49]:
headers

['amount', 'duration', 'rate', 'down_payment']

Next, let's define a function `parse_values` that takes a line containing some data and returns a list of floating-point numbers.

In [52]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        values.append(float(item))
    return values

In [53]:
file3_lines[1]

'45230,48,0.07,4300\n'

The values were parsed and converted to floating point numbers, as expected. Let's try it for another line from the file, which does not contain a value for the down payment.

In [54]:
parse_values(file3_lines[1])

[45230.0, 48.0, 0.07, 4300.0]

In [55]:
file3_lines[2]

'883000,16,0.14,\n'

In [56]:
parse_values(file3_lines[2])

ValueError: could not convert string to float: ''

The code above leads to a `ValueError` because the empty string `''` cannot be converted to a float. We can enhance the `parse_values` function to handle this *edge case*. We will also handle the case where the value is not a float.

## Using Pandas to Read and Write CSVs

There are some limitations to the `read_csv` and `write_csv` functions we've defined above:

* The `read_csv` function fails to create a proper dictionary if any of the values in the CSV files contains commas
* The `write_csv` function fails to create a proper CSV if any of the values to be written contains commas

When a value in a CSV file contains a comma (`,`), the value is generally placed within double quotes. Double quotes (`"`) in values are converted into two double quotes (`""`). Here's an example:

```
title,description
Fast & Furious,"A movie, a race, a franchise"
The Dark Knight,"Gotham, the ""Batman"", and the Joker"
Memento,A guy forgets everything every 15 minutes

```

Let's try it out.

In [65]:
import pandas

In [66]:
movies_url = "https://gist.githubusercontent.com/aakashns/afee0a407d44bbc02321993548021af9/raw/6d7473f0ac4c54aca65fc4b06ed831b8a4840190/movies.csv"

In [67]:
urlretrieve(movies_url, 'data/movies.csv')

('data/movies.csv', <http.client.HTTPMessage at 0x16516017d70>)

In [74]:
movies = pd.read_csv('data/movies.csv')

In [75]:
movies

Unnamed: 0,title,description
0,Fast & Furious,"A movie, a race, a franchise"
1,The Dark Knight,"Gotham, the ""Batman"", and the Joker"
2,Memento,A guy forgets everything every 15 minutes


A dataframe can be converted into a list of dictionaries using the `to_dict` method.

In [76]:
movies_dict = movies.to_dict()

In [77]:
movies_dict

{'title': {0: 'Fast & Furious', 1: 'The Dark Knight', 2: 'Memento'},
 'description': {0: 'A movie, a race, a franchise',
  1: 'Gotham, the "Batman", and the Joker',
  2: 'A guy forgets everything every 15 minutes'}}

In [80]:
write_csv(movies, 'movies2.csv')

NameError: name 'write_csv' is not defined

In [85]:
write_csv(movies, 'movies2.csv')

NameError: name 'write_csv' is not defined