## Using Python to interact with the filesystem

### A note on filesystems

Interacting with the filesystem is a crucial need that any programming language must have.

Python has a module dedicated to this, the `os` module.

But before digging into the module, it's important to understand what a filesystem is and how to navigate it. 

Your computer has some sort of storage, right? Usually an HDD (Hard Disk Drive) or SSD (Solid State Drive). All disks contain a file system and information about where disk data is stored, and how it may be accessed by a user or application. A file system typically manages operations, such as storage management, file naming, directories/folders, metadata, access rules and privileges.

Every operating system has it's own filesystem: Windows uses NTFS (New Technology File System), Mac uses APFS (Apple File System) and Linux can use several depending on the distribution, the most used being Ext4 (Extended File System, version 4).

For the purpose of this lesson, the most important takeaway is that they all use a tree architecture to organize directories and files, like the image below:

<img src="structure.png" alt="FS Structure" title="" width="650" />

The **/** symbol at the top is usually called *root*. Every directory below the root is usually called a *branch*, and you think of files as *leaves* in that branch.

As with real trees, on filesystems you can have branches and leaves connected to other branches, but can't have branches connected to leaves, so files can't be parents of directories.

It depends on the operating system, but you can traverse the filesystem by either using the / symbol (linux and mac) or the \ symbol on windows. Ex: **/home/jeancome/texmf/tex**

This is usually called a *path*.

### Absolute and relative paths

Imagine that your filesystem is the same as the image above, and that you want to go to the *jeancome* directory, starting from the root. You would do this:

`cd /home/jeancome`

Now imagine that from there you wanted to go to the *jeancomeson* directory. You could do this in two ways:

1. Using an absolute path: `cd /home/jeancomeson`
2. Using a relative path: `cd ../jeancomeson` &leftarrow; *we went up one branch an then entered another directory*

### Advices

* The better you know your filesystem, the easiest will be for you to navigate it
* Avoid spaces in filenames. Replace them with <b>_</b> or <b>-</b> (this alone will prevent a lot of problems down the road)
* Avoid placing your code on system folders like *Program Files*
* Normalize filenames. Keep all names in lowercase, for instance

### The `OS` module

This builtin python module comes packed with many methods to help you interact and navigate with a filesystem. Let's take a look at some of the most used ones:

#### Navigation

In [20]:
# import the module
import os

# get the current working directory
os.chdir('/workspace/python/python_course/lessons/os')
os.getcwd()

'/workspace/python/python_course/lessons/os'

In [21]:
# change to the previous directory (using absolute path)
os.chdir('/workspace/python/python_course/lessons')
os.getcwd()

'/workspace/python/python_course/lessons'

In [22]:
# change to the previous directory (using relative path)
# first let's go to the starting point, changing to the following directory
os.chdir('os')
os.getcwd()

'/workspace/python/python_course/lessons/os'

In [23]:
# and now change to the previous directory
os.chdir('..')
os.getcwd()

'/workspace/python/python_course/lessons'

#### Listing

In [24]:
# list the contents of a specific directory
os.listdir('/workspace/python/python_course/lessons/os')

['filesystem.ipynb',
 'structure.png',
 'my_file.txt',
 'my_file.txt.bak',
 'matches_dataset',
 'filesystem.html',
 '.ipynb_checkpoints']

In [25]:
# call oslistdir() without arguments to list the contents of the current directory
os.chdir('..')
print(f'Contents of the {os.getcwd()} directory:')
os.listdir()

Contents of the /workspace/python/python_course directory:


['import_modules.ipynb',
 'lists&tuples&sets.ipynb',
 '.~Course.pptx',
 'oop-classes.ipynb',
 'csv.ipynb',
 'exercise_csv',
 'debug_example.py',
 'exercise_calculator',
 'json.ipynb',
 'my_module.py',
 'functions.ipynb',
 'conditionals.ipynb',
 'exercise_json',
 'strings.ipynb',
 'variablescopes.ipynb',
 'loops.ipynb',
 'exercise_console',
 'decorators.ipynb',
 'comprehensions.ipynb',
 'data_science',
 'myFirstProgram.py',
 'databases.ipynb',
 'Course.pptx',
 'dictionaries.ipynb',
 'materials',
 'functions.log',
 'insightsPage-example.json',
 'variables.ipynb',
 'new_example.csv',
 'scripts',
 'relational.sql',
 'virtualenv.ipynb',
 'generators_mem_profile.py',
 'datatypes_recap.ipynb',
 'example.csv',
 'exercise_job_interview',
 '__pycache__',
 'integers&floats.ipynb',
 'data-analysis-and-feature-extraction-with-python.ipynb',
 'exercise_quizz',
 '.ipynb_checkpoints',
 'lessons',
 'noSQL.json',
 'slides',
 'exceptions.ipynb']

In [26]:
# what if you only want directories?
subfolders = []
for f in os.scandir(os.getcwd()):
    if f.is_dir():
        # look at some properties that you can get from a file/directory
        subfolders.append((f.name, f.path))
subfolders

# same but using a list comprehension
#subfolders = [(f.name, f.path) for f in os.scandir(os.getcwd()) if f.is_dir()]
#subfolders

[('exercise_csv', '/workspace/python/python_course/exercise_csv'),
 ('exercise_calculator',
  '/workspace/python/python_course/exercise_calculator'),
 ('exercise_json', '/workspace/python/python_course/exercise_json'),
 ('exercise_console', '/workspace/python/python_course/exercise_console'),
 ('data_science', '/workspace/python/python_course/data_science'),
 ('materials', '/workspace/python/python_course/materials'),
 ('scripts', '/workspace/python/python_course/scripts'),
 ('exercise_job_interview',
  '/workspace/python/python_course/exercise_job_interview'),
 ('__pycache__', '/workspace/python/python_course/__pycache__'),
 ('exercise_quizz', '/workspace/python/python_course/exercise_quizz'),
 ('.ipynb_checkpoints', '/workspace/python/python_course/.ipynb_checkpoints'),
 ('lessons', '/workspace/python/python_course/lessons'),
 ('slides', '/workspace/python/python_course/slides')]

In [27]:
# what if you only want files?
files_in_dir = []
for f in os.scandir(os.getcwd()):
    if f.is_file():
        # look at some properties that you can get from a file/directory
        files_in_dir.append((f.name, f.path))
files_in_dir

# same but using a list comprehension
#files_in_dir = [(f.name, f.path) for f in os.scandir(os.getcwd()) if f.is_file()]
#files_in_dir

[('import_modules.ipynb',
  '/workspace/python/python_course/import_modules.ipynb'),
 ('lists&tuples&sets.ipynb',
  '/workspace/python/python_course/lists&tuples&sets.ipynb'),
 ('.~Course.pptx', '/workspace/python/python_course/.~Course.pptx'),
 ('oop-classes.ipynb', '/workspace/python/python_course/oop-classes.ipynb'),
 ('csv.ipynb', '/workspace/python/python_course/csv.ipynb'),
 ('debug_example.py', '/workspace/python/python_course/debug_example.py'),
 ('json.ipynb', '/workspace/python/python_course/json.ipynb'),
 ('my_module.py', '/workspace/python/python_course/my_module.py'),
 ('functions.ipynb', '/workspace/python/python_course/functions.ipynb'),
 ('conditionals.ipynb', '/workspace/python/python_course/conditionals.ipynb'),
 ('strings.ipynb', '/workspace/python/python_course/strings.ipynb'),
 ('variablescopes.ipynb',
  '/workspace/python/python_course/variablescopes.ipynb'),
 ('loops.ipynb', '/workspace/python/python_course/loops.ipynb'),
 ('decorators.ipynb', '/workspace/python/

In [28]:
# what if you want to recursively scan the whole directory?
# the walk() method accepts a directory and will scan the entire contents of that directory.
# it return a tuple with 3 values: the possible paths inside the directory, the directory names and the filenames
x = os.walk(os.getcwd())
x

<generator object walk at 0x7fe4e04a3550>

In [29]:
# as you can see, x is a generator, meaning it doesn't hold any data, it will generate it when we ask for it
# (imagine asking python to walk through a whole hard drive, it would take a lot of time!)
# so we have to iterate the generator to get data from it:
for x in os.walk(os.getcwd()):
    print(x)

# or using a list comprehension    
#[x for x in os.walk(os.getcwd())]

('/workspace/python/python_course', ['exercise_csv', 'exercise_calculator', 'exercise_json', 'exercise_console', 'data_science', 'materials', 'scripts', 'exercise_job_interview', '__pycache__', 'exercise_quizz', '.ipynb_checkpoints', 'lessons', 'slides'], ['import_modules.ipynb', 'lists&tuples&sets.ipynb', '.~Course.pptx', 'oop-classes.ipynb', 'csv.ipynb', 'debug_example.py', 'json.ipynb', 'my_module.py', 'functions.ipynb', 'conditionals.ipynb', 'strings.ipynb', 'variablescopes.ipynb', 'loops.ipynb', 'decorators.ipynb', 'comprehensions.ipynb', 'myFirstProgram.py', 'databases.ipynb', 'Course.pptx', 'dictionaries.ipynb', 'functions.log', 'insightsPage-example.json', 'variables.ipynb', 'new_example.csv', 'relational.sql', 'virtualenv.ipynb', 'generators_mem_profile.py', 'datatypes_recap.ipynb', 'example.csv', 'integers&floats.ipynb', 'data-analysis-and-feature-extraction-with-python.ipynb', 'noSQL.json', 'exceptions.ipynb'])
('/workspace/python/python_course/exercise_csv', ['.ipynb_checkp

In [30]:
# or in a more understandable way:
for x in os.walk(os.getcwd()):
    print(f'The directory: {x[0]} has these folders: {x[1]} and these files {x[2]}')

The directory: /workspace/python/python_course has these folders: ['exercise_csv', 'exercise_calculator', 'exercise_json', 'exercise_console', 'data_science', 'materials', 'scripts', 'exercise_job_interview', '__pycache__', 'exercise_quizz', '.ipynb_checkpoints', 'lessons', 'slides'] and these files ['import_modules.ipynb', 'lists&tuples&sets.ipynb', '.~Course.pptx', 'oop-classes.ipynb', 'csv.ipynb', 'debug_example.py', 'json.ipynb', 'my_module.py', 'functions.ipynb', 'conditionals.ipynb', 'strings.ipynb', 'variablescopes.ipynb', 'loops.ipynb', 'decorators.ipynb', 'comprehensions.ipynb', 'myFirstProgram.py', 'databases.ipynb', 'Course.pptx', 'dictionaries.ipynb', 'functions.log', 'insightsPage-example.json', 'variables.ipynb', 'new_example.csv', 'relational.sql', 'virtualenv.ipynb', 'generators_mem_profile.py', 'datatypes_recap.ipynb', 'example.csv', 'integers&floats.ipynb', 'data-analysis-and-feature-extraction-with-python.ipynb', 'noSQL.json', 'exceptions.ipynb']
The directory: /work

#### Managing directories and files

In [31]:
# creating new directories
# the mkdir() method also follows the same absolute/relative path logic.
# if you specify the full path the directory will be created in that location:
os.mkdir('/workspace/python/python_course/lessons/os/my_new_folder')
print(os.listdir('/workspace/python/python_course/lessons/os'))
print()

# if you don't, the directory will be created in the current path, which now is:
#os.mkdir('my_new_folder')
print(os.getcwd())
print()

# if you want to create multiple nested directories at once, just use the makedirs() method:
os.makedirs('/workspace/python/python_course/lessons/os/another_new_folder/and_another/and_another')
print(os.listdir('/workspace/python/python_course/lessons/os'))

['my_new_folder', 'filesystem.ipynb', 'structure.png', 'my_file.txt', 'my_file.txt.bak', 'matches_dataset', 'filesystem.html', '.ipynb_checkpoints']

/workspace/python/python_course

['my_new_folder', 'filesystem.ipynb', 'structure.png', 'my_file.txt', 'another_new_folder', 'my_file.txt.bak', 'matches_dataset', 'filesystem.html', '.ipynb_checkpoints']


In [32]:
# it's usually a good practice to check if a directory/file exists before trying to create it.
# you can check for it's existence like this:
if not os.path.exists('/workspace/python/python_course/lessons/os/my_new_folder'):
    os.mkdir('/workspace/python/python_course/lessons/os/my_new_folder')
else:
    print('Directory already exists.')

Directory already exists.


In [33]:
# renaming directories or files
# the rename() method also follows the same absolute/relative path logic.
# if you specify the full path the directory/file will be renamed in that location:
os.rename('/workspace/python/python_course/lessons/os/my_new_folder','/workspace/python/python_course/lessons/os/my_new_folder_renamed')
print(os.listdir('/workspace/python/python_course/lessons/os'))
print()

# if you don't, python will search for the directory/file in the current path and rename it, which now is:
#os.rename('my_new_folder', 'my_new_folder_renamed')
print(os.getcwd())

['filesystem.ipynb', 'structure.png', 'my_file.txt', 'another_new_folder', 'my_file.txt.bak', 'matches_dataset', 'filesystem.html', 'my_new_folder_renamed', '.ipynb_checkpoints']

/workspace/python/python_course


In [34]:
# removing empty directories
# the rmdir() method also follows the same absolute/relative path logic.
# if you specify the full path the directory will be removed from that location:
os.rmdir('/workspace/python/python_course/lessons/os/my_new_folder_renamed')
print(os.listdir('/workspace/python/python_course/lessons/os'))
print()

# if you don't, python will remove the directory from the current path, which now is:
#os.rmdir('my_new_folder_renamed')
print(os.getcwd())

['filesystem.ipynb', 'structure.png', 'my_file.txt', 'another_new_folder', 'my_file.txt.bak', 'matches_dataset', 'filesystem.html', '.ipynb_checkpoints']

/workspace/python/python_course


> **Note**: Python will throw an error in you try to remove the current working directory. You must change to another directory and then call the rmdir() method.

In [None]:
# removing directories and their contents
# if you wish to do this, you must use the rmtree() method from another module called shutil
import shutil
shutil.rmtree('/workspace/python/python_course/lessons/os/another_new_folder')
print(os.listdir('/workspace/python/python_course/lessons/os'))

In [None]:
# removing files
# the rmdir() method also follows the same absolute/relative path logic.
# if you specify the full path the file will be removed from that location:
os.remove('/workspace/python/python_course/lessons/os/my_file.txt')
print(os.listdir('/workspace/python/python_course/lessons/os'))
print()

# if you don't, python will remove the file from the current path, which now is:
#os.remove('my_file.txt')
print(os.getcwd())

#### Move & Copy operations, and accessing the machine's shell

There are many ways of performing these operations in python. There's the `os` module, the `shutil` module we saw previously or the `subprocess` module, but let's keep the focus on the `os` module.

There's a very powerful method in the `os` module called `system()` that gives you access to the shell/console/terminal of the operating system you're in.

So not only you can move or copy files, you can do whatever you want!

The `system()` method expects one argument only: the command to execute, meaning that you have to know the available commands in your operating system.

So if you're on Windows and want to copy a file, you do `copy my_dir/my_file my_other_dir/my_file`, while on linux or mac you do `cp my_dir/my_file my_other_dir/my_file`.

In [None]:
# here's an example on linux:
os.system('cp /workspace/python/python_course/lessons/os/my_file.txt.bak /workspace/python/python_course/lessons/os/my_file.txt')

This method has an underlying problem however. The output is always about the execution of the command, not the result/output of that execution, so we don't actually know if the file was copied or not.

Enter the `subprocess` module. If we use this module instead:

In [None]:
import subprocess

output = subprocess.check_output('hostname', shell=True)
print(f'Subprocess: {output}')
print(f'Subprocess(pretty): {output.decode("utf-8")}')

output_2 = os.system('hostname')
print(f'OS: {output_2}')

### Further reading

Don't forget that you can always inspect the whole module by calling the `help()` function or visit <a href="https://docs.python.org/3.7/library/os" target="blank">https://docs.python.org/3.7/library/os</a>

In [None]:
help(os)

Or get help on a specific method:

In [None]:
help(os.listdir)

### Exercises

In [None]:
# choose a random folder in your filesystem (preferably a complicated one) and find a way to get all the 
# directories and files in each subdirectory.

# Hint: the best method to achieve this is the os.walk()



In [4]:
def list_files(path=os.getcwd()):
    indent_size = 4
    for root, dirs, files in os.walk(path):
        level = root.replace(path, '').count(os.sep)
        indent = ' ' * indent_size * (level)
        print(f'{indent}{os.path.basename(root)}/')
        subindent = ' ' * indent_size * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

os.chdir('/workspace/python/python_course')
list_files()

os/
    filesystem.ipynb
    structure.png
    my_file.txt
    my_file.txt.bak
    matches_dataset/
        2000-2019.zip/
            2000-2019.csv
            .ipynb_checkpoints/
        1900-1999.zip/
            1950-2000.csv
            1900-1950.csv
        final/
            final_data.csv
            .ipynb_checkpoints/
                final_data-checkpoint.csv
        1872-1899.zip/
            xix_century.csv
    .ipynb_checkpoints/
        filesystem-checkpoint.ipynb
        my_file-checkpoint.txt
        my_file.txt-checkpoint.bak


In [None]:
# see the 'matches_dataset' directory, that contains multiple csv files. 
# The goal is to unite all the csv's into one. Ready?


In [2]:
import os
import pandas as pd

def merge_datasets(path=os.getcwd()):
    # create a list to hold the data from each file
    all_data = []

    # iteratively loop over all the directories
    for root, dirs, files in os.walk(path):
        # if directory has files
        if files:
            # for each file
            for file in files:
                # if file is a csv
                if(file.lower().endswith('.csv')):
                    # read the csv and create a dataframe using Pandas, and append it to our list
                    all_data.append(pd.read_csv(root + '/' + file, index_col = None))          
    # merge the dataframes into a single dataframe
    merge_data = pd.concat(all_data[1:], sort=False)
    
    # create a new subdirectory of path if it doesn't exist
    if not os.path.exists(path + '/final'):
        os.mkdir(path + '/final')
    # navigate to the directory
    os.chdir(path + '/final')
    # save a new csv with merge_data
    merge_data.to_csv('final_data.csv', index = None, header = True)

merge_datasets('/workspace/python/python_course/lessons/os/matches_dataset')
data = pd.read_csv('final_data.csv', index_col = None)
data.head(10)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1951-02-06,France,Yugoslavia,2,1,Friendly,Paris,France,False
1,1951-02-18,Spain,Switzerland,6,3,Friendly,Madrid,Spain,False
2,1951-02-24,Zimbabwe,Zambia,2,1,Friendly,Salisbury,Southern Rhodesia,False
3,1951-02-25,Costa Rica,Nicaragua,8,1,CCCF Championship,Panama City,Panama,True
4,1951-02-27,Panama,Costa Rica,2,0,CCCF Championship,Panama City,Panama,False
5,1951-02-28,Panama,Nicaragua,4,0,CCCF Championship,Panama City,Panama,False
6,1951-03-02,Panama,Costa Rica,1,1,CCCF Championship,Panama City,Panama,False
7,1951-03-03,Nicaragua,Costa Rica,1,4,CCCF Championship,Panama City,Panama,True
8,1951-03-04,Panama,Nicaragua,6,2,CCCF Championship,Panama City,Panama,False
9,1951-03-07,Northern Ireland,Wales,1,2,British Championship,Belfast,Northern Ireland,False
