# Introduction - Python for File Management

**Goals:**

*   Be able to navigate around the file system
*   Indentify specific files
*   Request file metadata
*   Open and write metadata to CSV file





**Python Libraries:**

[os - Operating System](https://docs.python.org/3/library/os.html)


*   os.walk
*   os.lisdir
*   [os.path](https://docs.python.org/3/library/os.path.html)



[time - Time Access and Conversion](https://docs.python.org/3/library/time.html?highlight=time#module-time)


[csv - CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

[pandas - Data Analysis](https://pandas.pydata.org/)







## OS Library

The OS library is a python package that allows you to interface with your computers's operating system. This library has a variety of modules that can enable you to access information on files and manipulate them for desired effects.

For the purpose of this workshop, we will be using the OS library to navigate around Google colab's file system, find specific files, and generate metadata from these files.

### **Goal:** Identifying and Changing Current Directory

In [None]:
#import os library
import os

#Set the current working directory as directory
directory = os.getcwd()

print(directory)

/content


In [None]:
#Change working directory to desired directory
os.chdir("/content/sample_data")

directory = os.getcwd()

print(directory)

/content/sample_data


### Os.walk

This is an application of the OS Library that produces a list of file names in a directory tree by "walking" it from top to bottom, or optionally in reverse. It is used in a for loop and yeilds a 3-tuple of dirpath, dirnames, filenames.



**dirpath:** the path of the directory returned as a string

**dirnames:** the names of the subdirectories returned as a list

**filenames:** the name of non-directory files in the directory path returned as a list

In [None]:
#Write for loop to walk the /content directory
for root, dirs, files in os.walk("/content", topdown=False):

  print("Directory: " + root)

  #Notice that dirs has to be formatted as a string
  print("Subdirectories: " + (str(dirs)))

  print(files)

Directory: /content/.config/logs/2024.08.13
Subdirectories: []
['13.25.04.560730.log', '13.25.45.679285.log', '13.25.44.351490.log', '13.25.28.914591.log', '13.26.01.890863.log', '13.26.01.085715.log']
Directory: /content/.config/logs
Subdirectories: ['2024.08.13']
[]
Directory: /content/.config/configurations
Subdirectories: []
['config_default']
Directory: /content/.config
Subdirectories: ['logs', 'configurations']
['active_config', 'config_sentinel', '.last_survey_prompt.yaml', '.last_opt_in_prompt.yaml', 'gce', 'hidden_gcloud_config_universe_descriptor_data_cache_configs.db', '.last_update_check.json', 'default_configs.db']
Directory: /content/sample_data/.ipynb_checkpoints
Subdirectories: []
[]
Directory: /content/sample_data
Subdirectories: ['.ipynb_checkpoints']
['anscombe.json', 'README.md', 'mnist_test.CSV', 'california_housing_train.csv', 'mnist_train_small.csv', 'california_housing_test.csv']
Directory: /content
Subdirectories: ['.config', 'sample_data']
[]


### Challenge

Change your current working directory to the "web" directory in "datalab" and walk it.

Print your current directory and each subdirectory.


In [None]:
#Start your code here
os.chdir("/datalab/web")

directory = os.getcwd()

for root, dirs, files in os.walk(directory, topdown=False):

  print(directory)

  print("Subdirectories: " + str(dirs))


/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: ['futures']
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: ['expat']
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: ['sax', 'dom', 'parsers', 'etree']
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: ['dummy']
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subdirectories: []
/datalab/web
Subd

### **Goal:** List files and choose specific file types

In [None]:
#Change current directory back to sample_data
os.chdir("/content/sample_data")

directory = os.getcwd()

#Write a for loop using os.listdir and print each file in directory
for file in os.listdir(directory):
  print(file)

anscombe.json
README.md
mnist_test.CSV
.ipynb_checkpoints
california_housing_train.csv
mnist_train_small.csv
california_housing_test.csv


In [None]:
#Let's say we only want to see the CSV files

#Add an IF statement to the loop to check file extension
for file in os.listdir(directory):

  #Checks if filename ends with .csv
  if file.endswith(".csv"):

    print(file)

california_housing_train.csv
mnist_train_small.csv
california_housing_test.csv


In [None]:
for file in os.listdir(directory):

  #Add .lower to statement
  if file.lower().endswith(".csv"):

    print(file)

mnist_test.CSV
california_housing_train.csv
mnist_train_small.csv
california_housing_test.csv


### **Goal:** Generate File Metadata

Os.path: This module is used to interact with and manipulate pathnames

In [None]:
#We'll start with a for loop to begin our process
for file in os.listdir(directory):

  #We're going to continue with this IF statement to limit the files
  if file.lower().endswith(".csv"):

    #basename is used to return the base name of the pathname
    filename = os.path.basename(file)
    print("Filename: " + filename)

    #abspath is used to return a normalized absolutized version of the pathname
    filepath = os.path.abspath(file)
    print("File path: " + filepath)

Filename: mnist_test.CSV
File path: /content/sample_data/mnist_test.CSV
Filename: california_housing_train.csv
File path: /content/sample_data/california_housing_train.csv
Filename: mnist_train_small.csv
File path: /content/sample_data/mnist_train_small.csv
Filename: california_housing_test.csv
File path: /content/sample_data/california_housing_test.csv


In [None]:
#Let's say that we want to the size of the files

for file in os.listdir(directory):

  if file.lower().endswith(".csv"):

    filename = os.path.basename(file)
    print("Filename: " + filename)

    #getsize is used the return the size of the file in bytes
    filesize = os.path.getsize(file)

    #Note that filesize needs to be formatted as an string
    print("File size: " + str(filesize) + " bytes")

Filename: mnist_test.CSV
File size: 18289443 bytes
Filename: california_housing_train.csv
File size: 1706430 bytes
Filename: mnist_train_small.csv
File size: 36523880 bytes
Filename: california_housing_test.csv
File size: 301141 bytes


In [None]:
#Lastly, let's get when the file was created and when it was last modified

for file in os.listdir(directory):

  if file.lower().endswith(".csv"):

    filename = os.path.basename(file)

    #getctime returns when the file was created
    creation = os.path.getctime(file)

    #getmtime returns when the file was last modified
    modified = os.path.getmtime(file)

    print("Filename: " + filename)

    print("Created: " + str(creation))

    print("Last Modified: " + str(modified))

Filename: mnist_test.CSV
Created: 1723739856.2626855
Last Modified: 1723555582.0
Filename: california_housing_train.csv
Created: 1723559364.6925225
Last Modified: 1723555579.0
Filename: mnist_train_small.csv
Created: 1723559365.1175556
Last Modified: 1723555581.0
Filename: california_housing_test.csv
Created: 1723559364.6795213
Last Modified: 1723555579.0


getctime and getmtime return a value that is a floating point number giving the number of seconds since the Unix epoch. This being January 1, 1970, 00:00:00 (UTC).

To make this easier to read, when can use the **time** module to convert those seconds. ctime is used to convert time expressed in seconds since the epoch to a string.

In [None]:
#import time library
import time

for file in os.listdir(directory):

  if file.lower().endswith(".csv"):

    filename = os.path.basename(file)

    creation = os.path.getctime(file)
    #use ctime to convert the creation time value and assign it to creation_time
    creation_time = time.ctime(creation)

    modified = os.path.getmtime(file)
    #use ctime to convert the modified time value and assign it to modified_time
    modified_time = time.ctime(modified)

    print("Filename: " + filename)

    print("Created: " + creation_time)

    print("Last Modified: " + modified_time)


Filename: mnist_test.CSV
Created: Thu Aug 15 16:37:36 2024
Last Modified: Tue Aug 13 13:26:22 2024
Filename: california_housing_train.csv
Created: Tue Aug 13 14:29:24 2024
Last Modified: Tue Aug 13 13:26:19 2024
Filename: mnist_train_small.csv
Created: Tue Aug 13 14:29:25 2024
Last Modified: Tue Aug 13 13:26:21 2024
Filename: california_housing_test.csv
Created: Tue Aug 13 14:29:24 2024
Last Modified: Tue Aug 13 13:26:19 2024


## Challenge

Write a for loop to find the .json file in the directory and print its file path and when it was last accessed.

os.path.getatime(path) is used to return when a path was last accessed.

In [None]:
#Start your code here

for file in os.listdir(directory):

  if file.lower().endswith(".json"):

    filename = os.path.basename(file)

    path = os.path.abspath(file)


    accessed = os.path.getatime(file)

    accessed_time = time.ctime(accessed)

    print("Filename: " + filename)

    print("Path: " + path)

    print("Last Accessed: " + accessed_time)

Filename: anscombe.json
Path: /content/sample_data/anscombe.json
Last Accessed: Sat Jan  1 08:00:00 2000


##  CSV Library

The CSV library enables you to read and write data in CSV format. We will be using this library to create a new CSV file and writing the metadata that we generated to this new file.

So we can read the metadata csv file, we will be borrow a couple of lines of code from the PANDAS library.

In [None]:
#import csv library
import csv

#import pandas and assign it as pd
import pandas as pd

#file_metadata.csv will be the file that we create in the current directory
# 'w' sets the file as writeable
with open('file_metadata.csv', 'w') as csvfile:

  #used to convert data into delimited strings on the given file
  csvwriter = csv.writer(csvfile)

  #Create a list of strings that represent the catogeries we collected
  rows = ['Filename', 'Path', 'Test']

  #write the rows the csv file
  csvwriter.writerow(rows)

test = pd.read_csv('file_metadata.csv')

test


Unnamed: 0,Filename,Path,Test


In [None]:
#import csv library
import csv

#import pandas and assign it as pd
import pandas as pd

#file_metadata.csv will be the file that we create in the current directory
# 'w' sets the file as writeable
with open('file_metadata.csv', 'w') as csvfile:

  #used to convert data into delimited strings on the given file
  csvwriter = csv.writer(csvfile)

  #Create a list of strings that represent the catogeries we collected
  rows = ['Filename', 'Path', 'Size', "Created", "Modified", "Accessed"]

  #write the rows the csv file
  csvwriter.writerow(rows)

  #write for loop to process each file in directory
  for file in os.listdir(directory):

            #Find filename for file
            filename = os.path.basename(file)

            #Find file size in bytes
            path = os.path.abspath(file)

            filesize = os.path.getsize(file)

            creation = os.path.getctime(file)

            creation_time = time.ctime(creation)

            modified = os.path.getmtime(file)

            modified_time = time.ctime(modified)

            accessed = os.path.getatime(file)

            accessed_time = time.ctime(accessed)

            #write each variable in its corresponding row
            csvwriter.writerow([filename, path, filesize, creation_time, modified_time, accessed_time])


test = pd.read_csv('file_metadata.csv')

test


Unnamed: 0,Filename,Path,Size,Created,Modified,Accessed
0,anscombe.json,/content/sample_data/anscombe.json,1697,Tue Aug 13 14:29:24 2024,Sat Jan 1 08:00:00 2000,Sat Jan 1 08:00:00 2000
1,README.md,/content/sample_data/README.md,930,Tue Aug 13 14:29:24 2024,Sat Jan 1 08:00:00 2000,Sat Jan 1 08:00:00 2000
2,mnist_test.CSV,/content/sample_data/mnist_test.CSV,18289443,Thu Aug 15 16:37:36 2024,Tue Aug 13 13:26:22 2024,Thu Aug 15 16:37:44 2024
3,.ipynb_checkpoints,/content/sample_data/.ipynb_checkpoints,4096,Thu Aug 15 16:37:36 2024,Thu Aug 15 16:37:36 2024,Thu Aug 15 20:29:40 2024
4,file_metadata.csv,/content/sample_data/file_metadata.csv,0,Thu Aug 15 20:55:16 2024,Thu Aug 15 20:55:16 2024,Thu Aug 15 20:50:29 2024
5,california_housing_train.csv,/content/sample_data/california_housing_train.csv,1706430,Tue Aug 13 14:29:24 2024,Tue Aug 13 13:26:19 2024,Tue Aug 13 13:26:19 2024
6,mnist_train_small.csv,/content/sample_data/mnist_train_small.csv,36523880,Tue Aug 13 14:29:25 2024,Tue Aug 13 13:26:21 2024,Tue Aug 13 13:26:21 2024
7,california_housing_test.csv,/content/sample_data/california_housing_test.csv,301141,Tue Aug 13 14:29:24 2024,Tue Aug 13 13:26:19 2024,Tue Aug 13 13:26:19 2024


## Challenge

Edit the above script to also record the size of each file and when they were created, last modified, and accessed.

In [None]:
#Start your code here