#### In this notebook: We will try to extract from the files.tar in the tar_folder all the csv.gz files and save them in the folder of csv_files :     
https://chat.openai.com/share/0f635485-122d-41d3-8d32-7416801992b2

In [28]:
import sys 
import os 
import pandas as pd
import importlib
import tarfile
from collections import Counter
import gzip
import shutil

current_dir = os.getcwd()
src_path = os.path.join(current_dir,"src") 
dataLoadingDirectory = os.path.join(current_dir,"data","raw",
                                    "flash_crash_DJIA","tar_files")
dataSavingDirectory  = os.path.join(current_dir,"data","raw",
                                    "flash_crash_DJIA","csv_files")

#### Checking .tar file names 

In [3]:
## Listing the .tar file names 
print(os.listdir(dataLoadingDirectory))
## Storing the .tar files paths 
tar_files = [os.path.join(dataLoadingDirectory,tar_file) for tar_file in 
             os.listdir(dataLoadingDirectory) ]

['WMT.N-2010.tar', 'AMGN.OQ-2010.tar', 'RTX.N-2010.tar', 'IBM.N-2010.tar', 'UTX.N-2010.tar', 'NKE.N-2010.tar', 'VZ.N-2010.tar', 'KO.N-2010.tar', 'XOM.N-2010.tar', 'GS.N-2010.tar', 'JPM.N-2010.tar', 'AXP.N-2010.tar', 'MRK.N-2010.tar', 'WBA.OQ-2010.tar', 'CAT.N-2010.tar', 'DOW.N-2010.tar', 'V.N-2010.tar', 'CVX.N-2010.tar', 'PFE.N-2010.tar', 'JNJ.N-2010.tar', 'MMM.N-2010.tar', 'TRV.N-2010.tar', 'CSCO.OQ-2010.tar', 'PG.N-2010.tar', 'HD.N-2010.tar', 'BA.N-2010.tar', 'MSFT.OQ-2010.tar', 'UNH.N-2010.tar', 'MCD.N-2010.tar', 'INTC.OQ-2010.tar', 'AAPL.OQ-2010.tar']


#### Inspecting the type of files stored in each tar file 

In [6]:
# Function to list file types in a tar archive
def list_file_types_in_tar(tar_path):
    file_types = []
    with tarfile.open(tar_path, "r") as tar:
        for member in tar.getmembers():
            if member.isfile():
                _, ext = os.path.splitext(member.name)
                file_types.append(ext.lower())
    return file_types

In [12]:
# Simulating the processing of tar files
file_types_counter = Counter()
for tar_file in tar_files:
    # In an actual implementation, 'tar_file' would be the path to the .tar file
    file_types = list_file_types_in_tar(tar_file)  # This would extract file types from the actual .tar file
    file_types_counter.update(file_types)
print(f"The Different types of files given are : {file_types_counter}")

The Different types of files given are : Counter({'.gz': 16182})


#### Discussion: 
All the files stored in the .tar files are .gz files 

#### Next step: For each tar file : Create a folder with its name and extract the files inside it, the all folders will be saved in general folder csv_files

In [18]:
for tar_file in tar_files:
    # Extract the name of the .tar file without extension to use as directory name
    dir_name = os.path.splitext(os.path.basename(tar_file))[0]
    dir_path = os.path.join(dataSavingDirectory, dir_name)

    # Create a directory for extracted files
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

    # Open the .tar file
    with tarfile.open(tar_file, "r") as tar:
        # Extract each file directly, ignoring the internal directory structure
        for member in tar.getmembers():
            if member.isfile():
                member.name = os.path.basename(member.name)  # Remove the internal directory structure
                tar.extract(member, dir_path)


#### Example of Reading a csv.gz files 


In [21]:
directory = os.listdir(dataSavingDirectory)[0]
directory = os.path.join(dataSavingDirectory,directory)
print(directory)

/Users/ilyesbenayed/Desktop/Big data/data/raw/flash_crash_DJIA/csv_files/GS.N-2010


In [24]:
### Geting the file names : 
print(os.listdir(directory)[:5])

['2010-08-26-GS.N-bbo.csv.gz', '2010-01-27-GS.N-trade.csv.gz', '2010-07-06-GS.N-trade.csv.gz', '2010-11-25-GS.N-bbo.csv.gz', '2010-07-16-GS.N-bbo.csv.gz']


#### Discussion : It turns out that we have trade and bbo files , thus the next step is to save them seperately each in its corresponding folder 

### Making trade and boo folder for each stock : 

In [36]:
directories = [os.path.join(dataSavingDirectory,
                            directory) for directory in
               os.listdir(dataSavingDirectory)]
trade_directories = [os.path.join(directory,"trade") for directory in directories]
bbo_directories = [os.path.join(directory,"bbo") for directory in directories]
### Creating these directories if not existed yet : 
for trade_dir in trade_directories:
    if not os.path.exists(trade_dir):
        os.makedirs(trade_dir)
for bbo_dir in bbo_directories:
    if not os.path.exists(bbo_dir):
        os.makedirs(bbo_dir)

In [40]:
### We will iterate over each directroy : 
for directory in directories:
    trade_dir  = os.path.join(directory,"trade")
    bbo_dir    = os.path.join(directory,"bbo")
    for filename in os.listdir(directory):
        if filename.endswith(".csv.gz"):
            if 'trade' in filename:
            # Move trade files to the trade directory
                shutil.move(os.path.join(directory, filename), os.path.join(trade_dir, filename))
            elif 'bbo' in filename:
            # Move bbo files to the bbo directory
                shutil.move(os.path.join(directory, filename), os.path.join(bbo_dir, filename))


### Last Step: 
Exploring the csv.gz files (trade and bbo): 

In [52]:
## Example of trade file : 
trade_dir = trade_directories[0]
trade_file = os.listdir(trade_dir)[0]
# Using gzip.open to decompress the file and read it with pandas
with gzip.open(os.path.join(trade_dir,trade_file), 'rt') as file:
    df = pd.read_csv(file)
print(df.head(4))

/Users/ilyesbenayed/Desktop/Big data/data/raw/flash_crash_DJIA/csv_files/GS.N-2010/trade
         xltime  trade-price  trade-volume trade-stringflag  \
0  40205.604499       150.75        122600          auction   
1  40205.604500       150.80           100    uncategorized   
2  40205.604500       150.80           100    uncategorized   
3  40205.604500       150.80           100    uncategorized   

                                       trade-rawflag  
0  [CTS_QUAL       ]O                            ...  
1  [CTS_QUAL       ]                             ...  
2  [CTS_QUAL       ]                             ...  
3  [CTS_QUAL       ]                             ...  


In [53]:
## Example of bbo file : 
bbo_dir = bbo_directories[0]
bbo_file = os.listdir(bbo_dir)[0]
# Using gzip.open to decompress the file and read it with pandas
with gzip.open(os.path.join(bbo_dir,bbo_file), 'rt') as file:
    df = pd.read_csv(file)
print(df.head(4))

         xltime  bid-price  bid-volume  ask-price  ask-volume
0  40416.562628     144.70           1     144.93           1
1  40416.562628     144.76           1     144.93           1
2  40416.562628     144.76           2     144.93           1
3  40416.562628     144.76           2     144.93           2


#### Last step: Regrouping csv files by month : 

In [58]:
file_example = os.listdir(trade_directories[0])[0]

In [61]:
part_date = file_example.split('-')[:2]
"_".join(part_date)

'2010_01'

In [70]:
def createFoldersGroupingYnMonthfiles(directories):
    for dir in directories:
        files = os.listdir(dir)
        for file in files:
            file_path = os.path.join(dir, file)
            # Check if the path is a file
            if os.path.isfile(file_path):
                date_part = file.split("-")[:2]  # Taking the year and month
                date_folder = "_".join(date_part)
                date_dir = os.path.join(dir, date_folder)
                if not os.path.exists(date_dir):
                    os.makedirs(date_dir)
                
                # Construct the destination path
                dest_path = os.path.join(date_dir, file)   
        
                # Move the file
                shutil.move(file_path, dest_path)


In [71]:
createFoldersGroupingYnMonthfiles(trade_directories)


In [72]:
createFoldersGroupingYnMonthfiles(bbo_directories)

## Conclusion of this Notebook: 
In this notebook we started from .tar files that were located in folder .tar files, the first thing we did is to explore the contents of the different tar files : we find out all were .gz files: For each tar file  we created its corresponding folder inside csv_file folder , after exploring the names of the gz files, we noticed there were trade and bbo files, as a second step, for each stock, we made a directory of trade regrouping the csv.gz trade files and and a bbo directory regrouping the csv.gz bbo (Best bid and offer ) files, we made in this notebook an inital exploration of a random trade file and a random bbo file, as a final step: We regroup files based on the year and month: Thus the final  structure is:    
csv_files : Directoriy --> Directory for each stock ---> trade and boo directories --> year&month directories --> csv.gz files 