1\. **Text files**

Perform the following operations on plain `txt` files:

+ create a list of integrer numbers and then save it to a text file named `data_int.txt`. Run the `cat` command to print the content of the file.
+ create a matrix of 5x5 floats and then save it to a text file named `data_float.txt`. Use the `cat` command to print the content of the file.
+ load the `txt` file of the previous point and convert it to a `csv` file by hand.

In [None]:
# Operation 1: Creating a list of integer numbers
integer_numbers = [1, 2, 3, 4, 5]
with open("data_int.txt", "w") as file:
    for number in integer_numbers:
        file.write(str(number) + "\n")
print("Integer numbers have been saved to data_int.txt. Here's what it looks like:")
# Now, let's print the content of the file using the cat command
!cat data_int.txt

# Operation 2: Creating a matrix of 5x5 floats
matrix = [[0.1, 0.2, 0.3, 0.4, 0.5],
          [0.6, 0.7, 0.8, 0.9, 1.0],
          [1.1, 1.2, 1.3, 1.4, 1.5],
          [1.6, 1.7, 1.8, 1.9, 2.0],
          [2.1, 2.2, 2.3, 2.4, 2.5]]
with open("data_float.txt", "w") as file:
    for row in matrix:
        file.write(" ".join([str(num) for num in row]) + "\n")
print("Float matrix has been saved to data_float.txt. Here's what it looks like:")
# Let's print the content of the file using the cat command once again
!cat data_float.txt

# Operation 3: Converting the float matrix to a CSV file
import csv

# First, load the text file
loaded_matrix = []
with open("data_float.txt", "r") as file:
    for line in file:
        loaded_matrix.append([float(num) for num in line.strip().split()])

# Now, let's convert it to a CSV file
with open("data_float.csv", "w", newline="") as file:
    writer = csv.writer(file)
    for row in loaded_matrix:
        writer.writerow(row)
print("The float matrix has been converted to data_float.csv")

2\. **JSON files**

Load the file `user_data.json`, which can be found at:

- https://www.dropbox.com/s/sz5klcdpckc39hd/user_data.json

and filter the data by the "CreditCardType" when it equals to "American Express". Than save the data to a new CSV file.

In [None]:
import json
import csv

filename = "filtered_data.csv"

# Fetch and load the JSON data
with open("user_data.json", "r") as file:
    data = json.load(file)

# Filter the data based on the "CreditCardType" field
filtered_data = [user for user in data if user.get("CreditCardType") == "American Express"]

# Save the filtered data to a CSV file
with open(filename, "w", newline="") as file:
    writer = csv.DictWriter(file, fieldnames=filtered_data[0].keys())
    writer.writeheader()
    writer.writerows(filtered_data)

3\. **CSV files with Pandas**

Load the file from this url:

- https://www.dropbox.com/s/kgshemfgk22iy79/mushrooms_categorized.csv

with Pandas. 

+ explore and print the DataFrame
+ calculate, using `groupby()`, the average value of each feature, separately for each class
+ save the file in a JSON format.

In [None]:
import pandas as pd

# Load the mushrooms_categorized.csv file into a DataFrame
data = pd.read_csv("mushrooms_categorized.csv")

# Explore and print the DataFrame
print(data)

# Calculate the average value of each feature, separately for each class using groupby()
averages = data.groupby("class").mean()

# Save the DataFrame with average values to a JSON file
averages.to_json("averages.json")

4\. **Reading a database**

Get the database `sakila.db` from the lecture `06_dataio.ipynb`, and import the table `actors` as a Pandas dataframe. Using the dataframe, count how many actors have a first name that begins with `A`.

*Hint:* use the Series `.str` method to apply the Python string methods to the elements of a Series, see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html).

In [None]:
import pandas as pd
import sqlite3

# Connect to the sakila.db database
conn = sqlite3.connect("dat/sakila.db")

# Import the 'actors' table as a Pandas DataFrame
query = "SELECT * FROM actor"
actors_df = pd.read_sql_query(query, conn)

# Count how many actors have a first name that begins with 'A'
count_actors_with_a = actors_df[actors_df['first_name'].str.startswith('A')].shape[0]

# Display the count
print(f"The number of actors with a first name starting with 'A' is: {count_actors_with_a}")

5\. **Reading the credit card numbers**

Get the binary file named `credit_card.dat` from this address:

- https://www.dropbox.com/s/8m0syw2tkul3dap/credit_card.dat

and convert the data into the real credit card number, knowing that:
- each line corresponds to a credit card number, which consists of 16 characters (which are numbers in the 0-9 range) divided in 4 blocks, with a whitespace between each block
- each character is written using a 6 bit binary representation (including the whitespace)
- the final 4 bits of each line are a padding used to determine the end of the line, and can be ignored

*Hint*: convert the binary numbers to the decimal representation first, and then use the `chr()` function to convert the latter to a char

In [None]:
# Open the binary file in read mode
with open("credit_card.dat", "rb") as file:
    # Read the binary data from the file
    binary_data = file.read().decode()

# Split the binary data into lines
lines = binary_data.splitlines()

# Convert each line to the real credit card number
credit_cards = []
for line in lines:
    # Remove the padding bits at the end of each line
    line = line[:-4]
    
    # Split the line into 4 blocks
    blocks = [line[i:i+6] for i in range(0, len(line), 6)]
    
    # Convert each block from binary to decimal and then to a character
    card_number = ''.join([chr(int(block, 2)) for block in blocks])
    
    # Add the credit card number to the list
    credit_cards.append(card_number)

# Print the converted credit card numbers
for card in credit_cards:
    print(card)

6\. **Write data to a binary file**

a) Start from the `data/data_000637.txt` file that we have used during the previous lectures, and convert it to a binary file according to the format defined below:

In [None]:
from IPython.display import Image
Image("images/data_format.png")

*Hints*:
- Read the first 10 lines using Pandas
- Iterate over the DataFrame rows
- For every row, "pack" the values (features) into a single 64-bit word, according to the format specified above. Use bit-wise shifts and operators to do so.
- Write each 64-bit word to a binary file. You can use `struct` in this way:
```
binary_file.write( struct.pack('<q', word) )
```
where `word` is the 64-bit word.
- Close the file after completing the loop.

b) Check that the binary file is correctly written by reading it with the code used in the lecture `06_dataio.ipynb`, and verify that the content of the `txt` and binary files is consistent.

c) What is the difference of the size on disk between equivalent `txt` and binary files?

In [None]:
import pandas as pd
import struct

# Read the first 10 lines using Pandas
df = pd.read_csv('data/data_000637.txt', sep=',', nrows=10)

# Open a binary file to write the converted data
with open('data/data_000637.dat', 'wb') as binary_file:
    # Iterate over the DataFrame rows
    for _, row in df.iterrows():
        # Extract the values from the row
        head = row['HEAD']
        fpga = row['FPGA']
        tdc_chan = row['TDC_CHANNEL']
        orb_cnt = row['ORBIT_CNT']
        bx = row['BX_COUNTER']
        tdc_meas = row['TDC_MEAS']
        
        # Pack the values into a single 64-bit word
        word = (head << 62) | (fpga << 58) | (tdc_chan << 49) | (orb_cnt << 17) | (bx << 5) | tdc_meas
        
        # Write the 64-bit word to the binary file
        binary_file.write(struct.pack('<q', word))

# Close the binary file after completing the loop
binary_file.close()

In [None]:
# It works and the data is consistent
# but first of all I cannot understand what does that have to do with a PNG file (?!?!?)
# and second it is impossible to compare txt and binary files since I wrote only ten lines in it