1\. **Text files**

Perform the following operations on plain `txt` files:

+ create a list of integrer numbers and then save it to a text file named `data_int.txt`. Run the `cat` command to print the content of the file.
+ create a matrix of 5x5 floats and then save it to a text file named `data_float.txt`. Use the `cat` command to print the content of the file.
+ load the `txt` file of the previous point and convert it to a `csv` file by hand.

In [1]:
import numpy as np

file_name = "data_int.txt"
file = open(file_name, 'w') # opening a file in read only  mode
lista = np.arange(10)
file.write(str(lista))
file.close()

2\. **JSON files**

Load the file `user_data.json`, which can be found at:

- https://www.dropbox.com/s/sz5klcdpckc39hd/user_data.json

and filter the data by the "CreditCardType" when it equals to "American Express". Than save the data to a new CSV file.

In [2]:
import json 
import pandas as pd
import csv


data = json.load(open('user_data.json'))

df = pd.DataFrame(data)
 
df = df[df['CreditCardType'] == "American Express"]
#df.to_csv('user_data.csv')
df.to_csv(r'American Express.csv', index = False)

3\. **CSV files with Pandas**

Load the file from this url:

- https://www.dropbox.com/s/kgshemfgk22iy79/mushrooms_categorized.csv

with Pandas. 

+ explore and print the DataFrame
+ calculate, using `groupby()`, the average value of each feature, separately for each class
+ save the file in a JSON format.

In [3]:
file = "mushrooms_categorized.csv"
data =pd.read_csv(file)
print("Type:", type(data))

classe0 = data.groupby('class').mean()


js = classe0.to_json(orient="index")

save_file = open("mushrooms_categorized.json", "w") 

json.dump(js, save_file, indent = 6)  
save_file.close()  


Type: <class 'pandas.core.frame.DataFrame'>


4\. **Reading a database**

Get the database `sakila.db` from the lecture `06_dataio.ipynb`, and import the table `actors` as a Pandas dataframe. Using the dataframe, count how many actors have a first name that begins with `A`.

*Hint:* use the Series `.str` method to apply the Python string methods to the elements of a Series, see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html).

In [13]:
import sqlite3

conn = sqlite3.connect('sakila.db')
cur = conn.cursor()

query = "SELECT * FROM actor;"
actors_df = pd.read_sql_query(query, conn)


conn.close()


count_a_actors = actors_df[actors_df['first_name'].str.startswith('A', na=False)].shape[0]

print(f"The number of actors with a first name starting with 'A' is: {count_a_actors}")


The number of actors with a first name starting with 'A' is: 13


5\. **Reading the credit card numbers**

Get the binary file named `credit_card.dat` from this address:

- https://www.dropbox.com/s/8m0syw2tkul3dap/credit_card.dat

and convert the data into the real credit card number, knowing that:
- each line corresponds to a credit card number, which consists of 16 characters (which are numbers in the 0-9 range) divided in 4 blocks, with a whitespace between each block
- each character is written using a 6 bit binary representation (including the whitespace)
- the final 4 bits of each line are a padding used to determine the end of the line, and can be ignored

*Hint*: convert the binary numbers to the decimal representation first, and then use the `chr()` function to convert the latter to a char

In [None]:
with open('credit_card.dat', 'rb') as file:
    bin_data = file.read()

# Dividi il file binario in linee
lines = bin_data.split(b'\n')

carta = []
for line in lines:
    if len(line) == 4:
        break
    line = line[:-4] #tolgo gli ultimi 4 caratteri 
    blocks = [line[i:i+6] for i in range(0, len(line), 6)] #divido la stringa in blocchi
    decimals = [int(block, 2) for block in blocks]
    characters = [chr(decimal) for decimal in decimals]
    result = ''.join(characters)
    carta.append(result)    


6\. **Write data to a binary file**

a) Start from the `data/data_000637.txt` file that we have used during the previous lectures, and convert it to a binary file according to the format defined below:

In [None]:
#from IPython.display import Image
#Image("images/data_format.png")

*Hints*:
- Read the first 10 lines using Pandas
- Iterate over the DataFrame rows
- For every row, "pack" the values (features) into a single 64-bit word, according to the format specified above. Use bit-wise shifts and operators to do so.
- Write each 64-bit word to a binary file. You can use `struct` in this way:
```
binary_file.write( struct.pack('<q', word) )
```
where `word` is the 64-bit word.
- Close the file after completing the loop.

b) Check that the binary file is correctly written by reading it with the code used in the lecture `06_dataio.ipynb`, and verify that the content of the `txt` and binary files is consistent.

c) What is the difference of the size on disk between equivalent `txt` and binary files?

In [None]:

datatxt = pd.read_csv('data_000637.txt', nrows=10)
newfile = "data_000637_new.dat"

binary_file = open(newfile, 'wb')
for i in range(len(datatxt)):
    word = (datatxt.iloc[i, 0] & 0x3) << 62
    word = word | ((datatxt.iloc[i, 1] & 0xF) << 58)
    word = word | ((datatxt.iloc[i, 2] & 0x1FF) << 49)
    word = word | ((datatxt.iloc[i, 3] & 0xFFFFFFFF) << 17)
    word = word | ((datatxt.iloc[i, 4] & 0xFFF) << 5)
    word = word | (datatxt.iloc[i, 4] & 0x1F)
    binary_file.write( struct.pack('<q', word) )
binary_file.close()


# Reading back and print the data to check the correspondence
# with the actual values stored in the txt file
columns = ['HEAD', 'FPGA', 'CHANNEL', 'ORBIT_CNT', 'BX_CNT', 'TDC_MEAS']
df = pd.DataFrame({}, columns=columns)
with open(newfile, 'rb') as file:
    file_content = file.read()
    word_counter = 0
    word_size = 8 # size of the word in bytes
    for i in range(0, len(file_content), word_size):
        word_counter += 1
        if word_counter > 10: break
        word = struct.unpack('<q', file_content[i : i + word_size])[0] # get an 8-byte word
        head     = (word >> 62) & 0x3
        fpga     = (word >> 58) & 0xF
        tdc_chan = (word >> 49) & 0x1FF
        orb_cnt  = (word >> 17) & 0xFFFFFFFF
        bx       = (word >> 5 ) & 0xFFF
        tdc_meas = (word >> 0 ) & 0x1F
        if i == 0: print ('{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format('HEAD', 'FPGA', 'CHANNEL', 'ORBIT_CNT', 'BX_CNT', 'TDC_MEAS'))
        print('{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format(head, fpga, tdc_chan, orb_cnt, bx, tdc_meas))
        entry = {'HEAD' : head, 'FPGA' : fpga, 'CHANNEL' : tdc_chan, 'ORBIT_CNT' : orb_cnt, 'BX_CNT' : bx, 'TDC_MEAS' : tdc_meas}
        #df = df.append(entry, ignore_index=True)
        data[word_counter] = entry
        
#The txt file has in mean 26 byte per row, while the binary packed dat file has 8 byte per row. For each row
# the txt file occupies 18 bytes more than the dat file.