1\. **Text files**

Perform the following operations on plain `txt` files:

+ create a list of integrer numbers and then save it to a text file named `data_int.txt`. Run the `cat` command to print the content of the file.
+ create a matrix of 5x5 floats and then save it to a text file named `data_float.txt`. Use the `cat` command to print the content of the file.
+ load the `txt` file of the previous point and convert it to a `csv` file by hand.

In [1]:
import numpy as np
import pandas as pd
import json
import csv
import sqlite3 as sql
import struct
import os

array = np.random.randint(0, 1001, size=77)
string_w = " ".join(map(str, array))

with open("data_int.txt", "w") as file:
    file.write(string_w)
print("List of integers from file")
with open("data_int.txt", "r") as file:
    data = file.read()
print(data) 

float_matrix = np.random.rand(5, 5)
np.savetxt("data_float.txt", float_matrix)

print("Matrix 5x5 float")
with open("data_float.txt", "r") as file:
    data = file.read()
print(data) 

csv_file = pd.read_csv("data_float.txt")
csv_file.to_csv("data_float.csv", sep=" ", index=False)
print("Data from csv file")
with open("data_float.csv", "r") as file:
    data = file.read()
print(data)

List of integers from file
538 646 278 340 990 521 203 530 416 602 747 749 900 480 70 157 359 238 216 221 603 286 492 672 286 187 555 924 860 969 459 36 573 327 499 978 313 187 971 745 167 147 385 577 435 819 523 392 700 241 142 861 447 857 526 439 427 527 529 895 86 291 863 400 625 944 688 461 341 740 757 996 415 679 4 258 670
Matrix 5x5 float
4.859352266881202809e-01 8.381297207630219459e-01 7.953500529441548395e-02 6.763594487149449108e-01 9.449999797452778294e-01
3.318476355446806636e-01 8.091782439418153672e-01 4.708077634974406900e-01 3.831079092390106133e-01 6.633727829178771573e-01
6.227581559026729163e-01 2.691018070311934629e-02 1.068518071272656078e-01 8.883278533774658925e-01 7.050518348378825850e-01
9.195777953246092595e-01 8.711865612588625130e-01 3.981853128390490104e-01 4.289619891396749818e-01 6.868061415126800462e-01
3.206098149925568075e-01 9.494734504432392752e-01 2.031411683076739250e-01 4.511046550179784731e-01 1.277154071353678866e-01

Data from csv file
"4.85935

2\. **JSON files**

Load the file `user_data.json`, which can be found at:

- https://www.dropbox.com/s/sz5klcdpckc39hd/user_data.json

and filter the data by the "CreditCardType" when it equals to "American Express". Than save the data to a new CSV file.

In [2]:
with open("user_data.json") as json_file:
    data = json.load(json_file)

data_filter = [item for item in data if item["CreditCardType"] == "American Express"]

with open("ae_users.csv", "w", newline="") as csv_file:
    csv_writer = csv.DictWriter(csv_file, fieldnames=data[0].keys(), delimiter="\t")
    csv_writer.writeheader()
    csv_writer.writerows(data_filter)

with open("ae_users.csv", "r") as csv_file:
    print(csv_file.read())

ID	JobTitle	EmailAddress	FirstNameLastName	CreditCard	CreditCardType
2	Investment  Advisor	Clint_Thorpe5003@bulaffy.com	Clint Thorpe	7083-8766-0251-2345	American Express
12	Retail Trainee	Phillip_Carpenter9505@famism.biz	Phillip Carpenter	3657-0088-0820-5247	American Express
28	Project Manager	Russel_Graves1378@extex.org	Russel Graves	6718-4818-8011-6024	American Express
39	Stockbroker	Leanne_Newton1268@typill.biz	Leanne Newton	5438-0816-4166-4847	American Express
57	Budget Analyst	Tony_Giles1960@iatim.tech	Tony Giles	8130-3425-7573-7745	American Express
62	CNC Operator	Owen_Allcott5125@bauros.biz	Owen Allcott	4156-0107-7210-2630	American Express
68	Project Manager	Liam_Lynn3280@kideod.biz	Liam Lynn	7152-3247-6053-2233	American Express
74	Dentist	Regina_Woodcock5820@yahoo.com	Regina Woodcock	0208-1753-3870-8002	American Express
81	HR Specialist	Carter_Wallace9614@atink.com	Carter Wallace	4256-7201-6717-4322	American Express
92	Staffing Consultant	Maia_Stark2797@jiman.org	Maia Stark	385

3\. **CSV files with Pandas**

Load the file from this url:

- https://www.dropbox.com/s/kgshemfgk22iy79/mushrooms_categorized.csv

with Pandas. 

+ explore and print the DataFrame
+ calculate, using `groupby()`, the average value of each feature, separately for each class
+ save the file in a JSON format.

In [3]:
data = pd.read_csv('mushrooms_categorized.csv')
print("Exploring DataFrame:")
print(data) 

average_value = data.groupby('class').mean()
print("\nThe average value of each feature, separately for each class")
print(average_value)

average_value.to_json('mushrooms_average_values.json')

Exploring DataFrame:
      class  cap-shape  cap-surface  cap-color  bruises  odor  \
0         1          5            2          4        1     6   
1         0          5            2          9        1     0   
2         0          0            2          8        1     3   
3         1          5            3          8        1     6   
4         0          5            2          3        0     5   
...     ...        ...          ...        ...      ...   ...   
8119      0          3            2          4        0     5   
8120      0          5            2          4        0     5   
8121      0          2            2          4        0     5   
8122      1          3            3          4        0     8   
8123      0          5            2          4        0     5   

      gill-attachment  gill-spacing  gill-size  gill-color  ...  \
0                   1             0          1           4  ...   
1                   1             0          0           4  ... 

4\. **Reading a database**

Get the database `sakila.db` from the lecture `06_dataio.ipynb`, and import the table `actors` as a Pandas dataframe. Using the dataframe, count how many actors have a first name that begins with `A`.

*Hint:* use the Series `.str` method to apply the Python string methods to the elements of a Series, see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html).

In [4]:
conn = sql.connect('sakila.db')
query = "SELECT * FROM actor"
actors_data = pd.read_sql_query(query, conn)

count_actors_with_A = actors_data[actors_data['first_name'].str.startswith('A')].shape[0]

print("The amount of actors have a first name that begins with A:", count_actors_with_A)

The amount of actors have a first name that begins with A: 13


5\. **Reading the credit card numbers**

Get the binary file named `credit_card.dat` from this address:

- https://www.dropbox.com/s/8m0syw2tkul3dap/credit_card.dat

and convert the data into the real credit card number, knowing that:
- each line corresponds to a credit card number, which consists of 16 characters (which are numbers in the 0-9 range) divided in 4 blocks, with a whitespace between each block
- each character is written using a 6 bit binary representation (including the whitespace)
- the final 4 bits of each line are a padding used to determine the end of the line, and can be ignored

*Hint*: convert the binary numbers to the decimal representation first, and then use the `chr()` function to convert the latter to a char

In [5]:
cards= open('credit_card.dat')
binary_data = cards.read().split('\n')
n = 6
for item in binary_data:
    card_number = ""
    for index in range(0, len(item), n):
        binary = item[index: index + n]
        decimal = int(binary, 2)
        card_number += chr(decimal)
    print(card_number)

7648 5673 3775 2271

3257 8247 3354 2266

2722 0001 4011 6652

0661 3063 3742 3150

0432 1608 1462 4742

5827 2027 8785 7303

5774 8528 2087 1117

8140 1210 6352 2845

5764 1133 7301 7100

6456 1737 4126 6726

1228 8631 7382 0000

7051 0160 5374 3166

0618 3587 1630 6376

1545 5454 7444 5636

6735 3116 3202 6834

7287 5011 1547 8413

7033 2607 3328 4200

2568 5244 1874 5024

1684 2253 7570 7118

0672 2576 0575 6631

6332 8353 8787 1340

1813 3361 1175 4211

2477 6450 8840 2368

5512 3505 2563 1326

3083 7882 0621 0025

4521 5148 8045 0334

7563 3654 8713 5787

8324 2664 0476 5561

0565 2504 7168 3510

5107 5507 1767 0738

2462 1821 2448 1443

2788 0638 6861 6554

5851 5873 5474 0547

0670 1004 4013 2655

5874 5506 3048 0806

2805 5401 8462 1260

5083 8406 6310 1862

1076 1445 3013 2266

8440 4804 4844 5277

4758 6141 0686 1387

7586 0675 0315 2568

2544 1258 7432 5165

3474 5023 4434 5626

1410 0270 0434 5086

7315 4446 1104 4215

0224 7742 8300 0266

0170 2700 3145 0640

2006 2437 805

6\. **Write data to a binary file**

a) Start from the `data/data_000637.txt` file that we have used during the previous lectures, and convert it to a binary file according to the format defined below:

In [None]:
from IPython.display import Image
Image("images/data_format.png")

*Hints*:
- Read the first 10 lines using Pandas
- Iterate over the DataFrame rows
- For every row, "pack" the values (features) into a single 64-bit word, according to the format specified above. Use bit-wise shifts and operators to do so.
- Write each 64-bit word to a binary file. You can use `struct` in this way:
```
binary_file.write( struct.pack('<q', word) )
```
where `word` is the 64-bit word.
- Close the file after completing the loop.

b) Check that the binary file is correctly written by reading it with the code used in the lecture `06_dataio.ipynb`, and verify that the content of the `txt` and binary files is consistent.

c) What is the difference of the size on disk between equivalent `txt` and binary files?

In [6]:
datatxt = pd.read_csv('data_000637.txt', nrows=10)
binary_file = open("data_000637.dat", 'wb')
for i in range(len(datatxt)):
    word = ((datatxt.iloc[i, 0] & 0x3) << 62 |
            (datatxt.iloc[i, 1] & 0xF) << 58 |
            (datatxt.iloc[i, 2] & 0x1FF) << 49 |
            (datatxt.iloc[i, 3] & 0xFFFFFFFF) << 17 |
            (datatxt.iloc[i, 4] & 0xFFF) << 5 |
            (datatxt.iloc[i, 5] & 0x1F))
    binary_file.write(struct.pack('<q', word))
binary_file.close()
print("Print data from txt file\n", datatxt)

data = {}

with open('data_000637.dat', 'rb') as file:
    file_content = file.read()
    word_counter = 0
    word_size = 8 
    for i in range(0, len(file_content), word_size):
        word_counter += 1
        if word_counter > 10: break
        word = struct.unpack('<q', file_content[i : i + word_size])[0] # get an 8-byte word
        head     = (word >> 62) & 0x3
        fpga     = (word >> 58) & 0xF
        tdc_chan = (word >> 49) & 0x1FF
        orb_cnt  = (word >> 17) & 0xFFFFFFFF
        bx       = (word >> 5 ) & 0xFFF
        tdc_meas = (word >> 0 ) & 0x1F
        #if i == 0: print ('{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format('HEAD', 'FPGA', 'CHANNEL', 'ORBIT_CNT', 'BX_CNT', 'TDC_MEAS'))
        #print('{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format(head, fpga, tdc_chan, orb_cnt, bx, tdc_meas))
        entry = {'HEAD' : head, 'FPGA' : fpga, 'CHANNEL' : tdc_chan, 'ORBIT_CNT' : orb_cnt, 'BX_CNT' : bx, 'TDC_MEAS' : tdc_meas}
        #df = df.append(entry, ignore_index=True)
        data[word_counter] = entry
        
df = pd.DataFrame(data).T

print("Data from dat file\n", df)


text_size = os.path.getsize('data_000637.txt')
binary_size = os.path.getsize('data_000637.dat')
print("size on disk text file: ", text_size)
print("size on disk binary file: ", binary_size)

Print data from txt file
    HEAD  FPGA  TDC_CHANNEL   ORBIT_CNT  BX_COUNTER  TDC_MEAS
0     1     0          123  3869200167        2374        26
1     1     0          124  3869200167        2374        27
2     1     0           63  3869200167        2553        28
3     1     0           64  3869200167        2558        19
4     1     0           64  3869200167        2760        25
5     1     0           63  3869200167        2762         4
6     1     0           61  3869200167        2772        14
7     1     0          139  3869200167        2776         0
8     1     0           62  3869200167        2774        21
9     1     0           60  3869200167        2788         7
Data from dat file
     HEAD  FPGA  CHANNEL   ORBIT_CNT  BX_CNT  TDC_MEAS
1      1     0      123  3869200167    2374        26
2      1     0      124  3869200167    2374        27
3      1     0       63  3869200167    2553        28
4      1     0       64  3869200167    2558        19
5      1     