1\. **Text files**

Perform the following operations on plain `txt` files:

+ create a list of integrer numbers and then save it to a text file named `data_int.txt`. Run the `cat` command to print the content of the file.
+ create a matrix of 5x5 floats and then save it to a text file named `data_float.txt`. Use the `cat` command to print the content of the file.
+ load the `txt` file of the previous point and convert it to a `csv` file by hand.

In [2]:
import numpy as np
import pandas as pd
import sqlite3 as sql
import json

FILE1 = "data/data_int.txt"
FILE2 = "data/data_float.txt"
FILE3 = "data/data_000637.txt"
FILE4 = "data/user_data.json"
FILE5 = "data/mushrooms_categorized.csv"
FILE6 = "data/sakila.db"
FILE7 = "data/credit_card.dat"

def intData(iNumber):
    with open(FILE1,"w") as file:
        for value in range(iNumber):
            file.write(str(value)+"\n")

def intFloat(iSize):
    matrix = np.random.rand(iSize,iSize)
    with open(FILE2,"w") as file:
        file.write(str(matrix)+"\n")

intData(100)
intFloat(5)
        

2\. **JSON files**

Load the file `user_data.json`, which can be found at:

- https://www.dropbox.com/s/sz5klcdpckc39hd/user_data.json

and filter the data by the "CreditCardType" when it equals to "American Express". Than save the data to a new CSV file.

In [3]:
def AmericanExpressIntoCSV(iFilename):
    fileNameCSV = "data/american_express_user_data.csv"

    jsonContentFile = json.load(open(iFilename))
    filteredContent = [element for element in jsonContentFile if element["CreditCardType"]=="American Express"]
    dataFrame = pd.DataFrame(filteredContent)
    dataFrame.to_csv(fileNameCSV,index=False)
    
AmericanExpressIntoCSV(FILE4)


3\. **CSV files with Pandas**

Load the file from this url:

- https://www.dropbox.com/s/kgshemfgk22iy79/mushrooms_categorized.csv

with Pandas. 

+ explore and print the DataFrame
+ calculate, using `groupby()`, the average value of each feature, separately for each class
+ save the file in a JSON format.

In [4]:
def AverageAndJSON(iFilename):
    fileNameJSON = "data/average_mushrooms_categorized.json"

    dataFrame = pd.read_csv(iFilename)
    print(dataFrame)

    filteredContent = dataFrame.groupby("class").mean()
    print(filteredContent)

    with open(fileNameJSON,"w") as JSONFile:
        JSONFile.write(filteredContent.to_json(orient="records"))
        
AverageAndJSON(FILE5)


      class  cap-shape  cap-surface  cap-color  bruises  odor  \
0         1          5            2          4        1     6   
1         0          5            2          9        1     0   
2         0          0            2          8        1     3   
3         1          5            3          8        1     6   
4         0          5            2          3        0     5   
...     ...        ...          ...        ...      ...   ...   
8119      0          3            2          4        0     5   
8120      0          5            2          4        0     5   
8121      0          2            2          4        0     5   
8122      1          3            3          4        0     8   
8123      0          5            2          4        0     5   

      gill-attachment  gill-spacing  gill-size  gill-color  ...  \
0                   1             0          1           4  ...   
1                   1             0          0           4  ...   
2                 

4\. **Reading a database**

Get the database `sakila.db` from the lecture `06_dataio.ipynb`, and import the table `actors` as a Pandas dataframe. Using the dataframe, count how many actors have a first name that begins with `A`.

*Hint:* use the Series `.str` method to apply the Python string methods to the elements of a Series, see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html).

In [5]:
def countStartLetterName(iDataFrame,iLetter):
    count = iDataFrame[iDataFrame[1].str.startswith(iLetter)][1].count()
    print(count,"names starting with letter",iLetter)

def readingDataBase(iFilename):
    connection = sql.connect(iFilename)
    cursor = connection.cursor()
    query = "select * from actor"
    results = cursor.execute(query).fetchall()
    dataFrame = pd.DataFrame(results)
    print(dataFrame)
    countStartLetterName(dataFrame,"A")
    cursor.close()
    connection.close()

readingDataBase(FILE6)


       0         1             2                    3
0      1  PENELOPE       GUINESS  2019-02-16 18:17:33
1      2      NICK      WAHLBERG  2019-02-16 18:17:33
2      3        ED         CHASE  2019-02-16 18:17:33
3      4  JENNIFER         DAVIS  2019-02-16 18:17:33
4      5    JOHNNY  LOLLOBRIGIDA  2019-02-16 18:17:33
..   ...       ...           ...                  ...
195  196      BELA        WALKEN  2019-02-16 18:17:33
196  197     REESE          WEST  2019-02-16 18:17:33
197  198      MARY        KEITEL  2019-02-16 18:17:33
198  199     JULIA       FAWCETT  2019-02-16 18:17:33
199  200     THORA        TEMPLE  2019-02-16 18:17:33

[200 rows x 4 columns]
13 names starting with letter A


5\. **Reading the credit card numbers**

Get the binary file named `credit_card.dat` from this address:

- https://www.dropbox.com/s/8m0syw2tkul3dap/credit_card.dat

and convert the data into the real credit card number, knowing that:
- each line corresponds to a credit card number, which consists of 16 characters (which are numbers in the 0-9 range) divided in 4 blocks, with a whitespace between each block
- each character is written using a 6 bit binary representation (including the whitespace)
- the final 4 bits of each line are a padding used to determine the end of the line, and can be ignored

*Hint*: convert the binary numbers to the decimal representation first, and then use the `chr()` function to convert the latter to a char

In [6]:
def slicingLine(iLine,iStep):
    binLineBlocks = [iLine[i:i+iStep] for i in range(0,len(iLine),iStep)]
    cardNumber = ""
    for block in binLineBlocks:
        if block == "100000":
            cardNumber+=" "
        cardNumber+=chr(int(block,2))
    print(cardNumber)

def convertCreditCardNumber(iFilename):
    with open(iFilename,"r") as binFile:
        content = binFile.readlines()
    for line in content:
        slicingLine(line,6)
        
convertCreditCardNumber(FILE7)


7648  5673  3775  2271

3257  8247  3354  2266

2722  0001  4011  6652

0661  3063  3742  3150

0432  1608  1462  4742

5827  2027  8785  7303

5774  8528  2087  1117

8140  1210  6352  2845

5764  1133  7301  7100

6456  1737  4126  6726

1228  8631  7382  0000

7051  0160  5374  3166

0618  3587  1630  6376

1545  5454  7444  5636

6735  3116  3202  6834

7287  5011  1547  8413

7033  2607  3328  4200

2568  5244  1874  5024

1684  2253  7570  7118

0672  2576  0575  6631

6332  8353  8787  1340

1813  3361  1175  4211

2477  6450  8840  2368

5512  3505  2563  1326

3083  7882  0621  0025

4521  5148  8045  0334

7563  3654  8713  5787

8324  2664  0476  5561

0565  2504  7168  3510

5107  5507  1767  0738

2462  1821  2448  1443

2788  0638  6861  6554

5851  5873  5474  0547

0670  1004  4013  2655

5874  5506  3048  0806

2805  5401  8462  1260

5083  8406  6310  1862

1076  1445  3013  2266

8440  4804  4844  5277

4758  6141  0686  1387

7586  0675  0315  2568

2544  1258  7432

6\. **Write data to a binary file**

a) Start from the `data/data_000637.txt` file that we have used during the previous lectures, and convert it to a binary file according to the format defined below:

In [None]:
from IPython.display import Image
Image("images/data_format.png")

*Hints*:
- Read the first 10 lines using Pandas
- Iterate over the DataFrame rows
- For every row, "pack" the values (features) into a single 64-bit word, according to the format specified above. Use bit-wise shifts and operators to do so.
- Write each 64-bit word to a binary file. You can use `struct` in this way:
```
binary_file.write( struct.pack('<q', word) )
```
where `word` is the 64-bit word.
- Close the file after completing the loop.

b) Check that the binary file is correctly written by reading it with the code used in the lecture `06_dataio.ipynb`, and verify that the content of the `txt` and binary files is consistent.

c) What is the difference of the size on disk between equivalent `txt` and binary files?