### <div id="py"> Working with different file formats </div>



- JSON (java script object notation)
- CSV (Command Seperated Values)
- Excel
- Avro


### Data comes in various forms

<img src="images/data_gen.jpeg">

As a data person you will deal with various type of data and it's important to learn how to handle these file formats 



## Working in JSON files

***
Since its inception, JSON has quickly become the de facto standard for information exchange. 

Chances are you’re here because you need to transport some data from here to there. Perhaps you’re gathering information through an API or storing your data in a document database. 

One way or another, you’re up to your neck in JSON, and you’ve got to Python your way out.








## A (Very) Brief History of JSON

JSON stangs for JavaScript Object Notation was inspired by a subset of the JavaScript programming language dealing with object literal syntax. 

Ultimately, the community at large adopted JSON because it’s easy for both humans and machines to create and understand.

### Look, it’s JSON!
```
{
    "firstName": "Jane",
    "lastName": "Doe",
    "hobbies": ["running", "sky diving", "singing"],
    "age": 35,
    "children": [
        {
            "firstName": "Alice",
            "age": 6
        },
        {
            "firstName": "Bob",
            "age": 8
        }
    ]
}
```

### Does this look similar to something?

YES! Python **dictionary!**

### Writing JSON files

In [1]:
import json

In [2]:
data = {
    "president": {
        "name": "Trump",
        "species": "USA"
    }
}

In [3]:
with open("data_file.json", "w") as write_file:
    json.dump(data, write_file)

 Note that `dump()` takes two positional arguments:
 1. the data object to be serialized, and
 2. the file-like object to which the bytes will be written.

### Reading JSON files

In [4]:
with open("data_file.json", "r") as read_file:
    data = json.load(read_file)

In [5]:
type(data)

dict

In [6]:
data

{'president': {'name': 'Trump', 'species': 'USA'}}

### You can also read JSON as DataFrame in Pandas

In [7]:
import pandas as pd

jsonStr = '''{"Index0":{"Courses": "Azure Data Factory","Discount": "1200"},
           "Index1":{"Courses": "AWS Glue","Discount": "1500"},
           "Index2":{"Courses": "Spark","Discount": "1800"}
          }'''

# Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
print(df2)

                   Courses  Discount
Index0  Azure Data Factory      1200
Index1            AWS Glue      1500
Index2               Spark      1800


  df2 = pd.read_json(jsonStr, orient ='index')


### Convert Dict To DF

In [8]:
data['president']

{'name': 'Trump', 'species': 'USA'}

In [9]:
import pandas as pd

df3 = pd.DataFrame.from_dict(data, orient ='index')

In [10]:
df3

Unnamed: 0,name,species
president,Trump,USA


## Working with CSV files

A CSV file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. 

It’s a plain text file that has data separated by commas!

```
column 1 name,column 2 name, column 3 name
first row data 1,first row data 2,first row data 3
second row data 1,second row data 2,second row data 3
...
```

In [11]:
df = pd.read_csv('data/hrdata.csv', index_col='Name')

In [12]:
df.head()

Unnamed: 0_level_0,Hire Date,Salary,Sick Days remaining
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Graham Chapman,03/15/14,50000.0,10
John Cleese,06/01/15,65000.0,8
Eric Idle,05/12/14,45000.0,10
Terry Jones,11/01/13,70000.0,3
Terry Gilliam,08/12/14,48000.0,7


In [13]:
df = pd.read_csv('data/hrdata.csv', index_col='Name', parse_dates=['Hire Date'])

  df = pd.read_csv('data/hrdata.csv', index_col='Name', parse_dates=['Hire Date'])


In [14]:
df

Unnamed: 0_level_0,Hire Date,Salary,Sick Days remaining
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Graham Chapman,2014-03-15,50000.0,10
John Cleese,2015-06-01,65000.0,8
Eric Idle,2014-05-12,45000.0,10
Terry Jones,2013-11-01,70000.0,3
Terry Gilliam,2014-08-12,48000.0,7
Michael Palin,2013-05-23,66000.0,8


In [16]:
df.to_csv('data/hrdata_modified.csv')

## Working with Excel Files

Excel spreadsheets are one of those things you might have to deal with at some point. Either it’s because your boss loves them or because marketing needs them, and you might have to learn how to work with spreadsheets.

Many companies still prefer using Excel files for their data storage and analysis, as a data expert you should know how to handle these files programatically!

To work with Excel files we have package in python `openpyxl`

In [17]:
pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


### Basics of Excel

<img src="images/excel.png" width=550>

In [18]:
from openpyxl import Workbook

workbook = Workbook()
sheet = workbook.active

sheet["A1"] = "hello"
sheet["B1"] = "world!"

workbook.save(filename="hello_world.xlsx")

In [19]:
#Reading excel file

from openpyxl import load_workbook
workbook = load_workbook(filename="data/sample-xlsx-file.xlsx")
workbook.sheetnames
['Sheet 1']


['Sheet 1']

In [20]:
sheet = workbook.active

In [21]:
sheet

<Worksheet "Employee">

In [22]:
sheet.title

'Employee'

In [23]:
sheet["A1"]

<Cell 'Employee'.A1>

In [25]:
sheet["A3"].value

'John Doe'

In [26]:
sheet.cell(row=10, column=6)

<Cell 'Employee'.F10>

In [27]:
sheet.cell(row=3, column=3).value

datetime.datetime(1965, 1, 13, 0, 0)

In [28]:
sheet["A1:C2"]

((<Cell 'Employee'.A1>, <Cell 'Employee'.B1>, <Cell 'Employee'.C1>),
 (<Cell 'Employee'.A2>, <Cell 'Employee'.B2>, <Cell 'Employee'.C2>))

In [29]:
for row in sheet.iter_rows(values_only=True):
    print(row)

('Name', 'Email', 'Date Of Birth', 'Salary', 'Department', None)
('Rajeev Singh', 'rajeev@example.com', datetime.datetime(1992, 7, 21, 0, 0), 1500000.0, 'Software Engineering', None)
('John Doe', 'john@example.com', datetime.datetime(1965, 1, 13, 0, 0), 1300000.0, 'Sales', None)
('Jack Sparrow', 'jack@example.com', datetime.datetime(1986, 12, 19, 0, 0), 1000000.0, 'HR', None)
('Steven Cook', 'steven@example.com', datetime.datetime(1994, 5, 4, 0, 0), 1200000.0, 'Marketing', None)
(None, None, None, None, None, None)
(None, None, None, None, None, None)
(None, None, None, None, None, None)
(None, None, None, None, None, None)
(None, None, None, None, None, None)


### You can read Excel file as DataFrame using Pandas

In [30]:
excel_df = pd.read_excel('data/sample-xlsx-file.xlsx')

In [31]:
excel_df

Unnamed: 0,Name,Email,Date Of Birth,Salary,Department
0,Rajeev Singh,rajeev@example.com,1992-07-21,1500000,Software Engineering
1,John Doe,john@example.com,1965-01-13,1300000,Sales
2,Jack Sparrow,jack@example.com,1986-12-19,1000000,HR
3,Steven Cook,steven@example.com,1994-05-04,1200000,Marketing


In [32]:
excel_df.to_excel('data/sample-xlsx-file-modifeid.xlsx')

## Working with AVRO

Apache Avro is a data serialization format. We can store data as `.avro` files on disk. 

Avro files are typically used with Spark but Spark is completely independent of Avro.

Avro is a row-based format that is suitable for evolving data schemas. One benefit of using Avro is that schema and metadata travels with the data.

If you have an .avro file, you have the schema of the data as well. 

The Apache Avro Specification provides easy-to-read yet detailed information.

In [33]:
pip install avro-python3

Collecting avro-python3
  Downloading avro-python3-1.10.2.tar.gz (38 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: avro-python3
  Building wheel for avro-python3 (pyproject.toml): started
  Building wheel for avro-python3 (pyproject.toml): finished with status 'done'
  Created wheel for avro-python3: filename=avro_python3-1.10.2-py3-none-any.whl size=44039 sha256=dd4242a666f2cd313d944de8926c3ce160e7ecefff594d7657a476e2833b5ed6
  Stored in directory: c:\users\imflyr\appdata\local\pip\cache\wheels\51\1a\f4\bd962fd1830f8b34c3ba124e1fabbfcb64ccd588dd3bcf1ba9
Succe


[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [34]:
# Python 3 with `avro-python3` package available
import copy
import json
import avro
from avro.datafile import DataFileWriter, DataFileReader
from avro.io import DatumWriter, DatumReader

In [35]:

# Note that we combined namespace and name to get "full name"
schema = {
    'name': 'avro.example.User',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'age', 'type': 'int'}
    ]
}

# Parse the schema so we can use it to write the data
schema_parsed = avro.schema.Parse(json.dumps(schema))

In [36]:
schema_parsed

<avro.schema.RecordSchema at 0x1f6f3917750>

In [37]:

# Write data to an avro file
with open('users.avro', 'wb') as f:
    writer = DataFileWriter(f, DatumWriter(), schema_parsed)
    writer.append({'name': 'Pierre-Simon Laplace', 'age': 77})
    writer.append({'name': 'John von Neumann', 'age': 53})
    writer.close()

In [38]:

# Read data from an avro file
with open('users.avro', 'rb') as f:
    reader = DataFileReader(f, DatumReader())
    metadata = copy.deepcopy(reader.meta)
    schema_from_file = json.loads(metadata['avro.schema'])
    users = [user for user in reader]
    reader.close()

print(f'Schema that we specified:\n {schema}')
print(f'Schema that we parsed:\n {schema_parsed}')
print(f'Schema from users.avro file:\n {schema_from_file}')
print(f'Users:\n {users}')

Schema that we specified:
 {'name': 'avro.example.User', 'type': 'record', 'fields': [{'name': 'name', 'type': 'string'}, {'name': 'age', 'type': 'int'}]}
Schema that we parsed:
 {"type": "record", "name": "User", "namespace": "avro.example", "fields": [{"type": "string", "name": "name"}, {"type": "int", "name": "age"}]}
Schema from users.avro file:
 {'type': 'record', 'name': 'User', 'namespace': 'avro.example', 'fields': [{'type': 'string', 'name': 'name'}, {'type': 'int', 'name': 'age'}]}
Users:
 [{'name': 'Pierre-Simon Laplace', 'age': 77}, {'name': 'John von Neumann', 'age': 53}]


### Reading Avro Using Pandas

Avro format simply requires a schema and a list of records. We don’t need a dataframe to handle Avro files. 

However, we can write a `pandas` dataframe into an Avro file or read an Avro file into a `pandas` dataframe. 

To begin with, we can always represent a dataframe as a list of records and vice-versa

In [39]:
pip install pandavro

Collecting pandavro
  Downloading pandavro-1.8.0-py3-none-any.whl.metadata (8.5 kB)
Collecting fastavro<2.0.0,>=1.5.1 (from pandavro)
  Downloading fastavro-1.10.0-cp313-cp313-win_amd64.whl.metadata (5.7 kB)
Downloading pandavro-1.8.0-py3-none-any.whl (8.8 kB)
Downloading fastavro-1.10.0-cp313-cp313-win_amd64.whl (483 kB)
Installing collected packages: fastavro, pandavro
Successfully installed fastavro-1.10.0 pandavro-1.8.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [41]:
pip install --upgrade pandavro fastavro

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
pip install --upgrade pip

Collecting pip
  Downloading pip-25.0-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ----------- ---------------------------- 0.5/1.8 MB 3.5 MB/s eta 0:00:01
   ----------------- ---------------------- 0.8/1.8 MB 3.3 MB/s eta 0:00:01
   ---------------------------------- ----- 1.6/1.8 MB 2.6 MB/s eta 0:00:01
   ---------------------------------------- 1.8/1.8 MB 2.6 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.3.1
    Uninstalling pip-24.3.1:
      Successfully uninstalled pip-24.3.1
Successfully installed pip-25.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pandas pandavro fastavro avro-python3

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install --upgrade pandavro fastavro

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install avro-python3

Note: you may need to restart the kernel to use updated packages.


In [16]:
import copy
import json
import pandas as pd
import fastavro as pdx
from avro.datafile import DataFileReader
from avro.io import DatumReader

In [17]:
# Data to be saved
users = [{'name': 'Pierre-Simon Laplace', 'age': 77},
         {'name': 'John von Neumann', 'age': 53}]
users_df = pd.DataFrame.from_records(users)
print(users_df)

                   name  age
0  Pierre-Simon Laplace   77
1      John von Neumann   53


In [None]:
# Save DataFrame to AVRO
pdx.to_avro('data/users_test.avro', users_df)

In [None]:
# Read the data back
users_df_redux = pdx.from_avro('data/users_test.avro')
print(type(users_df_redux))
# <class 'pandas.core.frame.DataFrame'>


In [60]:
# Check the schema for "users.avro"
with open('users.avro', 'rb') as f:
    reader = DataFileReader(f, DatumReader())
    metadata = copy.deepcopy(reader.meta)
    schema_from_file = json.loads(metadata['avro.schema'])
    reader.close()
print(schema_from_file)

{'type': 'record', 'name': 'User', 'namespace': 'avro.example', 'fields': [{'type': 'string', 'name': 'name'}, {'type': 'int', 'name': 'age'}]}
