# Lesson 2: Working with different file formats

Objective:
- Reading data from CSV in Python
- Reading data from JSON in Python
- Reading data from XLSX in Python
- Reading data from XML in Python
- Reading data from Binary in Python

### Reading Data from CSV in Python

We already know how to do this unsing the pd.read_csv() command with options such as headers. If there are no headers in the file then a df will be created using the first row as the header, this can be corrected using the columns method to set the column. We can slice the dataframe using the column name or methods such as df.loc[] or df.iloc[] where loc i slabel based and iloc is index based.

Below we construct a dataframe and use the transform method with a lambda function to efficiently change our dataframe.

In [3]:
import pandas as pd
import numpy as np

In [None]:
df=pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
df

In [None]:
df = df.transform(func = lambda x : x + 10)
df

In [None]:
result = df.transform(func = ['sqrt']) # This has square root everything with the dataframe.

### Reading Data from JSON in Python

In [7]:
import json

In [8]:
person = {
    'first_name' : 'Mark',
    'last_name' : 'abc',
    'age' : 27,
    'address': {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021-3100"
    }
}

In [9]:
with open('person.json', 'w') as f:  # writing JSON object
    json.dump(person, f)

In [None]:
with open('person.json', 'r') as openfile:  # Reading the JSON object
  
    # Reading from json file 
    json_object = json.load(openfile) 
  

### Reading Data from XLSX in Python

This is done by just using the pd.read_excel() method from here you can just use the file name.

### Reading Data from XML in Python

In [None]:
import xml.etree.ElementTree as ET # Writing XML file

# create the file structure
employee = ET.Element('employee')
details = ET.SubElement(employee, 'details')
first = ET.SubElement(details, 'firstname')
second = ET.SubElement(details, 'lastname')
third = ET.SubElement(details, 'age')
first.text = 'Shiv'
second.text = 'Mishra'
third.text = '23'

# create a new XML file with the results
mydata1 = ET.ElementTree(employee)
# myfile = open("items2.xml", "wb")
# myfile.write(mydata)
with open("new_sample.xml", "wb") as files:
    mydata1.write(files)

In [15]:
tree = ET.parse("new_sample.xml") # Reading data
root = tree.getroot()

In [None]:
columns = ["firstname", "lastname", "age"] # Setting the dataframe
datatframe = pd.DataFrame(columns=columns)

In [None]:
for node in root: # Inputting the data into the dataframe from the XML file.
    # Extract text from each element
    firstname = node.find("firstname").text
    lastname = node.find("lastname").text
    age = node.find("age").text
    # title = node.find("title").text
    # division = node.find("division").text
    # building = node.find("building").text
    # room = node.find("room").text
    
    # Create a DataFrame for the current row
    row_df = pd.DataFrame([[firstname, lastname, age]], columns=columns)
    
    # Concatenate with the existing DataFrame
    datatframe = pd.concat([datatframe, row_df], ignore_index=True)

In [None]:
df=pd.read_xml("new_sample.xml") # Instead of that this cab  work as well

When it comes to saving data you can use df_to_csv() or json or excel or hdf or sql depending on what you want with methods like index.

### Reading Data from Binary in Python

In [30]:
from PIL import Image 

In [None]:
import urllib.request # Get an Image
urllib.request.urlretrieve("https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg", "dog.jpg")

('dog.jpg', <http.client.HTTPMessage at 0x2436a8a18d0>)

In [None]:
img = Image.open('./dog.jpg','r') # Open an image and show it
img.show()

### Gaining descriptive statistics of a dataset in python

Use methods like df.head(), df.info(), df.describe(), missing_data = df.isnull() to find missing data. df.dtypes to check data types.