<a href="https://colab.research.google.com/github/rafasyafiq/pyda-online/blob/master/day2/003_Data_Loading%2C_Storage%2C_and_File_Formats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np

# Reading and Writing Data in Text Format

In [0]:
# loading online dataset
df = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex1.csv')
df

In [0]:
pd.read_table('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex1.csv', sep=',')

A file will not always have a header row. Consider this file

In [0]:
pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex2.csv', header=None)

In [0]:
pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

In [0]:
names=['a', 'b', 'c', 'd', 'message']

In [0]:
# make message column to be the index of the returned DataFrame

pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex2.csv', names=names, index_col='message')

In the event that you want to form a hierarchical index from multiple columns, pass a list of column numbers or names

In [0]:
parsed = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/csv_mindex.csv', index_col=['key1', 'key2'])
parsed

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields.

In [0]:
# call data frame from txt without fixed delimiter
result = pd.read_table('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex3.txt', sep='\s+')
result

In [0]:
# call data frame from csv with skipped row
pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex4.csv', skiprows=[0, 2,3])

Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as **NA** and **NUL**

In [0]:
result = pd.read_csv(r'https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex5.csv')
result

In [0]:
pd.isnull(result)

The na_values option can take either a list or set of strings to consider missing values

In [0]:
result = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex5.csv', na_values=['NULL'])
result

In [0]:
# Different NA sentinels can be specified for each column in a dict

sentinels = {'message': ['foo', 'NA'], 'something':['two']}

In [0]:
pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex5.csv', na_values=sentinels)

## Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate
through smaller chunks of the file.Before we look at a large file, we make the pandas display settings more compact

In [0]:
pd.options.display.max_rows = 10

In [0]:
result = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex6.csv')
result

In [0]:
# nrows -  read a small number of rows 

pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex6.csv', nrows=5)

To read a file in pieces, specify a **chunksize** as a number of rows

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize.

In [0]:
chunker = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex6.csv', chunksize=1000)
chunker

In [0]:
chunker = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)

In [0]:
tot[:10]

## Writing Data to Text Format

In [0]:
data = pd.read_csv('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/ex5.csv')
data

Using DataFrame’s **to_csv** method, we can write the data out to a comma separated file

In [0]:
data.to_csv

In [0]:
import sys

data.to_csv(sys.stdout, sep='|')

Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value

In [0]:
data.to_csv(sys.stdout, na_rep='NULL')

In [0]:
# With no other options specified, both the row and column labels 
# are written. Both of these can be disabled

data.to_csv(sys.stdout, index=False, header=False)

In [0]:
# write only a subset of the columns, and in an order of your choosing

data.to_csv(sys.stdout ,index=False, columns=['a', 'b', 'c'])

## JSON Data

In [0]:
obj = '''
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
 {"name": "Katie", "age": 38,
 "pets": ["Sixes", "Stache", "Cisco"]}]
}
'''

In [0]:
import json

In [0]:
result = json.loads(obj)
result

In [0]:
# json.dumps, on the other hand, converts a Python object back to JSON:

asjson = json.dumps(result)

In [0]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age', 'pets'])
siblings

The **pandas.read_json** can automatically convert JSON datasets in specific arrangements into a Series or DataFrame.

In [0]:
data = pd.read_json('https://raw.githubusercontent.com/rafasyafiq/pyda-online/master/Data/example.json')
data

In [0]:
# pandas -> json

print(data.to_json())
print(data.to_json(orient='records'))