## Video 1.5 Accessing Raw Data

In this video, we'll discuss some common file format and how we can access data from Python

### Opening and reading files

In [3]:
%cat some_file.txt

ERROR:root:Line magic function `%cat` not found.


In [4]:
fname = 'some_file.txt'

f = open(fname, 'r')
content = f.read()
f.close()

print(content)

This is some file
It has a few line
This is the last line



In [6]:
fname = 'some_file.txt'
with open(fname, 'r') as f:
    content = f.read()

print(content)

This is some file
It has a few line
This is the last line



In [7]:
fname = 'some_file.txt'
with open(fname, 'r') as f:
    content = f.readlines()

print(content)

['This is some file\n', 'It has a few line\n', 'This is the last line\n']


In [8]:
fname = 'some_file.txt'
with open(fname, 'r') as f:
    for line in f:
        print(line)

This is some file

It has a few line

This is the last line



In [9]:
fname = 'some_file.txt'
with open(fname, 'r') as f:
    for i, line in enumerate(f):
        print("Line {}: {}".format(i, line.strip()))

Line 0: This is some file
Line 1: It has a few line
Line 2: This is the last line


### CSV files

Comma Separated Values

This format is very common for import/export for spreadsheet and databases

In [10]:
%cat data.csv

ERROR:root:Line magic function `%cat` not found.


In [11]:
import csv

fname = 'data.csv'

with open(fname, 'r') as f:
    data_reader = csv.reader(f, delimiter=',')
    headers = next(data_reader)
    print("Headers = {}".format(headers))
    for line in data_reader:
        print(line)

Headers = ['NAME', 'AGE', 'LANGUAGE']
['Alice', '30', 'English']
['Bob', '25', 'Spanish']
['Charlie', '35', 'French']


In [12]:
fname = 'data_no_header.csv'

with open(fname, 'r') as f:
    data_reader = csv.reader(f, delimiter=',')
    for line in data_reader:
        print(line)

['Alice', '30', 'English']
['Bob', '25', 'Spanish']
['Charlie', '35', 'French']


In [13]:
fname = 'data.csv'

with open(fname, 'r') as f:
    data_reader = csv.reader(f, delimiter=',')
    headers = next(data_reader)
    data = []
    for line in data_reader:
        item = {headers[i]: value for i, value in enumerate(line)}
        data.append(item)

data

[{'AGE': '30', 'LANGUAGE': 'English', 'NAME': 'Alice'},
 {'AGE': '25', 'LANGUAGE': 'Spanish', 'NAME': 'Bob'},
 {'AGE': '35', 'LANGUAGE': 'French', 'NAME': 'Charlie'}]

### JSON

JavaScript Object Notation

Good for data serialization and communication between services

In [14]:
%cat movie.json

ERROR:root:Line magic function `%cat` not found.


In [15]:
import json

fname = 'movie.json'
with open(fname, 'r') as f:
    content = f.read()
    movie = json.loads(content)

movie

{'actors': ['Brad Pitt', 'Edward Norton', 'Helena Bonham Carter'],
 'title': 'Fight Club',
 'watched': True,
 'year': 1999}

In [16]:
type(movie)

dict

In [17]:
import json

fname = 'movie.json'
with open(fname, 'r') as f:
    movie_alt = json.load(f)

In [18]:
movie == movie_alt

True

In [19]:
print(json.dumps(movie, indent=4))

{
    "title": "Fight Club",
    "watched": true,
    "year": 1999,
    "actors": [
        "Brad Pitt",
        "Edward Norton",
        "Helena Bonham Carter"
    ]
}


In [20]:
%cat movies-90s.jsonl

ERROR:root:Line magic function `%cat` not found.


In [21]:
import json

fname = 'movies-90s.jsonl'

with open(fname, 'r') as f:
    for line in f:
        try:
            movie = json.loads(line)
            print(movie['title'])
        except: 
            ...


Fight Club
Goodfellas
Forrest Gump


### Pickles: Python object serialization

In [22]:
with open('movie.json', 'r') as f:
    content = f.read()
    data = json.loads(content)

data

{'actors': ['Brad Pitt', 'Edward Norton', 'Helena Bonham Carter'],
 'title': 'Fight Club',
 'watched': True,
 'year': 1999}

In [23]:
type(data)

dict

In [24]:
import pickle 

with open('data.pickle', 'wb') as f:
    pickle.dump(data, f)

In [25]:
%cat data.pickle

ERROR:root:Line magic function `%cat` not found.


In [26]:
with open('data.pickle', 'rb') as f:
    data = pickle.load(f)

data

{'actors': ['Brad Pitt', 'Edward Norton', 'Helena Bonham Carter'],
 'title': 'Fight Club',
 'watched': True,
 'year': 1999}

In [None]:
type(data)