## Create Simulated Tabular Datasets

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame()
df['integer'] = np.random.randint(low=1, high=10, size=100)
df['datetime'] = pd.Series(pd.date_range('1/1/2015', periods=100, freq='S'))
df['category'] = np.random.randint(low=0, high=1, size=100)

df.head()

Unnamed: 0,integer,datetime,category
0,6,2015-01-01 00:00:00,0
1,8,2015-01-01 00:00:01,0
2,3,2015-01-01 00:00:02,0
3,2,2015-01-01 00:00:03,0
4,8,2015-01-01 00:00:04,0


In [3]:
df.to_csv("data/data.csv", index=False)
df.to_excel("data/data.xlsx", index=False)
df.to_html("data/data.html", index=False)
df.to_json("data/data.json")
df.to_hdf("data/data.h5", key="datetime")

## Loading a CSV File

In [4]:
# Load library
import pandas as pd

# Create path
path = './data/data.csv'

# Load dataset
dataframe = pd.read_csv(path)

# View first two rows
dataframe.head(2)

Unnamed: 0,integer,datetime,category
0,6,2015-01-01 00:00:00,0
1,8,2015-01-01 00:00:01,0


There are two things to note about loading CSV files. First, it is often useful to take a quick look at the contents of the file before loading. It can be very helpful to see how a dataset is structured beforehand and what parameters we need to set to load in the file. Second, `read_csv` has over 30 parameters and therefore the documentation can be daunting. Fortunately, those parameters are mostly there to allow it to handle a wide variety of CSV formats.

CSV files get their names from the fact that the values are literally separated by commas (e.g., one row might be `2,"2015-01-01 00:00:00",0`); however, it is common for "CSV" files to use other (termed "TSVs"). pandas' `sep` parameter allows us to define the delimiter used in the file. Although it is not always the case, a common formatting issue with CSV files is that the first line of the file is used to define column headers (e.g., `integer, datetime, category` in our solution). The `header` parameter allows us to specify if or where a header row exists. If a header row does not exist, we set `header=None`.

The `read_csv` function returns a Pandas DataFrame: a common and useful object for working with tabular data.

## Loading an Excel File

In [8]:
# Load library
import pandas as pd

# Create path
path = './data/data.xlsx'

# Load dataset
dataframe = pd.read_excel(path, sheet_name=0, header=0)

# View first two rows
dataframe.head(2)

Unnamed: 0,integer,datetime,category
0,6,2015-01-01 00:00:00,0
1,8,2015-01-01 00:00:01,0


This solution is similar to our solution for reading CSV files. The main difference is the additional parameter, `sheetname`, that specifies which sheet in the Excel file we wish to load. `sheetname` can accept both strings containing the name of the sheet and integers pointing to sheet positions (zero-indexed). If we need to load multiple sheets, include them as a list. For example, `sheetname=[0,1,2, "Monthly Sales"]` will return a dictionary of pandas DataFrames containing the first, second, and third sheets and the sheet named `Monthly Sales`.

## Loading a JSON File

In [9]:
# Load library
import pandas as pd

# Create path
path = './data/data.json'

# Load dataset
dataframe = pd.read_json(path, orient='columns')

# View first two rows
dataframe.head(2)

Unnamed: 0,integer,datetime,category
0,6,2015-01-01 00:00:00,0
1,8,2015-01-01 00:00:01,0


Importing JSON files into pandas is similar to the last few recipes we have seen. The key difference is the `orient` parameter, which indicates to pandas how the JSON file is structured. However, it might take some experimenting to figure out which argument (`split`, `records`, `index`, `columns`, and `values`) is the right one. Another helpful tool pandas offers is `json_normalize`, which can help convert semistructured JSON data into a pandas DataFrame.

## Loading a parquet file

In [11]:
# Load library
import pandas as pd

# Create path
path = './data/data.parquet'

# Load dataset
dataframe = pd.read_parquet(path)

# View first two rows
dataframe.head(2)

Unnamed: 0,integer,datetime,category
0,5,2015-01-01 00:00:00,0
1,5,2015-01-01 00:00:01,0


Paruqet is a popular data storage format in the large data space. It is often used with big data tools such as hadoop and spark. It's highly likely companies operating a large scale will use an efficient data storage format such as parquet and it's valuable to know how to read it into a dataframe and manipulate it.

## Loading a avro file

In [13]:
# Load library
import pandavro as pdx

# Create URL
path = './data/twitter.avro'

# Load data
dataframe = pdx.read_avro(path)

# View the first two rows
dataframe.head(2)

Unnamed: 0,username,tweet,timestamp
0,miguno,"Rock: Nerf paper, scissors is fine.",1366150681
1,BlizzardCS,Works as intended. Terran is IMBA.,1366154481


Apache Avro is an open source, binary data format that relies on schemas for the data structure. At the time of writing it is not as common as parquet. However, large binary data formats such as avro, thrift and protocol buffers are growing in popularity due to the efficient nature of these formats. If you work with large data systems, you're likely to run into one of these formats (such as avro) in the near future.

## Loading Unstructured Data

In [14]:
# import libraries
import requests

# path to load the text file from
txt_path = "./data/text.txt"

# Read in the file
with open(txt_path, 'r') as f:
    text = f.read()

# Print the content
print(text)

Hello there!


While structured data can easily be read in from CSV, JSON, or various databases, unstructured data can be more challegning and may require custom processing down the line. Sometimes, it’s helpful to open and read in files using Python’s basic open function. This allows us to open files, and then read the content of that file.