# Minimalist Guide to Getting Data in Any Format

Reading data with Python is a key skills for data scientists. This notebook will walk through how to read data in various kinds of commonly encountered data format so that you can start your analysis quickly.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

# CSV files

It is quite straightforward to read csv files using read_csv() function of pandas. However, there are some small tricks to make your data easier to read and analyze. We will walk through:
- Parsing dates
- Selectively read columns
- set index column

We use [Netflix Movies and TV Shows](https://www.kaggle.com/shivamb/netflix-shows) dataset for illustration.

In [None]:
# read_csv without arguments
nf = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
nf.head()

In [None]:
# Parse dates with "parse_dates" argument
nf = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv', parse_dates=['date_added'])
nf.head()

In [None]:
# Read selected columns using "usecols" argument
list_cols = ['show_id', 'type', 'title', 'country', 'date_added', 'rating', 'duration']
nf = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv', parse_dates=['date_added'], usecols = list_cols)
nf.head()

In [None]:
# Set a column as index by "index_col" argument - useful for making time series
nf = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv', parse_dates=['date_added'], 
                 usecols = list_cols, index_col = 'show_id')
nf.head()

There are many arguments one can use to read csv files more effectively. Please see [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for details.

# Images files

In doing image related machine learning tasks, we need to inspect the images in our dataset before building models. In this section, we will walk through folders to read image files, and show images with labels. We use [Flowers Recognition](https://www.kaggle.com/alxmamaev/flowers-recognition) dataset as an example.

In [None]:
data_dir = '/kaggle/input/flowers-recognition/flowers/flowers'

# Dictionary of all images paths
flowers = {'sunflower':[], 'tulip':[], 'daisy':[], 'rose': [], 'dandelion': []}

In [None]:
# Populate the dictionary with image paths
for flower in flowers.keys():
    for dirname, _, filenames in os.walk(os.path.join(data_dir, flower)):
        for filename in filenames:
            flowers[flower].append((os.path.join(
                os.path.join(data_dir, flower), filename)))

In [None]:
# Showing 2 random images from each of the categories
from tensorflow.keras.preprocessing import image

for flower in list(flowers.keys()):
    plt.figure(figsize=(8, 5))
    flower_choice = np.random.choice(len(flowers[flower]),2) # Choose two images by random
    plt.subplot(1, 2, 1)
    img_path1 = flowers[flower][flower_choice[0]]
    img = image.load_img(img_path1)
    plt.imshow(img)
    plt.title(flower)
    plt.subplot(1, 2, 2)
    img_path2 = flowers[flower][flower_choice[1]]
    img = image.load_img(img_path2)
    plt.imshow(img)
    plt.title(flower)

plt.show()

# Files of texts

In natural language processing problems, it may involve loading directories of text files. This section will introduce how to read the files and put it in a pandas dataframe. We use [The Works of Charles Dickens](https://www.kaggle.com/fuzzyfroghunter/dickens) dataset as an example.

In [None]:
# The dataset contains a metadata table

path = '../input/dickens/dickens'
meta = pd.read_csv(os.path.join(path, 'metadata.tsv'), delimiter='\t')
meta.head()

In [None]:
# Read the first part of one txt file
file1 = open(os.path.join(path, '924-0.txt'),'r')
_ = file1.read(500)
print(_)

In [None]:
files = {} # Create a dictionary of file names and contents
for f in os.listdir('/kaggle/input/dickens/dickens'):
    if f.endswith('.txt'): 
        with open(os.path.join(path, f), "r") as file:
            files[f] = file.read()

In [None]:
# Convert to dataframe
_ = pd.DataFrame.from_dict(files, orient='index', columns=['content']).reset_index()
dickens = meta.merge(_, left_on='Path', right_on='index')
dickens.head()

# BigQuery SQL

BigQuery is a cloud data service by Google which we can use SQL language to retrieve data from databases hosted on Google Cloud. More information can be found on this [Kaggle tutorial](https://www.kaggle.com/rtatman/sql-scavenger-hunt-handbook) and [documentation of BigQuery](https://cloud.google.com/bigquery/docs). In this section we will walk through:
- Getting datasets
- Inspecting table lists and schema
- Loading table to dataframe
- Read data through a SQL query
- Estimate data size before loading

There are many public datasets stored in BigQuery and accessible to public. See [this page](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&_ga=2.5846323.181324053.1598680127-1821649181.1596854139&pli=1) for details. In this section we will use Hacker News dataset as an example.

In [None]:
from google.cloud import bigquery

client = bigquery.Client()

# Get dataset
hacker_ref = client.dataset('hacker_news', project='bigquery-public-data')
hacker = client.get_dataset(hacker_ref)

# List all table names
tables = list(client.list_tables(hacker))
for table in tables:  
    print(table.table_id)

In [None]:
# Get a table
table_ref = hacker.table('full')
table = client.get_table(table_ref)

In [None]:
# Before loading the table, we can get some attributes
print ('No. of rows: ' + str(table.num_rows))
print ('Size (MB): ' + str(int(table.num_bytes / 1024768)))
print ('Columns:')
print (list(c.name for c in table.schema))

In [None]:
# Get first 100 rows
table_df = client.list_rows(table, max_results=100).to_dataframe()
table_df.head()

In [None]:
# Write SQL query to get data
# Get top 10 authors by number of articles
query = """
SELECT author, count(id) as stories
FROM `bigquery-public-data.hacker_news.stories`
GROUP BY author
ORDER BY count(id) DESC
"""

query_job = client.query(query)
iterator = query_job.result()
rows = list(iterator)

# Transform the rows into a nice pandas dataframe
top_authors = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))

# Look at the first 10 headlines
top_authors.head(10)

Reference: [BigQuery API reference](https://googleapis.dev/python/bigquery/latest/reference.html)

# Parquet files

Parquet is a file format suitable for storing large data, such as image data. read_parquet() function in pandas reads a parquet file into a pandas dataframe. We use dataset for [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19/data) as an example.

In [None]:
img = pd.read_parquet('../input/bengaliai-cv19/train_image_data_0.parquet')
img.shape

In [None]:
img.head()

The documentation says: 
> Each parquet file contains tens of thousands of 137x236 grayscale images. The images have been provided in the parquet format for I/O and space efficiency. Each row in the parquet files contains an image_id column, and the flattened image.

To show the image, we need to reshape the vector in each row into a two-dimensional array.

In [None]:
img2 = img.iloc[:,1:].values.reshape((-1,137,236,1))

row=3; col=4;
plt.figure(figsize=(20,(row/col)*12))
for x in range(row*col):
    plt.subplot(row,col,x+1)
    plt.imshow(img2[x,:,:,0])
plt.show()

Sometimes we may want to resize the images for future analysis. We can use cv2 package.

In [None]:
import cv2

DIM = 64

img3 = np.zeros((img2.shape[0],DIM,DIM,1),dtype='float32')
for j in range(img2.shape[0]):
    img3[j,:,:,0] = cv2.resize(img2[j,],(DIM,DIM),interpolation = cv2.INTER_AREA)

row=3; col=4;
plt.figure(figsize=(20,(row/col)*12))
for x in range(row*col):
    plt.subplot(row,col,x+1)
    plt.imshow(img3[x,:,:,0])
plt.show()

In [None]:
# Free up memory
del img

# JSON files

This section will demonstrate how to use json package to read JSON data into a dataframe. We use [arXiv dataset](https://www.kaggle.com/Cornell-University/arxiv) as an example.

In [None]:
import json

data  = []
with open("/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json", 'r') as f:
    for line in f: 
        data.append(json.loads(line))

print("No. of records: {}".format(len(data)))

In [None]:
# See first item
data[0]

In [None]:
# Convert to dataframe - due to memory issue we only load the first 1000 items
df = pd.DataFrame(data[:1000])
df.head()

In [None]:
# Free up some memory
del data

# XML files

The last section involves loading xml files, which requires xml package. We use [COVID-19 Clinical Trials dataset](https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset) as example.

In [None]:
from xml.etree import ElementTree

path = '../input/covid19-clinical-trials-dataset/COVID-19 CLinical trials studies/'

files = os.listdir(path)
print('Total Researches going on: ',len(files))

In [None]:
# Inspect the first element
file_path = os.path.join(path, files[0])
tree = ElementTree.parse(file_path)
root = tree.getroot()
print (root.tag, root.attrib)

In [None]:
# Look at all the children nodes
for child in root:
    print (child.tag, child.attrib)

In [None]:
df_temp = pd.DataFrame()
df = pd.DataFrame()
i = 0

for file in files[:10]: # Get the first 10 studies
    file_path = os.path.join(path, file)
    tree = ElementTree.parse(file_path)
    root = tree.getroot()
    trial = {} # Initialize dictionary
    
    # read tags using root.find() method
    trial['nct_id'] = root.find('id_info').find('nct_id').text
    trial['brief_title'] = root.find('brief_title').text
    trial['overall_status'] = root.find('overall_status').text
    
    df_temp  = pd.DataFrame(trial,index=[i])
    i=i+1
    
    df = pd.concat([df, df_temp])    

In [None]:
df

Reference: [The ElementTree XML API](https://docs.python.org/3/library/xml.etree.elementtree.html)

That's is for now. The best way to practice reading data is to find as many datasets you feel interested and load them and inspect them. You will encounter different issues or requirements in different datasets, and this notebook will be updated to illustrate more useful skills. Go find your favorite datasets to practice!