# CSV vs Parquet

    A Comma Separated Values (CSV) file is a plain text file that contains a list of data. These files are often used for exchanging data between different applications.
    A CSV file has a fairly simple structure. It’s a list of data separated by commas.
    
    Parquet is an open source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes.It is known for its both performant data compression and its ability to handle a wide variety of encoding types.
    Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time consuming compared to row-oriented databases. 

# Implementation of Parquet

In [7]:
import os

In [1]:
import parquet
import json

In [2]:
import pandas as pd
#df = pd. read_csv('example.csv')
#df. to_parquet('output.parquet')

# csv to parquet

In [3]:
df=pd.read_csv('C:\\Users\\Sanika\\Downloads\\Total Shipment Volume.csv')

In [4]:
df

Unnamed: 0,POU Country,POU State,POU City,Shipment Volume
0,US,NV,Las Vegas,1511
1,US,TX,Houston,1447
2,US,CA,Los Angeles,1260
3,US,TX,San Antonio,1260
4,US,FL,Miami,1224
...,...,...,...,...
6649,US,WY,Basin,1
6650,US,WY,Kemmerer,1
6651,US,WY,Lusk,1
6652,US,WY,Mountain View,1


In [5]:
df.to_parquet('output.parquet')

In [9]:
os.getcwd()

'C:\\Users\\Sanika'

In [10]:
pd.read_parquet('output.parquet', engine='pyarrow')

Unnamed: 0,POU Country,POU State,POU City,Shipment Volume
0,US,NV,Las Vegas,1511
1,US,TX,Houston,1447
2,US,CA,Los Angeles,1260
3,US,TX,San Antonio,1260
4,US,FL,Miami,1224
...,...,...,...,...
6649,US,WY,Basin,1
6650,US,WY,Kemmerer,1
6651,US,WY,Lusk,1
6652,US,WY,Mountain View,1


# excel to parquet

In [12]:
data=pd.read_excel('C:\\Users\\Sanika\\Downloads\\Data.xlsx')
data

Unnamed: 0,Order ID,Order Date,Unit Cost,Price,Order Qty,Cost of Sales,Sales,Profit,Channel,Promotion Name,Product Name,Manufacturer,Product Sub Category,Product Category,Region,City,Country
0,7077,2017-09-13,77,304,9,685,2715,2030,Store,European Spring Promotion,Contoso SLR Camera M143 Grey,"Contoso, Ltd",Digital SLR Cameras,Cameras and camcorders,Europe,Moscow,Russia
1,117,2016-08-20,8,13,4,30,51,21,Store,European Spring Promotion,Contoso 512MB MP3 Player E51 Blue,"Contoso, Ltd",MP4&MP3,Audio,Europe,Moscow,Russia
2,7018,2016-07-08,11,160,9,92,1396,1305,Store,European Spring Promotion,Contoso DVD 9-Inch Player Portable M300 Silver,"Contoso, Ltd",Movie DVD,"Music, Movies and Audio Books",Europe,Moscow,Russia
3,140,2018-08-11,1,26,18,11,463,453,Store,North America Spring Promotion,NT Bluetooth Stereo Headphones E52 Pink,Northwind Traders,Bluetooth Headphones,Audio,North America,Bellevue,United States
4,491,2017-07-15,109,304,9,977,2615,1638,Online,Asian Spring Promotion,Contoso SLR Camera M143 Grey,"Contoso, Ltd",Digital SLR Cameras,Cameras and camcorders,Asia,Beijing,China
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,22348,2016-10-03,17,25,12,195,300,106,Store,Asian Summer Promotion,Contoso USB Cable M250 Blue,"Contoso, Ltd",Cameras & Camcorders Accessories,Cameras and camcorders,Asia,Osaka,Japan
14996,22349,2016-10-20,56,999,10,554,9990,9437,Reseller,No Discount,WWI Projector 720p DLP56 White,Wide World Importers,Projectors & Screens,Computers,Europe,Paris,France
14997,22350,2017-05-20,364,588,13,4728,7409,2682,Store,North America Holiday Promotion,"A. Datum SLR Camera 35"" X358 Gold",A. Datum Corporation,Digital SLR Cameras,Cameras and camcorders,North America,Ithaca,United States
14998,22351,2017-11-05,114,279,12,1359,3348,1990,Online,Asian Summer Promotion,Adventure Works LCD20W M240 Black,Adventure Works,Monitors,Computers,Asia,Beijing,China


In [13]:
data.to_parquet('Dataoutput.parquet')

In [14]:
pd.read_parquet('Dataoutput.parquet', engine='pyarrow')

Unnamed: 0,Order ID,Order Date,Unit Cost,Price,Order Qty,Cost of Sales,Sales,Profit,Channel,Promotion Name,Product Name,Manufacturer,Product Sub Category,Product Category,Region,City,Country
0,7077,2017-09-13,77,304,9,685,2715,2030,Store,European Spring Promotion,Contoso SLR Camera M143 Grey,"Contoso, Ltd",Digital SLR Cameras,Cameras and camcorders,Europe,Moscow,Russia
1,117,2016-08-20,8,13,4,30,51,21,Store,European Spring Promotion,Contoso 512MB MP3 Player E51 Blue,"Contoso, Ltd",MP4&MP3,Audio,Europe,Moscow,Russia
2,7018,2016-07-08,11,160,9,92,1396,1305,Store,European Spring Promotion,Contoso DVD 9-Inch Player Portable M300 Silver,"Contoso, Ltd",Movie DVD,"Music, Movies and Audio Books",Europe,Moscow,Russia
3,140,2018-08-11,1,26,18,11,463,453,Store,North America Spring Promotion,NT Bluetooth Stereo Headphones E52 Pink,Northwind Traders,Bluetooth Headphones,Audio,North America,Bellevue,United States
4,491,2017-07-15,109,304,9,977,2615,1638,Online,Asian Spring Promotion,Contoso SLR Camera M143 Grey,"Contoso, Ltd",Digital SLR Cameras,Cameras and camcorders,Asia,Beijing,China
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,22348,2016-10-03,17,25,12,195,300,106,Store,Asian Summer Promotion,Contoso USB Cable M250 Blue,"Contoso, Ltd",Cameras & Camcorders Accessories,Cameras and camcorders,Asia,Osaka,Japan
14996,22349,2016-10-20,56,999,10,554,9990,9437,Reseller,No Discount,WWI Projector 720p DLP56 White,Wide World Importers,Projectors & Screens,Computers,Europe,Paris,France
14997,22350,2017-05-20,364,588,13,4728,7409,2682,Store,North America Holiday Promotion,"A. Datum SLR Camera 35"" X358 Gold",A. Datum Corporation,Digital SLR Cameras,Cameras and camcorders,North America,Ithaca,United States
14998,22351,2017-11-05,114,279,12,1359,3348,1990,Online,Asian Summer Promotion,Adventure Works LCD20W M240 Black,Adventure Works,Monitors,Computers,Asia,Beijing,China


# What is gzip? Does it have any alternatives?

gzip is described as 'GNU Gzip is a software application used for file compression and decompression. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding' and is a File Compressor in the File Management category. There are more than 10 alternatives to gzip for a variety of platforms, including Windows, Linux, Mac, BSD and Android. The best alternative is 7-Zip, which is both free and Open Source. Other great apps like gzip are WinRAR (Paid), 7-Zip ZS (Free, Open Source), DAR (Free, Open Source) and Explzh for Windows (Free Personal).

# Dask

Dask is popularly known as a Python parallel computing library
Through its parallel computing features, Dask allows for rapid and efficient scaling of computation.
It provides an easy way to handle large and big data in Python with minimal extra effort beyond the regular Pandas workflow.
In other words, Dask allows us to easily scale out to clusters to handle big data or scale down to single computers to handle large data through harnessing the full power of CPU/GPU, all beautifully integrated with Python code.

In [59]:
import dask.dataframe as dd

In [58]:
import dask
import datetime

In [60]:
data = dd.read_csv("C:\\Users\\Sanika\\Downloads\\Total Shipment Volume.csv",dtype={'MachineHoursCurrentMeter': 'float64'},assume_missing=True)
data.compute()

Unnamed: 0,POU Country,POU State,POU City,Shipment Volume
0,US,NV,Las Vegas,1511.0
1,US,TX,Houston,1447.0
2,US,CA,Los Angeles,1260.0
3,US,TX,San Antonio,1260.0
4,US,FL,Miami,1224.0
...,...,...,...,...
6649,US,WY,Basin,1.0
6650,US,WY,Kemmerer,1.0
6651,US,WY,Lusk,1.0
6652,US,WY,Mountain View,1.0


# Loading data in chunks

In [61]:
df1=pd.read_csv("C:\\Users\\Sanika\\Downloads\\Total Shipment Volume.csv")

In [62]:
df1

Unnamed: 0,POU Country,POU State,POU City,Shipment Volume
0,US,NV,Las Vegas,1511
1,US,TX,Houston,1447
2,US,CA,Los Angeles,1260
3,US,TX,San Antonio,1260
4,US,FL,Miami,1224
...,...,...,...,...
6649,US,WY,Basin,1
6650,US,WY,Kemmerer,1
6651,US,WY,Lusk,1
6652,US,WY,Mountain View,1


In [63]:
len(df1)

6654

In [64]:
df2= pd.read_csv("C:\\Users\\Sanika\\Downloads\\Total Shipment Volume.csv", chunksize=1000)

In [65]:
df2

<pandas.io.parsers.TextFileReader at 0x27cadfd0f10>

In [66]:
total_len = 0
for chunk in df2:
    # Do some preprocessing to reduce the memory size of each chunk
    total_len += len(chunk)
print(total_len)

6654


In [68]:
tp = pd.read_csv('C:\\Users\\Sanika\\Downloads\\Total Shipment Volume.csv', iterator=True, chunksize=1000)  # gives TextFileReader
df = pd.concat(tp, ignore_index=True)

In [69]:
len(df)

6654