# <center>Big Data for Engineers &ndash; Solutions</center>
## <center>Spring 2021 &ndash; Week 6 &ndash; ETH Zurich</center>
## <center>Data Models</center>


## 1. XML Data Models &ndash; Information Sets


XML "Information Set" provides an abstract representation of an XML document—it can be thought of as a set of rules on how one would draw an XML document on a whiteboard.

An XML document has an information set if it is well-formed and satisfies the namespace constraints. There is no requirement for an XML document to be valid in order to have an information set. An information set can contain up to eleven different types of information items, e.g., the document information item (always present), element information items, attribute information item, etc.

Draw the Information Set trees for the following XML documents. You can confine your trees to only have the following types of information items: *document information item, elements, character information items, and attributes.*

#### Document 1

```xml
<Burger>
    <Bun>
        <Pickles/>
        <Cheese origin="Switzerland" />
        <Patty/>
    </Bun>
</Burger>
```

#### Solution 
![](https://cloud.inf.ethz.ch/s/RWAzQyamYPzFeDe/download)

Validating an XML/JSON dataset with a schema enforces homogeneity Validating an XML/JSON dataset with a schema enforces homogeneity Validating an XML/JSON dataset with a schema enforces homogeneity #### Document 2
```xml
<catalog>
   <!-- A list of books -->
   <book id='bk101'>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date version='hard' version2='soft'>2000-10-01</publish_date>
   </book>
</catalog>
```

#### Solution

![](https://cloud.inf.ethz.ch/s/zj2rTSsMxA5DNip/download)



#### Document 3

```xml
<eth date="11.11.2006">
   <date>16.11.2017</date>
   <president since="2020">Prof. Dr. Joël Mesot</president>
   <rector>Prof. Dr. Sarah M. Springman</rector>
</eth>
```


**Solution**


![](https://cloud.inf.ethz.ch/s/sEBWAHR2Lne4ZGG/download)

## 2. XML Schemas

validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}Validating an XML/JSON dataset with a schema enforces homogeneity In this task we will explore XML Schemas in detail. An XML Schema describes the structure of an XML document.

The purpose of an XML Schema is to define the legal building blocks of an XML document:
* the elements and attributes that can appear in a document
* the number of (and order of) child elements
* data types for elements and attributes
* default and fixed values for elements and attributes

When you open an XML Schema in oXygen, you can switch to its graphical representation, by choosing the "Design" mode at the bottom of the document pane; "Text" mode shows the XML Schema as an XML document.


To test XML validation, you can either use oXygen (recommended) or an online validator [like this one](https://www.freeformatter.com/xml-validator-xsd.html).


validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}Validating an XML/JSON dataset with a schema enforces homogeneity Validating an XML/JSON dataset with a schema enforces homogeneity ### 2.1
Match the following XML documents to XML Schemas that will validate them. **First match them manually, then validate with oXygen**.


#### Document 1
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd"/>
```

#### Document 2
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    <health/>
    <friends/>
    <family/>
</happiness>
```

#### Document 3
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    3.141562
</happiness>
```

#### Document 4
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    <health value="100"/>
    <friends/>
    <family/>
</happiness>
```

#### Document 5
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    <health/>
    <friends/>
    <family/>
    But perhaps everybody defines it differently...
</happiness>
```

______


#### Schema 1
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="health"/>
                <xs:element name="friends"/>
                <xs:element name="family"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Schema 2
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType mixed="true">
            <xs:sequence>
                <xs:element name="health"/>
                <xs:element name="friends"/>
                <xs:element name="family"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Schema 3
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness" type="xs:decimal"/>
</xs:schema>
```

#### Schema 4
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType>
            <xs:sequence/>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Schema 5
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="health">
                    <xs:complexType>
                        <xs:attribute name="value" type="xs:integer" use="required"/>
                    </xs:complexType>
                </xs:element>
                <xs:element name="friends"/>
                <xs:element name="family"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

### Solution 
*  Document 1 – Schema 4
*  Document 2 – Schema 1 and Schema 2
*  Document 3 – Schema 3
*  Document 4 – Schema 1, Schema 2, and Schema 5
*  Document 5 – Schema 2

validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}### 2.2 


The following XML document is a hypothetical XML representation of a message thread in a slack-like message group. _Disclaimer: The events depicted in this document are fictitious. Any similarity to actual events orValidating an XML/JSON dataset with a schema enforces homogeneity  persons, living or dead, is purely coincidental._

Provide an XML Schema which will validate this document:

```xml
<?xml version="1.0" encoding="UTF-8"?>

<thread channel="IT Help">
    <message>
        <author>@jdean</author>
        <timestamp>2022-03-12T02:56:23.2</timestamp>
        <body>
            Hi, I tried running docker-compose up on this weeks exercise but 
            then somebody immediately stole my laptop.
        </body>
    </message>
    <replies>
        <message>
            <author>@sghemawat</author>
            <timestamp>2022-03-25T08:32:10.8</timestamp>
            <body>Could you please post your docker logs?</body>
        </message>
        <message>
            <author>@jdean</author>
            <timestamp>2022-03-25T08:36:16.8</timestamp>
            <body>Hi, I fixed it by restarting my docker container.</body>
        </message>
    </replies>
</thread>
```


### Solution
An example XML schema that would validate the above document is:

```xml
<?xml version="1.0" encoding="UTF-8"?>
"key": "val"Validating an XML/JSON dataset with a schema enforces homogeneity 
    <xs:element name="thread">
        <xs:complexType>
            <xs:sequence minOccurs="1" maxOccurs="1">
                <xs:element name="message">
                    <xs:complexType>
                        <xs:sequence minOccurs="1" maxOccurs="1">
                            <xs:element name="author" type="xs:string"/>
                            <xs:element name="timestamp" type="xs:dateTime"/>
                            <xs:element name="body" type="xs:string"/>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
                <xs:element name="replies">
                    <xs:complexType>
                        <xs:sequence minOccurs="0" maxOccurs="unbounded">
                            <xs:element name="message">
                                <xs:complexType>
                                    <xs:sequence minOccurs="1" maxOccurs="1">
                                        <xs:element name="author" type="xs:string"/>
                                        <xs:element name="timestamp" type="xs:dateTime"/>
                                        <xs:element name="body" type="xs:string"/>
                                    </xs:sequence>
                                </xs:complexType>
                            </xs:element>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
            <xs:attribute name="channel" type="xs:string"/>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

Also, the root element of the document needs to be changed as follows to point to the schema:
```xml
<thread xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:noNamespaceSchemaLocation="Messages.xsd" channel="IT Help">
```

validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}Validating an XML/JSON dataset with a schema enforces homogeneity **Bonus:** If you prefer to not repeat yourself, in an XML schema you can also declare named custom types and use them in your element declarations. For example the following schema would also validate the document:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <!-- Declare Custom Types -->
    <xs:complexType name="message">
        <xs:sequence minOccurs="1" maxOccurs="1">
            <xs:element name="author" type="xs:string"/>
            <xs:element name="timestamp" type="xs:dateTime"/>
            <xs:element name="body" type="xs:string"/>
        </xs:sequence>
    </xs:complexType>
    
    <!-- Schema -->
    <xs:element name="thread">
        <xs:complexType>
            <xs:sequence minOccurs="1" maxOccurs="1">
                <xs:element name="message" type="message"/>
                <xs:element name="replies">
                    <xs:complexType>
                        <xs:sequence minOccurs="0" maxOccurs="unbounded">
                            <xs:element name="message" type="message"/>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
            <xs:attribute name="channel" type="xs:string"/>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

## 3. JSON Schemas

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. It is used to:
* Describe your existing data format(s).
* Provide clear human- and machine- readable documentation.
* Validate data, i.e., automated testing, ensuring quality of client submitted data.

### 3.1 
Provide an JSON Schema which will validate the following document.

```json
{
  "firstName": "John",
  "lastName": "Doe",
  "age": 21
}
```

validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}### Solution
A possible schema for the above document:

```json
{
  "$id": "https://example.com/person.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string",
      "description": "The person's first name."
    },
    "lastName": {
      "type": "string",
      "description": "The person's last name."
    },
    "age": {
      "description": "Age in years which must be equal to or greater than zero.",
      "type": "integer",
      "minimum": 0
    }
  }
}
```

### 3.2
Provide an JSON Schema which will validate the following document.
The JSON Schema has to check for the following properties:


*   The price of a product has to be strictly positive.
*   Tags are describing the product and necessary for a proper product description. We need at least one tag per product and each tag should be unique.
*   The "productId", "productName" and the "price" should always be contained in a valid JSON document.



```json
  {
    "productId": 1,
    "productName": "An ice sculpture",
    "price": 12.50,
    "tags": [ "cold", "ice" ],
    "dimensions": {
      "length": 7.0,
      "width": 12.0,
      "height": 9.5
    }
  }
```

validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}Validating an XML/JSON dataset with a schema enforces homogeneity ### Solution 
A possible schema for the above document:

```json
{
  "$schema":"http://json-schema.org/draft-07/schema#",
  "$id": "https://example.com/product.schema.json",
  "title": "Product",
  "description": "A product from Acme's catalog",
  "type": "object",
  "properties": {
    "productId": {
      "description": "The unique identifier for a product",
      "type": "integer"
    },
    "productName": {
      "description": "Name of the product",
      "type": "string"
    },
    "price": {
      "description": "The price of the product",
      "type": "number",
      "exclusiveMinimum": 0
    },
    "tags": {
      "description": "Tags for the product",
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": 1,
      "uniqueItems": true
    },
    "dimensions": {
      "type": "object",
      "properties": {
        "length": {
          "type": "number"
        },
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        }
      },
      "required": [ "length", "width", "height" ]
    }
  },
  "required": [ "productId", "productName", "price" ]
}
```

## 4. JSound

[JSound](http://www.jsound-spec.org/) is a vocabulary that allows you to validate JSON documents. It employs a very simple and intuitive JSON-like syntax.

Validating an XML/JSON dataset with a schema enforces homogeneity ### 4.1 
Repeat the exercise in 3.1, but now instead produce a JSound schema that will validate the following document:

```json
{
  "firstName": "John",
  "lastName": "Doe",
  "age": 21
}
```

### Solution
The following JSound schema is a possible solution to the above question:

```json
{
  "firstName": "string",
  "lastName": "string",
  "age": "integer"
}
```


Validating an XML/JSON dataset with a schema enforces homogeneity ### 4.2 
Build a valid JSON document based on the following JSound schema. 

```json
{
  "id": "integer",
  "who": [{
    "name": "string",
    "type": "string",
    "preferred": "boolean"
  }],
  "year_of_birth": "integer",
  "living": "boolean"
}
```

Validating an XML/JSON dataset with a schema enforces homogeneity ### Solution
```json
 {
  "id": "100",
  "who": [{
    "name": "Albert",
    "type": "first",
    "preferred": true
  },
  {
    "name": "Einstein",
    "type": "last",
    "preferred": false
  }],
  "year_of_birth": 1879,
  "alive": false
}
```

## 5. Creating Parquet documents using RumbleDB and JSound
In this exercise, we will be writing a JSONiq query which validates a document using a JSound schema to create a user defined type, which we can then query! Then, we will output it in the [Parquet](https://parquet.apache.org/) file format. 

The Docker container for this week already has RumbleDB pre-installed, which we will treat as a black box to serve our needs. If you want, you could alternatively install it with [Homebrew](https://github.com/RumbleDB/homebrew-rumble) or any of the methods mentioned in the [documentation](https://rumble.readthedocs.io/en/latest/Getting%20started/).

We can run RumbleDB directly through the shell:

In [2]:
!rumbledb run -q '1+1'

2


Alternatively, we can also define some cell magic to allow us to more easily run multi-line JSONiq queries through the shell:

In [5]:
import json
import time
import os
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def rumble(line, cell=None):
    if cell is None:
        data = line
    else:
        data = cell
        
    start = time.time()                                                         
    resp = os.system(f'rumbledb run -q \'{data.strip()}\'')            
    end = time.time()                                                              
    print("Took: %s ms" % (end - start))

In [4]:
%%rumble
1+1

2
Took: 3.0729994773864746 ms


Validating an XML/JSON dataset with a schema enforces homogeneity ### 5.1 The Great Language Game
The [Great Language Game](http://greatlanguagegame.com/) is a game in which you are given a voice clip to listen, and you are asked to identify the language in which the person was speaking. It is a multiple-choice question–you make your choice out of several alternatives.

The following JSON document presents a user's attempt at answering a single question in the game: it contains the identifier of the voice clip, the choices presented to the player, and the player's response. Provide a JSound Schema which will validate this document.

```json
{
  "guess": "Norwegian",
  "target": "Norwegian", 
  "country": "AU",
  "choices": [ "Maori", "Mandarin", "Norwegian", "Tongan" ], 
  "sample": "48f9c924e0d98c959d8a6f1862b3ce9a",
  "date": "2013-08-19"
}
```

You can refer to the documentation for [a list of types in JSound]((http://www.jsound-spec.org/publish/en-US/JSound/2.0/html-single/JSound/index.html#idm126).


### Solution

The following JSound schema is a possible solution to the above question:

```json
{
    "guess": "string",
    "target": "string",
    "country": "string",
    "choices": [ "string" ],
    "sample": "hexBinary",
    "date": "date"
}
```

### 5.2 Validating our first documents
Now let's validate the above document with the schema we just wrote:

In [5]:
%%rumble
declare type local:attempt as {
    // Replace this object with your JSound schema!
    // Remember to keep the semi-colon :)
};

validate type local:attempt* {
  {
    "guess": "Norwegian",
    "target": "Norwegian", 
    "country": "AU",
    "choices": [ "Maori", "Mandarin", "Norwegian", "Tongan" ], 
    "sample": "48f9c924e0d98c959d8a6f1862b3ce9a",
    "date": "2013-08-19"
  }
}

Took: 2.7993102073669434 ms


⚠️  ️There was an error on line 2 in file:/home/jovyan/work/:

    // Replace this object with your JSound schema!
     ^

Code: [XPST0003]
Message: Parser failed. /
Metadata: file:/home/jovyan/work/:LINE:2:COLUMN:5:
This code can also be looked up in the documentation and specifications for more information.



Let's work with a larger version of the Great Language Game (Confusion) dataset, stored in the JSON format. The file should already be in the local directory as `confusion-100000.json`.

We can query this file directly on the local disk using Rumble:

In [6]:
%%rumble
count(json-file("confusion-1000000.json"))

Took: 2.9974193572998047 ms


⚠️  ️There was an error on line 1 in file:/home/jovyan/work/:

count(json-file("confusion-1000000.json"))
      ^

Code: [FODC0002]
Message: File file:/home/jovyan/work/confusion-1000000.json not found.
Metadata: file:/home/jovyan/work/:LINE:1:COLUMN:6:
This code can also be looked up in the documentation and specifications for more information.



Now, let's use our schema to validate the first 10 attempts from the dataset:

In [None]:
%%rumble
declare type local:attempt as {
    // Replace this object with your JSound schema!
    // Remember to keep the semi-colon :)
};

validate type local:attempt* {
    json-file("confusion-100000.json")[position() <= 10]
}

Although it seems like nothing is happening, our query is returning _typed_ objects instead of JSON strings. This means that we can interact with values like dates in more meaningful ways:

In [None]:
%%rumble
declare type local:attempt as {
    // Replace this object with your JSound schema!
    // Remember to keep the semi-colon :)
};

for $i in validate type local:attempt* {
    json-file("confusion-100000.json")[position() <= 10]
}
let $date := $i.date
return month-from-date($date)


### Solution

In [7]:
%%rumble
declare type local:attempt as {
    "guess": "string",
    "target": "string",
    "country": "string",
    "choices": [ "string" ],
    "sample": "hexBinary",
    "date": "date"
};

validate type local:attempt* {
  {
    "guess": "Norwegian",
    "target": "Norwegian", 
    "country": "AU",
    "choices": [ "Maori", "Mandarin", "Norwegian", "Tongan" ], 
    "sample": "48f9c924e0d98c959d8a6f1862b3ce9a",
    "date": "2013-08-19"
  }
}

{ "guess" : "Norwegian", "target" : "Norwegian", "country" : "AU", "choices" : [ "Maori", "Mandarin", "Norwegian", "Tongan" ], "sample" : "48F9C924E0D98C959D8A6F1862B3CE9A", "date" : "2013-08-19" }
Took: 2.9887449741363525 ms


In [8]:
%%rumble
declare type local:attempt as {
    "guess": "string",
    "target": "string",
    "country": "string",
    "choices": [ "string" ],
    "sample": "hexBinary",
    "date": "date"
};

validate type local:attempt* {
    json-file("confusion-100000.json", 10)[position() <= 10]
}

{ "choices" : [ "Maori", "Mandarin", "Norwegian", "Tongan" ], "country" : "AU", "date" : "2013-08-19", "guess" : "Norwegian", "sample" : "48F9C924E0D98C959D8A6F1862B3CE9A", "target" : "Norwegian" }
{ "choices" : [ "Danish", "Dinka", "Khmer", "Lao" ], "country" : "AU", "date" : "2013-08-19", "guess" : "Dinka", "sample" : "AF5E8F27CEF9E689A070B8814DCC02C3", "target" : "Dinka" }
{ "choices" : [ "German", "Hungarian", "Samoan", "Turkish" ], "country" : "AU", "date" : "2013-08-19", "guess" : "Turkish", "sample" : "509C36EB58DBCE009CCF93F375358D53", "target" : "Samoan" }
{ "choices" : [ "Danish", "Korean", "Latvian", "Somali" ], "country" : "AU", "date" : "2013-08-19", "guess" : "Latvian", "sample" : "A505AB771AE7C32744AD31B3051B8EE9", "target" : "Somali" }
{ "choices" : [ "Bangla", "Dinka", "Italian", "Japanese" ], "country" : "AU", "date" : "2013-08-19", "guess" : "Japanese", "sample" : "3569611136EA04BAB18A0CD605CED358", "target" : "Japanese" }
{ "choices" : [ "Hindi", "Lao", "Maltese", "

In [9]:
%%rumble
declare type local:attempt as {
    "guess": "string",
    "target": "string",
    "country": "string",
    "choices": [ "string" ],
    "sample": "hexBinary",
    "date": "date"
};

for $i in validate type local:attempt* {
  json-file("confusion-100000.json")[position() <= 10]
}
let $date := $i.date
return month-from-date($date)

8
8
8
8
8
8
8
8
8
8
Took: 13.538018703460693 ms


### 5.3 From JSON to Parquet
Now, let's try validate our entire dataset and output it as a Parquet file! We'll need the shell for this, so let's first create a JSONiq file containing our query. Modify and then copy the following query into a new file called `query.jq`.

**Hint:** Date types in JSound can have timezones, but Parquet does not support dates with timezones at the moment. We should instead validate dates as strings for this exercise.

```json
declare type local:attempt as {
    // Replace this object with your JSound schema!
    // Remember to keep the semi-colon :)
};

validate type local:attempt* {
    json-file("confusion-100000.json", 10)
}
```

Then, we can run the query through the shell, specifying the output format as Parquet:

In [1]:
!rumbledb run query.jq -o result.out -f parquet -P 1

[INFO] Validation against local:attempt compatible with data frames.
[INFO] Validation against local:attempt compatible with data frames.
[INFO] Validation against local:attempt compatible with data frames.
[INFO] Writing to format parquet


Then let's change the file name to be more representative of its contents:

In [2]:
!cp `find result.out/part-00000*` greatlanguagegame.parquet

Where, if we compare the sizes of the JSONL and Parquet files...

In [3]:
!ls -lh

total 17M
-rw-rw-r-- 1 jovyan jovyan  24K Apr 11 20:39 2022_Exercises_exercise06_Data_Models.ipynb
-rw-rw-r-- 1 jovyan jovyan  41K Apr 12 09:05 2022_Exercises_exercise06_Data_Models_Solution.ipynb
-rw-rw-r-- 1 jovyan jovyan  16M Apr  7 16:39 confusion-100000.json
drwxrwxr-x 3 jovyan jovyan 4.0K Apr  7 16:39 docker
-rw-rw-r-- 1 jovyan jovyan  367 Apr  7 16:39 docker-compose.yml
-rw-rw-r-- 1 jovyan jovyan  144 Apr 12 07:53 document.jschema
-rw-rw-r-- 1 jovyan jovyan   17 Apr 12 07:48 document.json
-rw-r--r-- 1 jovyan jovyan 656K Apr 12 09:06 greatlanguagegame.parquet
-rw-rw-r-- 1 jovyan jovyan  788 Apr 12 07:17 message.xml
-rw-rw-r-- 1 jovyan jovyan  990 Apr 12 07:22 message.xsd
-rw-rw-r-- 1 jovyan jovyan  266 Apr 12 07:38 movie.xml
-rw-rw-r-- 1 jovyan jovyan  575 Apr 12 07:37 movie.xsd
-rw-rw-r-- 1 jovyan jovyan  261 Apr 12 09:02 query.jq
drwxrwxr-x 2 jovyan jovyan 4.0K Apr 12 09:06 result.out


...we can see how much smaller the Parquet file is! Parquet is a column-oriented binary storage format with efficient data compression schemes.

Intuitively, reducing the size of a file while maintaining fixed throughput means that we can also scan the file much faster! Keep this in mind next time you need to work with a huge JSON dataset!

In [6]:
%%rumble
count(parquet-file("greatlanguagegame.parquet"))

100000
Took: 11.231056690216064 ms


### 5.4 Bigger data for a bigger benefit (optional)

The dataset we used was just a subset of the full great language game dataset. When using a larger dataset, we will observe a much more noticeable decrease in time to scan the dataset in Parquet over JSON!

We can download this dataset here:

In [None]:
!wget -O- http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2 | tar -jxv

Then, let's try repeat several steps from above with the new dataset:

In [None]:
%%rumble
count(json-file("confusion-2014-03-02/confusion-2014-03-02.json"))

Including creating a new query file to point to the larger dataset:

```json
declare type local:my-type as {
    // Replace this object with your JSound schema!
    // Remember to keep the semi-colon :)
};

validate type local:my-type* {
    json-file("confusion-2014-03-02/confusion-2014-03-02.json", 10)
}
```

which should be named `query-large.jq`.

In [None]:
!rumbledb run query-large.jq -o result-large.out -f parquet -P 1

In [None]:
!cp `find result-large.out/part-00000*` greatlanguagegame-large.parquet

In [None]:
%%rumble
count(parquet-file("greatlanguagegame-large.parquet"))