Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Do NOT add any cells to the notebook!

Do not forget to submit both the notebook AND the files in the data/ subfolder according to the CoC!

Make sure you fill in any place that says `YOUR CODE HERE` or _YOUR ANSWER HERE_ , as well as your name and group below:

In [1]:
NAME = "Silvia Christine Schlenz"
STUDENTID = "12115811"
GROUPID = "Assig 2 + 5 9";

# Assignment 2 (Group)
When carrying out a Data-Science project, screening and selecting appropriate data sources for the tasks at hand comes at the beginning. This assignment is about accessing and characterising potential data sources in teams of three. The teams have been randomly assigned. BEWARE! In Assignment 5, you will be asked to provide answers to those questions. Make sure that combining the two datasets makes sense from an analytical perspective!

-----
## Step 0 (2 points)

Find two data sets online (from one or several sources) that would be interesting to combine. The data sets should fulfill the following requirements:

* Each data set must have a different file format (either CSV, XML, or JSON), please choose
  - one CSV file (dataset1) 
  - and one JSON or XML file (dataset2)

* The two datasets should not be two variations of each other (i.e. simply the same dataset for two different regions or timeframes or from the same source just in two different formats)
* Workable data-set sizes: The selected or extracted data sets should have thousands of entries (>= 1000), but not more than (<=) 10000 entries. *If larger, use an excerpt from the original data set. Justify in detail the extraction criteria in the markdown cell below and 
  1) add the code used for the extraction in the code cell 
  2) make the extracted dataset also available at a downloadable URL (for instance in a Github repository, [here](https://raw.githubusercontent.com/AxelPolleres/simple_dataset_sharing_repo/main/test.csv)'s an example) 
  3) name the new `resourceURL` in the datacitation.
* You may start from (but you are not limited to) the resource collections hinted at [in the Unit 2 slides](https://datascience.ai.wu.ac.at/ws21/dataprocessing1/unit2.html#slide-53).

* Important: The use of datasets from kaggle.com and other curated collections (as highlighted to you in Unit 2) of datasets with accompanying tutorials on processing and analysis is discouraged. You are required to use primary data sources. See the policy on kaggle.com & friends at this assignment's submission site at MyLearn.

* Please adhere to the CoC - It is advised to already do so while working on the assignments.


[Data citations](http://blogs.nature.com/scientificdata/2016/07/14/data-citations-at-scientific-data/) must contain the following details:
- creator: provider organisation / author(s) of the data set, e.g. "Zentralanstalt für Meteorologie und Geodynamik (ZAMG)"
- catalogName: Names of the data repository and/or the Open Data portal used, e.g. Open Data Österreich"
- catalogURL: URL of th repository / portal, e.g. "https://www.data.gv.at/"
- datasetID: (specific to the data repository), e.g. "https://www.data.gv.at/katalog/dataset/zamg_meteorologischemessdatenderzamg"
- resourceURL: a URL where the CSV, XML or JSON file can be downloaded, e.g. "https://www.football-data.co.uk/new/JPN.csv"
- pubYear: Dataset publication year, i.e. since when it is published, e.g. "2012"
- lastAccessed: when have you last accessed the dataset (i.e. datetime of accessing, obtaining a copy of the data set) in ISO Format? e.g. "2021-03-08T13:55:00"

Store the data citation in a dictionary for each of the datasets:

In [88]:
import pandas as pd
import os
import csv
import json
from pprint import pprint


wd = os.getcwd()
data_path_csv = wd + "/data/School_Attendance_by_Student_Group_and_District__2021-2022.csv"
data_path_json = wd + "/data/a2aq-rsek.json"

cd = pd.read_csv(data_path_csv)
dataset1 = cd.to_dict("Series")


dataset2 = {}
with open(data_path_json, "r") as json_file:
    jd = json.load(json_file)
    dataset2 = {x: jd[x] for x in range(len(jd))}



#raise NotImplementedError() #I commented it out but I guess we should remove it because otherwise the code gets stuck (literally without a reason)? Then again, idk? May leave it so we don't overwrite our datasets with the citations?

#uncomment and fill out:
dataset1= {
    "creator" : "State of Connecticut" ,
    "catalogName" : "Connecticut Open Data" ,
    "catalogURL" : "https://data.ct.gov/" ,
    "datasetID" : "https://data.ct.gov/api/views/t4hx-jd4c/rows.csv" ,
    "resourceURL" : "https://raw.githubusercontent.com/silv741/WU_dp-1/main/School_Attendance_by_Student_Group_and_District__2021-2022.csv"  ,
    "pubYear" : "2022"  ,
    "lastAccessed" : "2023-10-21T09:30:00"  ,
}

dataset2= {
    "creator" : "State of Connecticut" ,
    "catalogName" : "Connecticut Open Data" ,
    "catalogURL" : "https://data.ct.gov/" ,
    "datasetID" : "https://data.ct.gov/resource/a2aq-rsek.json" ,
    "resourceURL" : "https://raw.githubusercontent.com/silv741/WU_dp-1/main/a2aq-rsek.json"  ,
    "pubYear" : "2021"  ,
    "lastAccessed" : "2023-10-21T09:35:00"  ,
}


NotImplementedError: 

In [89]:
from nose.tools import assert_equal, assert_in, assert_true
import traceback
import sys
import os

assert_equal(type(dataset1), dict)
assert_equal(type(dataset2), dict)


Use the following structure for your answer below:

**Data set 1**

*(Describe the source and the general content of the dataset and why you chose it)*

**Data set 2**

*(Describe the source and the general content of the dataset and why you chose it)*

**Project ideas**

*(Describe in your own words, which kind of tasks could be addressed by combining the selected data sets, esp. how the two data sets fit together and what complementary information they contain; **Formulate a question that could be potentially answered by combining data from both datasets;** how could the data sets be combined exactly? 250 words max. BEWARE! In Assignment 5, you will be asked to provide answers to those questions. Make sure that combining the two datasets makes sense from an analytical perspective!)*

YOUR ANSWER HERE

------
## Step 1 - File Access (3 points)

Write a Python function `accessData` that takes the dataset dictionary created in step 0 as an input and returns an extended dictionary including following additions:

* Write code that accesses the dataset from its `resourceURL` using the python `requests` package:
 * detects whether it's and XML, CSV or JSON file by
     * checking whether the download URL **ends** with suffix "xml", "json", "csv" 
     * checking whether the "Content-Type" HTTP header field contains information about the format, hinting on XML, JSON or CSV, i.e., check whether the substring XML, JSON or CSV appears in the "Content-Type" header in either upper- or lowercase. 
 * Detects the file size (convert to KB) of each data set, clearly documenting your actions (e.g. through commented code).

The result of the code below should extend your dictionaries `dataset1` and `dataset2` with two keys named 
* `"detectedFormat"` (which has one of the following values: `"XML"`, `"JSON"`, `"CSV"`, or `"unknown"`, if nothing could be detected from checking the suffix or HTTP header, or if the information in both was inconsistent)
* and `"filesizeKB"` which contains the filesize in KB (Conversion should be done accordingly to decimal SI prefixes) from the number of bytes in the header-information. If there is no respective header information return 0.
* If the detected format is `"unknown"`, the expected filesize to be returned is also 0

In [4]:
# YOUR CODE HERE 
import requests

def accessData(datadict):
    # YOUR CODE HERE
    raise NotImplementedError()
    return datadict

In [5]:
from nose.tools import assert_equal, assert_in, assert_true
dataset1= accessData(dataset1)
dataset2= accessData(dataset2)
assert_in(dataset1["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_in(dataset2["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_true(isinstance(dataset1["filesizeKB"], (int, float)))
assert_true(isinstance(dataset2["filesizeKB"], (int, float)))

NotImplementedError: 

In [None]:
# There are tests hiding here, please do not delete this cell...

In [None]:
# There are tests hiding here, please do not delete this cell...

In [None]:
# There are tests hiding here, please do not delete this cell...

In [None]:
# There are tests hiding here, please do not delete this cell...

Please explain your findings, using the following structure for your answer below (in "other remarks" you can explain, for instance, why you think your code did not detect the correct format, if needed)

**Data set 1**

*(format, size, other remarks)*


**Data set 2**

*(format, size, other remarks)*


YOUR ANSWER HERE

-----
## Step 2  (5 points) - Format Validation

Establish that the two data files obtained are well-formed according to the detected data format (CSV, JSON, or XML). That is, the syntax used is valid according to accepted syntax definitions. Are there any violations of well-formedness?


Proceed as follows (for each data file, in turn): according to the "suspected" data format from Step 1:

  1. Use an _online validator_ for CSV, XML, and JSON, respectively, to confirm whether the files you downloaded in Step 1 are well-formed for the respective file format, document your findings and modify the file as described: 

   a. **Case 1**: no well-formedness errors were detected: 
    * Generally describe at least 3 well-formedness checks that your data sets, depending on its "suspected" format (against the background knowledge of Unit 2) should fulfill;
    * Store a local copy of the file called `data_notebook-[notebook-nr.]_[name].[file extension]` in the `data/` subfolder
    * Create another local copy of your data file called `data_notebook-[notebook-nr.]_[name]-invalid.[file extension]` and introduce a selected well-formedness violation (one occurrence) therein;
    * document that the online validator you used finds the error you introduced

   b. **Case 2**: well-formedness errors occurred:
    * Document the occurrences by printing out the error message and describe the types of well-formedness violation that were reported to you.
    * Store a local copy called `data_notebook-[notebook-nr.]_[name]-invalid.[file extension]`  in the `data/ subfolder`
    * Create another local copy called `data_notebook-[notebook-nr.]_[name].[file extension]`, of your data file that fixes the well-formedness violations therein manually.  
    
**Please note that the datasets in the `data/` subfolder are for documentation only. Do not access those for subsequent steps!**
    

  2. Write a Python function `parseFile(datadict, format)` that that accesses the dataset from its `resourceURL`. The dataset should then be checked accordingly the given parser for the parameter `format` to check the following:
     * CSV: Returns `True`, if a consistent delimiter out of `",",";","\t"` can be detected, such that each row has the same (> 1) number of elements, otherwise False
     * JSON: Returns `True` if the file can be parsed with the `json` package, catching any parsing exceptions.
     * XML: Returns `True` if the file can be parsed with the `xmltodict` package, catching any parsing exceptions.
     * Returns `False` if any other format is supplied by the parameter.
     
Even if you do not have an XML or JSON file in your datasets, make sure to implement all checks (CSV, JSON and XML)!
     
In order to handle parsing exceptions and errors from the used packages, you can use [catching exceptions](https://docs.python.org/3/tutorial/errors.html), such that the program does not simply fail to check whether the file is parseable as the format specified in `format`     

Use the following structure for your answer in the cell below to document **Step 2.1**:

***Data set 1***

*(validator used, validation results, describe the modification to fix the file or to create an invalid version of it)*

***Data set 2***

*(validator used, validation results, describe the modification to fix the file or to create an invalid version of it)*


YOUR ANSWER HERE

In [None]:
import requests
import csv
import json
import xmltodict

def parseFile(datadict, format):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal([parseFile(dataset1, "XML"),
    parseFile(dataset1, "JSON"),
    parseFile(dataset1, "CSV"),
    parseFile(dataset2, "XML"),
    parseFile(dataset2, "JSON"),
    parseFile(dataset2, "CSV")].count(True), 2)

In [None]:
# There are tests hiding here, please do not delete this cell...

-----
## Step 3 - Content analysis (5 points)

Similar to the Python function `parseFile(datadict,format)` above, now create a new Python function `describeFile(datadict)` that analyses the given file according to the respective format detected in Step 1 and returns a dictionary containing the following information:

* for CSV files: number of columns, number of rows, and the index pair (row,column) of the longest string occurring in the dataset, with rows and columns assumed to be numbered from 0 to n. If there are no strings in the CSV file return `0`. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfColumns:"  ...,
       "numberOfRows":  ... ,
       "longestStringIndex" : ... }      
    ```
    
Example output for a CSV file with the longest string in the 9th row and 2nd column:

    ```
    { "numberOfColumns:"  4,
       "numberOfRows":  10 ,
       "longestStringIndex" : (8,1) }      
    ```


* for JSON files: number of different attribute names, nesting depth, length of the longest list appearing in an attribute value. That is, the resulting dictionary should have the following form:
    ```
  { "numberOfAttributes:" ... ,
    "nestingDepth":  ... ,
    "longestListLength" : ... }
    ```
    
Here the longestListLength should be set to 0 if no list appears. Nesting depth is defined as follows:

a flat JSON object with only atomic attribute values has depth 1.
a JSON attribute with another object as value (or another oject as member of a list value!) increases the depth by 1
and so on.


* for XML files: number of different element and attribute a names (i.e. the sum of both), nesting depth, and the maximum numeric value as a leaf node or attribute value occurring in the dataset (at any nesting depth). If there is no  numeric value in the XML file return `0`. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfElementsAttributes:" ... ,
      "nestingDepth":  ... ,
      "maxNumericValue" : ... }
     ```
Hint: You should be able to use the same function for both JSON and XML files by simply both converting the JSON or XML to a dictionary. (using the `json` or `xmltodict` libraries)
  
For files that cannot be parsed with respective given format, the function should simply return an empty dictionary (`{}`).

In [None]:
import codecs

def describeFile(datadict):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal(len(describeFile(dataset1)), 3)
assert_equal(len(describeFile(dataset2)), 3)

In [None]:
# There are tests hiding here, please do not delete this cell...

Use the following structure for your answer below:

**Data set 1**

*(number and types of items etc.)*


**Data set 2**

*(number and types of items etc.)*

YOUR ANSWER HERE