# Spark Network Services GmbH
### Data Warehouse Challenge 2021: Junior Data Engineer

##### The content of this document is confidential and shall not be shared with other than the candidate Spark employees

We are super excited to get to know you. To assess your skillset, we have an ELT/ETL challenge for
you. The assessment of Python (HTTP Request, JSON handler, database integration), SQL (DML/DDL)
and git is compulsory.

It’s required for you deliver your git project with instructions on how to reproduce it.
Background context:
One of our Product Owners came to us asking to create a new pipeline, and his team are looking
forward understanding this data. They don’t have any documentation apart from these 2 endpoints:
https://619ca0ea68ebaa001753c9b0.mockapi.io/evaluation/dataengineer/jr/v1/users
https://619ca0ea68ebaa001753c9b0.mockapi.io/evaluation/dataengineer/jr/v1/messages

##### The requirement for you is:

1) Collect all the data just once from these endpoints and create/populate the tables (user,
subscription and message).

2) Product Owners do intend to produce metrics based on date, age, city, country, email
domain, gender, smoking condition, income, subscriptions and messages. It is your
responsibility to propose how to model the tables, columns and relationships. PII
handling should be considered, so that no sensitive data can be accessed by the final
users.

3) The Product Owner asked you to provide the queries for some scenarios, please
add a file sql_test.sql in your project with the queries that solve the below
questions:

 How many total messages are being sent every day?

 Are there any users that did not receive any message?

 How many active subscriptions do we have today?

 How much is the average price ticket (sum amount subscriptions / count subscriptions) breakdown by year/month (format YYYY-MM)?

##### Privacy Requirements:
We need to be GDPR compliant, so we are very concerned about data privacy. It is important that
any sensitive user information is not exposed
Imported fields must be privacy-protected in the following way:

 Remove all PIIs.

 When required the type of anonymization is up to you.

 Emails: it is mandatory to discard the username part on import and to keep only the
domain (i.e. mickey.mouse@disney.com =&gt; disney.com).

 Other fields: is up to you to decide if those can be relevant, according to the task.

 Do not import the chat messages, this is extremely sensitive information.

##### Important to know:
For this test, any data shown here doesn’t represent the official data model from Spark Networks
or expose any real data from our customers. All the data are fake and generated by
https://mockapi.io/.

#### Thank you very much for investing your time to solve this challenge!

Starting the Project

## Data Extraction and Visualization

In [13]:
import requests

try:
    responseUser = requests.get('https://619ca0ea68ebaa001753c9b0.mockapi.io/evaluation/dataengineer/jr/v1/users')
    jsonRespUser = responseUser.json()
    print(f'Entire JSON response of the Users call: \n{jsonRespUser}')

except HTTPError as httpError:
    print(f'HTTP Error occored: {httpError}.')
except Exception as err:
    print(f'Other error occored: {err}.')

Entire JSON response of the Users call: 
[{'createdAt': '2021-11-23T16:10:33.614Z', 'updatedAt': '2021-11-24T13:34:15.404Z', 'firstName': 'Levi', 'lastName': 'Bins', 'address': 'Ernestine Shore', 'city': 'Pembroke Pines', 'country': 'United States', 'zipCode': '05734', 'email': 'Kianna_Nicolas@hotmail.com', 'birthDate': '2020-12-16T02:41:21.036Z', 'profile': {'gender': 'male', 'isSmoking': True, 'profession': 'Central Configuration Planner', 'income': '3709.61'}, 'subscription': [{'createdAt': '2021-11-24T16:58:46.581Z', 'startDate': '2021-11-24T05:12:49.301Z', 'endDate': '2022-09-15T06:05:59.630Z', 'status': 'Active', 'amount': '43.18'}], 'id': '1'}, {'createdAt': '2021-11-23T02:40:10.964Z', 'updatedAt': '2021-11-24T16:04:42.393Z', 'firstName': 'Aric', 'lastName': 'Shields', 'address': 'Sporer Field', 'city': 'Meganemouth', 'country': 'Namibia', 'zipCode': '59236', 'email': 'Madelynn.Ruecker27@gmail.com', 'birthDate': '2021-10-08T12:12:49.168Z', 'profile': {'gender': 'male', 'isSmokin

In [12]:
try:
    responseMessages = requests.get('https://619ca0ea68ebaa001753c9b0.mockapi.io/evaluation/dataengineer/jr/v1/messages')
    jsonRespMessages = responseMessages.json()
    print(f'Entire JSON response from the Messages call: \n{jsonRespMessages}')

except HTTPError as httpError:
    print(f'HTTP Error occored: {httpError}.')
except Exception as err:
    print(f'Other error occored: {err}.')

Entire JSON response from the Messages call: 
[{'createdAt': '2021-11-25T12:18:57.208Z', 'message': 'Tempora aspernatur quaerat cumque necessitatibus.', 'receiverId': '2', 'id': '1', 'senderId': '1'}, {'createdAt': '2021-11-25T15:26:33.436Z', 'message': 'Harum beatae explicabo.', 'receiverId': '3', 'id': '2', 'senderId': '2'}, {'createdAt': '2021-11-25T21:55:29.995Z', 'message': 'Dolor amet et molestiae quaerat rerum minus iste enim odio.', 'receiverId': '1', 'id': '3', 'senderId': '3'}, {'createdAt': '2021-11-26T03:09:45.900Z', 'message': 'Corrupti et eos omnis eveniet vitae pariatur error quo deserunt.', 'receiverId': '1', 'id': '4', 'senderId': '4'}, {'createdAt': '2021-11-26T09:15:42.912Z', 'message': 'Quo distinctio sint blanditiis.', 'receiverId': '1', 'id': '5', 'senderId': '5'}, {'createdAt': '2021-11-27T06:42:02.172Z', 'message': 'Eligendi cum impedit.', 'receiverId': '2', 'id': '6', 'senderId': '6'}, {'createdAt': '2021-11-27T06:38:39.424Z', 'message': 'Omnis itaque vel archi

In [25]:
#to better visualization
import pandas as pd
# First test
df_users = pd.json_normalize(jsonRespUser)
df_users

Unnamed: 0,createdAt,updatedAt,firstName,lastName,address,city,country,zipCode,email,birthDate,subscription,id,profile.gender,profile.isSmoking,profile.profession,profile.income
0,2021-11-23T16:10:33.614Z,2021-11-24T13:34:15.404Z,Levi,Bins,Ernestine Shore,Pembroke Pines,United States,05734,Kianna_Nicolas@hotmail.com,2020-12-16T02:41:21.036Z,"[{'createdAt': '2021-11-24T16:58:46.581Z', 'st...",1,male,True,Central Configuration Planner,3709.61
1,2021-11-23T02:40:10.964Z,2021-11-24T16:04:42.393Z,Aric,Shields,Sporer Field,Meganemouth,Namibia,59236,Madelynn.Ruecker27@gmail.com,2021-10-08T12:12:49.168Z,"[{'createdAt': '2021-11-24T14:36:18.895Z', 'st...",2,male,True,Corporate Tactics Strategist,1504.25
2,2021-11-23T06:26:53.843Z,2021-11-24T06:37:01.117Z,Norene,Lockman,Ida Villages,Port Cary,Bulgaria,83202-2695,Mia_Kling33@gmail.com,2021-01-13T10:11:14.643Z,"[{'createdAt': '2021-11-22T23:41:32.927Z', 'st...",3,female,False,Senior Quality Manager,3256.41
3,2021-11-23T03:27:56.458Z,2021-11-24T17:00:21.524Z,Dariana,Bradtke,Mallie Mission,South Christophe,United States,70183,Chance_Mertz@gmail.com,2021-09-22T20:53:42.528Z,"[{'createdAt': '2021-11-23T05:23:29.452Z', 'st...",4,male,True,Internal Division Agent,758.89
4,2021-11-23T14:57:27.793Z,2021-11-24T05:23:38.587Z,Gabriella,Rohan,Heaney Cove,East Darrionhaven,Iceland,98347,Arely_Terry11@gmail.com,2021-07-23T02:31:51.232Z,"[{'createdAt': '2021-11-24T14:40:24.257Z', 'st...",5,female,False,Central Paradigm Agent,2658.19
5,2021-11-22T19:14:23.721Z,2021-11-24T02:56:07.833Z,Wilhelm,Barrows,Stoltenberg Ranch,,United States,,Kathryn47@yahoo.com,2021-09-29T06:05:18.013Z,[],6,,True,Customer Security Producer,


In [26]:
df_subscriptions = pd.json_normalize(jsonRespUser, meta=['lastName', 'id'], record_path=['subscription'])
df_subscriptions

Unnamed: 0,createdAt,startDate,endDate,status,amount,lastName,id
0,2021-11-24T16:58:46.581Z,2021-11-24T05:12:49.301Z,2022-09-15T06:05:59.630Z,Active,43.18,Bins,1
1,2021-11-24T14:36:18.895Z,2021-11-24T12:57:48.724Z,2022-07-13T09:14:04.001Z,Active,23.78,Shields,2
2,2021-11-22T23:41:32.927Z,2021-11-23T14:42:04.416Z,2022-07-26T17:06:45.413Z,Rejected,64.75,Lockman,3
3,2021-11-23T18:57:20.540Z,2021-11-24T18:04:41.908Z,2022-03-03T09:47:26.916Z,Active,88.6,Lockman,3
4,2021-11-23T05:23:29.452Z,2021-11-23T09:24:30.685Z,2022-03-25T10:14:15.548Z,Rejected,15.98,Bradtke,4
5,2021-11-24T02:07:08.482Z,2021-11-24T12:47:33.246Z,2021-12-10T20:22:36.132Z,Inactive,3.51,Bradtke,4
6,2021-11-24T14:40:24.257Z,2021-11-24T11:22:33.265Z,2022-11-23T12:41:28.319Z,Active,89.71,Rohan,5


In [28]:
pd.json_normalize(jsonRespMessages)

Unnamed: 0,createdAt,message,receiverId,id,senderId
0,2021-11-25T12:18:57.208Z,Tempora aspernatur quaerat cumque necessitatibus.,2,1,1
1,2021-11-25T15:26:33.436Z,Harum beatae explicabo.,3,2,2
2,2021-11-25T21:55:29.995Z,Dolor amet et molestiae quaerat rerum minus is...,1,3,3
3,2021-11-26T03:09:45.900Z,Corrupti et eos omnis eveniet vitae pariatur e...,1,4,4
4,2021-11-26T09:15:42.912Z,Quo distinctio sint blanditiis.,1,5,5
5,2021-11-27T06:42:02.172Z,Eligendi cum impedit.,2,6,6
6,2021-11-27T06:38:39.424Z,Omnis itaque vel architecto incidunt eaque.,5,7,1
7,2021-11-27T07:16:37.817Z,Impedit eum distinctio minus nihil nisi.,6,8,2
8,2021-11-27T12:38:30.049Z,Voluptatem occaecati consequatur pariatur quid...,1,9,3
9,2021-11-27T21:11:57.106Z,Aut ut possimus fugiat atque.,3,10,1
