# Project #4

## Kristjan Lõhmus, Rimmo Rõõm

Description: \
We desire to build a system to manage the internship locations for the Professional training center. For
this purpose, in mind, we host the discussion with the management committee to look for their
specifications. The resumes from this meeting are:
* A city is located a specific region (or state, for example, for federal countries like the USA), country
and continent. By this way, a city can be identified by the triplet (e.g. Tartu linn, Estonia, Europe).
According to their explanation, a city is associated with a single region, a single country and a
single continent. These three pieces of data are mandatory for each city in the db.
* An organization (for example a large company, a university or a research center) is structured into
services (which may have as name of service the words department, division, laboratory, etc.). A
service is characterized by an address, which is the city in which the service is located.
* The student’s supervisors or service employees are characterized by their names and contact
details and a list of keywords which define their sector of activity, and linked to the service in
which they work.

## Task 1. Modelling

### 1. CDM model
Here's a high level CDM that I would present to some C-level management: \
![High-level CDM model](./img/high_level_cdm_model.jpg)

And here's the low-level one that I would present to more technical people: \
![Low-level CDM model](./img/low_level_cdm_model.jpg)


### 2. RDM

The RDM is pretty much the same as the low-level CDM with foreign keys added (Service has Address added as an extra field).

### 3. Document Data model

#### Continents Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the continent)

#### Countries Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the country)
  - `continent_id`: ObjectID (reference to the continent document)

#### Regions Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the region)
  - `country_id`: ObjectID (reference to the country document)

#### Cities Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the city)
  - `region_id`: ObjectID (reference to the region document)

#### Organizations Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the organization)

#### Services Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the service)
  - `address`: Object (address details, such as city_id)
    - `city_id`: ObjectID (reference to the city document)
  - `organization_id`: ObjectID (reference to the organization document)

#### Supervisors Collection:
- Key-Value Pairs:
  - `_id`: ObjectID (automatically generated unique identifier)
  - `name`: String (name of the supervisor)
  - `contact`: Object (contact details)
    - `email`: String (email address)
    - `phone`: String (phone number)
  - `keywords`: Array of Strings (keywords defining supervisor's expertise)
  - `service_id`: ObjectID (reference to the service document)

## 2. Implementation

### 1. Implement each structure on a native data engine

#### PostgreSQL

In [20]:
import psycopg2
import pandas.io.sql as sqlio
import pandas as pd
import warnings
import time
warnings.filterwarnings('ignore')

In [3]:
conn = psycopg2.connect(
    host= 'localhost',
    password = "postgres",
    user = "postgres",
    port = 5432,
    )
conn.autocommit = True
cursor = conn.cursor()
cursor.execute('CREATE SCHEMA training_centre;')
cursor.execute('set search_path = "training_centre";')

In [4]:
cursor.execute('''
CREATE TABLE Continent (
    ContinentID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL
);
''')

In [5]:
cursor.execute('''
CREATE TABLE Country (
    CountryID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL,
    ContinentID INT NOT NULL,
    FOREIGN KEY (ContinentID) REFERENCES Continent(ContinentID)
);
''')

In [6]:
cursor.execute('''
CREATE TABLE Region (
    RegionID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL,
    CountryID INT NOT NULL,
    FOREIGN KEY (CountryID) REFERENCES Country(CountryID)
);
''')

In [7]:
cursor.execute('''
CREATE TABLE City (
    CityID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL,
    RegionID INT NOT NULL,
    FOREIGN KEY (RegionID) REFERENCES Region(RegionID)
);
''')

In [8]:
cursor.execute('''
CREATE TABLE Organization (
    OrganizationID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL
);
''')

In [9]:
cursor.execute('''
CREATE TABLE Service (
    ServiceID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL,
    Address INT NOT NULL,
    OrganizationID INT NOT NULL,
    FOREIGN KEY (Address) REFERENCES City(CityID),
    FOREIGN KEY (OrganizationID) REFERENCES Organization(OrganizationID)
);
''')

In [10]:
cursor.execute('''
CREATE TABLE Supervisor (
    SupervisorID SERIAL PRIMARY KEY,
    Name VARCHAR(255) NOT NULL,
    ContactDetails TEXT NOT NULL,
    Keywords TEXT NOT NULL,
    ServiceID INT NOT NULL,
    FOREIGN KEY (ServiceID) REFERENCES Service(ServiceID)
);
''')

#### MongoDB

In [11]:
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['training_centre']

# Drop existing collections if they exist
db.continents.drop()
db.countries.drop()
db.regions.drop()
db.cities.drop()
db.organizations.drop()
db.services.drop()
db.supervisors.drop()

# Create collections
continents = db['continents']
countries = db['countries']
regions = db['regions']
cities = db['cities']
organizations = db['organizations']
services = db['services']
supervisors = db['supervisors']

### 2. Implement the uniqueness constraint on the fields: continent name, organization name on both.

#### PostgreSQL

In [11]:
cursor.execute('''
ALTER TABLE Continent
ADD CONSTRAINT unique_continent_name UNIQUE (Name);
''')

In [12]:
cursor.execute('''
ALTER TABLE Organization
ADD CONSTRAINT unique_organization_name UNIQUE (Name);
''')

#### MongoDB

In [12]:
from pymongo import errors

# Create unique indexes
try:
    continents.create_index([("name", 1)], unique=True)
    organizations.create_index([("name", 1)], unique=True)
    print("Unique indexes created successfully.")
except errors.DuplicateKeyError as e:
    print(f"Error creating unique index: {e}")

Unique indexes created successfully.


### 3. Populate your database with at least the following cardinalities [10 organizations, 5 Services/per organization (randomly assigned to different continents)]

#### PostgreSQL

In [13]:
cursor.execute('''
INSERT INTO Continent (Name) VALUES ('Europe'), ('Asia'), ('Africa'), ('North America'), ('South America');
''')

In [14]:
cursor.execute('''
INSERT INTO Country (Name, ContinentID) VALUES 
('France', 1), 
('Germany', 1),
('China', 2), 
('India', 2), 
('Kenya', 3), 
('South Africa', 3), 
('USA', 4), 
('Canada', 4), 
('Brazil', 5), 
('Argentina', 5);
''')

In [15]:
cursor.execute('''
INSERT INTO Region (Name, CountryID) VALUES 
('Paris', 1), 
('Bavaria', 2), 
('Guangdong', 3), 
('Maharashtra', 4), 
('Nairobi', 5), 
('Western Cape', 6), 
('California', 7), 
('Ontario', 8), 
('São Paulo', 9), 
('Buenos Aires', 10);
''')

In [16]:
cursor.execute('''
INSERT INTO City (Name, RegionID) VALUES 
('Paris', 1), 
('Munich', 2), 
('Guangzhou', 3), 
('Mumbai', 4), 
('Nairobi', 5), 
('Cape Town', 6), 
('San Francisco', 7), 
('Toronto', 8), 
('São Paulo', 9), 
('Buenos Aires', 10);
''')

In [17]:
cursor.execute('''
INSERT INTO Organization (Name) VALUES 
('Harvard University'), 
('MIT'), 
('Stanford University'), 
('Oxford University'), 
('Cambridge University'), 
('Tsinghua University'), 
('Peking University'), 
('ETH Zurich'), 
('University of Tokyo'), 
('Max Planck Institute');
''')

In [18]:
cursor.execute('''
INSERT INTO Service (Name, Address, OrganizationID) VALUES 
('Department of Computer Science', 1, 1), 
('Department of Mathematics', 2, 1), 
('Department of Physics', 3, 1), 
('Department of Biology', 4, 1), 
('Department of Chemistry', 5, 1), 
('Division of Engineering', 6, 2), 
('Division of Humanities', 7, 2), 
('Division of Social Sciences', 8, 2), 
('Division of Natural Sciences', 9, 2), 
('Division of Arts', 10, 2), 
('Institute of Technology', 1, 3), 
('Institute of Medicine', 2, 3), 
('Institute of Law', 3, 3), 
('Institute of Business', 4, 3), 
('Institute of Education', 5, 3), 
('Faculty of Science', 6, 4), 
('Faculty of Engineering', 7, 4), 
('Faculty of Arts', 8, 4), 
('Faculty of Law', 9, 4), 
('Faculty of Medicine', 10, 4), 
('School of Engineering', 1, 5), 
('School of Business', 2, 5), 
('School of Arts', 3, 5), 
('School of Education', 4, 5), 
('School of Law', 5, 5), 
('Research Lab A', 6, 6), 
('Research Lab B', 7, 6), 
('Research Lab C', 8, 6), 
('Research Lab D', 9, 6), 
('Research Lab E', 10, 6), 
('Center for Advanced Studies', 1, 7), 
('Center for Basic Sciences', 2, 7), 
('Center for Applied Sciences', 3, 7), 
('Center for Theoretical Physics', 4, 7), 
('Center for Molecular Biology', 5, 7), 
('Institute of Advanced Research', 6, 8), 
('Institute of Fundamental Research', 7, 8), 
('Institute of Applied Research', 8, 8), 
('Institute of Social Research', 9, 8), 
('Institute of Economic Research', 10, 8), 
('Laboratory of Physics', 1, 9), 
('Laboratory of Chemistry', 2, 9), 
('Laboratory of Biology', 3, 9), 
('Laboratory of Computer Science', 4, 9), 
('Laboratory of Environmental Science', 5, 9), 
('School of Humanities', 6, 10), 
('School of Social Sciences', 7, 10), 
('School of Natural Sciences', 8, 10), 
('School of Engineering', 9, 10), 
('School of Health Sciences', 10, 10);
''')

In [19]:
cursor.execute('''
INSERT INTO Supervisor (Name, ContactDetails, Keywords, ServiceID) VALUES 
('John Doe', 'john@example.com', 'Software Development', 1), 
('Jane Smith', 'jane@example.com', 'Data Science', 2), 
('Jim Brown', 'jim@example.com', 'Networking', 3), 
('Jill White', 'jill@example.com', 'AI Research', 4), 
('Jack Black', 'jack@example.com', 'Cybersecurity', 5);
''')

#### MongoDB

In [13]:
import random

# Insert sample data for continents
continent_ids = {
    "Europe": continents.insert_one({"name": "Europe"}).inserted_id,
    "Asia": continents.insert_one({"name": "Asia"}).inserted_id,
    "Africa": continents.insert_one({"name": "Africa"}).inserted_id,
    "North America": continents.insert_one({"name": "North America"}).inserted_id,
    "South America": continents.insert_one({"name": "South America"}).inserted_id,
}

# Insert sample data for countries
country_data = [
    ("France", "Europe"), ("Germany", "Europe"),
    ("China", "Asia"), ("India", "Asia"),
    ("Kenya", "Africa"), ("South Africa", "Africa"),
    ("USA", "North America"), ("Canada", "North America"),
    ("Brazil", "South America"), ("Argentina", "South America")
]

country_ids = {}
for country, continent in country_data:
    country_ids[country] = countries.insert_one({
        "name": country,
        "continent_id": continent_ids[continent]
    }).inserted_id

# Insert sample data for regions
region_data = [
    ("Paris", "France"), ("Bavaria", "Germany"),
    ("Guangdong", "China"), ("Maharashtra", "India"),
    ("Nairobi", "Kenya"), ("Western Cape", "South Africa"),
    ("California", "USA"), ("Ontario", "Canada"),
    ("São Paulo", "Brazil"), ("Buenos Aires", "Argentina")
]

region_ids = {}
for region, country in region_data:
    region_ids[region] = regions.insert_one({
        "name": region,
        "country_id": country_ids[country]
    }).inserted_id

# Insert sample data for cities
city_data = [
    ("Paris", "Paris"), ("Munich", "Bavaria"),
    ("Guangzhou", "Guangdong"), ("Mumbai", "Maharashtra"),
    ("Nairobi", "Nairobi"), ("Cape Town", "Western Cape"),
    ("San Francisco", "California"), ("Toronto", "Ontario"),
    ("São Paulo", "São Paulo"), ("Buenos Aires", "Buenos Aires")
]

city_ids = {}
for city, region in city_data:
    city_ids[city] = cities.insert_one({
        "name": city,
        "region_id": region_ids[region]
    }).inserted_id

# Insert sample data for organizations
organization_names = [
    "Harvard University", "MIT", "Stanford University",
    "Oxford University", "Cambridge University",
    "Tsinghua University", "Peking University",
    "ETH Zurich", "University of Tokyo", "Max Planck Institute"
]

organization_ids = []
for org in organization_names:
    organization_ids.append(organizations.insert_one({"name": org}).inserted_id)

# Insert sample data for services and randomly assign to cities
service_names = [
    "Department of Computer Science", "Department of Mathematics", "Department of Physics",
    "Department of Biology", "Department of Chemistry", "Division of Engineering",
    "Division of Humanities", "Division of Social Sciences", "Division of Natural Sciences",
    "Division of Arts", "Institute of Technology", "Institute of Medicine",
    "Institute of Law", "Institute of Business", "Institute of Education",
    "Faculty of Science", "Faculty of Engineering", "Faculty of Arts",
    "Faculty of Law", "Faculty of Medicine", "School of Engineering",
    "School of Business", "School of Arts", "School of Education",
    "School of Law", "Research Lab A", "Research Lab B",
    "Research Lab C", "Research Lab D", "Research Lab E",
    "Center for Advanced Studies", "Center for Basic Sciences",
    "Center for Applied Sciences", "Center for Theoretical Physics",
    "Center for Molecular Biology", "Institute of Advanced Research",
    "Institute of Fundamental Research", "Institute of Applied Research",
    "Institute of Social Research", "Institute of Economic Research",
    "Laboratory of Physics", "Laboratory of Chemistry",
    "Laboratory of Biology", "Laboratory of Computer Science",
    "Laboratory of Environmental Science", "School of Humanities",
    "School of Social Sciences", "School of Natural Sciences",
    "School of Health Sciences"
]

for org_id in organization_ids:
    for _ in range(5):  # Insert 5 services per organization
        service_name = random.choice(service_names)
        city_id = random.choice(list(city_ids.values()))
        services.insert_one({
            "name": service_name,
            "address": {
                "city_id": city_id
            },
            "organization_id": org_id
        })

# Function to find service ID or handle missing service
def get_service_id(service_name):
    service = services.find_one({"name": service_name})
    if service:
        return service["_id"]
    else:
        print(f"Service '{service_name}' not found.")
        return None

# Insert sample data for supervisors
supervisor_data = [
    ("John Doe", "john@example.com", "+123456789", ["Software Development"], "Department of Computer Science"),
    ("Jane Smith", "jane@example.com", "+987654321", ["Data Science"], "Department of Mathematics"),
    ("Jim Brown", "jim@example.com", "+1122334455", ["Networking"], "Department of Physics"),
    ("Jill White", "jill@example.com", "+5566778899", ["AI Research"], "Department of Biology"),
    ("Jack Black", "jack@example.com", "+9988776655", ["Cybersecurity"], "Department of Chemistry")
]

for name, email, phone, keywords, service_name in supervisor_data:
    service_id = get_service_id(service_name)
    if service_id:
        supervisors.insert_one({
            "name": name,
            "contact": {
                "email": email,
                "phone": phone
            },
            "keywords": keywords,
            "service_id": service_id
        })

## Querying

### 1. Display the name of the organizations and respectively the number of services located on the European continent.

#### PostgreSQL

In [21]:
start_time = time.time()
result = sqlio.read_sql_query("""
SELECT 
    O.Name AS OrganizationName, 
    COUNT(S.ServiceID) AS NumberOfServices
FROM 
    Organization O
JOIN 
    Service S ON O.OrganizationID = S.OrganizationID
JOIN 
    City C ON S.Address = C.CityID
JOIN 
    Region R ON C.RegionID = R.RegionID
JOIN 
    Country CO ON R.CountryID = CO.CountryID
JOIN 
    Continent CON ON CO.ContinentID = CON.ContinentID
WHERE 
    CON.Name = 'Europe'
GROUP BY 
    O.OrganizationID, O.Name;

""",conn)
end_time = time.time()
result.head()

Unnamed: 0,organizationname,numberofservices
0,Harvard University,2
1,Stanford University,2
2,Cambridge University,2
3,Peking University,2
4,University of Tokyo,2


#### MongoDB

In [15]:
import time

# Perform the aggregation query
pipeline = [
    {
        '$lookup': {
            'from': 'services',
            'localField': '_id',
            'foreignField': 'organization_id',
            'as': 'services'
        }
    },
    {
        '$lookup': {
            'from': 'cities',
            'localField': 'services.address.city_id',
            'foreignField': '_id',
            'as': 'cities'
        }
    },
    {
        '$lookup': {
            'from': 'regions',
            'localField': 'cities.region_id',
            'foreignField': '_id',
            'as': 'regions'
        }
    },
    {
        '$lookup': {
            'from': 'countries',
            'localField': 'regions.country_id',
            'foreignField': '_id',
            'as': 'countries'
        }
    },
    {
        '$lookup': {
            'from': 'continents',
            'localField': 'countries.continent_id',
            'foreignField': '_id',
            'as': 'continents'
        }
    },
    {
        '$unwind': '$services'
    },
    {
        '$unwind': '$cities'
    },
    {
        '$unwind': '$regions'
    },
    {
        '$unwind': '$countries'
    },
    {
        '$unwind': '$continents'
    },
    {
        '$match': {
            'continents.name': 'Europe'
        }
    },
    {
        '$group': {
            '_id': {
                'organization_id': '$_id',
                'organization_name': '$name'
            },
            'number_of_services': {'$sum': 1}
        }
    },
    {
        '$project': {
            '_id': 0,
            'organization_name': '$_id.organization_name',
            'number_of_services': 1
        }
    }
]

start_time2 = time.time()
result = list(db.organizations.aggregate(pipeline))
end_time2 = time.time()

for doc in result:
    print(doc)

{'number_of_services': 625, 'organization_name': 'Stanford University'}
{'number_of_services': 320, 'organization_name': 'Harvard University'}
{'number_of_services': 625, 'organization_name': 'Tsinghua University'}
{'number_of_services': 320, 'organization_name': 'Oxford University'}
{'number_of_services': 135, 'organization_name': 'ETH Zurich'}
{'number_of_services': 625, 'organization_name': 'MIT'}


### 2. Analyze and compare the execution performance on the two systems.

#### PostgreSQL

In [22]:
print(f'PostgreSQL executed in {round(end_time - start_time, 2)} seconds')

PostgreSQL executed in 0.22 seconds


#### MongoDB

In [16]:
print(f"Mongodb executed in {end_time2 - start_time2:.2f} seconds")

Mongodb executed in 0.02 seconds
