# Data Engineering & EDA Workshop

This notebook demonstrates cloud database connection, data engineering, EDA, and visualization.

In [None]:
!pip install psycopg2-binary faker sqlalchemy scikit-learn seaborn

## 1. Data Collection
We use a Neon PostgreSQL database and synthetic data generated with Faker.

In [None]:

import psycopg2
import pandas as pd
from faker import Faker
import random
from datetime import date


In [None]:

NEON_DB_URL = "postgresql://username:password@host/dbname?sslmode=require"
conn = psycopg2.connect(NEON_DB_URL)
cur = conn.cursor()


In [None]:

cur.execute("""
CREATE TABLE IF NOT EXISTS employees (
    employee_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    position VARCHAR(50),
    start_date DATE,
    salary INTEGER
);
""")
conn.commit()


In [None]:

fake = Faker()
positions = [
    "Software Engineer", "Data Scientist", "DevOps Engineer",
    "Cloud Architect", "Cybersecurity Analyst", "AI Engineer", "Backend Developer"
]

employees = []
for _ in range(50):
    employees.append((
        fake.name(),
        random.choice(positions),
        fake.date_between(start_date='-9y', end_date='today'),
        random.randint(60000, 200000)
    ))

cur.executemany("""
INSERT INTO employees (name, position, start_date, salary)
VALUES (%s, %s, %s, %s);
""", employees)
conn.commit()


In [None]:

df = pd.read_sql("SELECT * FROM employees;", conn)
df.head()


## 2. Data Cleaning
Checking structure and missing values.

In [None]:

df.info()
df.isnull().sum()
df.describe()


## 3. Feature Engineering
Extracting start year and years of service.

In [None]:

df['start_year'] = pd.to_datetime(df['start_date']).dt.year
df['years_of_service'] = 2025 - df['start_year']
df.head()


## 4. Scaling
Applying StandardScaler to salary.

In [None]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['salary_scaled'] = scaler.fit_transform(df[['salary']])


## 5. Visualization 1
Average Salary by Position and Start Year.

In [None]:

import matplotlib.pyplot as plt
grouped = df.groupby(['position', 'start_year'])['salary'].mean().unstack()
grouped.plot(kind='bar', figsize=(14,6))
plt.title("Average Salary by Position and Start Year")
plt.tight_layout()
plt.show()


## 6. Advanced Visualization
Creating departments and joining tables.

In [None]:

cur.execute("""
CREATE TABLE IF NOT EXISTS departments (
    department_id SERIAL PRIMARY KEY,
    department_name VARCHAR(50),
    location VARCHAR(50)
);
""")
conn.commit()


In [None]:

departments = [
    ("Engineering", "Toronto"),
    ("Data", "Vancouver"),
    ("Security", "Montreal"),
    ("Cloud", "Calgary")
]

cur.executemany("""
INSERT INTO departments (department_name, location)
VALUES (%s, %s);
""", departments)
conn.commit()


In [None]:

cur.execute("ALTER TABLE employees ADD COLUMN IF NOT EXISTS department_id INTEGER;")
cur.execute("UPDATE employees SET department_id = floor(random() * 4 + 1);")
conn.commit()


In [None]:

query = """
SELECT e.*, d.department_name
FROM employees e
JOIN departments d
ON e.department_id = d.department_id;
"""
df_joined = pd.read_sql(query, conn)
df_joined.head()


In [None]:

import seaborn as sns
pivot = df_joined.pivot_table(values='salary', index='department_name', columns='position', aggfunc='mean')
plt.figure(figsize=(14,6))
sns.heatmap(pivot, annot=True, fmt=".0f", cmap="coolwarm")
plt.title("Average Salary by Department and Position")
plt.show()


## 7. Conclusions
This notebook demonstrates a complete data engineering and EDA workflow.