# **`Lab3 - Data Engineering & EDA with Python, SQL, and Pandas`**

##### **Name** - Manu Mathew
##### **CourseID** - PROG8245
##### **Course** - Machine Learning Programming
##### **Student ID** - 8990691

---

**Install required packages**

In [26]:
%pip install psycopg2-binary pandas faker sqlalchemy scikit-learn

Note: you may need to restart the kernel to use updated packages.


`1 -- `
- Create a free SQL Database
- Create a table named employees with the following columns:
    - employee_id (integer, primary key)
    - name (string)
    - position (string, IT-related job titles)
    - start_date (date, between 2015 and 2024)
    - salary (integer, $60,000–$200,000)`



I have set up a Postgres database and also I have created the SQL table using the below SQL query

```SQL
CREATE TABLE employees (
  employee_id SERIAL PRIMARY KEY,
  name VARCHAR(50),
  position VARCHAR(50),
  start_date DATE,
  salary INTEGER
);

**Adding the imports**

In [27]:
import random
from faker import Faker
from datetime import date
import pandas as pd
# import psycopg2
from sqlalchemy import create_engine
from sklearn.preprocessing import MinMaxScaler

`2 --`
- Generate & Populate Data
    - Generate at least 50 synthetic records using Python and the Faker library.
    - Insert the data into your cloud database.

In [28]:
# Initializes faker object from the Faker library
fake = Faker()
# Position list that can be assigned randomly to employees
positions = ['Software Engineer', 'Data Analyst', 'DevOps Engineer', 'ML Engineer', 'QA Engineer','Backend Developer', 'Frontend Developer', 'Cloud Architect', 'SysAdmin', 'Data Scientist']
# Running a loop for 100 iterations
for i in range(100):
    # Replace single quote with double quotes
    name = fake.name().replace("'", "''")
    # Select random position for the employee
    position = random.choice(positions)
    # Select any date in the year between 2015 and 2024
    start_date = fake.date_between(start_date=date(2015,1,1), end_date=date(2024,6,1))
    # Select an integer between 60000 and 200000
    salary = random.randint(60000, 200000)
    # print 100 SQL insert queries
    print(f"INSERT INTO employees (name, position, start_date, salary) VALUES('{name}', '{position}', '{start_date}', {salary});")

INSERT INTO employees (name, position, start_date, salary) VALUES('Dennis Ramos', 'Frontend Developer', '2019-04-07', 165782);
INSERT INTO employees (name, position, start_date, salary) VALUES('Jordan Cruz', 'SysAdmin', '2020-09-09', 174274);
INSERT INTO employees (name, position, start_date, salary) VALUES('Dustin Ray PhD', 'QA Engineer', '2018-08-10', 169549);
INSERT INTO employees (name, position, start_date, salary) VALUES('Jessica Gross', 'QA Engineer', '2016-09-24', 177066);
INSERT INTO employees (name, position, start_date, salary) VALUES('Linda Potts', 'SysAdmin', '2019-06-07', 151760);
INSERT INTO employees (name, position, start_date, salary) VALUES('Elizabeth Quinn', 'Frontend Developer', '2018-05-17', 167983);
INSERT INTO employees (name, position, start_date, salary) VALUES('Yesenia Hale', 'Data Analyst', '2018-03-28', 174993);
INSERT INTO employees (name, position, start_date, salary) VALUES('Amy Higgins', 'Data Scientist', '2016-01-31', 193258);
INSERT INTO employees (na

`3--`
- Connect and Load Data
- Using Python, psycopg2, and Pandas, connect to your cloud database.
- Query the entire employee table and load the data into a Pandas DataFrame.
- Display the first few rows using df.head().

In [29]:
# Connection string
conn_str = "postgresql://neondb_owner:npg_Ppd3S2nUcWfx@ep-steep-rain-a8s0cnp2-pooler.eastus2.azure.neon.tech/neondb?sslmode=require"
# Create SQLAlchemy engine
engine = create_engine(conn_str)
# Query entire employee table and load the data into the dataframe
df = pd.read_sql_query("SELECT * FROM employees;", engine)
# Display the first few records , showing the first 50 records
print(df.head(50))
# Close the engine
engine.dispose()

    employee_id                   name            position  start_date  salary
0             1      Zachary Rogers MD            SysAdmin  2024-01-31  164268
1             2      Francis Castaneda        Data Analyst  2023-04-09   92677
2             3            Oscar Davis      Data Scientist  2019-02-12  114123
3             4            David Brown   Software Engineer  2021-05-11  183617
4             5         Corey Campbell            SysAdmin  2023-07-04   64790
5             6  Dr. Kyle Martinez DDS         ML Engineer  2017-01-11   86908
6             7        Kristin Russell         ML Engineer  2016-12-07   68990
7             8         Patricia Gates   Backend Developer  2018-04-02  140714
8             9          Kelly Carlson            SysAdmin  2019-12-18  107035
9            10        Samantha Hardin     DevOps Engineer  2016-11-04  198378
10           11           Brian Newman         ML Engineer  2019-09-07   94029
11           12      Mikayla Dickerson         ML En

`3--`
**Explain each EDA step**

**Data Collection**

**`Database Setup and data collection`**
- Go to https://neon.tech/
- Sign up with your GitHub or Google account
- After logging in, click `Create a project`
- Set the project name.
- Choose any region and click `CreateProject`
- Once in the project dashboard page, click on the `connect` button on top right side.
- Copy the connection string
- Select the SQLEditor option , and then copy the SQL query below to create the employees table
- ```SQL
    CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    name VARCHAR(50),
    position VARCHAR(50),
    start_date DATE,
    salary INTEGER
    );
- Generate 100 asynthetic records using Python and the Faker library. We generate 100 INSERT queries and copy the insert queries.
- Go to project dashboard page and then selected `SQLEditor` and then executed the 100 INSERT queries to insert the data into the cloud database.

**Data Cleaning**
- I checked for the missing values and I could find that there were no null or empty values present. This was checked using the command `df.info()`
- Also the data types for each columns names were correct as per the requirement using the command `df.info()`

In [30]:
# Column types and null counts
df.info()

# Check for the missing values
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   employee_id  200 non-null    int64 
 1   name         200 non-null    object
 2   position     200 non-null    object
 3   start_date   200 non-null    object
 4   salary       200 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 7.9+ KB


employee_id    0
name           0
position       0
start_date     0
salary         0
dtype: int64

**Data Transformation**
- As part of data transformation , I have created a new column `years_of_service` by calculating the difference between the current date and the sart_date

In [31]:
df['years_of_service'] = date.today().year - pd.DatetimeIndex(df['start_date']).year
df.head()

Unnamed: 0,employee_id,name,position,start_date,salary,years_of_service
0,1,Zachary Rogers MD,SysAdmin,2024-01-31,164268,1
1,2,Francis Castaneda,Data Analyst,2023-04-09,92677,2
2,3,Oscar Davis,Data Scientist,2019-02-12,114123,6
3,4,David Brown,Software Engineer,2021-05-11,183617,4
4,5,Corey Campbell,SysAdmin,2023-07-04,64790,2


**Feature Engineering**
- As a part of the feature engineering, I have created **new columns** that will more effectively help in visualization. I have added new columns and they are
- **Normalized Salary** - Salary are usually in large numbers like $80,000 or $150,000, which can be hard to compare so I used a method called *Min-Max Scaling* to convert all salaries to a range between **0 and 1** and this makes it easier to make visualizations.
- **Seniority Level** -
  - "Junior" if they have worked less than 3 years
  - "Mid-Level" if they have worked between 3 and 6 years
  - "Senior" if they have worked more than 6 years

  `To implement the Seniority Level, I used the lamda function`.


In [32]:
# Normalize the salary column using Min-Max Scaling
minmax = MinMaxScaler()
df['normalized_salary'] = minmax.fit_transform(df[['salary']])

# Create a seniority level based on years of service
df['seniority'] = df['years_of_service'].apply(
    lambda x: 'Junior' if x < 3 else 'Mid-Level' if x < 7 else 'Senior'
)

# Show new features
df[['employee_id', 'name', 'position', 'normalized_salary', 'years_of_service', 'seniority']].head()


Unnamed: 0,employee_id,name,position,normalized_salary,years_of_service,seniority
0,1,Zachary Rogers MD,SysAdmin,0.753144,1,Junior
1,2,Francis Castaneda,Data Analyst,0.235037,2,Junior
2,3,Oscar Davis,Data Scientist,0.390243,6,Mid-Level
3,4,David Brown,Software Engineer,0.893174,4,Mid-Level
4,5,Corey Campbell,SysAdmin,0.033218,2,Junior
