1️⃣ Load and Inspect

Load the dataset (job_postings_dirty.csv) using Pandas.

Display the first 10 rows.

Show the number of rows and columns.

Check for missing values per column.

In [12]:
import pandas as pd
job_postings = pd.read_csv("../DATA/job_postings_dirty.csv")
job_postings.head()


Unnamed: 0,job_id,job_title,job_location,search_location,job_posted_date,county,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills
0,2189.0,Software Engineer,Atlanta,atlanta,2025-08-16,Japan,Monthly,110870.58,,MediCare,"Project Management, Teamwork"
1,1379.0,Sales Executive,Boston,boston,2024-07-10,UK,Monthly,,86.14,NextGen Solutions,"Communication, Creativity, Project Management"
2,430.0,Data Analyst,Dallas,dallas,2025-05-19,Germany,Hourly,74008.55,53.35,DataBridge,"Leadership, Power BI, Project Management, SQL,..."
3,3207.0,Accountant,Miami,miami,2024-11-23,USA,Annual,62981.46,,SmartWare,"Leadership, Teamwork, Python, SQL"
4,449.0,Marketing Specialist,New York,new york,2025-04-26,Brazil,Hourly,51715.78,46.11,BrightPath,"Excel, Communication, Leadership, Power BI"


In [15]:
job_postings.shape

(5300, 11)

In [17]:
job_postings.isnull().sum()

job_id               94
job_title             0
job_location        174
search_location     174
job_posted_date       0
county                0
salary_rate        1552
salary_year_avg     817
salary_hour_avg     800
company_name          0
job_skills          255
dtype: int64

2️⃣ Handle Missing Values

Which columns have the most missing data?

Drop rows where job_title or company_name is missing.

Fill missing salary_year_avg with the column’s mean

In [18]:
job_postings.isnull().sum().sort_values(ascending=False)

salary_rate        1552
salary_year_avg     817
salary_hour_avg     800
job_skills          255
search_location     174
job_location        174
job_id               94
job_title             0
job_posted_date       0
county                0
company_name          0
dtype: int64

In [19]:
job_postings.dropna(subset=["job_title", "company_name"], inplace=True)
job_postings.head()


Unnamed: 0,job_id,job_title,job_location,search_location,job_posted_date,county,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills
0,2189.0,Software Engineer,Atlanta,atlanta,2025-08-16,Japan,Monthly,110870.58,,MediCare,"Project Management, Teamwork"
1,1379.0,Sales Executive,Boston,boston,2024-07-10,UK,Monthly,,86.14,NextGen Solutions,"Communication, Creativity, Project Management"
2,430.0,Data Analyst,Dallas,dallas,2025-05-19,Germany,Hourly,74008.55,53.35,DataBridge,"Leadership, Power BI, Project Management, SQL,..."
3,3207.0,Accountant,Miami,miami,2024-11-23,USA,Annual,62981.46,,SmartWare,"Leadership, Teamwork, Python, SQL"
4,449.0,Marketing Specialist,New York,new york,2025-04-26,Brazil,Hourly,51715.78,46.11,BrightPath,"Excel, Communication, Leadership, Power BI"


In [20]:
job_postings['salary_year_avg'] = job_postings['salary_year_avg'].fillna(job_postings['salary_year_avg'].mean())
job_postings.head()

Unnamed: 0,job_id,job_title,job_location,search_location,job_posted_date,county,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills
0,2189.0,Software Engineer,Atlanta,atlanta,2025-08-16,Japan,Monthly,110870.58,,MediCare,"Project Management, Teamwork"
1,1379.0,Sales Executive,Boston,boston,2024-07-10,UK,Monthly,74920.608519,86.14,NextGen Solutions,"Communication, Creativity, Project Management"
2,430.0,Data Analyst,Dallas,dallas,2025-05-19,Germany,Hourly,74008.55,53.35,DataBridge,"Leadership, Power BI, Project Management, SQL,..."
3,3207.0,Accountant,Miami,miami,2024-11-23,USA,Annual,62981.46,,SmartWare,"Leadership, Teamwork, Python, SQL"
4,449.0,Marketing Specialist,New York,new york,2025-04-26,Brazil,Hourly,51715.78,46.11,BrightPath,"Excel, Communication, Leadership, Power BI"


Remove Duplicates

Find how many duplicate job entries exist (same job_title, company_name, job_location).

Remove them.

In [21]:
job_postings.drop_duplicates(subset=["job_title", "company_name", "job_location"], inplace=True)
job_postings.shape


(1403, 11)

Fix Text Inconsistencies

Standardize all text columns to title case (e.g. software engineer → Software Engineer).

Fix typos in job_title (e.g. “Softwar Enginner” → “Software Engineer”).

In [22]:
job_postings.job_title = job_postings.job_title.str.title()
job_postings.job_title = job_postings.job_title.replace("SoftWar Engineer", "Software Engineer")

In [23]:
job_postings.job_title

0          Software Engineer
1            Sales Executive
2               Data Analyst
3                 Accountant
4       Marketing Specialist
                ...         
5236            Data Analyst
5252                 Teacher
5261         Sales Executive
5276             Data Anylst
5277       Software Engineer
Name: job_title, Length: 1403, dtype: object

Clean Salary Data

Convert salary_year_avg and salary_hour_avg to numeric.

Replace impossible salary values (e.g. negative or >1,000,000) with NaN.

In [24]:
import numpy as np
job_postings.salary_year_avg = pd.to_numeric(job_postings.salary_year_avg, errors='coerce')
job_postings.salary_hour_avg = pd.to_numeric(job_postings.salary_hour_avg, errors='coerce')

job_postings.salary_year_avg = job_postings.salary_year_avg.replace(-1, np.nan)
job_postings.salary_hour_avg = job_postings.salary_hour_avg.replace(-1, np.nan)

In [25]:
job_postings.salary_year_avg 

0       110870.580000
1        74920.608519
2        74008.550000
3        62981.460000
4        51715.780000
            ...      
5236     80383.660000
5252     84306.670000
5261     74920.608519
5276     89309.190000
5277     89531.970000
Name: salary_year_avg, Length: 1403, dtype: float64

In [26]:
job_postings.salary_hour_avg 

0         NaN
1       86.14
2       53.35
3         NaN
4       46.11
        ...  
5236      NaN
5252    85.60
5261    16.19
5276    10.22
5277    19.82
Name: salary_hour_avg, Length: 1403, dtype: float64

In [27]:
job_postings.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1403 entries, 0 to 5277
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   job_id           1371 non-null   float64
 1   job_title        1403 non-null   object 
 2   job_location     1316 non-null   object 
 3   search_location  1316 non-null   object 
 4   job_posted_date  1403 non-null   object 
 5   county           1403 non-null   object 
 6   salary_rate      1012 non-null   object 
 7   salary_year_avg  1403 non-null   float64
 8   salary_hour_avg  1168 non-null   float64
 9   company_name     1403 non-null   object 
 10  job_skills       1344 non-null   object 
dtypes: float64(3), object(8)
memory usage: 131.5+ KB


Descriptive Insights

What is the average yearly salary across all jobs?

Which 3 companies offer the highest average salary?

How many unique job titles are there?

In [28]:
job_avg_yearly_salary = job_postings.groupby('job_title', as_index=False)['salary_year_avg'].mean().round(2)
job_avg_yearly_salary.sort_values(by='salary_year_avg', ascending=False, inplace=True)
job_avg_yearly_salary.head()

Unnamed: 0,job_title,salary_year_avg
0,Accountant,80434.76
1,Accuntant,79637.62
5,Hr Manager,78868.3
10,Softwar Enginner,77615.69
8,Project Manager,75845.32


In [35]:
company_with_highest_salary = job_postings.groupby('company_name', as_index=False)['salary_year_avg'].mean()
company_with_highest_salary.sort_values(by='salary_year_avg', ascending=False, inplace=True)
company_with_highest_salary.head(3)

Unnamed: 0,company_name,salary_year_avg
5,HealthLink,78972.688507
3,FinEddge,77242.800161
8,NextGen Solutions,77199.284519


In [37]:
job_postings.isna()

Unnamed: 0,job_id,job_title,job_location,search_location,job_posted_date,county,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills
0,False,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
5236,False,False,True,True,False,False,False,False,True,False,False
5252,False,False,True,True,False,False,False,False,False,False,False
5261,False,False,False,False,False,False,False,False,False,False,False
5276,False,False,False,False,False,False,True,False,False,False,True


In [34]:
import pandas as pd
job_post = pd.read_csv("../DATA/job_postings_flats.csv")

In [35]:
job_post.shape

(10000, 12)

In [36]:
job_post.head()

Unnamed: 0,job_id,job_title,job_title_short,job_location,search_location,job_posted_date,county,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills
0,1,Mechanical Engineer,Mechanical,Seattle,"Seattle, USA",2025-09-26,Kings County,Hourly,80685.82,38.79,NextGen Corp,"Customer Service, Communication"
1,2,Graphic Designer,Graphic,Austin,"Austin, USA",2025-04-26,Cook County,Hourly,63075.25,30.32,GreenWorks,"Python, SQL, Excel"
2,3,Customer Support Representative,Customer,Houston,"Houston, USA",2024-11-18,Los Angeles County,Hourly,43600.14,20.96,GreenWorks,"JavaScript, React, Node.js"
3,4,Project Manager,Project,Houston,"Houston, USA",2024-11-27,Los Angeles County,Yearly,133826.77,64.34,DataNest,"Photoshop, Illustrator"
4,5,Mechanical Engineer,Mechanical,San Francisco,"San Francisco, USA",2024-11-05,King County,Hourly,85309.76,41.01,BrightPath,"Accounting, QuickBooks, Excel"


In [37]:
job_post.job_title = job_post.job_title.replace('Software Engineer', 'Data Engineer')

In [38]:
job_post.job_title_short = job_post.job_title_short.replace('Software', 'Engineer')

In [39]:
job_post.job_title = job_post.job_title.replace('Mechanical Engineer', 'Data Science')
job_post.job_title = job_post.job_title.replace('Financial Analyst', 'Financial Data Analyst')

In [40]:
job_post.job_title_short = job_post.job_title_short.replace('Mechanical', 'Science')

In [41]:
job_post.head()

Unnamed: 0,job_id,job_title,job_title_short,job_location,search_location,job_posted_date,county,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills
0,1,Data Science,Science,Seattle,"Seattle, USA",2025-09-26,Kings County,Hourly,80685.82,38.79,NextGen Corp,"Customer Service, Communication"
1,2,Graphic Designer,Graphic,Austin,"Austin, USA",2025-04-26,Cook County,Hourly,63075.25,30.32,GreenWorks,"Python, SQL, Excel"
2,3,Customer Support Representative,Customer,Houston,"Houston, USA",2024-11-18,Los Angeles County,Hourly,43600.14,20.96,GreenWorks,"JavaScript, React, Node.js"
3,4,Project Manager,Project,Houston,"Houston, USA",2024-11-27,Los Angeles County,Yearly,133826.77,64.34,DataNest,"Photoshop, Illustrator"
4,5,Data Science,Science,San Francisco,"San Francisco, USA",2024-11-05,King County,Hourly,85309.76,41.01,BrightPath,"Accounting, QuickBooks, Excel"


In [43]:
job_post.to_csv("../DATA/job_postings_flat.csv")

In [1]:
import mysql.connector
import pandas as pd

# Connect to MySQL
conn = mysql.connector.connect(
    host="localhost",
    user="root",              # your MySQL username
    password="08023256300", # your MySQL password
    database="mychallenge"  # name of the database
)

# Read data into pandas
query = "SELECT * FROM customers"
df = pd.read_sql(query, conn)

# Close connection
conn.close()

# Display result
print(df.head())


   customer_id customer_name country
0            1         Alice     USA
1            2           Bob  Canada
2            3       Charlie      UK


  df = pd.read_sql(query, conn)


In [2]:
df.to_csv("customers.csv", index=False)

In [7]:
import mysql.connector
import pandas as pd

# Connect to MySQL
conn = mysql.connector.connect(
    host="localhost",
    user="root",              # your MySQL username
    password="08023256300", # your MySQL password
    database="mychallenge"  # name of the database
)

# Read data into pandas
query = "SELECT * FROM products"
d = pd.read_sql(query, conn)

# Close connection
conn.close()

# Display result
print(d.head())

   product_id product_name   price
0           1       Laptop  1200.0
1           2        Phone   800.0
2           3   Headphones   150.0
3           4       Tablet   500.0


  d = pd.read_sql(query, conn)


In [8]:
d.to_csv("products.csv", index=False)

In [9]:
import mysql.connector
import pandas as pd

# Connect to MySQL
conn = mysql.connector.connect(
    host="localhost",
    user="root",              # your MySQL username
    password="08023256300", # your MySQL password
    database="mychallenge"  # name of the database
)

# Read data into pandas
query = "SELECT * FROM transactions"
df = pd.read_sql(query, conn)

# Close connection
conn.close()

# Display result
print(df.head())

   id  customer_id  product_id  quantity transaction_date
0   1            1           1         1       2025-09-01
1   2            2           2         1       2025-09-02
2   3            3           3         2       2025-09-02
3   4            1           4         1       2025-09-03
4   5            2           3         1       2025-09-04


  df = pd.read_sql(query, conn)


In [10]:
df.to_csv("transactions.csv", index=False)