<a href="https://colab.research.google.com/github/zelaneroz/cwru-csds-coursework/blob/main/csds133_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mohithsairamreddy/salary-data")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/mohithsairamreddy/salary-data/versions/4


In [2]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
files = os.listdir(path)
csv_file = [f for f in files if f.endswith('.csv')][0]  # Finds the first CSV file
csv_path = os.path.join(path, csv_file)
df = pd.read_csv(csv_path)

# I. Data Exploration & Preprocessing

## A. Data Exploration

Using the code below, we gathered the following information:


*   6704 rows, 6 columns
*   Columns: \[Age, Gender Education, Level, Job Title, Years of Experience]
* 2 rows are empty
* Found the number of missing values per column
* Summary Statistis for each column
* Unique Values for Columns 'Education' and 'Job Title'


In [4]:
# Load dataset (update with actual file path)
# 1️⃣ Check the first few rows
print("📌 First 5 Rows:")
print(df.head(), "\n")

# 2️⃣ Check the number of rows and columns
print("📌 Shape of the dataset (rows, columns):", df.shape, "\n")

# 3️⃣ Get column names
print("📌 Column Names:")
print(df.columns, "\n")

# 4️⃣ Get basic info about dataset (data types, non-null counts, memory usage)
print("📌 Dataset Information:")
print(df.info(), "\n")

📌 First 5 Rows:
    Age  Gender Education Level          Job Title  Years of Experience  \
0  32.0    Male      Bachelor's  Software Engineer                  5.0   
1  28.0  Female        Master's       Data Analyst                  3.0   
2  45.0    Male             PhD     Senior Manager                 15.0   
3  36.0  Female      Bachelor's    Sales Associate                  7.0   
4  52.0    Male        Master's           Director                 20.0   

     Salary  
0   90000.0  
1   65000.0  
2  150000.0  
3   60000.0  
4  200000.0   

📌 Shape of the dataset (rows, columns): (6704, 6) 

📌 Column Names:
Index(['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience',
       'Salary'],
      dtype='object') 

📌 Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6704 entries, 0 to 6703
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  670

In [5]:
# 5️⃣ Check for missing values
print("📌 Missing Values Per Column:")
print(df.isnull().sum(), "\n")

# 6️⃣ Count completely empty rows
print("📌 Number of Fully Empty Rows:", df.isnull().all(axis=1).sum(), "\n")

# 7️⃣ Summary statistics for numerical columns
print("📌 Summary Statistics for Numerical Columns:")
print(df.describe(), "\n")

# 8️⃣ Summary statistics for categorical columns
print("📌 Summary Statistics for Categorical Columns:")
print(df.describe(include=['object']), "\n")

# 9️⃣ Check unique values in categorical columns (adjust column names as needed)
categorical_cols = ['Gender', 'Education Level', 'Job Title']  # Adjust these column names as necessary
for col in categorical_cols:
    if col in df.columns:
        print(f"📌 Unique values in '{col}':")
        print(df[col].value_counts(), "\n")

📌 Missing Values Per Column:
Age                    2
Gender                 2
Education Level        3
Job Title              2
Years of Experience    3
Salary                 5
dtype: int64 

📌 Number of Fully Empty Rows: 2 

📌 Summary Statistics for Numerical Columns:
               Age  Years of Experience         Salary
count  6702.000000          6701.000000    6699.000000
mean     33.620859             8.094687  115326.964771
std       7.614633             6.059003   52786.183911
min      21.000000             0.000000     350.000000
25%      28.000000             3.000000   70000.000000
50%      32.000000             7.000000  115000.000000
75%      38.000000            12.000000  160000.000000
max      62.000000            34.000000  250000.000000 

📌 Summary Statistics for Categorical Columns:
       Gender    Education Level          Job Title
count    6702               6701               6702
unique      3                  7                193
top      Male  Bachelor's Deg

## B. Data Cleaning

From the information gathered from data exploration, we perform the following to clean the data:

*   Remove rows where all columns are empty.
*   Removes rows where at least one column has missing data.
*   Reset the index to maintain a clean dataset.


In [6]:
print("Original Dataset Shape:", df.shape)
# Drop fully empty rows (where all columns are NaN)
df = df.dropna(how='all')

# Drop rows where at least one column has missing values
df = df.dropna(how='any')

# Reset index after dropping rows
df = df.reset_index(drop=True)

# Display new shape of the dataset
print("Updated Dataset Shape:", df.shape)


Original Dataset Shape: (6704, 6)
Updated Dataset Shape: (6698, 6)
