In [None]:
# Your Task
# Download and import the Data Science Job Salary dataset.
# Normalize the ‘salary’ column using Min-Max normalization which scales all salary values between 0 and 1.
# Implement dimensionality reduction like Principal Component Analysis (PCA) or t-SNE to reduce the number of features (columns) in the dataset.
# Group the dataset by the ‘experience_level’ column and calculate the average and median salary for each experience level (e.g., Junior, Mid-level, Senior).
# Hint :
# As a reminder, normalization is crucial when dealing with data that has different ranges. For example, salary data might have a wide range (e.g., from $20,000 to $200,000). By scaling the data using Min-Max normalization, you make sure that all salary values fall within a consistent range (0 to 1). This is particularly helpful when the data is going to be used in machine learning models, as some algorithms (like k-nearest neighbors or neural networks) perform better when features are normalized. It ensures that no single salary dominates the learning process, making the analysis more balanced.

# Dimensionality reduction helps simplify complex datasets by reducing the number of variables under consideration. This can make the data more manageable and help avoid the curse of dimensionality—a phenomenon where machine learning models struggle when dealing with high-dimensional data.
# PCA, for instance, helps in retaining the most important information (variance) from the dataset while reducing noise and redundancy.
# It can also speed up the training process for models and help in visualizing data in fewer dimensions.

# Aggregating data helps in understanding trends within subgroups of the dataset.
# Calculating average and median salaries for each experience level gives insights into the compensation distribution and disparities across different job levels. This kind of aggregation can help in answering business questions like “How does salary evolve with experience?” or “What is the salary distribution for senior-level roles?”

In [12]:
import pandas as pd


dataSet="datascience_salaries.csv"

dss = pd.read_csv(dataSet)

df = dss.copy()


In [13]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['salary_normalized'] = scaler.fit_transform(df[['salary']])

df[['salary', 'salary_normalized']].head()


Unnamed: 0,salary,salary_normalized
0,149000,0.60101
1,120000,0.454545
2,68000,0.191919
3,120000,0.454545
4,149000,0.60101


In [14]:
from sklearn.decomposition import PCA

df_encoded = pd.get_dummies(df, drop_first=True)

# Remove salary columns to avoid bias
X = df_encoded.drop(columns=['salary', 'salary_normalized'], errors='ignore')

pca = PCA(n_components=2)
principal_components = pca.fit_transform(X)

df_pca = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

df_pca.head()
print("Explained Variance Ratio:", pca.explained_variance_ratio_)


Explained Variance Ratio: [9.99995550e-01 9.28184157e-07]


In [15]:
df.groupby('experience_level')['salary'].mean()

Unnamed: 0_level_0,salary
experience_level,Unnamed: 1_level_1
Entry,36111.111111
Executive,76076.923077
Mid,51786.885246
Senior,75088.033012
