## üìò Notebook Summary

### üéØ **Objective**
To guide learners through a complete data analysis workflow using Python in Google Colab‚Äîfrom importing a dataset to generating insights through visualization and exploratory data analysis (EDA).

### üß≠ **Purpose**
- Understand how to clean and prepare data for analysis.
- Explore relationships between features like salary, gender, education, and experience.
- Visualize trends and patterns using appropriate charts.
- Generate actionable insights for decision-making.

### üîÑ **Process Overview**
1. Setup & Import
2. Data Cleaning
3. Data Overview
4. Exploratory Data Analysis (EDA)
5. Visualization & Insight Generation
6. Summary & Next Steps

> üí° **Tip for Learners**: You can ask Gemini or Copilot to help you write code snippets, explain concepts, or suggest next steps. Try prompts like:
> - ‚ÄúGenerate a boxplot comparing salary by gender.‚Äù
> - ‚ÄúExplain how IQR helps detect outliers.‚Äù
> - ‚ÄúWhat‚Äôs the next step after cleaning missing values?‚Äù

---

## üß™ 1. Setup & Import


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from google.colab import files

# Upload CSV file
uploaded = files.upload()
df = pd.read_csv(next(iter(uploaded)))


---
## üßπ 2. Data Cleaning


In [None]:
# Rename columns for consistency
df.rename(columns={'Title': 'Job Title', 'Education level': 'Education Level'}, inplace=True)

# Identify column types
numerical_cols = ['Age', 'Years of Experience', 'Salary']
categorical_cols = ['Gender', 'Education Level']

# Display basic info
print("Shape:", df.shape)
print("Numerical Columns:", numerical_cols)
print("Categorical Columns:", categorical_cols)
df.info()


---

## üìä 3. Data Overview


In [None]:
# Overview of dataset
print(df.shape)
df.describe()
print(df.dtypes)
display(df.head())
print("Column Names:", df.columns.tolist())
print("Missing Values:\n", df.isnull().sum())

---

## üîç 4. Exploratory Data Analysis (EDA)

### üé® Salary Distribution

In [None]:
sns.histplot(df['Salary'], kde=True)
plt.title("Salary Distribution")
plt.show()

### üë• Salary by Gender


In [None]:
sns.boxplot(x='Gender', y='Salary', data=df)
plt.title("Salary by Gender")
plt.show()

### üéì Salary by Education Level

In [None]:
sns.violinplot(x='Education Level', y='Salary', data=df)
plt.title("Salary by Education Level")
plt.show()

sns.boxplot(x='Education Level', y='Salary', data=df)
plt.title("Salary Distribution by Education Level")
plt.xticks(rotation=45)
plt.show()

### üìà Salary vs. Age


In [None]:
sns.scatterplot(x='Age', y='Salary', data=df)
plt.title("Salary vs. Age")
plt.show()

### üìâ Salary vs. Experience

In [None]:
sns.lineplot(x='Years of Experience', y='Salary', data=df)
plt.title("Salary vs. Experience")
plt.show()

### üßë‚Äçüíº Salary by Job Title

df.groupby('Job Title')['Salary'].mean().sort_values().plot(kind='barh')
plt.title("Average Salary by Job Title")
plt.xlabel("Salary")
plt.show()

### üîó Correlation Heatmap


In [None]:
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title("Feature Correlation")
plt.show()

---

## üß† Insight Summary

- **Experience and education** positively correlate with salary.
- **Gender-based salary gaps** may exist‚Äîfurther statistical testing recommended.
- **Job title** is a strong predictor of salary.
- **Outliers** detected in salary‚Äîmay represent executives or data errors.

---

## üöÄ Next Steps

- Apply statistical tests (e.g., t-test, ANOVA) to validate observed differences.
- Build predictive models (e.g., linear regression) to forecast salary.
- Create interactive dashboards with Plotly or Streamlit.
- Export cleaned dataset for further use.

In [None]:

df.to_csv("cleaned_dataset.csv", index=False)

---