# Employee Dataset Analysis with PySpark
## Comprehensive Analysis: EDA | SQL Analytics | AI/ML

This notebook demonstrates a complete data analysis pipeline using PySpark on the Employee Dataset.

### Contents:
1. **Data Loading & Exploration**
2. **Exploratory Data Analysis (EDA)**
3. **SQL Analytics**
4. **Machine Learning (AI/ML)**
5. **Insights & Recommendations**

## 1. Setup and Data Loading

In [None]:
# Import required libraries
from employee_data_analysis import EmployeeDataAnalysis
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Initialize the analysis object
analysis = EmployeeDataAnalysis(data_path="Employee_Complete_Dataset.csv")

In [None]:
# Load the dataset
df = analysis.load_data()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Run comprehensive EDA
analysis.exploratory_data_analysis()

### Custom EDA Queries
You can also run custom queries on the data:

In [None]:
# Example: Check specific age group distribution
from pyspark.sql.functions import col, count

df.filter((col("Employee_age") >= 25) & (col("Employee_age") <= 35)).count()

In [None]:
# Example: Top performing employees
df.filter(col("performance_rating") == 5).select(
    "Employee_name", "Department", "Role", "Current_Salary", "years_experience"
).show(10, truncate=False)

## 3. SQL Analytics

In [None]:
# Run SQL analytics
analysis.sql_analytics()

### Custom SQL Queries
You can write your own SQL queries:

In [None]:
# Register dataframe as SQL table
df.createOrReplaceTempView("employees")

# Custom query example
custom_query = """
SELECT Department, Role, AVG(Current_Salary) as Avg_Salary
FROM employees
WHERE years_experience > 10
GROUP BY Department, Role
ORDER BY Avg_Salary DESC
LIMIT 10
"""

analysis.spark.sql(custom_query).show(truncate=False)

## 4. Machine Learning (AI/ML)

In [None]:
# Run machine learning models
analysis.machine_learning()

## 5. Insights & Recommendations

In [None]:
# Generate insights report
analysis.generate_insights_report()

## Complete Analysis Pipeline
Alternatively, you can run the complete analysis in one go:

In [None]:
# Uncomment to run complete analysis
# analysis.run_complete_analysis()

## Cleanup

In [None]:
# Stop Spark session when done
analysis.stop()

---
## Summary

This notebook demonstrated:
- ✅ Loading and exploring employee data with PySpark
- ✅ Comprehensive EDA with statistical analysis
- ✅ SQL-based analytics for business insights
- ✅ Machine Learning models for predictions
- ✅ Feature importance and model evaluation

### Next Steps:
1. Experiment with different ML algorithms
2. Create visualizations using matplotlib/seaborn
3. Deploy models for real-time predictions
4. Integrate with Cassandra database for scalability