project3-HR Analytics - Predict Employee Attrition

 1. Exploratory Data Analysis (EDA)

Tools: Python (Pandas, Seaborn)

Steps:
Load & clean data: Check for null values, outliers, and data types.

Univariate Analysis:

Plot attrition rate by department

Attrition vs. salary bands (Low/Medium/High)

Promotion history vs. attrition

Bivariate Analysis:

Correlation heatmap (Seaborn)

Crosstabs for categorical features (e.g., promotion_last_5years, satisfaction_level, salary)

Sample Code:

2. Classification Model

Tools: Python (Sklearn)

 3. SHAP Value Analysis

Tools: Python (SHAP)

Steps:
Use SHAP to explain individual predictions

Visualize feature importance

 4. Power BI Dashboard

Tools: Power BI

Suggested Visuals:
Department-wise attrition rates

Attrition by salary and promotion

Overall KPIs (attrition rate %, avg satisfaction)

Top 5 features influencing attrition (from SHAP or model feature importance)

Data Preparation:
Export processed data and SHAP feature importances to CSV from Python

Load into Power BI for dashboard creation

5. Deliverables
Power BI Dashboard
Save as .pbix file

Include interactive filters (department, salary level)

Model Accuracy Report
Include:

Model used (Logistic / Decision Tree)

Accuracy, Precision, Recall

Confusion matrix (as table or heatmap)

 PDF: Attrition Prevention Suggestions
Structure:

Executive Summary

Key Findings

E.g., "High attrition in Sales and Low Salary Band"

Recommendations

Increase engagement programs for high-risk departments

Review compensation structure

Provide growth opportunities (promotions/training)

Data-Driven Support

Include SHAP feature impact plots



TOP 50 INTERVIEW QUESTIONS FOR DATA ANALYST

1.What are the key differences between inner join and outer join in SQL?

Ans:Inner Join vs Outer Join in SQL
Inner Join: Returns only matching rows between tables.

Outer Join: Returns all records from one or both tables, filling in NULLs where there's no match (Left, Right, or Full).

2.How do you handle missing data in a dataset?

Ans:Handling Missing Data
Drop rows/columns with missing values

Impute using mean, median, mode, or model-based approaches

Use indicators for missingness if valuable

3.What is the difference between variance and standard deviation?

Ans:Variance vs Standard Deviation
Variance: Measures spread of data squared from the mean.

Standard Deviation: Square root of variance; in same unit as data.

4.Explain the concept of normalization in databases.

Ans:Normalization in Databases
Organizing data to reduce redundancy and dependency using forms (1NF, 2NF, etc.).



5.What is the role of a primary key in a relational database?

Ans:Primary Key Role
Uniquely identifies each record in a table.

Enforces entity integrity.

6.How would you detect outliers in a dataset?

Detecting Outliers
Statistical methods: Z-score, IQR

Visualization: Box plots, scatter plots

ML-based: Isolation Forest, DBSCAN



7.What is data wrangling and why is it important?

Data Wrangling
The process of cleaning and transforming raw data into usable formats.

Essential for analysis accuracy.

8.Describe a situation where you used data to solve a business problem.

Example of Solving a Business Problem
Reduced churn by 15% using a classification model that identified high-risk users based on engagement data.

9.What is the difference between a clustered and non-clustered index?

Ans:Clustered vs Non-Clustered Index
Clustered: Sorts the actual data rows in table.

Non-clustered: Separate from data, stores pointers.

10.Explain the difference between supervised and unsupervised learning.

Ans:Supervised vs Unsupervised Learning
Supervised: Labeled data (e.g., classification, regression)

Unsupervised: No labels (e.g., clustering, PCA)

11.What is the purpose of the GROUP BY clause in SQL?

Ans:GROUP BY in SQL
Aggregates data by one or more columns (e.g., SUM(sales) GROUP BY region)

12.How do you handle duplicate data entries in a dataset?

Ans:Handling Duplicate Data
Use drop_duplicates(), DISTINCT in SQL, or de-dupe based on key fields.

13.What is a pivot table and how have you used it?

Ans:Pivot Table
A table summarizing data (e.g., totals, averages) based on rows and columns.

Used in Excel or Python (pandas.pivot_table())



14.Explain the differences between a bar chart and a histogram.

Ans:Bar Chart vs Histogram
Bar Chart: For categorical data.

Histogram: For continuous data distribution.

15.How do you optimize a slow SQL query?

Ans:Optimize a Slow SQL Query
Use indexes, avoid SELECT *, optimize joins, analyze query plan.



16.What are the common KPIs used in business analysis?

Ans:Common KPIs
Revenue, conversion rate, churn, NPS, average order value, retention rate.

17.What is A/B testing and how is it used in data analysis?

Ans:A/B Testing
Controlled experiment to compare two versions (A vs B) and measure impact.

18.How do you ensure data accuracy and integrity in a project?

Ans:Ensuring Data Accuracy
Validation rules, audits, data quality checks, version control.

19.What is a correlation matrix and how do you interpret it?

Ans:Correlation Matrix
Shows pairwise correlation (Pearson usually). Values close to ±1 indicate strong linear relationships.

20.What is the difference between correlation and causation?

Ans:Correlation vs Causation
Correlation: Relationship, not necessarily cause.

Causation: One variable directly affects another.

21.Describe a data project where you used Python.

Ans:Python Project Example
Built an attrition prediction model using pandas, sklearn, SHAP, and Power BI.

22.What libraries do you use for data analysis in Python?

Ans:Python Libraries
Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, SHAP, Statsmodels

23.Explain the use of Pandas groupby() function.

Ans:groupby() in Pandas
Groups data based on keys and performs aggregation.

24.How do you deal with imbalanced datasets?

Ans:Dealing with Imbalanced Datasets
Resampling (SMOTE, undersampling), class weights, anomaly detection techniques.

25.What are the steps of a typical data analysis pipeline?

Ans:Data Analysis Pipeline
Define objective

Collect data

Clean/wrangle data

EDA

Modeling (if needed)

Visualization/reporting

Recommendations

Problem Definition

Data Collection

Data Cleaning

Exploratory Data Analysis (EDA)

Data Modeling or Statistical Analysis

Validation and Testing

Visualization and Reporting

Deployment or Communication of Results


26. What is the purpose of data visualization?

Ans:To present data in a graphical or pictorial format that makes it easier to identify trends, outliers, patterns, and insights, enabling more informed decision-making.

27. Explain the difference between ETL and ELT.

Ans:ETL (Extract, Transform, Load): Data is transformed before being loaded into the data warehouse.

ELT (Extract, Load, Transform): Data is loaded first and transformed inside the data warehouse, leveraging its compute power.



28. What is the difference between OLAP and OLTP systems?

Ans:OLAP (Online Analytical Processing): Used for complex queries and data analysis (e.g., dashboards, reports).

OLTP (Online Transaction Processing): Handles daily transactional data (e.g., banking systems, e-commerce orders).

29. How do you decide which chart to use for a dataset?

Ans:Depends on the goal:

Trend: Line chart

Comparison: Bar chart

Distribution: Histogram or box plot

Relationships: Scatter plot

Composition: Pie chart or stacked bar

30. What is time series analysis and where have you used it?

Ans:Analysis of data points over time to identify trends, seasonality, or forecast.
Example: Forecasting sales or analyzing website traffic over months.

31. Describe your experience with Tableau or Power BI.

Ans:(Example answer): Created dashboards, used calculated fields, connected to SQL data sources, designed interactive reports with filters and drill-downs.



32. What are dimensions and measures in Tableau?

Ans:Dimensions: Qualitative fields (e.g., country, product name)

Measures: Quantitative fields used for calculations (e.g., sales, profit)

33. How do you track data quality over time?

Ans:Use metrics like completeness, accuracy, consistency, and timeliness. Implement automated checks, dashboards, and alerting systems.

34. What is multicollinearity and why is it a problem?

Ans:When independent variables in a regression model are highly correlated. It can distort the importance of predictors and lead to unreliable coefficients.

35. How would you analyze user behavior on a website?

Ans:Use tools like Google Analytics or logs to track clicks, time on page, conversion paths, and heatmaps. Segment users and analyze funnel drop-offs.

36. What are your favorite Python functions for data analysis?

Ans:groupby(), pivot_table(), describe(), value_counts(), apply(), merge(), matplotlib/seaborn for visualization, and scikit-learn for modeling.

37. What is data cleaning and how do you perform it?

Ans:Removing inaccuracies or inconsistencies. Includes handling missing data, correcting formats, removing duplicates, and standardizing values.

38. What does the term 'data storytelling' mean to you?

Ans:Combining data, visuals, and narrative to explain insights in a compelling way that drives action and understanding, especially for non-technical audiences.

39. How do you handle large datasets efficiently?

Ans:Use chunking, indexing, vectorized operations, memory-efficient libraries like Dask or PySpark, and optimized queries in SQL.

40. What are lag and lead functions in SQL?

Ans:LAG(): Accesses a previous row’s value

LEAD(): Accesses a following row’s value
Useful in time-series or sequential data comparisons.

41. What is a hypothesis test and when would you use it?

Ans:A statistical method to test an assumption (null hypothesis) about a population. Used to determine significance, e.g., A/B testing for web design changes.

42. How do you explain complex data insights to non-technical stakeholders?

Ans:Use plain language, visual aids, focus on the “so what” and implications, relate insights to business goals.



43. What is the difference between a heatmap and a scatter plot?

Ans:Heatmap: Shows values with color intensity in a matrix format.

Scatter plot: Displays relationships between two numeric variables with points on a grid.

44. How do you validate a machine learning model?

Ans:Using techniques like train/test split, cross-validation, confusion matrix, ROC-AUC, RMSE depending on the problem type.

45. Describe a challenging dataset you worked on.

Ans:(Example): A customer churn dataset with imbalanced classes and many missing values. I used oversampling (SMOTE), imputation, and feature engineering to improve model performance.

46. What is the role of feature engineering in data analysis?

Ans:Creating new input variables that improve model accuracy by capturing hidden relationships or patterns (e.g., extracting date parts or ratios).



47. What is the difference between a data analyst and a data scientist?

Ans:Data Analyst: Focuses on descriptive analytics, reporting, visualization.

Data Scientist: Builds predictive models, performs statistical inference, machine learning.

48. How do you prioritize tasks when working on multiple data projects?

Ans:Use frameworks like Eisenhower matrix or agile boards, prioritize by business impact, deadlines, and stakeholder urgency.

49. What steps do you take before starting a data analysis project?

Ans:Understand the problem

Define objectives and KPIs

Gather requirements

Identify data sources

Perform initial data exploration

50. Describe a situation where your analysis had a measurable business impact.?

Ans:(Example): Built a churn prediction model that helped the retention team target high-risk users, reducing churn by 15% in three months.

