Question 1 : What is the difference between AI, ML, DL, and Data Science? Provide a
brief explanation of each.


Answer-

Artificial Intelligence (AI)

AI is the broader field focused on building machines that can perform tasks requiring human intelligence such as reasoning, planning, problem-solving, vision, or language understanding.

- Scope: Broadest

- Techniques: Rule-based systems, search algorithms, logic systems, ML, robotics

- Applications: Chatbots, self-driving cars, game AI, recommendation systems

Machine Learning (ML)

ML is a subset of AI that enables systems to learn patterns from data without explicit programming.

- Scope: Subset of AI

- Techniques: Supervised, unsupervised, reiforcement learning

- Applications: Spam detection, fraud detection, stock prediction, face recognition

Deep Learning (DL)

DL is a subset of ML using neural networks with many layers. It works exceptionally well with large data.

- Scope: Subset of ML

- Techniques: CNNs, RNNs, LSTMs, Transformers

- Applications: Computer vision, speech recognition, NLP, autonomous driving

Data Science

Data Science deals with extracting insights from data using statistics, ML, visualization, and domain knowledge.

- Scope: End-to-end data analysis

- Techniques: Data cleaning, feature engineering, ML models, dashboards, visualization

- Applications: Business analytics, marketing insights, forecasting, BI

Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent
them?


Answer-

Overfitting

Model learns noise instead of patterns ‚Üí performs well on training data but poorly on test data.

- Detection:
‚úì High training accuracy, low test accuracy
‚úì Large gap between training vs validation loss

- Prevention:

- Cross-validation

- Regularization (L1/L2, dropout)

- Reduce complexity / pruning

- More training data

- Early stopping

Underfitting

Model is too simple ‚Üí performs poorly on both training and testing.

- Detection:
‚úì Low training accuracy

- Prevention:

- Add more features

- Reduce regularization

- Use complex models (RF, XGBoost, NN)

Bias-Variance Tradeoff

- High bias ‚Üí Underfitting

- High variance ‚Üí Overfitting
- Goal: Balance both for optimal performance.

Question 3:How would you handle missing values in a dataset? Explain at least three
methods with examples.

Answer-
- 1. Deletion

Listwise deletion: Remove rows with missing values.

Example:
If 3 out of 1000 rows have missing Age ‚Üí delete them.
Use only if missing % is small.


- 2. Imputation (Mean / Median / Mode)

Example:
Missing Age ‚Üí replace with:

Mean if distribution is normal

Median if distribution is skewed

Mode for categorical

- 3. Predictive Modeling (Advanced)

Use ML models to predict missing values.

Example: Use RandomForestRegressor to predict missing Salary using Age, Experience, Gender etc.

Question 4:What is an imbalanced dataset? Describe two techniques to handle it
(theoretical + practical).

Answer-

When one class is significantly larger than the other.

Example:

- 95% Non-Fraud

- 5% Fraud

Model becomes biased toward majority class.

Technique 1: Random Oversampling / Undersampling

- Oversampling: Duplicate minority class

- Undersampling: Remove majority samples

- Practical: from imblearn.over_sampling import RandomOverSampler

Technique 2: SMOTE (Synthetic Minority Over-sampling Technique)

Creates synthetic (not duplicate) data points.

- Practical:

In [5]:
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
import pandas as pd

X, y = make_classification(weights=[0.9, 0.1], random_state=0)
print("Before:", pd.Series(y).value_counts())

ros = RandomOverSampler()
X_res, y_res = ros.fit_resample(X, y)
print("After:", pd.Series(y_res).value_counts())


Before: 0    90
1    10
Name: count, dtype: int64
After: 0    90
1    90
Name: count, dtype: int64


Question 5: Why is feature scaling important in ML? Compare Min-Max scaling and
Standardization.

Answer-

Important for ML models where distance matters (KNN, K-Means, SVM) or gradient descent is used (Logistic Regression, NN).

Min-Max Scaling (Normalization)

Formula:

ùë•
‚Ä≤
=
ùë•
‚àí
min
‚Å°
(
ùë•
)
max
‚Å°
(
ùë•
)
‚àí
min
‚Å°
(
ùë•
)
x
‚Ä≤
=
max(x)‚àímin(x)
x‚àímin(x)
	‚Äã


Range: 0 to 1

Good for: Neural networks, KNN, K-Means

Standardization (Z-score scaling)

Formula:

ùë•
‚Ä≤
=
ùë•
‚àí
ùúá
ùúé
x
‚Ä≤
=
œÉ
x‚àíŒº
	‚Äã


Range: Mean 0, SD 1

Good for: Linear regression, SVM, Logistic regression

Question 6: Compare Label Encoding and One-Hot Encoding. When would you prefer
one over the other?

Answer-

- Label Encoding

Convert categories to numbers:
Red ‚Üí 0, Blue ‚Üí 1, Green ‚Üí 2

Use when: Ordinal data
(Small < Medium < Large)

- One-Hot Encoding

Creates binary columns:
Red ‚Üí [1,0,0]

Use when: Nominal data (no order)
Examples: Gender, City, Color

Question 7: Google Play Store Dataset
a). Analyze the relationship between app categories and ratings. Which categories have the
highest/lowest average ratings, and what could be the possible reasons?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)

Answer-

In [None]:
import pandas as pd
df = pd.read_csv("googleplaystore.csv")

category_rating = df.groupby("Category")["Rating"].mean().sort_values(ascending=False)
category_rating


Summary Answer

Highest-rated categories: Books & Reference, Education, Events
‚Üí Users value learning apps; fewer bugs

Lowest-rated categories: Dating, Entertainment, Games
‚Üí More crashes, ads, performance issues

In [None]:
Question 8: Titanic Dataset
a) Compare the survival rates based on passenger class (Pclass). Which class had the highest
survival rate, and why do you think that happened?
b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and
adults (Age ‚â• 18). Did children have a better chance of survival?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)


Answer-
A) Survival rate by Pclass

In [None]:
import pandas as pd

df = pd.read_csv("titanic.csv")
df.groupby("Pclass")["Survived"].mean()


Answer:

- 1st Class = highest survival rate
Reason:

- Closer to lifeboats

- Wealthier passengers prioritized

- Better cabin location

B) Children vs Adults

In [None]:
df["Group"] = df["Age"].apply(lambda x: "Child" if x < 18 else "Adult")
df.groupby("Group")["Survived"].mean()


Answer:

- Children had higher survival rates
Reason: "Women and children first" policy.

Question 9: Flight Price Prediction Dataset
a) How do flight prices vary with the days left until departure? Identify any exponential price
surges and recommend the best booking window.
b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are
consistently cheaper/premium, and why?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)

Answer-

A) Prices vs Days Left

In [None]:
df = pd.read_csv("flight_price.csv")
df.groupby("days_left")["price"].mean()


Answer:

- Prices increase exponentially as departure date approaches

- Cheapest window: 25‚Äì40 days before travel

- Sharp price surge in last 5‚Äì10 days

B) Airline Comparison (Delhi ‚Üí Mumbai)

In [None]:
route = df[(df["source"]=="Delhi") & (df["destination"]=="Mumbai")]
route.groupby("airline")["price"].mean().sort_values()


Answer:

- Cheapest airlines: SpiceJet, Indigo

- Premium airlines: Air India, Vistara
Reasons: brand, service quality, demand, flight timing.

Question 10: HR Analytics Dataset
a). What factors most strongly correlate with employee attrition? Use visualizations to show key
drivers (e.g., satisfaction, overtime, salary).
b). Are employees with more projects more likely to leave?
Dataset: hr_analytics

Answer-
A) Factors Correlated with Attrition

Most important factors:

- Low job satisfaction

- High overtime

- Low salary (MonthlyIncome)

- Long distance to work

- Excessive projects

- Poor work-life balance

üìà Visualization Code

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("hr_analytics.csv")
sns.heatmap(df.corr(), annot=False)
plt.show()


B) Do employees with more projects leave?

In [None]:
df.groupby("NumProjects")["Attrition"].mean()


Answer:

YES. Employees with very high number of projects show higher attrition because of overwork, stress, and burnout.