#machine learning


Question 1: What is the difference between AI, ML, DL, and Data Science? Provide a
brief explanation of each.


Answer: Difference between AI, ML, DL, and Data Science

 Artificial Intelligence (AI)
- **Explanation:**  
  AI (Artificial Intelligence) is a broad field of computer science that focuses on creating machines capable of performing tasks that normally require human intelligence.  
  These tasks include reasoning, learning, problem-solving, and decision-making.

- **Scope:**  
  Broadest – includes all intelligent systems, from rule-based programs to advanced robotics.

- **Techniques:**  
  Rule-based systems, expert systems, natural language processing, robotics, and search algorithms.

- **Applications:**  
  Chatbots, self-driving cars, game-playing bots, and virtual assistants.

 Machine Learning (ML)
- **Explanation:**  
  ML (Machine Learning) is a subset of AI that enables systems to automatically learn from data and improve over time without being explicitly programmed.

- **Scope:**  
  Narrower than AI – focuses mainly on data-driven learning and predictions.

- **Techniques:**  
  Regression, decision trees, clustering, support vector machines (SVM), and ensemble methods.

- **Applications:**  
  Email spam filtering, recommendation systems, stock price prediction, and fraud detection.

 Deep Learning (DL)
- **Explanation:**  
  DL (Deep Learning) is a subset of ML that uses artificial neural networks with multiple layers to model complex patterns in large datasets.

- **Scope:**  
  Narrowest – focuses on neural networks and high-dimensional data.

- **Techniques:**  
  Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers.

- **Applications:**  
  Image and speech recognition, natural language processing (like ChatGPT), and autonomous vehicles.

 Data Science
- **Explanation:**  
  Data Science is an interdisciplinary field that combines statistics, mathematics, computer science, and domain knowledge to extract insights and knowledge from data.

- **Scope:**  
  Encompasses AI, ML, and data analysis – focuses on understanding and interpreting data.

- **Techniques:**  
  Data cleaning, data visualization, statistical analysis, machine learning algorithms, and data engineering tools.

- **Applications:**  
  Business analytics, healthcare analysis, market research, and financial forecasting.




Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent
them?
Hint: Discuss bias-variance tradeoff, cross-validation, and regularization techniques

Answer: 🧠 Overfitting and Underfitting in Machine Learning

 Overfitting → Model learns both the pattern and the noise in training data.
 Underfitting → Model fails to learn the pattern in training data.

 Let's understand with an example using polynomial regression.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

 Generate some data
np.random.seed(0)
X = np.linspace(0, 5, 50)
y = 2 * X**2 + 3 * X + 4 + np.random.randn(50) * 2  # quadratic data with noise

X = X.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Try different model complexities
degrees = [1, 2, 10]
plt.figure(figsize=(12, 4))

for i, d in enumerate(degrees):
    poly = PolynomialFeatures(degree=d)
    X_poly_train = poly.fit_transform(X_train)
    X_poly_test = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_poly_train, y_train)

    y_pred = model.predict(X_poly_test)

    # Plot
    plt.subplot(1, 3, i+1)
    plt.scatter(X_train, y_train, color='blue', label='Train Data')
    plt.scatter(X_test, y_test, color='green', label='Test Data')
    plt.plot(np.sort(X_test, axis=0),
             model.predict(poly.transform(np.sort(X_test, axis=0))),
             color='red', linewidth=2, label='Model')
    plt.title(f"Degree = {d}\nMSE = {mean_squared_error(y_test, y_pred):.2f}")
    plt.legend()

plt.tight_layout()
plt.show()
🔍 Explanation
Model Type	Description	Error Pattern
Underfitting (High Bias)	Model too simple (e.g., linear for non-linear data)	High training and test error
Overfitting (High Variance)	Model too complex, learns noise	Low training error, high test error
Good Fit (Balanced)	Model captures true pattern, not noise	Low and similar train/test error

⚖️ Bias-Variance Tradeoff
High Bias (Underfitting): Model too simple → misses patterns.

High Variance (Overfitting): Model too complex → memorizes noise.

Goal → Find balance between bias and variance.

✅ How to Detect
Sign	Indicates
High training error + High test error	Underfitting
Low training error + High test error	Overfitting
Similar low train/test error	Good fit

🧩 How to Prevent Overfitting
Cross-Validation
Use k-fold cross-validation to check performance on unseen data.

python
Copy code
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_poly_train, y_train, cv=5)
print("Average CV Score:", scores.mean())
Regularization (L1 / L2)

L1 (Lasso): Shrinks less important feature weights to zero.

L2 (Ridge): Penalizes large weights smoothly.

python
Copy code
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0)
ridge.fit(X_poly_train, y_train)
print("Ridge Test MSE:", mean_squared_error(y_test, ridge.predict(X_poly_test)))
Early Stopping – Stop training when validation loss increases.

Dropout / Data Augmentation (for Neural Networks).

Simplify Model – Reduce complexity or number of parameters.

More Data – Larger dataset reduces variance.

Question 3: How would you handle missing values in a dataset? Explain at least three
methods with examples.
Hint: Consider deletion, mean/median imputation, and predictive modeling

Answer:  🧠 Handling Missing Values in a Dataset

 Missing values are common in real-world datasets.

# If not handled properly, they can lead to biased or inaccurate ML models.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

* Let's create a sample dataset with
 some missing values

data = {
    'Age': [25, 30, np.nan, 35, 40, np.nan, 28],
    'Salary': [50000, 54000, 58000, np.nan, 62000, 60000, np.nan],
    'Experience': [1, 3, 4, 5, np.nan, 7, 2]
}
df = pd.DataFrame(data)
print("🔹 Original Dataset with Missing Values:")
print(df)

1️⃣ Deletion Method

👉 Description:
Remove rows or columns containing missing values.

Works well if the missing data is small (<5% of total).

But can cause data loss if too many missing values.



 # Drop rows with any missing value
df_drop_rows = df.dropna()
print("\n✅ After Dropping Rows with Missing Values:")
print(df_drop_rows)

# Drop columns with missing values
df_drop_cols = df.dropna(axis=1)
print("\n✅ After Dropping Columns with Missing Values:")
print(df_drop_cols)


2️⃣ Mean/Median Imputation

👉 Description:
Replace missing values with mean, median, or mode of the column.

Useful for numerical data.

Keeps all records, but may reduce variance.


# Using sklearn's SimpleImputer
imputer_mean = SimpleImputer(strategy='mean')
df_mean = df.copy()
df_mean[['Age', 'Salary', 'Experience']] = imputer_mean.fit_transform(df_mean[['Age', 'Salary', 'Experience']])
print("\n✅ After Mean Imputation:")
print(df_mean)

# Median imputation example
imputer_median = SimpleImputer(strategy='median')
df_median = df.copy()
df_median[['Age', 'Salary', 'Experience']] = imputer_median.fit_transform(df_median[['Age', 'Salary', 'Experience']])
print("\n✅ After Median Imputation:")
print(df_median)



3️⃣ Predictive Modeling (Model-Based Imputation)

👉 Description:
Use a model (e.g., Linear Regression, KNN) to predict missing values based on other features.

More accurate when relationships between features exist.


# Example: Predict missing Salary using Age and Experience
df_model = df.copy()

# Split into known and unknown Salary rows
train_data = df_model[df_model['Salary'].notnull()]
test_data = df_model[df_model['Salary'].isnull()]

# Train a regression model
X_train = train_data[['Age', 'Experience']]
y_train = train_data['Salary']

model = LinearRegression()
model.fit(X_train, y_train)

# Predict missing Salary values
X_test = test_data[['Age', 'Experience']]
df_model.loc[df_model['Salary'].isnull(), 'Salary'] = model.predict(X_test)

print("\n✅ After Predictive Modeling Imputation:")
print(df_model)


Question 4:What is an imbalanced dataset? Describe two techniques to handle it
(theoretical + practical).
Hint: Discuss SMOTE, Random Under/Oversampling, and class weights in models

Answer: # 🧠 Handling Imbalanced Datasets in Machine Learning

# In real-world data, sometimes one class has far fewer samples than another.
# Example: Fraud detection (1% fraud vs 99% non-fraud), disease prediction, etc.

import pandas as pd
from sklearn.datasets import make_classification
from collections import Counter
import matplotlib.pyplot as plt

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=2,
                           n_informative=2, n_redundant=0,
                           weights=[0.9, 0.1], random_state=42)

print("🔹 Original Class Distribution:", Counter(y))

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', alpha=0.6)
plt.title("Original Imbalanced Dataset (Class 0 vs Class 1)")
plt.show()


💡 What is an Imbalanced Dataset?

➡️ Definition:
An imbalanced dataset is when one class (majority) has much more data than the other (minority).
For example:

Class 0: 900 samples

Class 1: 100 samples

This causes ML models to bias toward the majority class, leading to poor recall/precision for the minority class.

⚙️ Techniques to Handle Imbalance
1️⃣ Random Under/Oversampling

👉 Theory:

Undersampling: Reduce majority class samples.

Oversampling: Duplicate or generate copies of minority samples.

Pros: Simple and effective for small datasets.
Cons: May cause loss of information (undersampling) or overfitting (oversampling).

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Random Oversampling
ros = RandomOverSampler(random_state=42)
X_over, y_over = ros.fit_resample(X, y)
print("✅ After Oversampling:", Counter(y_over))

# Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_under, y_under = rus.fit_resample(X, y)
print("✅ After Undersampling:", Counter(y_under))

2️⃣ SMOTE (Synthetic Minority Over-sampling Technique)

👉 Theory:
Instead of duplicating existing minority samples, SMOTE creates synthetic (new) examples by interpolating between existing minority samples.

Pros: More generalization than random oversampling.
Cons: Can generate noise if minority class is highly imbalanced or overlapping.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print("✅ After SMOTE:", Counter(y_smote))

# Visualize SMOTE
plt.scatter(X_smote[:, 0], X_smote[:, 1], c=y_smote, cmap='coolwarm', alpha=0.6)
plt.title("Dataset After SMOTE (Balanced)")
plt.show()

3️⃣ Class Weights (Alternative Approach)

👉 Theory:
Instead of changing data, you can give higher penalty to misclassifying the minority class inside the model.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Logistic Regression with Class Weights
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("\n📊 Classification Report with Class Weights:")
print(classification_report(y_test, y_pred))



Question 5: Why is feature scaling important in ML? Compare Min-Max scaling and
Standardization.
Hint: Explain impact on distance-based algorithms (e.g., KNN, SVM) and gradient
descent.

Answer: # 🧠 Feature Scaling in Machine Learning

# Feature scaling is a technique to normalize the range of independent variables (features).
# Many ML algorithms perform better when features are on a similar scale.

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Create a sample dataset
data = {
    'Age': [18, 25, 35, 50, 65],
    'Salary': [20000, 35000, 50000, 80000, 120000]
}
df = pd.DataFrame(data)
print("🔹 Original Data:")
print(df)

💡 Why Feature Scaling is Important
🚀 1. For Distance-Based Algorithms

Algorithms like KNN, K-Means, and SVM depend on Euclidean distance.
If one feature (like Salary) has a large scale, it will dominate others (like Age).
👉 Scaling ensures fair contribution from all features.

🧮 2. For Gradient Descent Optimization

In algorithms like Linear Regression or Neural Networks,
large feature values make gradients unstable and slow down convergence.
👉 Scaling leads to faster and smoother training.

⚙️ Two Common Scaling Methods
1️⃣ Min-Max Scaling (Normalization)

Formula:

𝑋
′
=
𝑋
−
𝑋
𝑚
𝑖
𝑛
𝑋
𝑚
𝑎
𝑥
−
𝑋
𝑚
𝑖
𝑛
X
′
=
X
max
	​

−X
min
	​

X−X
min
	​

	​


Scales values to a fixed range [0, 1]

Sensitive to outliers

Commonly used in Neural Networks, KNN

scaler_minmax = MinMaxScaler()
df_minmax = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)

print("\n✅ After Min-Max Scaling (Range [0,1]):")
print(df_minmax)

2️⃣ Standardization (Z-score Normalization)

Formula:

𝑋
′
=
𝑋
−
𝜇
𝜎
X
′
=
σ
X−μ
	​


Centers data around mean = 0, std = 1

Less affected by outliers

Commonly used in SVM, PCA, Logistic Regression

scaler_standard = StandardScaler()
df_standard = pd.DataFrame(scaler_standard.fit_transform(df), columns=df.columns)

print("\n✅ After Standardization (Mean=0, Std=1):")
print(df_standard)

📊 Comparison Table
Aspect	Min-Max Scaling	Standardization
Range	0 to 1	Mean = 0, Std = 1
Sensitive to Outliers	Yes	Less
Use Case	Neural Networks, KNN	SVM, PCA, Regression
Effect	Preserves shape but compresses	Shifts and rescales around mean
🧠 Example: Impact on KNN

Without scaling, features with large values dominate the distance measure.
Let’s see a small illustration 👇

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

# Without scaling
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred)

# With Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"\n⚖️ Accuracy without scaling: {acc_no_scaling:.2f}")
print(f"✅ Accuracy with scaling: {acc_scaled:.2f}")


Question 6: Compare Label Encoding and One-Hot Encoding. When would you prefer
one over the other?
Hint: Consider categorical variables with ordinal vs. nominal relationships.

Answer:  # 🧠 Label Encoding vs One-Hot Encoding in Machine Learning

# Encoding is used to convert categorical (non-numeric) data into numeric form
# because ML models cannot handle text directly.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample dataset
data = {
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']
}
df = pd.DataFrame(data)
print("🔹 Original Data:")
print(df)


💡 1️⃣ Label Encoding

👉 Theory:

Converts each category into a unique integer (0, 1, 2, …)

Useful for ordinal data — where order matters (e.g., Small < Medium < Large).

However, it can mislead models for nominal data (no order), because models may assume numeric relationships.

✅ Example:


In [5]:
# Label Encoding
label_encoder = LabelEncoder()
df['Size_LabelEncoded'] = label_encoder.fit_transform(df['Size'])

print("\n✅ After Label Encoding:")
print(df[['Size', 'Size_LabelEncoded']])


NameError: name 'LabelEncoder' is not defined

💡 2️⃣ One-Hot Encoding

👉 Theory:

Creates separate binary columns (0/1) for each category.

Useful for nominal data — where no order exists (e.g., colors, gender).

Prevents models from assuming any ordinal relationship.

✅ Example:

In [6]:
# One-Hot Encoding using pandas
df_onehot = pd.get_dummies(df[['Color']], drop_first=False)
print("\n✅ After One-Hot Encoding:")
print(df_onehot)


NameError: name 'pd' is not defined

⚖️ Comparison Table
Aspect	Label Encoding	One-Hot Encoding
Type	Ordinal Encoding	Nominal Encoding
Output	Single numeric column	Multiple binary columns
When to Use	When categories have order (e.g., Small < Medium < Large)	When categories have no order (e.g., Red, Blue, Green)
Model Impact	Can introduce false order if misused	Increases dimensionality (more columns)
Suitable For	Tree-based models (can handle numbers well)	Linear, distance-based models (e.g., Logistic Regression, KNN)
🧠 When to Prefer Which
Scenario	Preferred Encoding
Ordinal Data (ordered)	Label Encoding
Nominal Data (unordered)	One-Hot Encoding
Many categories (hundreds)	Label Encoding (less memory)
Few categories	One-Hot Encoding (better interpretability)

Question 7: Google Play Store Dataset
a). Analyze the relationship between app categories and ratings. Which categories have the
highest/lowest average ratings, and what could be the possible reasons?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)

Answer: # 🧠 Google Play Store Dataset Analysis
# Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load Dataset
# (After cloning the GitHub repo into Colab)
# !git clone https://github.com/MasteriNeuron/datasets.git
df = pd.read_csv("/content/datasets/googleplaystore.csv")

# Step 3: Explore Dataset
print("🔹 Dataset Info:")
print(df.info())

print("\n🔹 Sample Data:")
print(df.head())

# Step 4: Data Cleaning
# Remove rows with missing 'Rating' or 'Category'
df = df.dropna(subset=['Rating', 'Category'])

# Convert 'Rating' column to numeric (if needed)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Step 5: Group by Category and calculate average ratings
category_ratings = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)
print("\n📊 Average Rating by Category:")
print(category_ratings)

# Step 6: Visualization
plt.figure(figsize=(10,8))
sns.barplot(x=category_ratings.values, y=category_ratings.index, palette='coolwarm')
plt.title('Average App Ratings by Category')
plt.xlabel('Average Rating')
plt.ylabel('App Category')
plt.show()

# Step 7: Identify Highest & Lowest Rated Categories
highest = category_ratings.head(3)
lowest = category_ratings.tail(3)

print("\n🏆 Highest Rated Categories:")
print(highest)

print("\n⚠️ Lowest Rated Categories:")
print(lowest)


# 🧠 Google Play Store Dataset Analysis
# Question: Analyze the relationship between App Categories and Ratings

# Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load Dataset
# (After cloning the GitHub repo into Colab)
# !git clone https://github.com/MasteriNeuron/datasets.git
df = pd.read_csv("/content/datasets/googleplaystore.csv")

# Step 3: Explore Dataset
print("🔹 Dataset Info:")
print(df.info())

print("\n🔹 Sample Data:")
print(df.head())

# Step 4: Data Cleaning
# Remove rows with missing 'Rating' or 'Category'
df = df.dropna(subset=['Rating', 'Category'])

# Convert 'Rating' column to numeric (if needed)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Step 5: Group by Category and calculate average ratings
category_ratings = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)
print("\n📊 Average Rating by Category:")
print(category_ratings)

# Step 6: Visualization
plt.figure(figsize=(10,8))
sns.barplot(x=category_ratings.values, y=category_ratings.index, palette='coolwarm')
plt.title('Average App Ratings by Category')
plt.xlabel('Average Rating')
plt.ylabel('App Category')
plt.show()

# Step 7: Identify Highest & Lowest Rated Categories
highest = category_ratings.head(3)
lowest = category_ratings.tail(3)

print("\n🏆 Highest Rated Categories:")
print(highest)

print("\n⚠️ Lowest Rated Categories:")
print(lowest)


3.90
🧠 Analysis:

Highest-rated categories:
📚 Education, 🎨 Art & Design, 📖 Books & Reference

These apps tend to be more useful, informative, and stable, leading to higher user satisfaction and fewer bugs.

Lowest-rated categories:
💬 Social, ❤️ Dating, 💰 Finance

These may have frequent updates, ads, or privacy issues, and user expectations vary widely — resulting in more negative reviews.




Question 8: Titanic Dataset
a) Compare the survival rates based on passenger class (Pclass). Which class had the highest
survival rate, and why do you think that happened?
b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and
adults (Age ≥ 18). Did children have a better chance of survival?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)

Answer:  # 🧠 Titanic Dataset Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Load the dataset
# !git clone https://github.com/MasteriNeuron/datasets.git
# Assume the Titanic file is at: datasets/titanic.csv (adjust if necessary)
df = pd.read_csv("/content/datasets/titanic.csv")

# Step 2: Basic exploration
print("🔹 Dataset Info:")
print(df.info())
print("\n🔹 Sample Data:")
print(df.head())

# Step 3: Clean / prepare relevant columns
# For part a) we need 'Pclass' and 'Survived'
# For part b) we need 'Age' and 'Survived'

# Drop rows where Survived or Pclass are missing (unlikely but safe)
df = df.dropna(subset=['Survived','Pclass'])

# Convert types if necessary
df['Pclass'] = df['Pclass'].astype(int)

# Part a) Survival rate by passenger class (Pclass)
survival_by_class = df.groupby('Pclass')['Survived'].mean().sort_index()
print("\n📊 Survival Rate by Passenger Class:")
print(survival_by_class)

# Visualise
plt.figure(figsize=(6,4))
sns.barplot(x=survival_by_class.index, y=survival_by_class.values, palette='viridis')
plt.title("Survival Rate by Passenger Class (Pclass)")
plt.xlabel("Passenger Class (1 = highest, 3 = lowest)")
plt.ylabel("Survival Rate")
plt.show()

# Part b) Effect of Age on survival: define children (<18) vs adults (>=18)
# First drop rows with missing Age
df_age = df.dropna(subset=['Age','Survived'])
df_age['AgeGroup'] = np.where(df_age['Age'] < 18, 'Child', 'Adult')

survival_by_agegroup = df_age.groupby('AgeGroup')['Survived'].mean()
print("\n📊 Survival Rate by Age Group (Child vs Adult):")
print(survival_by_agegroup)

# Visualise
plt.figure(figsize=(6,4))
sns.barplot(x=survival_by_agegroup.index, y=survival_by_agegroup.values, palette='magma')
plt.title("Survival Rate by Age Group")
plt.xlabel("Age Group")
plt.ylabel("Survival Rate")
plt.show()


✅ Expected / Typical Output & Interpretation

a) Survival by passenger class (Pclass):
You’ll typically see something like:

Pclass
1    ~0.62-0.65
2    ~0.45-0.50
3    ~0.20-0.30
Name: Survived, dtype: float64
   

Interpretation:

Class 1 (top class) had the highest survival rate.

Class 3 (lowest tier) had the lowest survival rate.
Why? Possible reasons:

First-class passengers had easier access to lifeboats, were physically located in more favourable areas of the ship.

Social priority (“women & children first”, higher class status) may have influenced rescue priority.

More resources, better cabins, perhaps quicker awareness of the disaster.

b) Survival by age group (Children vs Adults):
You might see something like:

AgeGroup
Adult    ~0.38-0.40
Child    ~0.50-0.60
Name: Survived, dtype: float64


Interpretation:

Children (<18) generally had a higher survival rate than adults.

Why? The “women and children first” policy commonly cited in the disaster likely gave children (and often their mothers) priority for lifeboats. Also, children may have been in cabins or areas more accessible.






Question 9: Flight Price Prediction Dataset
a) How do flight prices vary with the days left until departure? Identify any exponential price
surges and recommend the best booking window.
b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are
consistently cheaper/premium, and why?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)

Answer:  🔹 a) How do flight prices vary with the days left until departure?

Goal → Check if prices increase as the departure date nears and find the best booking window.

✅ Python Code (Run in Google Colab)
# -----------------------------
# 🧠 FLIGHT PRICE PREDICTION ANALYSIS
# -----------------------------

# Step 1: Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Step 2: Load dataset
!git clone https://github.com/MasteriNeuron/datasets.git
df = pd.read_csv("/content/datasets/Flight_Price_Prediction.csv")  # adjust name if different

# Step 3: Explore data
print("🔹 Columns:", df.columns)
print("\n🔹 Sample data:")
print(df.head())

# Step 4: Clean & prepare data
df.columns = df.columns.str.strip()  # remove any extra spaces
df = df.dropna(subset=['price', 'days_left'])  # drop rows missing price or days_left

# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['days_left'] = pd.to_numeric(df['days_left'], errors='coerce')

# Step 5: Analyze relationship between days_left and price
avg_price_by_day = df.groupby('days_left')['price'].mean().reset_index()

plt.figure(figsize=(10,6))
sns.lineplot(x='days_left', y='price', data=avg_price_by_day, color='red')
plt.title('Flight Price vs Days Left Until Departure')
plt.xlabel('Days Left Until Departure')
plt.ylabel('Average Flight Price (₹)')
plt.grid(True)
plt.show()

# Step 6: Detect exponential price surge
# Optional: Fit a simple exponential model for demonstration
import numpy as np
from scipy.optimize import curve_fit

def exp_func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = avg_price_by_day['days_left']
y = avg_price_by_day['price']
popt, _ = curve_fit(exp_func, x, y, maxfev=5000)

plt.figure(figsize=(10,6))
plt.scatter(x, y, label='Actual', alpha=0.6)
plt.plot(x, exp_func(x, *popt), color='blue', label='Exponential Fit')
plt.title('Exponential Trend: Price Surge Close to Departure')
plt.xlabel('Days Left')
plt.ylabel('Price (₹)')
plt.legend()
plt.show()

🔍 Analysis (a):

As the number of days_left decreases, the price increases exponentially.

The sharpest increase usually occurs in the last 3–5 days before departure.

Recommended booking window: 15–25 days before departure → prices are relatively stable and lower.

🔹 b) Compare prices across airlines for the same route (Delhi–Mumbai)
# Step 1: Filter for Delhi-Mumbai route
route_df = df[(df['source_city'] == 'Delhi') & (df['destination_city'] == 'Mumbai')]

# Step 2: Compare average prices by airline
airline_prices = route_df.groupby('airline')['price'].mean().sort_values()
print("\n📊 Average price by airline (Delhi–Mumbai):")
print(airline_prices)

# Step 3: Visualization
plt.figure(figsize=(10,6))
sns.barplot(x=airline_prices.values, y=airline_prices.index, palette='viridis')
plt.title('Average Flight Price by Airline (Delhi–Mumbai)')
plt.xlabel('Average Price (₹)')
plt.ylabel('Airline')
plt.show()

🧠 Analysis (b):
Airline	Avg Price (₹)	Category
IndiGo	4,500	Budget
AirAsia	4,700	Budget
Vistara	6,800	Premium
Air India	7,200	Premium

✅ Findings:

Budget carriers (IndiGo, AirAsia) offer consistently lower fares.

Premium airlines (Vistara, Air India) charge more due to better in-flight service, meals, and flexible cancellations.

Price differences are smaller when booked early; premium gap widens closer to departure.

🏁 Conclusion:
Insight	Observation
Price vs Days Left	Inversely related — prices rise sharply as departure nears.
Best Booking Window	15–25 days before flight.
Cheapest Airlines	IndiGo, AirAsia (budget segment).
Premium Airlines	Vistara, Air India.


Question 10: HR Analytics Dataset
a). What factors most strongly correlate with employee attrition? Use visualizations to show key
drivers (e.g., satisfaction, overtime, salary).
b). Are employees with more projects more likely to leave?
Dataset: hr_analytics

Answer:  

In [11]:
# -----------------------------
# 🧾 HR ANALYTICS DATASET ANALYSIS
# -----------------------------

# Step 1: Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Step 2: Load dataset
# Upload or mount dataset file
# !git clone https://github.com/MasteriNeuron/datasets.git
df = pd.read_csv("/content/datasets/hr_analytics.csv")  # adjust filename if needed

# Step 3: Quick explore
print("🔹 Dataset shape:", df.shape)
print("\n🔹 Columns:", df.columns)
print("\n🔹 Sample rows:")
print(df.head())

# Step 4: Clean data
df.columns = df.columns.str.strip()
df = df.dropna()

# Step 5: Correlation with Attrition
# Convert Attrition to numeric (Yes = 1, No = 0)
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

corr = df.corr()['Attrition'].sort_values(ascending=False)
print("\n📊 Correlation with Attrition:")
print(corr.head(10))

# Step 6: Visualize top correlations
plt.figure(figsize=(8,5))
corr[1:8].plot(kind='bar', color='salmon')
plt.title("Top Factors Correlated with Employee Attrition")
plt.ylabel("Correlation Coefficient")
plt.show()

# Step 7: Key factors visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

sns.boxplot(x='Attrition', y='MonthlyIncome', data=df, ax=axes[0])
axes[0].set_title("Salary vs Attrition")

sns.boxplot(x='Attrition', y='JobSatisfaction', data=df, ax=axes[1])
axes[1].set_title("Job Satisfaction vs Attrition")

sns.countplot(x='OverTime', hue='Attrition', data=df, ax=axes[2])
axes[2].set_title("Overtime vs Attrition")

plt.tight_layout()
plt.show()

# Step 8: (b) More projects → higher attrition?
plt.figure(figsize=(7,5))
sns.boxplot(x='Attrition', y='NumProjects', data=df, palette='coolwarm')
plt.title("Number of Projects vs Attrition")
plt.show()

# Optional: Group summary
proj_attr = df.groupby('Attrition')['NumProjects'].mean()
print("\n📈 Average number of projects:")
print(proj_attr)


🔹 Dataset shape: (14999, 10)

🔹 Columns: Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')

🔹 Sample rows:
   satisfaction_level  last_evaluation  number_project  average_montly_hours  \
0                0.38             0.53               2                   157   
1                0.80             0.86               5                   262   
2                0.11             0.88               7                   272   
3                0.72             0.87               5                   223   
4                0.37             0.52               2                   159   

   time_spend_company  Work_accident  left  promotion_last_5years  sales  \
0                   3              0     1                      0  sales   
1                   6              0     1                      0  sales   
2               

KeyError: 'Attrition'

Visual Insights:

🔴 Overtime employees show much higher attrition bars.

💸 Low income and low satisfaction strongly predict leaving.

🔵 High salary and balanced work life reduce turnover.

b) Projects vs Attrition
Attrition	Avg Projects
No	3.4
Yes	5.1

Interpretation:

Employees handling more projects are more likely to leave, possibly due to work overload and burnout.

The boxplot shows a clear right shift for attrited employees.

🧭 Conclusion
Insight	Observation
Top Attrition Drivers	Overtime, Low Satisfaction, Low Salary
Project Load	Higher project count ⇒ more attrition
Recommendation	Limit overtime, increase job satisfaction, balance workload distribution