# Question 1 :  What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each. (Hint: Compare their scope, techniques, and applications for each.) 
Answer:  Differences between AI, ML, DL, and data science

 #Scope:
- AI: Broad field aiming to build systems that perform tasks requiring human-like intelligence (reasoning, perception, decision-making).
- ML: Subfield of AI focused on learning patterns from data to make predictions/decisions with minimal explicit programming.
- DL: Subfield of ML using multi-layer neural networks to learn hierarchical representations, especially in high-dimensional data.
- Data science: Interdisciplinary field combining statistics, ML, programming, and domain knowledge to extract insights and support decisions.
#Techniques:
- AI: Planning, search (A*), logic, knowledge graphs, expert systems, ML methods.
- ML: Supervised/unsupervised learning (regression, decision trees, SVM, clustering), model evaluation.
- DL: CNNs, RNNs/LSTMs, Transformers, autoencoders, optimization via backpropagation.
- Data science: Data wrangling, EDA, statistical inference, ML modeling, visualization, experimentation.
#Applications:
- AI: Robotics, autonomous agents, dialog systems, game-playing.
- ML: Credit scoring, recommendation systems, demand forecasting.
- DL: Image recognition, speech/NLP, generative models.
- Data science: Business analytics, A/B testing, dashboards, operational decision support.





# Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent them? Hint: Discuss bias-variance tradeoff, cross-validation, and regularization techniques. 
Answer:  
Overfitting and underfitting in ML
## Definitions:
- Overfitting: Model captures noise and idiosyncrasies of the training data, leading to low training error but high test error.
- Underfitting: Model is too simple to capture underlying patterns, leading to high error on both training and test data.
#Bias-variance tradeoff:
- High variance (overfitting): Complex models sensitive to training fluctuations.
- High bias (underfitting): Simple models that miss important structure.
- Goal is to minimize total error by balancing bias and variance.
#Detection:
- Train vs. validation curves: Large generalization gap signals overfitting; high errors on both sets indicate underfitting.
- Learning curves: If adding data reduces validation error (variance problem), or error plateaus high (bias problem).
- Cross-validation: Consistent performance across folds vs. variability indicates stability or overfitting risks.
#Prevention:
- Regularization: L1/L2 penalties, dropout (DL), data augmentation, early stopping.
- Model complexity control: Pruning trees, limiting polynomial degree, simplifying architectures.
- Cross-validation & hyperparameter tuning: Grid/random/Bayesian search with k-fold CV.
- More/better data: Increase sample size, improve data quality, feature engineering.




# Question 3:How would you handle missing values in a dataset? Explain at least three methods with examples. Hint: Consider deletion, mean/median imputation, and predictive modeling. 
Answer: Handling missing values in a dataset

#Deletion:
- Listwise deletion: Remove rows with missing values; suitable when missingness is rare and MCAR (missing completely at random).
- Example: If 1% rows have missing Age, drop them in a large dataset to avoid bias.
- Risk: Loss of data; biased if missingness relates to outcome or features.
#Simple imputation:
- Mean/median (numeric): Median for skewed distributions; mean for roughly symmetric.
- Mode (categorical): Replace with most frequent category.
- Example: Replace missing income with median income; replace missing City with mode “Mumbai.”
- Risk: Underestimates variability and can bias relationships.
#Predictive modeling:
- Impute via models: Train a model (e.g., regression, k-NN, iterative imputer) to predict missing values from other features.
- Example: Predict missing Age using Fare, Pclass, Sex in Titanic via iterative imputer.
- Pros/cons: More accurate; adds model uncertainty; must avoid target leakage by fitting on training data only.
#Best practices:
- Diagnose mechanism: MCAR vs. MAR vs. MNAR.
- Flag imputed values: Add indicator features to capture missingness patterns.
- Evaluate impact: Compare models with/without imputation strategies.





# Question 4:What is an imbalanced dataset? Describe two techniques to handle it (theoretical + practical). Hint: Discuss SMOTE, Random Under/Oversampling, and class weights in models. 
Answer:Imbalanced datasets and handling techniques
## Definition:
- Class distribution is skewed (e.g., 95% negatives, 5% positives), causing models to favor the majority class and misleading metrics like accuracy.
#Theoretical techniques:
- Class weights: Penalize misclassification of minority class more during training (supported in many algorithms like logistic regression, tree-based models, SVM, neural nets).
- SMOTE: Synthetic Minority Over-sampling Technique creates synthetic minority samples by interpolating between nearest neighbors.
#Practical techniques:
- Random oversampling/undersampling: Duplicate minority samples or remove majority samples to balance classes; quick and effective but risks overfitting (oversampling) or losing information (undersampling).
- Pipeline considerations: Apply resampling on the training split only; use stratified CV; evaluate with precision-recall, AUROC, F1 instead of accuracy.



# Question 5: Why is feature scaling important in ML? Compare Min-Max scaling and Standardization. Hint: Explain impact on distance-based algorithms (e.g., KNN, SVM) and gradient descent. 
Answer:-
## Importance:
- Distance-based algorithms (KNN, k-means), margin-based (SVM with RBF), and gradient descent optimization are sensitive to feature scales; unscaled     features can dominate distances/margins and slow or destabilize optimization.
- Tree-based methods (Random Forest, XGBoost) are largely scale-invariant.
#Min-Max scaling (normalization):
- Transform: x'=\frac{x-\min }{\max -\min } to range [0, 1] (or another specified range).
- Pros: Preserves original distribution and relative distances; bounded outputs beneficial for algorithms assuming limited ranges or image pixels.
- Cons: Sensitive to outliers; min/max shifts with new data; not centering.
#Standardization (z-score):
- Transform: x'=\frac{x-\mu }{\sigma }, mean 0, variance 1.
- Pros: More robust to outliers than min-max; suits models assuming Gaussianity; improves convergence in linear/logistic regression and neural nets.
- Cons: Unbounded; interpretation less intuitive; still affected by extreme outliers (use robust scalers if needed).
#When to use:
- Min-Max: KNN, distance-based methods where bounded scales help; neural nets with bounded activations; data without heavy outliers.
- Standardization: SVM, linear models, PCA, logistic regression, deep learning; mixed-scale features; presence of outliers.






# Question 6:  Compare Label Encoding and One-Hot Encoding. When would you prefer one over the other? Hint: Consider categorical variables with ordinal vs. nominal relationships.
Answer:-

## Label encoding vs. one-hot encoding
### Label encoding:
- What: Map categories to integers (A→0, B→1, C→2).
- Use when: Ordinal categorical variables where order matters (e.g., size: Small < Medium < Large).
- Risk: Creates false ordinal relationships for nominal variables; tree models can sometimes tolerate it, but linear/distance-based models               mayMisinterpret magnitude.
### One-hot encoding:
- What: Create binary indicator columns per category.
- Use when: Nominal variables with no inherent order (e.g., city names, colors).
- Trade-offs: Increases dimensionality; can be sparse; necessary for models that interpret values as distances or weights.
- Guidance:
- Prefer label encoding for ordinal features and one-hot for nominal. For high-cardinality nominal features, consider target encoding, hashing, or embeddings (with careful validation to avoid leakage).


In [1]:
# Question 7:  Google Play Store Dataset a). Analyze the relationship between app categories and ratings. Which categories have the highest/lowest average ratings, and what could be the possible reasons? Dataset: https://github.com/MasteriNeuron/datasets.git (Include your Python code and output in the code box below.) 

# Answer: 

import pandas as pd

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/googleplaystore.csv"
df = pd.read_csv(url)

# Basic cleaning
df = df[df['Rating'].notna()]
df = df[df['Category'].notna()]
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
df = df.dropna(subset=['Rating'])

# Aggregate average ratings by category
cat_ratings = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)

# Top and bottom 10 categories by average rating
top10 = cat_ratings.head(10)
bottom10 = cat_ratings.tail(10)

print("Top 10 categories by average rating:")
print(top10.to_string())
print("\nBottom 10 categories by average rating:")
print(bottom10.to_string())

# Optional: count apps per category (to contextualize averages)
counts = df['Category'].value_counts()

print("\nApp counts per category (top 10):")
print(counts.head(10).to_string())

Top 10 categories by average rating:
Category
1.9                    19.000000
EVENTS                  4.435556
EDUCATION               4.389032
ART_AND_DESIGN          4.358065
BOOKS_AND_REFERENCE     4.346067
PERSONALIZATION         4.335987
PARENTING               4.300000
GAME                    4.286326
BEAUTY                  4.278571
HEALTH_AND_FITNESS      4.277104

Bottom 10 categories by average rating:
Category
NEWS_AND_MAGAZINES     4.132189
FINANCE                4.131889
ENTERTAINMENT          4.126174
BUSINESS               4.121452
TRAVEL_AND_LOCAL       4.109292
LIFESTYLE              4.094904
VIDEO_PLAYERS          4.063750
MAPS_AND_NAVIGATION    4.051613
TOOLS                  4.047411
DATING                 3.970769

App counts per category (top 10):
Category
FAMILY             1747
GAME               1097
TOOLS               734
PRODUCTIVITY        351
MEDICAL             350
COMMUNICATION       328
FINANCE             323
SPORTS              319
PHOTOGRAPHY       

In [2]:
# Question 8: Titanic Dataset 
#a) Compare the survival rates based on passenger class (Pclass). Which class had the highest survival rate, and why do you think that happened? 
#b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and adults (Age ≥ 18). Did children have a better chance of survival? 
#Dataset: https://github.com/MasteriNeuron/datasets.git (Include your Python code and output in the code box below.) 

# Answer: 

import pandas as pd

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/titanic.csv"
df = pd.read_csv(url)

# Ensure columns
df = df[['Survived','Pclass','Age','Sex']]

# Survival rates by Pclass
pclass_survival = df.groupby('Pclass')['Survived'].mean().sort_index()

# Age groups: children (<18) vs adults (>=18)
df['AgeGroup'] = pd.cut(df['Age'], bins=[-float('inf'), 18, float('inf')], labels=['Child','Adult'])
age_survival = df.groupby('AgeGroup')['Survived'].mean()

print("Survival rate by Pclass:")
print((pclass_survival*100).round(2).astype(str) + "%")

print("\nSurvival rate by AgeGroup:")
print((age_survival*100).round(2).astype(str) + "%")

# Additional: cross-tab by class and age group
pivot = df.pivot_table(index='Pclass', columns='AgeGroup', values='Survived', aggfunc='mean')
print("\nSurvival rate by Pclass and AgeGroup (%):")
print((pivot*100).round(2))

Survival rate by Pclass:
Pclass
1    62.96%
2    47.28%
3    24.24%
Name: Survived, dtype: object

Survival rate by AgeGroup:
AgeGroup
Child    50.36%
Adult    38.26%
Name: Survived, dtype: object

Survival rate by Pclass and AgeGroup (%):
AgeGroup  Child  Adult
Pclass                
1         87.50  63.53
2         79.31  41.67
3         35.11  19.92


  age_survival = df.groupby('AgeGroup')['Survived'].mean()
  pivot = df.pivot_table(index='Pclass', columns='AgeGroup', values='Survived', aggfunc='mean')


In [5]:
#Question 9: Flight Price Prediction Dataset 
#a) How do flight prices vary with the days left until departure? Identify any exponential price surges and recommend the best booking window. 
#b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are consistently cheaper/premium, and why? 
#Dataset: https://github.com/MasteriNeuron/datasets.git (Include your Python code and output in the code box below.)

#Answer:  

import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/flight_price.csv"
df = pd.read_csv(url)

# Standardize column names based on common schemas
# Assume columns: 'price','airline','source','destination','days_left','route'
# If 'route' absent, create from source-destination
if 'route' not in df.columns and {'source_city','destination_city'}.issubset(df.columns):
    df['route'] = df['source_city'].str.strip() + "-" + df['destination_city'].str.strip()

# Clean
df = df.dropna(subset=['price','days_left'])
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['days_left'] = pd.to_numeric(df['days_left'], errors='coerce')
df = df.dropna(subset=['price','days_left'])

# a) Price vs days_left
days_price = df.groupby('days_left')['price'].mean().sort_index()

# Identify surge window: compute relative change
rel_change = days_price.pct_change().fillna(0)
# Example heuristic: surge when relative change > 0.15 day-over-day for close-in dates
surge_days = rel_change[rel_change > 0.15].index.tolist()

print("Average price by days left (first 15):")
print(days_price.head(15).round(2).to_string())

print("\nDetected surge days (heuristic):", surge_days)

# Recommend booking window: look for plateau/minima
rolling = days_price.rolling(7, min_periods=3).mean()
min_window_day = rolling.idxmin()
print("Suggested lowest-price window center day_left:", int(min_window_day))

# b) Airline comparison for Delhi-Mumbai
route_mask = df['route'].str.lower() == 'delhi-mumbai'
dm = df[route_mask]
airline_avg = dm.groupby('airline')['price'].mean().sort_values()
print("\nDelhi-Mumbai average prices by airline:")
print(airline_avg.round(2).to_string())

# Classify airlines as cheaper/premium relative to route average
route_avg = dm['price'].mean()
labels = (airline_avg / route_avg).apply(lambda x: 'Cheaper' if x < 0.95 else ('Premium' if x > 1.05 else 'Average'))
print("\nRelative classification:")
print(labels.to_string())

Average price by days left (first 15):
days_left
1     21591.87
2     30211.30
3     28976.08
4     25730.91
5     26679.77
6     24856.49
7     25588.37
8     24895.88
9     25726.25
10    25572.82
11    22990.66
12    22505.80
13    22498.89
14    22678.00
15    21952.54

Detected surge days (heuristic): [2]
Suggested lowest-price window center day_left: 49

Delhi-Mumbai average prices by airline:
airline
AirAsia       3981.19
Indigo        4473.74
SpiceJet      4628.25
GO_FIRST      5762.21
Air_India    23695.92
Vistara      26630.29

Relative classification:
airline
AirAsia      Cheaper
Indigo       Cheaper
SpiceJet     Cheaper
GO_FIRST     Cheaper
Air_India    Premium
Vistara      Premium
