# Foundations of Machine Learning and EDA – Assignment Answers

## Question 1: AI vs ML vs DL vs Data Science
**AI**: Broad field creating systems that mimic human intelligence using rule‑based, statistical, or learning approaches.

**ML**: Subset of AI using algorithms that learn patterns from data (supervised, unsupervised, reinforcement learning).

**DL**: Subset of ML using multi‑layered neural networks for complex tasks (vision, NLP).

**Data Science**: End‑to‑end discipline covering data collection, cleaning, analysis, modeling, and communication.

## Question 2: Overfitting and Underfitting
**Overfitting**: Model learns noise; low training error, high test error. Detected via validation gap. Prevent using regularization, dropout, early stopping, cross‑validation.

**Underfitting**: Model too simple; high errors on both sets. Prevent by adding features, increasing complexity, reducing regularization.

Related concept: **Bias‑variance tradeoff**.

## Question 3: Handling Missing Values
**Deletion**: Remove rows/columns with many missing values; used when missingness is small.

**Imputation**: Fill with mean/median (numerical) or mode (categorical).

**Predictive Modeling**: Use regression/KNN imputer to predict missing values.

## Question 4: Imbalanced Dataset
Imbalance occurs when one class dominates.

**Random Oversampling/Undersampling**: Duplicate minority samples or remove majority.

**SMOTE**: Generates synthetic minority examples.

**Class Weights**: Penalize misclassification of minority class.

## Question 5: Feature Scaling
Important for distance‑based models and gradient descent.

**Min‑Max scaling**: Scales to [0,1]; sensitive to outliers.

**Standardization**: Mean 0, SD 1; preferred for most ML models.

## Question 6: Label Encoding vs One‑Hot Encoding
**Label Encoding**: Converts categories to integers; used for ordinal features.

**One‑Hot Encoding**: Binary vectors; for nominal categories.

## Question 7: Google Play Store Dataset — Category vs Ratings

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/master/googleplaystore.csv"
df = pd.read_csv(url)

df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
category_avg = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)
category_avg.head(), category_avg.tail()

(Category
 1.9                    19.000000
 EVENTS                  4.435556
 EDUCATION               4.389032
 ART_AND_DESIGN          4.358065
 BOOKS_AND_REFERENCE     4.346067
 Name: Rating, dtype: float64,
 Category
 LIFESTYLE              4.094904
 VIDEO_PLAYERS          4.063750
 MAPS_AND_NAVIGATION    4.051613
 TOOLS                  4.047411
 DATING                 3.970769
 Name: Rating, dtype: float64)

## Question 8: Titanic Dataset

In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/master/titanic.csv"
titanic = pd.read_csv(url)

# survival by class
class_survival = titanic.groupby('Pclass')['Survived'].mean()

# children vs adults
titanic['Group'] = titanic['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')
group_survival = titanic.groupby('Group')['Survived'].mean()

class_survival, group_survival

(Pclass
 1    0.629630
 2    0.472826
 3    0.242363
 Name: Survived, dtype: float64,
 Group
 Adult    0.361183
 Child    0.539823
 Name: Survived, dtype: float64)

## Question 9: Flight Price Prediction

In [3]:
import pandas as pd

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/master/flight_price.csv"
fp = pd.read_csv(url)

fp['days_left'] = pd.to_numeric(fp['days_left'], errors='coerce')
fp['price'] = pd.to_numeric(fp['price'], errors='coerce')

days_price = fp.groupby('days_left')['price'].mean()
airline_price = fp.groupby(['airline', 'source_city', 'destination_city'])['price'].mean()

days_price.head(), airline_price.head()

(days_left
 1    21591.867151
 2    30211.299801
 3    28976.083569
 4    25730.905653
 5    26679.773368
 Name: price, dtype: float64,
 airline  source_city  destination_city
 AirAsia  Bangalore    Chennai             2073.043478
                       Delhi               4807.092426
                       Hyderabad           2931.494792
                       Kolkata             4443.468160
                       Mumbai              3342.385350
 Name: price, dtype: float64)

## Question 10: HR Analytics Dataset

In [6]:
import pandas as pd

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/master/hr_analytics.csv"
hr = pd.read_csv(url)

# Check the column names
print(hr.columns)

# Assuming the column related to attrition is 'left', calculate the correlation
corr = hr.corr(numeric_only=True)['left'].sort_values(ascending=False)
corr.head()

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')


Unnamed: 0,left
left,1.0
time_spend_company,0.144822
average_montly_hours,0.071287
number_project,0.023787
last_evaluation,0.006567
