# DA-AG-007 — Foundations of Machine Learning and EDA
Submitted by - Mohd Khalil





In [None]:

# --- Injected cell: use local CSVs ---
googleplay_local = '/mnt/data/googleplaystore.csv'
titanic_local = '/mnt/data/titanic.csv'
flight_local = '/mnt/data/flight_price.csv'
hr_local = '/mnt/data/hr_analytics.csv'


url = googleplay_local
titanic_url = titanic_local
flight_url = flight_local
hr_url = hr_local

print('Using local datasets:')
print(url)
print(titanic_url)
print(flight_url)
print(hr_url)


Using local datasets:
/mnt/data/googleplaystore.csv
/mnt/data/titanic.csv
/mnt/data/flight_price.csv
/mnt/data/hr_analytics.csv


### Question 1: Difference between AI, ML, DL, and Data Science

**AI (Artificial Intelligence)** — broad field focused on creating systems that perform tasks that normally require human intelligence. Scope: reasoning, planning, language, perception. Techniques: search algorithms, knowledge representation, rule-based systems, ML. Applications: autonomous vehicles, chatbots, recommendation systems.

**ML (Machine Learning)** — a subset of AI where systems learn patterns from data. Scope: supervised, unsupervised, reinforcement learning. Techniques: linear/logistic regression, decision trees, SVM, ensemble methods. Applications: classification, regression, clustering, anomaly detection.

**DL (Deep Learning)** — a subset of ML that uses neural networks with many layers (deep networks). Scope: representation learning from raw data. Techniques: CNNs, RNNs, Transformers. Applications: image recognition, NLP, speech, generative models.

**Data Science** — interdisciplinary field combining domain knowledge, statistics, and computing to extract insights from data. Scope: data cleaning, EDA, modeling, visualization, deployment. Techniques: statistics, ML, data engineering, storytelling. Applications: business intelligence, product analytics, scientific research.

Key differences:
- Scope: AI (broadest) → ML (subset) → DL (subset of ML). Data Science overlaps with all and focuses on the data-to-insight pipeline.
- Techniques: AI includes symbolic approaches; ML focuses on learning algorithms; DL uses deep neural nets; Data Science includes statistics and engineering.
- Applications: AI for general automation; ML/DL for predictive tasks; Data Science for analysis and decision support.


### Question 2: Overfitting and Underfitting

**Overfitting**: A model learns noise and idiosyncrasies of training data, performing well on training but poorly on unseen data. Symptoms: large gap between training and validation error.

**Underfitting**: A model is too simple to capture underlying patterns, showing poor performance on both training and validation data. Symptoms: high training error.

**Bias-Variance Tradeoff**:
- High bias → underfitting (model too simple).
- High variance → overfitting (model too complex).

**Detection**:
- Plot learning curves (training vs validation error as function of training set size or model complexity).
- Cross-validation scores: large variance across folds indicates overfitting.

**Prevention / Remedies**:
- Regularization (L1/L2) to penalize large weights.
- Use simpler models or reduce model complexity to combat overfitting.
- Increase training data or use data augmentation.
- Early stopping during training when validation loss stops improving.
- Cross-validation (k-fold) to get robust estimate of generalization.
- Ensembling (bagging) to reduce variance.


### Question 3: Handling Missing Values (3 methods)

1. **Deletion (listwise or pairwise)**
   - Drop rows (`df.dropna()`) or columns with missing values.
   - Pros: simple, preserves observed values.
   - Cons: loses data; biased if missingness is not completely at random.

2. **Simple imputation (mean/median/mode)**
   - Replace numeric NaNs with column mean/median; categorical with mode.
   - Example: `df['age'].fillna(df['age'].median(), inplace=True)`.
   - Pros: easy and fast.
   - Cons: underestimates variance and may introduce bias.

3. **Predictive modeling / advanced imputation**
   - Use a model (KNNImputer, IterativeImputer / MICE) to predict missing values from other features.
   - Example: scikit-learn `IterativeImputer` or `KNNImputer`.
   - Pros: often more accurate; preserves relationships between features.
   - Cons: more complex and computationally intensive.

Other considerations:
- Always analyze missingness mechanism (MCAR, MAR, MNAR).
- Consider adding a missingness indicator column (`df['age_missing']=df['age'].isnull().astype(int)`).


### Question 4: Imbalanced Dataset

An imbalanced dataset is one where class labels are not represented equally (e.g., fraud detection where positive cases are rare). Problems: classifiers can be biased towards majority class; evaluation metrics like accuracy become misleading.

Techniques:
- **SMOTE (Synthetic Minority Over-sampling Technique)**: generates synthetic examples of the minority class by interpolating between minority neighbors. Practical: `imblearn.over_sampling.SMOTE()`.
- **Random undersampling / oversampling**: Remove samples from majority (undersample) or duplicate minority samples (oversample). Practical: `RandomUnderSampler`, `RandomOverSampler` in `imblearn`.
- **Class weights**: Set `class_weight='balanced'` in models like `LogisticRegression`, `RandomForestClassifier` to penalize mistakes on minority class more.

When to use which:
- If you have plenty of majority data: undersample.
- If small dataset: SMOTE or class weights to avoid information loss.
- Always evaluate with metrics suited to imbalance: precision-recall, F1, ROC AUC, PR AUC, confusion matrix, and use cross-validation.


### Question 5: Feature Scaling — Min-Max vs Standardization

Feature scaling is important because many ML algorithms (KNN, SVM, K-means, gradient-based models) use distances or assume features are on comparable scales. Without scaling, features with larger magnitudes dominate.

**Min-Max scaling (Normalization)**
- Transforms features to a fixed range, usually [0, 1]: `X_scaled = (X - X.min) / (X.max - X.min)`.
- Preserves shape of original distribution (doesn't change variance relationships).
- Sensitive to outliers (outliers will be squashed to extremes).

**Standardization (Z-score)**
- Centers features to mean 0 with standard deviation 1: `X_scaled = (X - X.mean) / X.std()`.
- Less sensitive to outliers than Min-Max for some algorithms; does not bound values.
- Preferred for algorithms that assume zero-centered data (e.g., PCA, linear models).

Which to use:
- Use **Min-Max** when you need features within a bounded interval (e.g., neural network inputs when activations are sensitive).
- Use **Standardization** for algorithms like SVM, logistic regression, PCA, or when outliers exist.


### Question 6: Label Encoding vs One-Hot Encoding

**Label Encoding**
- Converts categories to integer labels (0..k-1).
- Use when categories are ordinal (e.g., `low < medium < high`).
- Danger: some models may infer a spurious ordinal relationship when applied to nominal categories.

**One-Hot Encoding**
- Creates binary columns for each category using `pd.get_dummies()` or `OneHotEncoder`.
- Use for nominal categorical variables with no ordinal relationship.
- Can create high-dimensional feature spaces for categories with many unique values (use hashing or embeddings in that case).

When to prefer which:
- Ordinal categories → Label Encoding.
- Nominal categories with few unique values → One-Hot Encoding.
- High-cardinality nominal features → Target encoding, frequency encoding, or embeddings.


## Question 7: Google Play Store Dataset

### (a) Analyze relationship between app categories and ratings

Below is a runnable analysis. Run the cell to read `googleplaystore.csv` from the MasteriNeuron repo and calculate average ratings per category.


In [2]:
# Q7: Google Play Store — Category vs Rating analysis
import pandas as pd
url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/googleplaystore.csv'
df = pd.read_csv(url)
print('Loaded googleplaystore.csv — shape:', df.shape)

# Basic cleaning: ensure Rating is numeric
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
cat_rating = df.groupby('Category')['Rating'].agg(['count','mean','median','std']).sort_values('mean', ascending=False)
display(cat_rating.head(10))

# Example insight lines (replace with actual output after running):
# - Categories with highest average ratings might include 'ART_AND_DESIGN', 'BOOKS_AND_REFERENCE' etc.
# - Categories with lowest average ratings often include apps where utility is critical or many low-quality apps exist.


URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

## Question 8: Titanic Dataset

### (a) Survival rates by passenger class (Pclass)
### (b) Survival by age group (children <18 vs adults ≥18)

Run the following cell to load `titanic.csv` and compute the requested statistics.


In [None]:
# Q8: Titanic analysis
import pandas as pd
titanic_url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/titanic.csv'
t = pd.read_csv(titanic_url)
print('Loaded titanic.csv — shape:', t.shape)

# Ensure column names include 'Survived' and 'Pclass' and 'Age'
t['Age'] = pd.to_numeric(t['Age'], errors='coerce')
survival_by_class = t.groupby('Pclass')['Survived'].mean().sort_values(ascending=False)
display(survival_by_class)

# Children vs adults
t['is_child'] = t['Age'] < 18
survival_by_agegroup = t.groupby('is_child')['Survived'].mean()
display(survival_by_agegroup)

# Example conclusions (after running):
# - Historically, first-class passengers had the highest survival rate.
# - Children often had a higher survival rate than adults due to 'women and children first' evacuation practices.


## Question 9: Flight Price Prediction Dataset

### (a) Price vs days left until departure — identify surges and recommend booking window.
### (b) Compare prices across airlines for a route (e.g., Delhi-Mumbai).
Run the cell below to analyze `flight_price.csv`.


In [None]:
# Q9: Flight price analysis
import pandas as pd
import numpy as np
flight_url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/flight_price.csv'
fp = pd.read_csv(flight_url)
print('Loaded flight_price.csv — shape:', fp.shape)

# Example preprocessing: convert date columns or days_left if present
if 'days_left' in fp.columns:
    days_price = fp.groupby('days_left')['price'].agg(['count','mean','median']).reset_index()
    display(days_price.head(20))
else:
    print('Column days_left not present — inspect dataset for date fields and compute days until departure first.')

# Compare airlines for a route (example: Delhi-Mumbai)
route = fp[(fp['source']=='Delhi') & (fp['destination']=='Mumbai')] if {'source','destination'}.issubset(fp.columns) else pd.DataFrame()
if not route.empty:
    airline_cmp = route.groupby('airline')['price'].agg(['count','mean','median']).sort_values('mean')
    display(airline_cmp)
else:
    print('Route columns not found or no matching route — inspect columns:', fp.columns.tolist())

# Example recommendation (after plotting mean price vs days_left):
# - Often prices are lower when booking 30-60 days in advance; exponential surges commonly occur within last 7-14 days before departure.


## Question 10: HR Analytics Dataset

### (a) Factors correlated with attrition
### (b) Relationship: number of projects vs attrition
Run the cell below to analyze `hr_analytics.csv`.


In [None]:
# Q10: HR Analytics
import pandas as pd
hr_url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/hr_analytics.csv'
hr = pd.read_csv(hr_url)
print('Loaded hr_analytics.csv — shape:', hr.shape)

# Example columns often present: 'Attrition' or 'left', 'satisfaction_level', 'number_project', 'average_montly_hours','time_spend_company','salary','Work_accident','promotion_last_5years'
display(hr.head())

# Convert attrition to binary if needed
if 'left' in hr.columns:
    hr['attrition'] = hr['left']
elif 'Attrition' in hr.columns:
    hr['attrition'] = hr['Attrition'].map({ 'Yes':1, 'No':0 }) if hr['Attrition'].dtype == object else hr['Attrition']
else:
    print('No obvious attrition column found; please inspect column names:', hr.columns.tolist())

# Correlation with attrition
if 'attrition' in hr.columns:
    numeric_cols = hr.select_dtypes(include=[np.number]).columns.tolist()
    corr = hr[numeric_cols].corr()['attrition'].sort_values(ascending=False)
    display(corr)
    
    # Relationship between number of projects and attrition
    if 'number_project' in hr.columns:
        proj = hr.groupby('number_project')['attrition'].mean().reset_index().sort_values('number_project')
        display(proj)
else:
    print('Cannot compute correlations without attrition column.')

# Example findings (after running):
# - Low satisfaction_level, high average_monthly_hours, and overtime often correlate with higher attrition.
# - There may or may not be a monotonic relationship between number_of_projects and attrition (sometimes too many projects -> burnout -> higher attrition).


----
### Notes & next steps
- The notebook contains **runnable** code cells that read data from `https://raw.githubusercontent.com/MasteriNeuron/datasets/main/`.
- **I couldn't execute the cells here** because the execution environment for this notebook builder does not have internet access. To obtain real outputs, run the notebook cells in a Jupyter environment with internet access (e.g., your laptop, Google Colab, or Binder).
- If you prefer, upload the CSV files here and I will run the analysis in this environment and produce the executed notebook with real outputs.


In [None]:

# Q7 plot: Average Rating by Category (top and bottom 10)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(url)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
cat_rating = df.groupby('Category')['Rating'].agg(['count','mean']).sort_values('mean', ascending=False)
top10 = cat_rating.head(10)
bottom10 = cat_rating.tail(10)

plt.figure(figsize=(10,5))
top10['mean'].plot(kind='bar')
plt.title('Top 10 Categories by Average Rating')
plt.ylabel('Average Rating')
plt.tight_layout()
plt.show()


In [None]:

# Q8 plots: Survival by Pclass and by age group
import pandas as pd
import matplotlib.pyplot as plt
t = pd.read_csv(titanic_url)
t['Age'] = pd.to_numeric(t['Age'], errors='coerce')
survival_by_class = t.groupby('Pclass')['Survived'].mean().sort_index()
plt.figure(figsize=(6,4))
survival_by_class.plot(kind='bar')
plt.title('Survival Rate by Pclass')
plt.ylabel('Survival Rate')
plt.tight_layout()
plt.show()

t['is_child'] = t['Age'] < 18
survival_by_age = t.groupby('is_child')['Survived'].mean()
plt.figure(figsize=(4,4))
survival_by_age.plot(kind='bar')
plt.title('Survival: Children (<18) vs Adults (>=18)')
plt.ylabel('Survival Rate')
plt.xticks([0,1], ['Adults','Children'])
plt.tight_layout()
plt.show()


In [None]:

# Q9 plots: Price vs days_left and airline comparison for a route (if available)
import pandas as pd, matplotlib.pyplot as plt
fp = pd.read_csv(flight_url)
# ensure lowercase column names for flexibility
cols = [c.lower() for c in fp.columns]
fp.columns = cols
if 'days_left' in fp.columns and 'price' in fp.columns:
    days_price = fp.groupby('days_left')['price'].mean().reset_index()
    plt.figure(figsize=(8,4))
    plt.plot(days_price['days_left'], days_price['price'], marker='o')
    plt.gca().invert_xaxis()  # often days_left decreases towards departure; invert for readability
    plt.title('Average Price vs Days Left to Departure')
    plt.xlabel('Days left')
    plt.ylabel('Average Price')
    plt.tight_layout()
    plt.show()
else:
    print('No days_left or price columns found. Found columns:', fp.columns.tolist())

# Airline comparison for route: try common column names
if {'source','destination','airline','price'}.issubset(fp.columns):
    route = fp[(fp['source'].str.lower()=='delhi') & (fp['destination'].str.lower()=='mumbai')]
    if not route.empty:
        airline_cmp = route.groupby('airline')['price'].mean().sort_values()
        plt.figure(figsize=(8,4))
        airline_cmp.plot(kind='bar')
        plt.title('Average Price by Airline (Delhi -> Mumbai)')
        plt.ylabel('Average Price')
        plt.tight_layout()
        plt.show()
    else:
        print('No Delhi-Mumbai rows found in dataset.')
else:
    print('Route comparison columns missing; available columns:', fp.columns.tolist())


In [None]:

# Q10 plots: Correlation of numeric features with attrition and number_project vs attrition
import pandas as pd, matplotlib.pyplot as plt
hr = pd.read_csv(hr_url)
# normalize column names
hr.columns = [c.strip() for c in hr.columns]
# identify attrition column
if 'left' in hr.columns:
    hr['attrition'] = hr['left']
elif 'Attrition' in hr.columns:
    try:
        hr['attrition'] = hr['Attrition'].map({'Yes':1,'No':0})
    except:
        hr['attrition'] = hr['Attrition']
elif 'attrition' in hr.columns:
    pass
else:
    # Try common alternatives
    for alt in ['left','target','resigned']:
        if alt in hr.columns:
            hr['attrition'] = hr[alt]
            break

if 'attrition' in hr.columns:
    numeric_cols = hr.select_dtypes(include=['number']).columns.tolist()
    if 'attrition' in numeric_cols:
        numeric_cols.remove('attrition')
    corr = hr[numeric_cols + ['attrition']].corr()['attrition'].abs().sort_values(ascending=False)
    display(corr.head(10))
    # plot top 3 correlated numeric features vs attrition
    top_feats = corr.index[:3].tolist()
    for feat in top_feats:
        plt.figure(figsize=(6,4))
        # scatter with jitter for categorical-like numeric features
        plt.scatter(hr[feat], hr['attrition'], alpha=0.3)
        plt.xlabel(feat)
        plt.ylabel('Attrition')
        plt.title(f'{feat} vs Attrition')
        plt.tight_layout()
        plt.show()
else:
    print('Attrition-like column not found. Columns:', hr.columns.tolist())

# number_project vs attrition mean plot
if 'number_project' in hr.columns and 'attrition' in hr.columns:
    proj = hr.groupby('number_project')['attrition'].mean().reset_index()
    plt.figure(figsize=(6,4))
    plt.plot(proj['number_project'], proj['attrition'], marker='o')
    plt.xlabel('Number of Projects')
    plt.ylabel('Attrition Rate (mean)')
    plt.title('Attrition Rate by Number of Projects')
    plt.tight_layout()
    plt.show()
