<a href="https://colab.research.google.com/github/zuhayerror3i8/AI-ML-Expert-With-Phitron-Batch-01/blob/main/001%20Machine%20Learning/005_Module_04_Assignment_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: MZ AnjumHaQ Heemel**   

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [None]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [None]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()


### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [None]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()


> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  The dataset contains 100 observations with complete data for daily screen time.

>  The distribution exhibits significant variability, with the minimum value approximately 121 units below the mean and the maximum approximately 117 units above the mean.

>  The interquartile range analysis reveals substantial spread, where the 75th percentile is nearly twice the value of the 25th percentile, indicating right-skewed distribution.



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [None]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)


> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  The dataset comprises 100 samples, with 52 instances belonging to the positive class.

>  This indicates a relatively balanced dataset with a slight majority (52%) of positive cases, suggesting minimal class imbalance that would require special handling.



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [None]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

In [None]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)


> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  The model demonstrates moderate performance with an accuracy of 55%, precision of 57%, and recall of 53%.

>  The accuracy of 55% suggests the model performs marginally better than random classification, indicating room for substantial improvement.

>  Both precision (57%) and recall (53%) are relatively balanced but suboptimal, demonstrating that the model neither excels at identifying true positives nor at avoiding false positives.

>  This balanced mediocrity suggests the model requires refinement, as it achieves neither high precision nor high recall characteristics.



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [None]:
# Q4.1: Choose the numeric column [2 marks]
num_col = df["monthly_income"]

num_col

In [None]:
# Q4.2: Standardization with z-score [10 marks]
mean = num_col.mean()
std = num_col.std()
df["z-score"] = (num_col - mean) / std

df[["monthly_income", "z-score"]]

In [None]:
df["z-score"].min()

In [None]:
df["z-score"].max()

In [None]:
# Q4.3: Min max scaling implementation [10 marks]
mn = num_col.min()
mx = num_col.max()
rg = mx - mn
ss = num_col - mn
mm = ss / rg
mm = mm.round(2)
df["min-max-scaling"] = mm

df[["monthly_income", "min-max-scaling"]]

In [None]:
df["min-max-scaling"].min()

In [None]:
df["min-max-scaling"].max()


> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  Standardized columns contain both negative and positive values, while min-max scaled columns are bounded between 0 and 1.

>  The standardized values range from -2.09 to 2.40, typically falling within approximately three standard deviations from the mean, whereas min-max scaling produces values strictly between 0 and 1.

>  Standardization preserves the distribution shape and handles outliers better, while min-max scaling compresses all values into a fixed range regardless of the original distribution.



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [None]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
d_city = pd.get_dummies(df["city_type"], prefix="city" ,dtype=int)

d_city

In [None]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df = pd.concat([df, d_city], axis=1)

df

In [None]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
order = {"Low":1, "Medium":2, "High":3}
df["satisfaction_level"] = df["satisfaction_level"].map(order).astype(int)

df


> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  One-hot encoding is appropriate for `city_type` because city categories (Urban, Suburban, Rural) are nominal variables with no inherent ordering or hierarchical relationship among them.

>  Ordinal encoding is suitable for `satisfaction_level` because it represents an ordered categorical variable with a clear ranking system where Low < Medium < High, and the numerical encoding preserves this meaningful ordinality.



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

In [None]:
# Q6.2: Euclidean distance computation [10 marks]
eu = np.linalg.norm(v1 - v2)

print(eu)

In [None]:
# Q6.3: Manhattan distance computation [10 marks]
ma = np.linalg.norm(v1 - v2, ord=1)

print(ma)


> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  In this analysis, the Manhattan distance is larger than the Euclidean distance.

>  This relationship occurs because Manhattan distance computes the sum of absolute differences, while Euclidean distance calculates the square root of the sum of squared differences.

>  The square root operation in the Euclidean formula typically produces smaller values than the direct summation used in Manhattan distance, especially when differences are distributed across multiple dimensions.



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  From Modules 1 and 2, I applied model evaluation metrics including accuracy, precision, and recall to assess classification performance systematically.

>  Module 3 concepts included standardization, min-max scaling, categorical encoding techniques, and distance calculations (Euclidean and Manhattan).

>  The evaluation metrics from earlier modules provided quantitative insights into model performance and dataset balance, enabling informed assessment of predictive quality.

>  Descriptive statistics revealed underlying data patterns and distributions, informing preprocessing decisions.

>  Scaling and encoding techniques from Module 3 transformed raw data into formats suitable for machine learning algorithms, bridging the gap between data exploration and model development.

>  Distance metrics demonstrated practical applications of transformed features, measuring similarity between observations and connecting preprocessing choices to their analytical consequences.



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
