
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name: Subrata saha**   
**Date: 16th November, 2025**  

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [None]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [None]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [None]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  The mean is 181.89 in minutes, and the minimum value is 60 min (3 times lower) which is far below the mean. The max value is 299 min which is closer to mean than the minimum value.
This indicates some people use their screen much less than average also screen time varies but doesn't have extreme outliers.

> The standard deviation is around 68.89, which is fairly large compared to the mean, this indicates that daily screen time varies a lot between individuals.



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [None]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  Among the 100 samples, 52 is positive count and 48 is negative count. So, the dataset is almost balanced. In ML cases, this dataset can be considered fairly balanced.


### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [None]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [None]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  Accurcy is very low only 55%, it means the model is slightly better than random guessing.

>  Precision is 57%, that means out of the 100 positive predictions by the model, only 57 are actually positive and other 43 are false positives.It can be said, it is not very careful in predicting postivie outcomes.

>  Recall is almost 54%, which means out of 100 positive instances, it catches only 54 positives. So, this model is not very good at catching positive outcomes.


---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling
Use one numeric column, `monthly_income`.


In [None]:
# Q4.1: Choose the numeric column [2 marks]
num_col = df["monthly_income"]

In [None]:
# Q4.2: Standardization with z-score [10 marks]
mean_income = num_col.mean()
std_income = num_col.std()

df['monthly_income_zscore'] = (num_col- mean_income) / std_income
# print(df["monthly_income_zscore"])

In [None]:
# Q4.3: Min max scaling implementation [10 marks]
min_income = num_col.min()
max_income = num_col.max()

df['monthly_income_minmax'] = (num_col - min_income) / (max_income - min_income)
# print(df["monthly_income_minmax"])

In [None]:
# print(df["monthly_income"].mean())
# print(df["monthly_income_zscore"].std())

In [None]:
# # df.head()
# df[["monthly_income", "monthly_income_zscore", "monthly_income_minmax"]].head()

In [None]:
values = {
    'monthly_income': {
        'Min': df['monthly_income'].min(),
        'Max': df['monthly_income'].max(),
        'Mean': df['monthly_income'].mean(),
        'Std': df['monthly_income'].std()
    },
    'monthly_income_zscore': {
        'Min': df['monthly_income_zscore'].min(),
        'Max': df['monthly_income_zscore'].max(),
        'Mean': df['monthly_income_zscore'].mean(),
        'Std': df['monthly_income_zscore'].std()
    },
    'monthly_income_minmax': {
        'Min': df['monthly_income_minmax'].min(),
        'Max': df['monthly_income_minmax'].max(),
        'Mean': df['monthly_income_minmax'].mean(),
        'Std': df['monthly_income_minmax'].std()
    }
}

values_df = pd.DataFrame(values)
print(values_df)

      monthly_income  monthly_income_zscore  monthly_income_minmax
Min      1000.000000          -2.099647e+00               0.000000
Max      5049.400000           2.409081e+00               1.000000
Mean     2885.745000          -8.826273e-16               0.465685
Std       898.124693           1.000000e+00               0.221792



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  The standardized z score transforms the column so that, it has mean 0 and StD of 1. Values above the mean are positive and below the mean are negative.

>  The min-max scaling transforms the column into a fixed range between 0 and 1. Here, all values are positive. The greater the value, more the value is closer to 1 and the smaller the value, closer to 0.

>  In this case, The standardized z-score column transforms the monthly income values to the values range roughly from -2.09 to 2.40, with both negative and positive numbers depending on whether they are below or above the mean.


### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [None]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]
d_city=pd.get_dummies(df['city_type'], prefix='city', dtype=int)

In [None]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df_encoded = pd.concat([df,d_city], axis = 1)

# # # if drop the old column
# df_encoded = df_encoded.drop("city_type", axis = 1)

# df_encoded.head()

In [None]:
# df['satisfaction_level'].unique()

In [None]:
# df['city_type'].unique()

In [None]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
order = {"Low":1,"Medium":2,"High":3}

# add a new column
# df["satisfaction_encoded"]=df["satisfaction_level"].map(order).astype(int)

# replacing the old column
df["satisfaction_level"]=df["satisfaction_level"].map(order).astype(int)
# df


> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

> "city_type" has 3 catagories (Suburban, Urban, Rural) and they have no natural order in this case. One hot encoding is better for this column. It just converts each value into a numerical value.

> "satisfaction_level"  has 3 catagories (Low , Medium , High) and
they have natural order (Low < Medium < High).ordinal encoding is suitable for "satisfaction_level" because it preserves the natural order or ranking among them.



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [None]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [None]:
# Q6.2: Euclidean distance computation [5 marks]

# print(type(v1))
# print(v1.shape)


eq_d = np.linalg.norm(v1-v2)
print(eq_d)


1140.7370424422975


In [None]:
# Q6.3: Manhattan distance computation [5 marks]
manhatton_d = np.linalg.norm(v1 - v2, ord=1)
print(manhatton_d)


1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  Here, Euclidean distance 1140.72 < Manhattan distance 1181.
Manhatton distance considers the absoulte difference of each elements. On the other hand, in Euclidean distance, square root of the sum of the squared differences is calculated. This square root reduces the value.


---
## Final Reflection [10 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  From module 1 i got the idea about count, mean, min, max, and standard deviation. From module 2, got idea about confusion matrix and how to calculate precision, recall, accuracy and their importance.

>  From module 3, i learnt how to use scaling and encoding and when to use them depending on the data.

> Combination of these ideas is helpful for understanding a dataset more depply.Descriptive statistics like mean, max, min, standard deviation helps understand the basic distribution, and possible outliers in the data. Accuracy, precision, recall from confusion matrix shows how well a model performs and whether it makes more false positives or false negatives. Scaling and encoding help prepare the data properly so models can learn meaningful patterns,which can improve model accuracy.


## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Instruction video অনুযায়ী আমাদের দেয়া Colab ফাইলটি থেকে প্রথম একটি Save copy in drive করে নিবা। এরপর Google colab এর মধ্যে কোডগুলো করবে এবং সেই ফাইলটি ‘Anyone with the link’ & ‘View’ Access দিয়ে ফাইলটির Shareble Link টি সাবমিট করবে।
