
# Week 01 Assignment  
## Data Quality, Evaluation, Scaling, and Encoding

**Student name:**  MD.Rifat Islam Rizvi  

This is a small assignment that connects topics from Module 1, 2, and 3.  
You must complete it in this Colab notebook.

You will need to use concepts that appeared in the videos:
- Module 1 and 2: basic descriptive statistics, proportions, confusion matrix, accuracy, precision, recall
- Module 3: standardization, min max scaling, nominal vs ordinal, one hot encoding, ordinal encoding, Euclidean and Manhattan distance

Please do not use any extra libraries beyond `pandas`, `numpy`.



---
## 0. Setup and Dataset

We will use a dataset that should have columns given below:

- `user_id`  
- `age`  
- `monthly_income` (numeric)  
- `daily_screen_time_min` (numeric)  
- `daily_app_opens` (numeric)  
- `true_label` and `pred_label` for a binary classification task (0 or 1)  
- `satisfaction_level` (for example: `Low`, `Medium`, `High`)  
- `city_type` (for example: `Urban`, `Suburban`, `Rural`)


In [18]:
# Cell 1: Imports
import pandas as pd
import numpy as np

In [19]:
# Cell 2: Load the dataset (Already done for you)
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1OmDDCh4MD1TtvAemnwVDyz5zwCIXJ220")

# Show first few rows
df.head()

Unnamed: 0,user_id,age,monthly_income,daily_screen_time_min,daily_app_opens,true_label,pred_label,satisfaction_level,city_type
0,1,43,3734.19,109,48,0,0,Medium,Suburban
1,2,49,2594.19,194,7,0,0,Low,Urban
2,3,19,3550.47,146,36,1,0,High,Rural
3,4,19,3821.18,287,14,1,0,High,Suburban
4,5,63,1750.84,66,46,0,0,Medium,Suburban



### 0.1 Check your dataset

1. Confirm that the dataset loaded correctly.  
2. Check that you have at least these columns:  
   - numeric: `age`, `monthly_income`, `daily_screen_time_min`, `daily_app_opens`  
   - labels: `true_label`, `pred_label`  
   - categorical: `satisfaction_level`, `city_type`  



---
## Part A - Module 1 and 2 Review

In this part you will do simple descriptive statistics and basic classification evaluation.



### Q1. Descriptive statistics on a numeric feature

Choose one numeric column, for example `daily_screen_time_min`.


In [20]:
# Q1.1: Choose your numeric column here [We already write this ans]
num_col = "daily_screen_time_min"

df[num_col].describe()

Unnamed: 0,daily_screen_time_min
count,100.0
mean,181.89
std,68.886951
min,60.0
25%,122.0
50%,178.0
75%,243.75
max,299.0



> **Q1.2 Short answer: [Marks: 05]**  
> Look at the count, mean, min, max, and standard deviation for your chosen column.  
> In 2 to 3 sentences, comment on what you see.  
> For example, does the max look very far from the mean, or does it look quite close?

Write your answer here:

>  count -100,mean-181.89,min-60,max-299

>  There is considerable variation in daily screen time among individuals .On average people spend about 181.89 minutes on their screens every day.

>  max is quite far from the mean



### Q2. Proportion of positive class

Use the `true_label` column, where 1 means "positive" and 0 means "negative".


In [21]:
# Q2.1: Compute proportion of positive class [We already write this ans]
label_col = "true_label"

positive_count = (df[label_col] == 1).sum()
total_count = df.shape[0]
positive_proportion = positive_count / total_count

print("Positive count:", positive_count)
print("Total samples:", total_count)
print("Proportion of positive class:", positive_proportion)

Positive count: 52
Total samples: 100
Proportion of positive class: 0.52



> **Q2.2 Short answer: [5 marks]**  
> In 1 to 2 sentences, explain what this proportion tells you about your dataset.  
> For example, is the dataset balanced between 0 and 1, or is one class much more common?

Write your answer here:

>  
>  The proportion of the positive class is 0.52 meaning slightly more than half of the samples belong to class 1.

>  The dataset is quite balanced as both classes appear in almost equal proportions and neither class is significantly more common than the other.



### Q3. Confusion matrix and basic metrics

For this question, use:
- `true_label` as the actual label  
- `pred_label` as the model prediction


In [22]:
# Q3.1: Manually compute TP, TN, FP, FN [We already write this ans]
true_col = "true_label"
pred_col = "pred_label"

tp = ((df[true_col] == 1) & (df[pred_col] == 1)).sum()
tn = ((df[true_col] == 0) & (df[pred_col] == 0)).sum()
fp = ((df[true_col] == 0) & (df[pred_col] == 1)).sum()
fn = ((df[true_col] == 1) & (df[pred_col] == 0)).sum()

print("TP:", tp)
print("TN:", tn)
print("FP:", fp)
print("FN:", fn)

TP: 28
TN: 27
FP: 21
FN: 24


In [23]:
# Q3.2: Compute accuracy, precision, recall [We already write this ans]
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.55
Precision: 0.5714285714285714
Recall: 0.5384615384615384



> **Q3.3 Short answer: [10 marks]**  
> In 3 to 4 sentences, briefly comment on the model using these three metrics.  
> For example, is the model catching most positives (high recall) or being careful when it predicts positive (high precision)?

Write your answer here:

>  
>  The model has an accuracy of 0.55 meaning it only predicts correctly a little over half the time. Its precision is 0.57. so when it predicts a positive result. it is right just a bit more than half the time. The recall is 0.54 which shows that it misses many real positive cases. Overall, the model’s performance is not very strong and needs improvement.

>  The recall is 0.54 meaning the model is not catching most of the actual positive cases and is missing many of them. The precision is 0.57 which is only slightly higher meaning the model is also not very careful when predicting positive. So overall it is neither strongly focused on identifying positives nor making very accurate positive predictions.



---
## Part B - Module 3: Scaling and Encoding

Now we will pick a few features and apply scaling and encoding.



### Q4. Standardization and Min max scaling

Use one numeric column, `monthly_income`.


In [24]:
# Q4.1: Choose the numeric column [2 marks]
num_col = "monthly_income"
df[num_col]


Unnamed: 0,monthly_income
0,3734.19
1,2594.19
2,3550.47
3,3821.18
4,1750.84
...,...
95,3975.90
96,3764.77
97,2336.44
98,3007.53


In [25]:
# Q4.2: Standardization with z-score [10 marks]
mn=df[num_col].mean()
std=df[num_col].std()
z=(df[num_col]-mn)/std
print(z)

0     0.944685
1    -0.324626
2     0.740126
3     1.041542
4    -1.263639
        ...   
95    1.213813
96    0.978734
97   -0.611613
98    0.135599
99   -0.768084
Name: monthly_income, Length: 100, dtype: float64


In [26]:
# Q4.3: Min max scaling implementation [10 marks]
min=df[num_col].min()
max=df[num_col].max()
mms=(df[num_col]-min)/(max-min)
print(mms)

0     0.675209
1     0.393685
2     0.629839
3     0.696691
4     0.185420
        ...   
95    0.734899
96    0.682760
97    0.330034
98    0.495760
99    0.295330
Name: monthly_income, Length: 100, dtype: float64



> **Q4.4 Short answer: [3 marks]**  
> Compare the standardized and min max scaled columns in 2 to 3 sentences.  
> Mention what kind of range each one uses and how the numbers look.

Write your answer here:

>  
>  The standardized column z-score centers the data around 0 with a standard deviation of 1. so values can be positive or negative depending on whether they are above or below the mean. In contrast the Min-Max scaled column rescales all values to fit between 0 and 1 keeping the relative distances but removing negative numbers.

>  The standardized z-score column has a range that depends on the data and can include both negative and positive numbers with most values centered around 0. The Min-Max scaled column has a fixed range from 0 to 1



### Q5. One hot and ordinal encoding

We will use:
- `city_type` as a nominal feature  
- `satisfaction_level` as an ordinal feature with order `Low` < `Medium` < `High`  


In [27]:
# Q5.1: One hot encoding for city_type using pandas [10 marks]

d_city = pd.get_dummies(df["city_type"], prefix="city_type").astype(int)

print(d_city.head())

   city_type_Rural  city_type_Suburban  city_type_Urban
0                0                   1                0
1                0                   0                1
2                1                   0                0
3                0                   1                0
4                0                   1                0


In [28]:
# Q5.2: Attach one hot encoded columns to df [5 marks]
df_encoded = pd.concat([df, d_city], axis=1)
print(df_encoded.head())

   user_id  age  monthly_income  daily_screen_time_min  daily_app_opens  \
0        1   43         3734.19                    109               48   
1        2   49         2594.19                    194                7   
2        3   19         3550.47                    146               36   
3        4   19         3821.18                    287               14   
4        5   63         1750.84                     66               46   

   true_label  pred_label satisfaction_level city_type  city_type_Rural  \
0           0           0             Medium  Suburban                0   
1           0           0                Low     Urban                0   
2           1           0               High     Rural                1   
3           1           0               High  Suburban                0   
4           0           0             Medium  Suburban                0   

   city_type_Suburban  city_type_Urban  
0                   1                0  
1               

In [36]:
# Q5.3: Ordinal encoding for satisfaction_level [10 marks]
stsfctn_ord={"Low": 0,"Medium": 1,"High": 2}
df["satisfaction_level_encoded"] = df["satisfaction_level"].map(stsfctn_ord)
print(df[["satisfaction_level","satisfaction_level_encoded"]].head())

  satisfaction_level  satisfaction_level_encoded
0             Medium                           1
1                Low                           0
2               High                           2
3               High                           2
4             Medium                           1



> **Q5.4 Short answer: [5 marks]**  
> In 2 to 3 sentences, explain why one hot encoding is suitable for `city_type`  
> and why ordinal encoding is suitable for `satisfaction_level`.

Write your answer here:

>  
>  One-hot encoding is suitable for city_type because it is a categorical feature with no inherent order.By converting each category into a separate binary column the model can treat each city type independently without assuming any ranking.


>  Ordinal encoding is suitable for satisfaction_level because it is a categorical feature with a clear order (Low < Medium < High). By mapping the categories to numbers like 0,1,2 . the model can understand the relative ranking between levels while preserving the meaning of “higher” or “lower” satisfaction.



---
## Part C - Module 3: Distances between users

For this small part we will work with vectors based on scaled numeric features.



### Q6. Euclidean and Manhattan distance

Build 2D vectors for user 0 and user 1 using:
- `income_std`  
- `daily_app_opens` (or its min max scaled version if you prefer)


In [30]:
# Q6.1: Build 2D vectors for first two users [We already write this ans]
vec_cols = ["monthly_income", "daily_app_opens"]

v1 = df.loc[0, vec_cols].values
v2 = df.loc[1, vec_cols].values

print("v1:", v1)
print("v2:", v2)

v1: [np.float64(3734.19) np.int64(48)]
v2: [np.float64(2594.19) np.int64(7)]


In [37]:
# Q6.2: Euclidean distance computation [10 marks]
e_d=np.linalg.norm(v1-v2)
print(e_d)

1140.7370424422975


In [38]:
# Q6.3: Manhattan distance computation [10 marks]
m_d=np.linalg.norm(v1-v2,ord=1)
print(m_d)

1181.0



> **Q6.4 Short answer: [5 marks]**  
> Which one is larger in your result, Euclidean or Manhattan distance  
> and why does that usually happen based on their formulas?

Write your answer here:

>  
>  Manhattan distance   is larger

>  Manhattan distance   is larger because it sums the absolute difference .but euclidean distance squares the difference ,sum sthem then takes the square root . thats why mahhatten distance is larger than euclidian distance



---
## Final Reflection [5 marks]

> In 4 to 6 sentences, describe how the three modules connect in this assignment.  
> Mention:
> - One idea from Module 1 or 2 that you used  
> - One idea from Module 3 that you used  
> - How these ideas together help you understand a dataset more deeply

Write your reflection here:

>  from module 1 i used statistical formulas like mean,median,count

>  from module 2 i used recall, precision,tp,tf,np,nf
>  from module 3 i used  z score ,euclidian distance ,manhatten distance

>  these ideas help me explore the dataset from multiple angles:   
1. understanding its distribution
2. evaluating predictions
3. measuring similarity

which gives a much deeper and more complete understanding of the data.



## End of Assignment

Before submitting:
- Run all cells from top to bottom.  
- Check that all answer sections are filled.  
- Download this notebook as `.ipynb` and upload it according to the given instructions.
- ***Must Read Assignment Module Text Instruction fully Where you will find how to submit this assignment***
