# ⚠️ **Important Disclaimer**

Do **not edit or delete** any of the **Markdown cells** (the ones containing the questions and instructions).

Only write your answers in the **code cells provided below each question**.  
This ensures consistency during our feedback process.

### Q1. Load and Explore the Dataset

Load the `AirQalityDataset.csv` file into a pandas DataFrame using the correct separator.

After loading the data:

1. Display basic information about the dataset.
2. Save the statistical description of the dataset into a separate variable.
3. Drop fully empty/unnamed columns, and rows
4. Use `type()` to print the type of that description variable.

In [None]:
# your Code Here

import numpy as np
np.random.seed(0)
import pandas as pd

df = pd.read_csv("AirQualityDataset.csv", sep=';')

In [4]:

print("\nBasic Information of the dataset")
df.info()

description = df.describe()

print(type(description))


Basic Information of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   float64
 4   NMHC(GT)       9357 non-null   float64
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   float64
 7   NOx(GT)        9357 non-null   float64
 8   PT08.S3(NOx)   9357 non-null   float64
 9   NO2(GT)        9357 non-null   float64
 10  PT08.S4(NO2)   9357 non-null   float64
 11  PT08.S5(O3)    9357 non-null   float64
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
 15  Unnamed: 15    0 non-null      float64
 16  Unnamed: 16    0 non-null      float64
dtypes: float64(15), ob

In [5]:
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,10/03/2004,18.00.00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,,
1,10/03/2004,19.00.00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,,
2,10/03/2004,20.00.00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,,
3,10/03/2004,21.00.00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,,
4,10/03/2004,22.00.00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,,


In [6]:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,10/03/2004,18.00.00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578
1,10/03/2004,19.00.00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255
2,10/03/2004,20.00.00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502
3,10/03/2004,21.00.00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867
4,10/03/2004,22.00.00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888


### Q2. Dataset structure and features overview  
Write a code to collect:
1. The number of rows and columns in the dataset.
2. The list of first 10 feature columns excluding `'Date'` and `'Time'`.  

Store both lists in tuple called `dataset_info` and print it.

In [43]:
print(df.shape)
print(df.columns.tolist())


(9471, 18)
['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'humidity_band', 'PT08.S1_noisy', 'moisture_index']


In [44]:

print("Number of rows = {df.shape[0]}  \nNumber of columns = {df.shape[1]}")

col = [col for col in df.columns if col not in ['Date', 'Time']]
print("\n frist 10 columns:",col[:10])
print("\n dataset in tuple:", (df.shape, col))

Number of rows = {df.shape[0]}  
Number of columns = {df.shape[1]}

 frist 10 columns: ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)']

 dataset in tuple: ((9471, 18), ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'humidity_band', 'PT08.S1_noisy', 'moisture_index'])


### Q3. CO(GT) summary with Pandas and NumPy
Compute the **mean** and **standard deviation** of `CO(GT)` using both:
- Pandas
- NumPy

In [45]:
# your Code Here
mean_pandas = df['CO(GT)'].mean()
print(mean_pandas)
standarddeviation_pandas = df['CO(GT)'].std()
print(standarddeviation_pandas)


-34.20752377898899
77.65717034683162


In [46]:
co_GT_value = df['CO(GT)'].to_numpy()
mean_numpy = np.nanmean(co_GT_value)
std_numpy = np.nanstd(co_GT_value, ddof=1)   
print(mean_numpy)
print(std_numpy)


-34.20752377898899
77.65717034683162


### Q4. Absolute humidity (AH) distribution  
Compute the **min**, **median**, and **max** of `AH` using Pandas.  

Do you notice an issue in the values?  
If you think that there are values that are problematic, replace them with the median of the column and print the same three statistics after that.


In [47]:
print("AH before replacement:Min:", df['AH'].min(),"\nMedian:", df['AH'].median(),"\nMax:", df['AH'].max())

median_valid = df.loc[df['AH'] > 0, 'AH'].median()
df['AH'] = df['AH'].apply(lambda x: median_valid if x <= 0 else x)

print("\nAH After replacement:Min:", df['AH'].min(), "\nMedian:", df['AH'].median(), "\nMax:", df['AH'].max())

AH before replacement:Min: 0.1847 
Median: 0.9954 
Max: 2.231

AH After replacement:Min: 0.1847 
Median: 0.9954 
Max: 2.231


### Q5. Humidity bands
Create a new column `humidity_band` using `RH`:
- `'dry'` if `RH < 30`
- `'comfortable'` if `30 <= RH <= 60`
- `'humid'` if `RH > 60`

Then show the **count** of each category.

In [48]:
df['humidity_band'] = pd.cut(df['RH'], bins=[-float('inf'), 30, 60, float('inf')], labels=['dry', 'comfortable', 'humid'])

humidity_counts = df['humidity_band'].value_counts()
print(humidity_counts)


humidity_band
comfortable    4917
humid          2633
dry            1807
Name: count, dtype: int64


### Q6. Compute the Average 'CO(GT)' for Humid Conditions  

Using the `'humidity_band'` column created above, filter the dataset for rows labeled `'humid'` and compute the **average value of `'CO(GT)'`** for these observations.  

Format the output to 4 decimal places for better readability and precision.

In [50]:
df[df['humidity_band'] == 'humid']
df[df['humidity_band'] == 'humid']['CO(GT)'].mean()



np.float64(-35.556665400683634)

### Q7. Retrieve and sort array by a specific column
Create a NumPy array from the columns `[T, RH, AH]` (in this order), then sort the array by the **third column (`AH`)** ascending. Show the first 5 rows.

In [52]:
# your Code Here
tri_array = df[['T', 'RH', 'AH']].to_numpy()
arr = tri_array[tri_array[:, 2].argsort()]
print(arr[:5])

[[ 0.     29.7     0.1847]
 [11.8    13.5     0.1862]
 [ 0.2    30.2     0.191 ]
 [-0.1    31.9     0.1975]
 [12.2    14.      0.1988]]


### Q8. Normalized moisture index

Using the NumPy array you built above (**Do not change it**):  

1. Using Numpy, **Convert RH to a fraction** (0–1 scale) by dividing it by 100 and save it to another array `RH_frac`.
2. Using Numpy, **Compute a normalized moisture index** by dividing `AH` by `RH_frac`. This almost computes the amount of absolute humidity per unit of relative humidity.

Print the first 10 values of this new array and then **store** the result in the original DataFrame as a new column `'moisture_index'`.

In [53]:
RH_frac = arr[:, 1] / 100
moisture_index = arr[:, 2] / RH_frac
print(moisture_index[:10])
df['moisture_index'] = moisture_index
df.head()



[0.62188552 1.37925926 0.63245033 0.61912226 1.42       1.39931034
 0.63667712 1.23473054 1.36339869 1.17228261]


Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,humidity_band,PT08.S1_noisy,moisture_index
0,10/03/2004,18.00.00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,comfortable,1293.663398,0.621886
1,10/03/2004,19.00.00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,comfortable,1442.129978,1.379259
2,10/03/2004,20.00.00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,comfortable,1401.083723,0.63245
3,10/03/2004,21.00.00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,comfortable,1451.602118,0.619122
4,10/03/2004,22.00.00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,comfortable,1361.755377,1.42


### Q9. Temperature profile for high moisture index  

Using Numpy only, and the `moisture_index` values you computed earlier:  

1. Find the **median** of `moisture_index`.  
2. Filter `tri_array` to include only rows where `moisture_index` is above this median.  
3. Compute and print the **mean temperature** for this high-moisture group using only NumPy.

Format the output to 4 decimal places for better readability and precision.

In [54]:
data = np.genfromtxt("AirQualityDataset.csv", 
                     delimiter=";", skip_header=1, 
                     usecols=(12, 13, 14), dtype=float, 
                     filling_values=np.nan)

mask = (~np.isnan(data).any(axis=1)) & np.all(data != -200, axis=1)
clean = data[mask]
mi = clean[:, 0] * clean[:, 1]
median_mi = np.median(mi)
high_group = clean[mi > median_mi]
mean_temp = high_group[:, 0].mean()

print("First 10 moisture_index:", np.round(mi[:10], 4))
print("Median moisture_index:", round(median_mi, 4))
print("Mean temperature (high group):", round(mean_temp, 4))


First 10 moisture_index: [665.04 634.41 642.6  660.   667.52 663.04 641.84 642.   638.79 620.06]
Median moisture_index: 820.06
Mean temperature (high group): 21.6729


### Q10. Percentile-based filtering
Compute:
- the **85th percentile** of `C6H6(GT)` (benzene), and
- the **25th percentile** of `RH`.

Filter and return rows where `C6H6(GT)` is **above** its 85th percentile **and** `RH` is **below** its 25th percentile. Show the number of rows and the first 5 matches.


In [55]:
p85 = np.percentile(df['C6H6(GT)'], 85)   
p25  = np.percentile(df['RH'], 25)        
print(" C6H6(GT):", p85 ,"RH:", p25 )

fi_df = df[(df['C6H6(GT)'] > p85) & (df['RH'] < p25)]

print("Number of rows:", len(fi_df))
print(fi_df.head())

 C6H6(GT): nan RH: nan
Number of rows: 0
Empty DataFrame
Columns: [Date, Time, CO(GT), PT08.S1(CO), NMHC(GT), C6H6(GT), PT08.S2(NMHC), NOx(GT), PT08.S3(NOx), NO2(GT), PT08.S4(NO2), PT08.S5(O3), T, RH, AH, humidity_band, PT08.S1_noisy, moisture_index]
Index: []


### Q11. Simulate Sensor Measurement Noise and Analyze the Effect  

Simulate **normally distributed measurement noise** with a mean of `0` and a standard deviation of `100` (in raw sensor units). Then:  

- Use **NumPy** to generate the noise.  
- Use **Pandas** to add this noise to the `'PT08.S1(CO)'` column and store the result in a new column `'PT08.S1_noisy'`.  
- Print the **mean** and **standard deviation** of both `'PT08.S1(CO)'` and `'PT08.S1_noisy'` to observe the impact of the simulated noise.  

Observe how the added noise affects the distribution, particularly the spread (**standard deviation**). Format all printed values to **4 decimal places** using `.4f`.  


In [56]:
# your Code Here
df['PT08.S1_noisy'] = df['PT08.S1(CO)'] + np.random.randn(len(df)) * 100
print(df[['PT08.S1_noisy', 'PT08.S1(CO)']].head())

df[['PT08.S1(CO)', 'PT08.S1_noisy']].agg(['mean', 'std']).round(4).head()



   PT08.S1_noisy  PT08.S1(CO)
0    1386.862898       1360.0
1    1168.528412       1292.0
2    1462.660810       1402.0
3    1461.300081       1376.0
4    1307.684235       1272.0


Unnamed: 0,PT08.S1(CO),PT08.S1_noisy
mean,1048.9901,1049.1281
std,329.8327,344.5808


# Make Your Results Reproducible

If you re-run the previous cell multiple times, you'll notice that the results involving randomness (e.g., simulated noise) change each time. This is because NumPy generates new random numbers on every execution.

To make your results **reproducible** (meaning that both you and your instructor get the **same output every time**) you need to set a fixed **random seed**.

As the final task, go back and add the following line to your code **immediately after importing NumPy** for the first time in your notebook:

```python
np.random.seed(0)


So, your NumPy import at the top of the notebook should now look like this:

```python
import numpy as np
np.random.seed(0)


# After Making This Change:

- Re-run **all cells** in the notebook from top to bottom.  
- Make sure **all outputs are visible**.  
- **Save your notebook.**  
- **Submit it as-is (with all outputs included.)**
