## Air Quality Prediction from Low-Cost IoT devices
Data Scientist: Victor Kelechi Ahaji

`Objective:`
Developing a machine learning model that accurately predicts CO2 levels using data from Chemotronix’s low-cost sensors. Building this model will help bridge the gap between affordability and precision in carbon emission tracking enabling widespread adoption of low-cost monitoring technologies.

`Expected Result:`
> 1.) Democratize access to environmental monitoring tools.

> 2.) Assist governments and organizations in implementing data-driven policies to curb carbon emissions.

> 3.) Promote sustainability by making emission tracking affordable for communities and industries worldwide.

`Evaluation:`
The evaluation metric for this competition is Root Mean Squared Error.

### Importing Libraries

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from scipy.stats import skew

ImportError: C extension: None not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext' to build the C extensions first.

### Loading dataset

In [None]:
# Load data
train = pd.read_csv("Train.csv")
test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("SampleSubmission.csv")

### Quick Look
Taking an initial look on the dataset to understand structure.

In [None]:
train.head(3)

In [None]:
test.head(3)

In [None]:
# Descriptive statistic of the train dataset
train.describe()

From the above, we can understand that we have about 7307 samples in our train dataset.

### Exploratory Data Analysis

In [None]:
train.info()

In [None]:
train.set_index("ID", inplace = True)

From the above we can observe the absence of null values in the features. But for one of the columns; the `device_name` feature might not be represented in the right data type

In [None]:
print(train["device_name"].nunique())
print(train["device_name"].unique())

The `device_name` feature should be more of a category data type instead of object.

In [None]:
# Change the data type from `object` to `category` data type
train["device_name"] = train["device_name"].astype("category")

In [None]:
train.info()

### Understanding the distribution of train dataset

In [None]:
# Utilizing skew module of the scipy library
skewness_value = skew(train.select_dtypes(include=["number"]))
Skewness = list(zip(train.select_dtypes(include = ["number"]).columns, skewness_value))
for col, skewness_value in Skewness:
    print(f'Skewness of {col} : {skewness_value:.2f}')

In [None]:
# Visualizing to understand skewness
# Set plot style
sns.set_style("whitegrid")

# create histograms for each numeric column
fig, ax = plt.subplots(nrows= 3, ncols= 3, figsize =(15,10))
ax = ax.flatten()

for i, col in enumerate(train.select_dtypes(include=["number"]).columns):
    sns.histplot(train[col], bins = 10, kde = True, ax = ax[i], color = 'blue')
    ax[i].set_title(f'Histogram of {col}')
    ax[i].set_xlabel(col)
    ax[i].set_ylabel("Frequency")

plt.tight_layout();

> Based on the above plot and skewness values, we can comment that the dataset is quite normally distributed.

### Understanding the nature of relationship between features.
This is to help avoid multicollinearity.

In [None]:
# Using seabon pairplot
sns.pairplot(train);

In [None]:
# Using heatmap to understand correlation
plt.figure(figsize = (15,10))
sns.heatmap(train.drop("device_name", axis = 1).corr(), annot = True)
plt.title("Correlation Heatmap")
plt.show()

### Feature Engineering, Selection and Preprocessing
Based on the above, we have instances of multicollinearity this requires us to drop certain features before model building, but we will not drop these features. An algorithm that penalizes multicollineraity will be utilized.

In [None]:
# Feature Engineering
# Create Heat Index Feature
train["Heat_Index"] = train["Temperature"] + (0.55 - 0.0055 * train["Humidity"]) * (train["Temperature"] - 14.5)

In [None]:
# Feature selection and preprocessing
features = train.drop("CO2", axis = 1).columns
target =  'CO2'

In [None]:
X = train[features]
y = train[target]

In [None]:
# Feature transformation
numerical_cols = X.select_dtypes(include=["number"]).columns
categorical_col = X.columns.drop(numerical_cols)

# Log Transformation (for positive skewness)
log_transformer = FunctionTransformer(lambda x: np.log1p(x), validate=False)

# Squared Transformation (for negative skewness)
squared_transformer = FunctionTransformer(lambda x: x**2, validate=False)

# Standardization
scaler = StandardScaler()

# One-Hot Encoding for Categorical features
encoder = OneHotEncoder(handle_unknown='ignore')

# PCA for dimensionality reduction
pca = PCA(n_components = 6)

# Column Transfprmer to apply transformation
preprocessor = ColumnTransformer([
    ("log", log_transformer, ["MG811_analog", "MQ9_analog","Humidity","MQ7_analog"]),  # Apply log transformation to positively skewed features
    ("square", squared_transformer, ["MQ135_analog","Temperature"]),  # Apply squared transformation to negatively skewed features
    ("scale", scaler, numerical_cols),  # Standardization for all numeric features
    ("one_hot", encoder, categorical_col),  # One-Hot Encoding
])

# Create a Pipeline
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("pca", pca)
])

# Apply Transformations
X_processed = pipeline.fit_transform(X)

# Convert to DataFrame for easy viewing
X_processed = pd.DataFrame(X_processed)
X_processed


In [None]:
# Data Splitting
X_train, X_val, y_train, y_val = train_test_split(X_processed, y, test_size=0.2, random_state=42)

### Model Training

In [None]:
# Model Training:Deep Learning Model
reg_model = RandomForestRegressor(n_estimators = 600, max_depth = 10
                                  , min_samples_split = 2, min_samples_leaf = 2,
                                  max_features = "sqrt", bootstrap = True)
reg_model.fit(X_train, y_train)

In [None]:
# Predicting
y_val_pred = reg_model.predict(X_val)

# Evaluating
RMSE = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f'Validation RMSE: {RMSE:.2f}')

In [None]:
reg_model_ = GradientBoostingRegressor(n_estimators = 100,learning_rate = 0.1,
                                       max_depth = 7, min_samples_split = 5,
                                       min_samples_leaf = 4, subsample = 0.8, max_features = "sqrt")
reg_model_.fit(X_train, y_train)

# Predicting
y_val_pred_ = reg_model_.predict(X_val)

# Evaluating
RMSE = np.sqrt(mean_squared_error(y_val, y_val_pred_))
print(f'Validation RMSE: {RMSE:.2f}')

In [None]:
# Create the Heat Index feature in the test data
test["Heat_Index"] = test["Temperature"] + (0.55 - 0.0055 * test["Humidity"]) * (test["Temperature"] - 14.5)

In [None]:
# Test predictions
test_features = pipeline.fit_transform(test[features])
test_predictions = reg_model.predict(test_features)

In [None]:
# Prepare submission
sample_submission['CO2'] = test_predictions
sample_submission.to_csv('submission.csv', index=False)
print("Submission file saved as 'submission.csv'")