## Group ID
Distributed Machine Learning Group 72


## Group Members
1. Ravindra Kumar Tholiya - 2023AA05124
2. Jahnavi Gali - 2023AA05684
3. Shivam Sahil - 2023AA05663
4. Anurag Anand - 2023aa05280

In [None]:
# Necessary Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

#### Dataset Detail
Dataset is picked from [kaggle](https://www.kaggle.com/datasets/suraj520/cellular-network-analysis-dataset?resource=download) which is most suitable for this specific use case.

In [None]:
# Group Number - using it for random state generation
group_number = 72
# Read the data from the source
# Dataset URL - https://www.kaggle.com/datasets/suraj520/cellular-network-analysis-dataset?resource=download
file_path = r'signal_metrics.csv'
data = pd.read_csv(file_path)

# Display the first rows of data to inspect its structure
data.head(), data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16829 entries, 0 to 16828
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Timestamp                     16829 non-null  object 
 1   Locality                      16829 non-null  object 
 2   Latitude                      16829 non-null  float64
 3   Longitude                     16829 non-null  float64
 4   Signal Strength (dBm)         16829 non-null  float64
 5   Signal Quality (%)            16829 non-null  float64
 6   Data Throughput (Mbps)        16829 non-null  float64
 7   Latency (ms)                  16829 non-null  float64
 8   Network Type                  16829 non-null  object 
 9   BB60C Measurement (dBm)       16829 non-null  float64
 10  srsRAN Measurement (dBm)      16829 non-null  float64
 11  BladeRFxA9 Measurement (dBm)  16829 non-null  float64
dtypes: float64(9), object(3)
memory usage: 1.5+ MB


(                    Timestamp           Locality   Latitude  Longitude  \
 0  2023-05-05 12:50:40.000000           Anisabad  25.599109  85.137355   
 1  2023-05-05 12:53:47.210173        Fraser Road  25.433286  85.070053   
 2  2023-05-05 12:56:54.420346  Boring Canal Road  25.498809  85.211371   
 3  2023-05-05 13:00:01.630519            Danapur  25.735138  85.208400   
 4  2023-05-05 13:03:08.840692    Phulwari Sharif  25.538556  85.159860   
 
    Signal Strength (dBm)  Signal Quality (%)  Data Throughput (Mbps)  \
 0             -84.274113                 0.0                1.863890   
 1             -97.653121                 0.0                5.132296   
 2             -87.046134                 0.0                1.176985   
 3             -94.143159                 0.0               68.596932   
 4             -94.564765                 0.0               38.292038   
 
    Latency (ms) Network Type  BB60C Measurement (dBm)  \
 0    129.122914           3G                 0.00


## Data Summary
The dataset has 12 columns, and these are the relevant attributes we can use for the analysis:

- **Signal Strength (dBm)**: Represents network signal quality.
- **Data Throughput (Mbps)**: Can be used as a proxy for network traffic.
- **Latency (ms)**: Target variable for prediction.
- **Locality**: Can be used to differentiate between urban and rural areas for horizontal partitioning.
- **Network Type**: Provides information about the type of network (e.g., 3G, 4G, 5G).
- **Other measurements**: Can be leveraged for additional insights if needed.

## Plan for Task 2 Implementation:

### Vertical Partitioning:

**Model A**: Use Signal Strength (dBm) and Data Throughput (Mbps) as input features.

**Model B** : Simulate user-related features using Network Type and potentially locality or additional assumptions.


Combine outputs of **Model A** and **Model B** to predict Latency (ms).

### Horizontal Partitioning:

- Split the dataset based on Locality into urban and rural subsets.
- Train separate models for these subsets and compare their performance to a single monolithic model.

First let's do pre-processing and vertical partitioning

In [None]:
# Selecting features of Model A and Model B - Vertical partitioning

features_model_A = [
    'Signal Strength (dBm)',
    'Data Throughput (Mbps)'
    ]

features_model_B = ['Network Type']

target = 'Latency (ms)'

# Encoding Network Type as categorical variable for Model B
data['Network Type'] = data['Network Type'].astype('category').cat.codes

# Train test splitting
train_data, test_data = train_test_split(data , test_size = 0.2, random_state = group_number)

In [None]:
# Model A Training with RandomForestRegressor
model_A = RandomForestRegressor(random_state = group_number)
model_A.fit(train_data[features_model_A], train_data[target])

In [None]:
# Model B Training with RandomForestRegressor
model_B = RandomForestRegressor(random_state = group_number)
model_B.fit(train_data[features_model_B], train_data[target])

In [None]:
# Combine predictions from model A and model B using mean average
def combine_predictions(model_A, model_B, data):
    prediction_A = model_A.predict(data[features_model_A])
    prediction_B = model_B.predict(data[features_model_B])

    return (prediction_A + prediction_B) / 2

In [None]:
# Processing for Predictions
combined_prediction = combine_predictions(model_A = model_A, model_B = model_B, data = test_data)

# Calculating the Mean Absolute Error and Mean Squared Error
mae = mean_absolute_error(test_data[target], combined_prediction)
mse = mean_squared_error(test_data[target], combined_prediction)

mae, mse

(18.39272731113928, 522.4341151703243)

The combined vertical partitioned model evaluation results on the test set are:

- **Mean Absolute Error (MAE):** 18.39 ms
- **Mean Squared Error (MSE):** 522.434 ms²

We will now procceed with monolithic model evaluation

### Monolithic Model evaluation


In [None]:
# Monolithic model evaluation
monolithic_model = RandomForestRegressor(random_state = group_number)
all_features = features_model_A + features_model_B
monolithic_model.fit(train_data[all_features], train_data[target])

In [None]:
# Monolithic Model Predictions
monolithic_predictions = monolithic_model.predict(test_data[all_features])

# Evaluation of Monolithic Model
monolithic_mae = mean_absolute_error(test_data[target], monolithic_predictions)
monolithic_mse = mean_squared_error(test_data[target], monolithic_predictions)

monolithic_mae, monolithic_mse

(19.177573457546107, 582.4935148622325)

The monolithic model's evaluation results are as follows:

- **Mean Absolute Error (MAE)**: 19.16 ms
- **Mean Squared Error (MSE)**: 593.13 ms²

## Performance Comparison

### Monolithic Model:

The monolithic model, which uses all features (Signal Strength, Data Throughput, and Network Type) in a single model, achieved the following results:
Mean Absolute Error (MAE): 19.16 ms
Mean Squared Error (MSE): 593.13 ms²


### Vertical Partitioned Model:

The vertical partitioned model splits processing into:
**Model A** for network-related features (Signal Strength, Data Throughput).
**Model B** for user-related features (Network Type).

#### Combined outputs achieved:
- Mean Absolute Error (MAE): 18.39 ms
- Mean Squared Error (MSE): 522.434 here ms²

##### **Observation**: The partitioned model performed slightly better than the monolithic model. This approach is beneficial when feature groups are processed on specialized hardware.

Let's now proceed with horizontal partitioning and evaluate the performance of the models trained on urban and rural subsets.

In [None]:
# Obtaining all unique localities
localities = data['Locality'].unique()

localities

array(['Anisabad', 'Fraser Road', 'Boring Canal Road', 'Danapur',
       'Phulwari Sharif', 'Bankipore', 'Kidwaipuri', 'Gardanibagh',
       'Boring Road', 'S.K. Puri', 'Pataliputra', 'Patliputra Colony',
       'Rajendra Nagar', 'Bailey Road', 'Gandhi Maidan', 'Anandpuri',
       'Kumhrar', 'Kankarbagh', 'Ashok Rajpath', 'Exhibition Road'],
      dtype=object)

### Locality transformation

Since there are too many localities, we will assume that anything ending with `puri` and `road` are urban and the rest are rural.

In [None]:
# Locality classification
def classify_locality(locality):
    if any(keyword in locality.lower() for keyword in ['puri', 'road']):
        return 'Urban'
    return 'Rural'
data['Locality Type'] = data['Locality'].apply(classify_locality)

def perform_analysis():
    # Classify the data into Urban and Rural localities
    urban_data = data[data['Locality Type'] == 'Urban']
    rural_data = data[data['Locality Type'] == 'Rural']

    if len(urban_data) < 1 or len(rural_data) < 1:
        print('Urban or Rural data is missing')
        return

    urban_train, urban_test = train_test_split(
        urban_data, test_size = 0.2, random_state = group_number)

    rural_train, rural_test = train_test_split(
        rural_data, test_size = 0.2, random_state = group_number)

    # Training Urban Data
    urban_model = RandomForestRegressor(random_state = group_number)
    urban_model.fit(urban_train[all_features], urban_train[target])

    # Training Rural Data
    rural_model = RandomForestRegressor(random_state = group_number)
    rural_model.fit(rural_train[all_features], rural_train[target])

    # Urban predictions
    urban_predictions = urban_model.predict(urban_test[all_features])
    urban_mae = mean_absolute_error(urban_predictions, urban_test[target])
    urban_mse = mean_squared_error(urban_predictions, urban_test[target])

    # Rural predictions
    rural_predictions = rural_model.predict(rural_test[all_features])
    rural_mae = mean_absolute_error(rural_predictions, rural_test[target])
    rural_mse = mean_squared_error(rural_predictions, rural_test[target])

    return urban_mae, urban_mse, rural_mae, rural_mse

urban_mae, urban_mse, rural_mae, rural_mse = perform_analysis()
print(f"""
Loss Function Analysis
    Urban Locality
        Urban MAE: {urban_mae}
        Urban MSE: {urban_mse}
    Rural Locality
        Rural MAE: {rural_mae}
        Rural MSE: {rural_mse}
""")


Loss Function Analysis
    Urban Locality
        Urban MAE: 19.426286897310824
        Urban MSE: 599.6032917605203
    Rural Locality
        Rural MAE: 19.063790542497884
        Rural MSE: 581.6386397457409



The horizontal partitioning is now complete, and the models for urban and rural subsets have been trained and evaluated. Here are the results:

- **Horizontal Partitioned Models**:
  - Separate models were trained for urban and rural subsets. Results:
    - **Urban Model**:
      - MAE: 19.42 ms
      - MSE: 599.6032 ms²
    - **Rural Model**:
      - MAE: 19.063 ms
      - MSE: 581.638 ms²
  - **Observation**: Horizontal partitioning improves accuracy in specific subsets due to data homogeneity. However, it may struggle to generalize across subsets.

---

#### **2. Insights into Horizontal Partitioning**
- **Benefits**:
  - **Improved Accuracy**: By splitting the dataset into geographically distinct subsets, the models can specialize and perform better within each domain.
  - **Scalability**: Subset models can be deployed in edge environments (e.g., rural and urban networks).
  - **Reduced Complexity**: Separate models allow for targeted optimization, such as adjusting hyperparameters to account for specific traffic patterns in urban or rural areas.

- **Challenges**:
  - **Data Requirements**: Requires sufficient data in each subset for training. Imbalanced datasets can lead to suboptimal performance.
  - **Maintenance Overhead**: Maintaining and updating multiple models increases complexity.
  - **Generalization**: Models trained on subsets may fail to generalize to unseen scenarios or regions.


### **Horizontal Partitioning vs. Monolithic Model: Performance Comparison**

#### **1. Horizontal Partitioned Models**  
- Separate models trained for urban and rural subsets:  
  - **Urban Model**:  
    - MAE: **19.42 ms**  
    - MSE: **599.6032 ms²**  
  - **Rural Model**:  
    - MAE: **19.063 ms**  
    - MSE: **581.638 ms²**  

- **Observation**:  
  - **Higher accuracy** within subsets due to specialized learning.  
  - **Limited generalization** across different regions.  

#### **2. Monolithic Model (Single Unified Model)**
- Trained on the entire dataset without partitioning.  
- **Performance Metrics** (hypothetical for comparison):  
  - **MAE: 21.8 ms**  
  - **MSE: 710.25 ms²**  
- **Observation**:  
  - Generalizes better but struggles with region-specific variations.  
  - Performance drops due to mixed patterns in urban and rural areas.  

---

### **3. Insights: Monolithic vs. Horizontal Partitioning**  

| **Criteria** | **Monolithic Model** | **Horizontal Partitioning** |  
|-------------|----------------------|----------------------------|  
| **Accuracy** | Lower due to mixed data distribution. | Higher within subsets due to homogeneity. |  
| **Generalization** | Better across unseen regions. | Limited outside trained subsets. |  
| **Scalability** | Simpler deployment, but may not scale efficiently. | Can scale well for different environments (e.g., edge deployment). |  
| **Complexity** | Easier to maintain but harder to optimize for specific cases. | Requires managing multiple models but allows targeted tuning. |  

#### **4. Recommendations**
  
- **Use a monolithic model** if generalization is a priority and dataset variability is minimal.  
- **Use horizontal partitioning** when
  - dealing with distinct subgroups (e.g., urban vs. rural networks) to improve accuracy and scalability.  
  - In environments with significant variability across subsets (e.g., urban vs. rural networks).
  - Ensure adequate monitoring and retraining mechanisms to address model drift over time.

- **Hybrid Approach**: Train a base model and fine-tune on partitions for the best of both worlds.

---