# Cloud Reachability and Latency Forecasting with RIPE Atlas
This project is accesible at: https://github.com/rita-imdea/ripe-userguide

## II) Retrieving data from RIPE Atlas

In this notebook, we are going to see how to retrieve measurements fron RIPE Atlas and also how to parse measurements from JSON files (which have been pre-downloaded from RIPE Atlas). 
We recommend the reader to first take a look at the PDF document that contains the guide for the whole course. 

------------------------------
To complete:
!!!!!!!!!!!!!!!!
This notebook is divided in XXX parts.

In [37]:
# Give the Atlas api key an easy to remember variable name 
ATLAS_API_KEY = " "

In [38]:
import requests # to create http requests from Python
import json # Library to write and ready JSON files in PYthon

### A) ACCESSING PUBLICLY AVAILABLE MEASUREMENTS

Retrieving data from the RIPE Atlas database in Python is very simple. 
One only needs to know the ID that indentifies the measurement. Then, it is enough to run the following code.
The id provided is a ramdom id measurement that you can substitute with your measurement's ids if you create customized measurements. 

First, we set the IDs of the measurements we want to retrieve.

In [39]:
# Set the measurement IDs you want to retrieve
measurement_ids = ["61142052"] # example of measurement ID


Now, for each measurement, we save the data in a json file.

In [40]:

# Loop through the measurement IDs and retrieve the JSON files
for measurement_id in measurement_ids:
        url = f"https://atlas.ripe.net/api/v2/measurements/{measurement_id}/results/?format=json"
        response = requests.get(url, headers=headers)

        # Check if the request was successful
        if response.status_code == 200:
                json_data = response.json()
                
                measurement_file = f"RIPE-Atlas-measurement-{measurement_id}.json" # Name of the JSON file where the data will be stored. 
                
                # Write the JSON data to a file
                with open(measurement_file, "w") as f:
                        json.dump(json_data, f, indent=4)
                        
        else:
                print(f"Failed to retrieve measurement {measurement_id}. Error code: {response.status_code}")


### B) PARSING YOUR MEASUREMENTS

Next, we are going to get the data from the JSON files (which can be already present in the course directory or could be obtained with the above code). The next steps are:

1. Initially, we parse the JSON files and store the data in a Python structure. 
2. We clean the data

-----------------------------------------

!!!!!!!!!!!!!!!! To complete

3. Finally you can plot the probability distribution to see how the data is spread 

First, we set the IDs of the measurements we are interested in. 
In this case, the IDs provided below refer to the measurements taken for this course. 

In [52]:
measurement_ids = ["48819905", "48819906", "48819907", "48819908", "48819909", "48819910"] # example of measurement ID


### a) Reading JSON files in Python

Next, we use the Python library "json", which allows us to ready and write JSON files with Python.
Furthermore, we use two of the most used packages for data processing in PYthon: Pandas and Numpy. 

1. Pandas is a Python library that provides a data structure called a DataFrame (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which is a structure that facilitates data processing and manipulation. We shall use it to parse our data because it provides a number of useful functions for manipulation and visualization of data.

2. Numpy is a Python library that facilitates mathematical operations, in particular for arrays and matrices. 

In [53]:
# We first import the necessary libraries 
import pandas as pd
import numpy  as np

For each one of the measurements, we read the JSON file and store it in a pandas DataFrame (df) structure.

Then, we create a list of DataFrames (df) with all the measurements. 

In [43]:
# Loop through the measurement IDs experiment files and create a DataFrame for each
dfs = []
for measurement_id in measurement_ids:
    # Read the JSON data from the file
    with open(f"RIPE-Atlas-measurement-{measurement_id}.json", "r") as f:
        json_data = json.load(f)
    
    # Normalize the JSON data into a pandas DataFrame
    df = pd.json_normalize(json_data)
    
    # Append the DataFrame to the list of DataFrames
    dfs.append(df)


We would like to have all the DataFrames together to analyze all the measurements at the same time. Thus, we concatenate the list of dataframes

In [49]:
# Concatenate all the DataFrames into a single one
result_df = pd.concat(dfs, ignore_index=True)

We can visualize the current data structure thanks to the pandas function "head", which prints the first 5 rows of the DataFrame with the value of all the columns.

In [50]:
# Print the resulting DataFrame
result_df.head()

Unnamed: 0,fw,mver,lts,dst_name,af,dst_addr,src_addr,proto,ttl,size,...,prb_id,timestamp,msm_name,from,type,group_id,step,stored_timestamp,endtime,paris_id
0,5040,2.4.1,14,52.46.72.50,4,52.46.72.50,10.18.246.209,ICMP,234.0,64,...,1003454,1673864005,Ping,51.15.99.8,ping,48819905,300.0,1673864050,,
1,5040,2.4.1,20,52.46.72.50,4,52.46.72.50,10.109.0.30,ICMP,233.0,64,...,1003747,1673864003,Ping,45.137.88.145,ping,48819905,300.0,1673864114,,
2,5080,2.6.2,96,52.46.72.50,4,52.46.72.50,192.168.250.65,ICMP,233.0,64,...,20757,1673863991,Ping,82.116.160.225,ping,48819905,300.0,1673864089,,
3,5080,2.6.2,12,52.46.72.50,4,52.46.72.50,192.168.1.38,ICMP,234.0,64,...,53229,1673864019,Ping,83.54.157.101,ping,48819905,300.0,1673864090,,
4,5080,2.6.2,33,52.46.72.50,4,52.46.72.50,192.168.0.101,ICMP,232.0,64,...,51381,1673864074,Ping,102.34.0.4,ping,48819905,300.0,1673864205,,


In [51]:
result_df['avg']

0         14.612530
1         24.125096
2         22.542065
3         26.183208
4        171.458986
            ...    
17427           NaN
17428           NaN
17429           NaN
17430           NaN
17431           NaN
Name: avg, Length: 17432, dtype: float64

### b) Data cleaning and pre-processing

The following step is to clean the data collection. 

For that, we should first know that the RIPE Atlas measurements may include negative values for latency (-1.0), which represent samples for which it was not possible to store the value (Due to, for example, timeouts in the netowrk protocols). 

To avoid that these values impact the later analysis, we remove those samples. We first transform the negative values into Not-a-Number values (NaN), and we make use of the function "dropna" from Pandas library to remove those samples.  

In [46]:
# Clean the data 

# Change them to NaN
result_df['avg'].replace(-1.0, np.nan, inplace=True)

# Remove Null values 
result_df = result_df.dropna(how='any',axis=0) 

We further pre-process the data to facilitate the plotting, visualization and readability of the data.

For that, we change the probe ID with an acronym that represents the country where the probe is located, plus an index in case there are several probes in the same country. 

In [47]:
# Renaming the probe_id column for easy plotting 
nprb_id = []

for value in df["prb_id"]:
    if value == 1004991:
        nprb_id.append('es1')
   
        
        
result_df["nprb_id"] = nprb_id
result_df

Unnamed: 0,fw,mver,lts,dst_name,af,dst_addr,src_addr,proto,ttl,size,...,timestamp,msm_name,from,type,group_id,step,stored_timestamp,endtime,paris_id,nprb_id
0,,,,,,,,,,,...,,,,,,,,,,es1
1,,,,,,,,,,,...,,,,,,,,,,es1
2,,,,,,,,,,,...,,,,,,,,,,es1
3,,,,,,,,,,,...,,,,,,,,,,es1
4,,,,,,,,,,,...,,,,,,,,,,es1
5,,,,,,,,,,,...,,,,,,,,,,es1
6,,,,,,,,,,,...,,,,,,,,,,es1
7,,,,,,,,,,,...,,,,,,,,,,es1
8,,,,,,,,,,,...,,,,,,,,,,es1
9,,,,,,,,,,,...,,,,,,,,,,es1


In [None]:
# Changing the time column from epoch to date time format for time series processing 
new_timestamp = []

for i in result_df['timestamp']:
    my_datetime = datetime.fromtimestamp(i)
    new_timestamp.append(my_datetime)

df = result_df.copy()
df['new_time'] = new_timestamp
df.head()

In [None]:
country_name = []

for value in df["nprb_id"]:
    if (value == 'es1') or (value == 'es2') or(value == 'es3') or( value == 'es4'):
        country_name.append('Spain')
    if(value == 'nl1') or (value == 'nl2')or (value == 'nl3'):
        country_name.append('Netherlands')
    if(value == 'pt1') or (value == 'pt2'):
        country_name.append('Portugal')
    if(value == 'us1') or (value == 'us2'):
        country_name.append('USA')
    if(value == 'ug1'):
        country_name.append('Uganda')
        
df['country_name'] = country_name

In [None]:
# Plotting the probability distribution
# checking cdf for each of the probes
probes = ['es1','es2','es3','es4','nl1','nl3','pt2','ug1','us1','us2']

for probe in probes:
    df_cdf = df[(df['nprb_id'] == probe)]
    axx = df_cdf['avg'].hist(cumulative=True, density=True, bins=100, alpha = 0.3)
axx.legend(probes)

In [None]:
#checking pdf for each of the probes
countries = df['country_name'].unique()

for country in countries:
    df_pdf = df[(df['country_name'] == country)]
    axx = df_pdf['avg'].hist(density=True, bins=100, alpha = 0.3)

axx.legend(countries)

#### C) ANALYZING YOUR MEASUREMENTS 

1. First you do some feature engineering ie add features that may be missing but could be important like distance and probe status 
2. Check how latency varies over time, how mean and standard deviation vary over distance or any other interesting scenarios you can come up with 
3. Finally you can do some predictions based using established mathematical models or machine learning models and see what gives you best results.  

In [None]:
# Feature Engineering 
# Collect the source probe information 

from ripe.atlas.cousteau import Probe  

probe_id_list = []

#extract the probe cordinates from ripe atlas 
probe_coordinates = []
probe_country = []

for id_i in probe_id_list:
    probe = Probe(id=id_i) # Obtains all metadata of probe id_i
    print(probe.geometry) #probe.geometry is a GeoJSON https://en.wikipedia.org/wiki/GeoJSON
    probe_coordinates.append(probe.geometry['coordinates']) # saving to the list
    probe_country.append(probe.country_code)

longitude = []
latitude = []

for i in probe_coordinates:
    longitude.append(i[0])
    latitude.append(i[1])

# create a probe metadata dataframe
srcprobes_df = pd.DataFrame({'prb_id': probe_id_list,'longitude': longitude, 'latitude': latitude,'probe_country': probe_country})
srcprobes_df

In [None]:
# Collect the destination probe information 

probe_id_list = []

#extract the probe cordinates from ripe atlas 
probe_coordinates = []
probe_country = []

for id_i in probe_id_list:
    probe = Probe(id=id_i) 
    print(probe.geometry) 
    probe_coordinates.append(probe.geometry['coordinates']) # saving to the list
    probe_country.append(probe.country_code)

longitude = []
latitude = []

for i in probe_coordinates:
    longitude.append(i[0])
    latitude.append(i[1])

# create a probe metadata dataframe
dstprobes_df = pd.DataFrame({'prb_id': probe_id_list,'longitude': longitude, 'latitude': latitude,'probe_country': probe_country})
dstprobes_df

In [None]:
#calculate the distances between all possible source probe and destination probe pairs 
from geopy.distance import distance
from itertools import product

# Create an empty list to store the distances
data = []

# Iterate over each combination of source and destination probes
for source_row, dest_row in product(srcprobes_df.iterrows(), dstprobes_df.iterrows()):
    source_row = source_row[1]  # Get the row data from the iterator
    dest_row = dest_row[1]  # Get the row data from the iterator
    
    source_coordinates = (source_row['latitude'], source_row['longitude'])
    dest_coordinates = (dest_row['latitude'], dest_row['longitude'])
    
    distance_km = distance(source_coordinates, dest_coordinates).km
    # Append data to the list
    data.append({
        'source_prb_id': source_row['prb_id'],
        'source_longitude': source_row['longitude'],
        'source_latitude': source_row['latitude'],
        'destination_prb_id': dest_row['prb_id'],
        'destination_longitude': dest_row['longitude'],
        'destination_latitude': dest_row['latitude'],
        'distance': distance_km
    })

# Create a new dataframe from the data list
distance_df = pd.DataFrame(data)

# Print the new dataframe
distance_df.head()

In [None]:
# Create a map from the distance dataframe and and use it to map distances to all probes and all destinations 

# Create a dictionary mapping (source_prb_id, destination_prb_id) to distance
distance_map = {(int(row['source_prb_id']), int(row['destination_prb_id'])): row['distance'] for _, row in distance_df.iterrows()}


# Map the distance values to the existing DataFrame based on (source_prb_id, destination_prb_id)
df['distance'] = df.apply(lambda row: distance_map.get((int(row['prb_id']), int(row['dst_id']))), axis=1)

# Display the updated DataFrame
distance_map

In [None]:
# Obtaining the probe status 
# you have to download the connection log json files manually 

probe_id_list = []

# Dictionary to store combined json files with probe status 
combined_dict = []

# Iterate over servers
for server in probe_id_list:
    server_name = str(server)
    
    # Read JSON log file
    log_file = server_name + '.json'
    
    results = process_data(log_file, server)
        
    # Append to combined dictionary
    combined_dict.append(results)

In [None]:
# Convert the list of dictionaries into a more accessible format
uptime_dict = {}
for d in combined_dict:
    for server_id, uptime_ranges in d.items():
        uptime_dict[server_id] = uptime_ranges
#uptime_dict

In [None]:
# Function to check if the timestamp is within the server's uptime ranges
def is_probe_up(probe_id, timestamp):
    if probe_id in uptime_dict:
        for uptime_range in uptime_dict[probe_id]:
            if uptime_range['to'] is None:
                if uptime_range['from'] <= timestamp:
                    return "connected"
            else:
                if uptime_range['from'] <= timestamp <= uptime_range['to']:
                    return "connected"
    return "disconnected"

# Iterate through the DataFrame and check if the timestamp occurred during the server's uptime
df['probe_status'] = df.apply(lambda row: is_probe_up(row['prb_id'], row['timestamp']), axis=1)


In [None]:
#checking how mean and standard deviation vary over time 

grouped_data = df.groupby(['prb_id', 'dst_id'])
mean = grouped_data['avg_rtt'].mean()
std = grouped_data['avg_rtt'].std()
distance = grouped_data['distance'].unique()

fig, ax = plt.subplots()
plt.scatter(distance, mean, label='Mean',color="BLUE")
plt.ylabel('Mean')
plt.legend()
plt.title('Mean and Standard Deviation vs. Distance')

fig2, ax2 = plt.subplots()
plt.errorbar(distance, mean, yerr=std, fmt='o',label='Standard Deviation',color="GREEN")

plt.xlabel('Distance')
plt.ylabel('Standard Deviation')
plt.legend()

plt.xticks(rotation=45)
plt.tick_params(axis='x', which='both', bottom=True)
plt.show()

In [None]:
# Applying some simple forecasting methods

# Using the naive forecast
df = df.assign(naive=df['rtt'].shift(1))
# Replace NaN at top of value column with 0
df['naive'] = df['naive'].fillna(method='ffill').fillna(0)

# Testing the prediction accuracy for naive forecast
se = (df['rtt'] - df['naive']) ** 2
9 mse_naive = se.mean()
10 mse_naive


In [None]:
# Exponential smoothing method
from statsmodels.tsa.api import SimpleExpSmoothing
fit1 = SimpleExpSmoothing(df_sktime['avg']).fit()
df['Simple-smoothing'] = SimpleExpSmoothing(df['avg']).fit().fittedvalues
df[['avg','Simple-smoothing']].plot(title='Exponential Smoothing')

In [None]:
# Using the decision tree 
# randomising the test and train data 
import itertools
import random

test_indices = []
train_indices = []
        
array1 = df2['nprb_id'].unique()
array2 = df2['dst_addr'].unique()

# Creating all possible pairs
pairs = list(itertools.product(array1, array2))

# Randomly selecting 10 pairs
selected_pairs = random.sample(pairs, 10)

# Removing selected pairs from the original list
for pair in selected_pairs:
    pairs.remove(pair)

# Creating separate lists
selected_list = selected_pairs
remaining_list = pairs

train_dfs = []
for i,k in remaining_list:
    temp_df = df2.loc[(df2['nprb_id'] == i) & (df2['dst_addr'] == k)]
            
    # Append the piece to the selected data
    train_dfs.append(temp_df)

train_df = pd.concat(train_dfs)
        
test_dfs = []
for i,k in selected_list:
    temp_df = df2.loc[(df2['nprb_id'] == i) & (df2['dst_addr'] == k)]
            
    # Append the piece to the selected data
    test_dfs.append(temp_df)

test_df = pd.concat(train_dfs)
        

In [None]:
# Select your features and target
X_train = train_df['normalizzed_distance'].values.reshape(-1,1)
y_train = train_df['normalizzed_avg'].values

X_test = test_df['normalizzed_distance'].values.reshape(-1,1)
y_test = test_df['normalizzed_avg'].values

# Import the Machine learning libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Create a Decision Tree Regressor
model = DecisionTreeRegressor()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the accuracy of the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

In [None]:
# Viewing the Decision Tree
from sklearn.tree import export_graphviz
tree_dot = export_graphviz(reg_tree,feature_names =["distance"],out_file=None,rounded=True, filled=True)

# Visualize the tree using Graphviz
graph = graphviz.Source(tree_dot)
graph

In [None]:
# LSTM Modelling 
import tensorflow as tf
from tensorflow. keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from sklearn.metrics import mean_squared_error

model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(steps,features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

model.fit(Xtrain, y_train, epochs=20, verbose=0)
y_pred = model.predict(Xtest)

mse = mean_squared_error(y_pred,y_test[0:len(y_pred)])
mse

#### D ) LATENCY Vs DISTANCE 
Sample code below helps us see how latency would change depending on the distance and when/if you added servers to deliver the customer services 

In [None]:
# References 

# Install Jupyter-MATLAB
# https://am111.readthedocs.io/en/latest/jmatlab_install.html
# Calling user-defined MATLAB functions from Python
# https://www.mathworks.com/help/matlab/matlab_external/call-user-script-and-function-from-python.html


# Check if python version is 64bit or 32bit
# Then download the corresponding MATLAB version 
# import sys
# print(sys.maxsize > 2**32)

# MATLAB-side configuration
# First Install MATLAB from this website https://www.mathworks.com/
# Get the matlab root directory by running <matlabroot> in the MATLAB command window
# Add the matlabroot/bin to the system path
# export PATH="/Applications/MATLAB_R2019b.app/bin:$PATH"

# SETUP MATLAB ENGINE API FOR PYTHON
# cd /usr/local/MATLAB/R2018a/extern/engines/python -  change to your matlab version
# python setup.py install // change setup tools if this fails pip install setuptools==58.2.0

# JUPYTER SIDE CONFIGURATION 
# python -m matlab_kernel install --user //this adds matlab to the jupyter kernels list
# jupyter kernelspec list //check if matlab is in the list
# pip install matlab.engine

import matlab.engine

# Start a MATLAB session
eng = matlab.engine.start_matlab()

In [2]:
from oct2py import Oct2Py



In [6]:
oc = Oct2Py()


Oct2PyError: octave not found, please see README

In [None]:

script = "function y = myScript(x)\n" \
         "    y = x-5" \
         "end"

with open("myScript.m","w+") as f:
    f.write(script)

oc.PoA_student_workshop(nargout=0)

In [None]:
#call the matlab simulation 
eng.PoA_student_workshop(nargout=0)