# DATA - Advanced Methods of Data Processing 
## Exercise: Data Analytics Mathematics 2 (confidence and using outliers) in Jupyter

Let's carry on with same ping file. And analyze it bit more.

In [1]:
## Your code here 
print("Exercise by: Janne Bragge")

Exercise by: Janne Bragge


#### Step 1: Read the data 
**Task1.** Read `../data/google2_ping.txt` to pandas dataframe `google2_df`.

In [2]:
## Task 1:
!pip install ping3

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

!cp ../data/google2_ping.txt temp.txt

with open("temp.txt", "r") as file:
    lines = file.readlines()
with open("temp.txt", "w") as file:
    file.writelines(lines[1:-4])

!sed -i -e 's/timeout/NaN/g' temp.txt

!sed -i 's/^.*time=\(.*\) ms$/\1/; s/^Request NaN for icmp_seq.*$/NaN/' temp.txt


google2_data = !cat temp.txt 
google2_list = list(np.float_(google2_data))
google2_df = pd.DataFrame(google2_list)
 

 



#### Step 2: Calculate Google.com availability 
Availability means that how many times `google.com` was answering the ping. 

**Task 2.** Calculate google.com availability by calculating: `successful_pings / all_pings`

**Task 3.** Calculate, how many minutes servers can be down in a year to meet this availability (see https://sre.google/sre-book/availability-table/ for other availabily values).

- **Note 1.** Most of the time missed pings are due to networks issues, not Google server unavailability.
- **Note 2.** Round availability to 3 decimals and down minutes to 1 decimal
- **Note 3.** Use np.around() to avoid floating point number rounding problems see https://numpy.org/doc/stable/reference/generated/numpy.around.html

In [3]:
## Task 2:
response_times = google2_df

def calculate_availability(response_times):
    total_pings = len(response_times)
    successful_pings = sum(1 for time in response_times if not np.isnan(time) and time < 100)

    availability = successful_pings / total_pings
    return availability

availability_result = calculate_availability(response_times)
#print(f"Availability of Google.com: ", 1-availability_result)
google_availability =  1-availability_result 

 

In [4]:
## Task 3:
downtime_hour = 60**2 * 0.01
downtime_year = (downtime_hour * 24)*365

down_minutes_per_year = np.around(downtime_year / 60, decimals=1) 

 

In [5]:
print("Google.com availability is:", google_availability)
print("This availability means down time [min] / year:", down_minutes_per_year)

Google.com availability is: 0.99
This availability means down time [min] / year: 5256.0


#### Step 3: Confidence level 

But how confident we can be, that availability is really that good? And mean value in previous exercise is accurate? Cause we have only small number of samples to calculate these availability and mean values. 

##### Let's start with mean value confidence:

Mean value confidence can be estimated with `confidence interval` (https://www.mathsisfun.com/data/confidence-interval.html). 

**Task 4.** Use formula below and calculate 95% confidence interval for mean ping delay:  

$$ \overline{X}_{delay} \pm Z \frac{s}{\sqrt{n}} \qquad  (1)$$

- **Note 1.** n is the number of observations (i.e. number of ping delay values)  
- **Note 2.** Round results to 1 decimals
- **Note 3.** Use np.around() to avoid floating point number rounding problems see https://numpy.org/doc/stable/reference/generated/numpy.around.html


In [6]:
## Task 4:
import math

google2_df.dropna(inplace=True)
n = len(google2_df)

z = 1.960

mean_value = np.mean(google2_df)
google2_mean = mean_value

std_value = google2_df.std(axis=0) 
google2_std = std_value[0]

mean_low_limit = np.around(google2_mean - 1.960 * (google2_std / math.sqrt(n)), decimals=1)
mean_high_limit = np.around(google2_mean + 1.960 * (google2_std / math.sqrt(n)), decimals=1)

 

In [7]:
print("With 95% propability google.com ping time delay is between", mean_low_limit, "and", mean_high_limit, "ms.")

With 95% propability google.com ping time delay is between 20.0 and 35.3 ms.


##### And then availability confidence:
Availability confidence interval can be calculated with formula below. 

**Task 5.** Use formula to calculate google.com service availability low limit for 95% confidence:

$$p = \hat{p} \pm Z \sqrt{\frac{\hat{p}\hat{q}}{n}} \qquad (2)$$


- **Note 1.** n is the number of observations (i.e. number of all pings) 
- **Note 2.** Read Z value from here https://www.calculator.net/sample-size-calculator.html
- **Note 3.** Round results to 3 decimals
- **Note 4.** Use np.around() to avoid floating point number rounding problems see https://numpy.org/doc/stable/reference/generated/numpy.around.html


**Hint.** You can check your answer with https://www.calculator.net/sample-size-calculator.html

In [8]:
## Task 5:
google2_df_nan = pd.DataFrame(google2_list)
nan_count = google2_df_nan.isna().sum()
error_percent = np.around((nan_count / n), decimals=2)
correct_percent = 1 - error_percent[0]

p_low = np.around(correct_percent - 1.96 * math.sqrt((correct_percent * (1-correct_percent)/n)), decimals=3)

 

In [9]:
print("With 95% propability, google.com availability is > ", p_low)

With 95% propability, google.com availability is >  0.97


##### And what if you need bigger confidence for availability:
**Task 6.** Reformulate (2) and calculate how many ping samples you need to find out if google.com availability is > 99.5% with error margin 0.5% and confidence level 99%.

- **Note** Remember to round result to upper integer!


In [10]:
## Task 6:
z_99 = 2.58
availability = 1 - 0.995

N = np.ceil(z_99 **2 * availability * ((1 - availability) / availability **2))

 

In [11]:
print("Data sample count for 99% confidence must be >", N)

Data sample count for 99% confidence must be > 1325.0


### Reflection
Answer following questions:
1. What is service availability?

Service Availability kertoo kuinka suuren osan ajasta t on järjestelmä käytösssä, tyypillisesti luku annetaan prosenteissa tai desimaalilukuna.

2. How the number of samples influences to confidence of system availability?
Tyypillisesti suurempi näytteiden määrä antaa luetettavampia tuloksia järjestelmän saatavuudesta ja luottamusväli on tulosten osalta laajempi esim. 95% -> 99%. Tämä johtuu siitä että satunnaisuuden merkitys tulosten osalta pienenee. Tieteellisessä tutkimuksessa varsinkin taloustieteissä hyväksytään yleisesti luottamusväliksi 95%

3. What is good system availability metric?

Hyvä mittari on sellainen joka antaa kattavan kuvan järjestelmän saatavuudesta. Materiaaleissa otettiin esiin tehtävässä käytetyn kokonaisaatavuuden lisäksi MTBF (Mean Time Between Failures) eli aika joka menee keskimäärin kahden peräkkäisen vian välillä (toiminta-aika) ja MTTR (Mean Time to Repair) eli aika joka menee vian korjaamiseen. 

Keskestä on mihin kysymyksiin pyritään vastaamaan, sen mukaan määräytyy käytettävä mittaristo

*Your answers here...*

### Check your answers by running following cell:

In [12]:
# Do not change this code!

import sys
sys.path.insert(0, '../answers/data_math_answers/')
from data_math2_check import check_math2


print("Results:")
correct = check_math2(google_availability, down_minutes_per_year, 
         mean_low_limit, mean_high_limit, p_low, N)
print("Correct answers", correct, "/ 6.")


Results:
Correct answers 6 / 6.


### Nice work! 