# Lab - Análisis de datos del contador de Internet 

## Part 1: Collect and Store Data
<p>The goal of this first part of the lab is to gather internet speed measurements through the Raspberry Pi. Three kinds of measurements will be collected:
1. Ping speed
2. Download speed
3. Upload speed

#### Step 1: Install Speedtest and Import Python Libraries.
In this step, you will install Speedtest and import Python libraries.
<p>Speedtest-cli is a Python script that measures the upload and download speed of your Internet connection. For more information about speedtest, go to https://github.com/sivel/speedtest-cli.

a) Install `speedtest-cli`.

In [47]:
!pip install speedtest-cli

Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple


This cli allows the Jupyter notebook to connect to the website and store the data.

b) Import the necessary Python libraries.

In [48]:
# Python library to manage date and time data
import datetime
# Python library to read and write csv files
import csv
# Python library to execute bash commands from the notebook.
import subprocess

#### Step 2: Generate timestamps using the `datetime` package.
In this lab, measurements of Internet speed statistics will be generated. A crucial step in data acquisition for the majority of data analytics applications, is to associate a timestamp to measurements. 

a) To generate a timestamp, use the `datetime.now` function of the `datetime` package: 

In [49]:
date_time = datetime.datetime.now()
print(date_time, type(date_time))

2019-12-17 11:46:51.715212 <class 'datetime.datetime'>


b) An instance of the class `datetime` cannot be directly written to in text form. The function `strftime` parses the date information into a string. The arguments of this function determine the format of the output sting.

In [50]:
date_time.strftime('%a, %d %b %Y %H:%M:%S')

'Tue, 17 Dec 2019 11:46:51'

Generate a timestamp and parse it into a string with the following format: YYYY-MM-DD HH:MM:SS.

In [51]:
date_time.strftime("%Y-%b-%d %H:%M:%S")

'2019-Dec-17 11:46:51'

#### Step 3: Run the process and collect the output with Python.

The `speedtest-cli` command, if run from a terminal, returns a string with download and upload speeds. To run the command from this notebook, it is necessary to use the Python module `subprocess`, which allows running a process directly from the notebook code cell. 

a) Run a speed test using the `speedtest-cli` command from Python. The output will be stored in the process_output variable.

In [52]:
# This string contains the command line to interface with speedtest.net
speedtest_cmd = "speedtest-cli --simple"
# Execute the process
process = subprocess.Popen(speedtest_cmd.split(), stdout=subprocess.PIPE)
# Collect the command output
process_output = process.communicate()[0]

b) Print the process output. Notice the type for the `process_output` variable.

In [53]:
print(process_output, type(process_output))

b'Ping: 7.818 ms\nDownload: 93.40 Mbit/s\nUpload: 93.73 Mbit/s\n' <class 'bytes'>


c) The speed test result is split, and a timestamp is appended to the results.

In [54]:
# Store the time at which the speedtest was executed
date_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
process_output = process_output.split()
process_output.append(date_time)
print(process_output, type(process_output))

[b'Ping:', b'7.818', b'ms', b'Download:', b'93.40', b'Mbit/s', b'Upload:', b'93.73', b'Mbit/s', '2019-12-17 11:47:28'] <class 'list'>


d) The speedtest() function is created to return the results from the speedtest-cli command.

In [55]:
# function to excute the speed test
def speedtest():
    # We need to store the time at which the speedtest was executed
    date_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    # This is a string that contains what we would write on the command line 
    #to interface with speedtest.net
    speedtest_cmd = "speedtest-cli --simple"
    # We now execute the process: 
    process = subprocess.Popen(speedtest_cmd.split(), stdout=subprocess.PIPE)
    process_output = process.communicate()[0]
    process_output = process_output.split()
    # and we add the date and time 
    process_output.append(date_time)
    return process_output

#### Step 4: Store the output of the `speedtest()` function.

The comma separated values (csv) is the most common import and export format for spreadsheets and databases. 

a) Create a file named test.txt in the /tmp directory and write "test_msg" in the file.

In [56]:
with open("/tmp/test.txt",'w') as f:
    f.write('test_msg')

b) Use the Linux command `cat` to verify the creation and content of the file.

In [57]:
!cat /tmp/test.txt

test_msg

c) To check that the file was successfully open:

In [58]:
with open("/tmp/test.txt",'r') as f:
   str = f.read()
print(str)

test_msg


To write into a `csv` file, it is necessary to create a `csv.writer` object. 

In [59]:
# function to save data to csv
def save_to_csv(data, filename):
    try:
        # If the file exists, we want to append a new line to it, with the 
        #results of the current experiment
        with open(filename + '.csv', 'a') as f:
            wr = csv.writer(f)
            wr.writerow(data)
    except:
        # If it does not exist, create the file first
        with open(filename + '.csv', 'w') as f:
            wr = csv.writer(f)
            wr.writerow(data)

#### Step 5: Check the collected data.
Write a function to open a csv file and print its content to screen. 

In [60]:
def print_from_csv(filename): 
    with open(filename + '.csv', 'r') as f:
            re = csv.reader(f)
            for row in re:
                print (row)

#### Step 6: Run the Speedtest multiple times and store the data.
a) Write a `for` loop that calls the speedtest 5 times, prints the output of the tests, and stores the data in a csv file.

In [61]:
for i in range(5):
    speedtest_output = speedtest()
    print('Test number {}'.format(i))
    print(speedtest_output)
    save_to_csv(speedtest_output, '/tmp/rpi_data_test')

Test number 0
[b'Ping:', b'7.47', b'ms', b'Download:', b'93.52', b'Mbit/s', b'Upload:', b'93.96', b'Mbit/s', '2019-12-17 11:47:44']
Test number 1
[b'Ping:', b'8.303', b'ms', b'Download:', b'93.68', b'Mbit/s', b'Upload:', b'93.90', b'Mbit/s', '2019-12-17 11:48:10']
Test number 2
[b'Ping:', b'7.347', b'ms', b'Download:', b'93.29', b'Mbit/s', b'Upload:', b'94.05', b'Mbit/s', '2019-12-17 11:48:36']
Test number 3
[b'Ping:', b'7.575', b'ms', b'Download:', b'93.35', b'Mbit/s', b'Upload:', b'93.88', b'Mbit/s', '2019-12-17 11:49:02']
Test number 4
[b'Ping:', b'7.521', b'ms', b'Download:', b'93.55', b'Mbit/s', b'Upload:', b'93.71', b'Mbit/s', '2019-12-17 11:49:28']


b) Display the file to verify that the data has been saved correctly.

In [62]:
print_from_csv('/tmp/rpi_data_test')

["b'Ping:'", "b'13.23'", "b'ms'", "b'Download:'", "b'93.59'", "b'Mbit/s'", "b'Upload:'", "b'93.90'", "b'Mbit/s'", '2019-12-17 11:35:36']
["b'Ping:'", "b'7.416'", "b'ms'", "b'Download:'", "b'93.42'", "b'Mbit/s'", "b'Upload:'", "b'93.95'", "b'Mbit/s'", '2019-12-17 11:36:03']
["b'Ping:'", "b'7.474'", "b'ms'", "b'Download:'", "b'93.51'", "b'Mbit/s'", "b'Upload:'", "b'93.80'", "b'Mbit/s'", '2019-12-17 11:36:29']
["b'Ping:'", "b'7.457'", "b'ms'", "b'Download:'", "b'93.23'", "b'Mbit/s'", "b'Upload:'", "b'93.99'", "b'Mbit/s'", '2019-12-17 11:36:55']
["b'Ping:'", "b'6.987'", "b'ms'", "b'Download:'", "b'93.55'", "b'Mbit/s'", "b'Upload:'", "b'93.99'", "b'Mbit/s'", '2019-12-17 11:37:21']
["b'Ping:'", "b'7.47'", "b'ms'", "b'Download:'", "b'93.52'", "b'Mbit/s'", "b'Upload:'", "b'93.96'", "b'Mbit/s'", '2019-12-17 11:47:44']
["b'Ping:'", "b'8.303'", "b'ms'", "b'Download:'", "b'93.68'", "b'Mbit/s'", "b'Upload:'", "b'93.90'", "b'Mbit/s'", '2019-12-17 11:48:10']
["b'Ping:'", "b'7.347'", "b'ms'", "b'Downl

## Part 2: Manipulate Data

The Python library `pandas` is very useful for working with structured data.

#### Step 1: Import the Python libraries.

Import `pandas` and the other libraries used for the next tasks.

In [65]:
import datetime
import csv
import pandas as pd
import numpy as np

#### Step 2: Load the `csv` file into a `DataFrame` object using `pandas`.

A `pandas DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. The pandas library function `read_csv` automatically converts a `csv` file into a `DataFrame` object.

This function contains a lot of parameters. The only non-optional one is the `filepath`, i.e. the location of the `csv` file. All the other parameters are optional.

In this step, you will import and view the content of the csv file, `rpi_data_long.csv`. This csv file is located in the same directory as this Jupyter notebook.

a) Assign the file `rpi_data_long.csv` to the variable `data_file`.

In [66]:
data_file = './rpi_data_long.csv'

b) Use the Linux command `head` to view the first 10 lines of the csv file.

In [67]:
!head -n 5 ./rpi_data_long.csv

Ping:,26.992,ms,Download:,91.80,Mbit/s,Upload:,14.31,Mbit/s,2016-11-24 13:36:25
Ping:,24.532,ms,Download:,88.19,Mbit/s,Upload:,14.12,Mbit/s,2016-11-24 13:36:55
Ping:,20.225,ms,Download:,59.86,Mbit/s,Upload:,14.11,Mbit/s,2016-11-24 13:37:25
Ping:,19.332,ms,Download:,91.81,Mbit/s,Upload:,14.22,Mbit/s,2016-11-24 13:37:57
Ping:,22.494,ms,Download:,92.05,Mbit/s,Upload:,14.08,Mbit/s,2016-11-24 13:38:27


c) Use the `names` parameter of the `read_csv` function to specify the name of the `DataFrame` columns.

In [68]:
column_names = [ 'Type A', 'Measure A', 'Units A',
                 'Type B', 'Measure B', 'Units B',
                 'Type C', 'Measure C', 'Units C', 
                 'Datetime']

d) Use the `read_csv` function to read from `data_file` and assign `column_names` as the column names in the dataframe.

In [69]:
with open(data_file, 'r') as f:
   df_redundant = pd.read_csv(f, names = column_names)

e) The command `head()` displays the first few rows of the dataframe.

In [70]:
# You can specify the number of rows you want to print to screen: 
df_redundant.head()

Unnamed: 0,Type A,Measure A,Units A,Type B,Measure B,Units B,Type C,Measure C,Units C,Datetime
0,Ping:,26.992,ms,Download:,91.8,Mbit/s,Upload:,14.31,Mbit/s,2016-11-24 13:36:25
1,Ping:,24.532,ms,Download:,88.19,Mbit/s,Upload:,14.12,Mbit/s,2016-11-24 13:36:55
2,Ping:,20.225,ms,Download:,59.86,Mbit/s,Upload:,14.11,Mbit/s,2016-11-24 13:37:25
3,Ping:,19.332,ms,Download:,91.81,Mbit/s,Upload:,14.22,Mbit/s,2016-11-24 13:37:57
4,Ping:,22.494,ms,Download:,92.05,Mbit/s,Upload:,14.08,Mbit/s,2016-11-24 13:38:27


#### Step 3: Create a concise representation.
In this step, you will create a more compact representation using a copy of the data frame `df_redundant`.

a) Copy `df_redundant` into another dataframe called `df_compact` using `copy()`.

In [71]:
df_compact = df_redundant.copy()

b) Rename the columns relative to the measures as shown:

    Measure A -> Ping (ms)
    Measure B -> Download (Mbit/s)
    Measure C -> Upload (Mbit/s)

In [72]:
df_compact.rename(columns={'Measure A':'Ping (ms)', 
                           'Measure B': 'Download (Mbit/s)',
                           'Measure C': 'Upload (Mbit/s)'}, inplace=True)
df_compact.head(3)

Unnamed: 0,Type A,Ping (ms),Units A,Type B,Download (Mbit/s),Units B,Type C,Upload (Mbit/s),Units C,Datetime
0,Ping:,26.992,ms,Download:,91.8,Mbit/s,Upload:,14.31,Mbit/s,2016-11-24 13:36:25
1,Ping:,24.532,ms,Download:,88.19,Mbit/s,Upload:,14.12,Mbit/s,2016-11-24 13:36:55
2,Ping:,20.225,ms,Download:,59.86,Mbit/s,Upload:,14.11,Mbit/s,2016-11-24 13:37:25


c) Because the Types and Units columns are no longer necessary, these columns can be dropped.

In [73]:
df_compact.drop(['Type A', 'Type B', 'Type C',
         'Units A', 'Units B', 'Units C'], axis=1, inplace=True)
df_compact.head()

Unnamed: 0,Ping (ms),Download (Mbit/s),Upload (Mbit/s),Datetime
0,26.992,91.8,14.31,2016-11-24 13:36:25
1,24.532,88.19,14.12,2016-11-24 13:36:55
2,20.225,59.86,14.11,2016-11-24 13:37:25
3,19.332,91.81,14.22,2016-11-24 13:37:57
4,22.494,92.05,14.08,2016-11-24 13:38:27


In the table above, the `Datetime` field is a string. Pandas and Python offer a number of operations to work with date and time that can be very helpful.

In the next step, the string in the `Datetime` column will be separated into two new columns.

#### Step 4: Separate data into two columns.
In this step, you will use Pandas to generate the columns `Date` and `Time` from the column `Datetime` and then drop the `Datetime` column.
<p>The `lambda` function is used create two anonymous functions that extract only the date and the time from a `datetime` object, respectively. Then, use the `pandas` function `apply` to apply this function to an entire column (in practice, `apply` implicitly defines a `for` loop and passes the rows one by one to our `lambda` function). Store the result of the `apply` functions in two new columns of the `DataFrame`.

a) Apply the `lambda` function to iterate through the data frame to split the date from the `Datetime` column.

In [74]:
df_compact['Date'] = df_compact['Datetime'].apply(lambda dt_str: pd.to_datetime(dt_str).date())

b) Repeat step a to split time from `Datetime` column. 

In [75]:
temp = df_compact['Datetime'].apply(lambda dt_str: pd.to_datetime(dt_str))
df_compact['Time'] = temp.dt.time

c) All the information for the `Datetime` column has now been copied to the `Date` and `Time` columnns. The `Datetime` column serves no purpose. The `Datetime` column can be dropped from the data frame.
<p>Enter the code to drop the `Datetime` column in the cell below.

In [76]:
df_compact.drop(['Datetime'], axis=1, inplace=True)

Enter the code to print out the first 3 rows of the data frame to verify the changes.

In [77]:
df_compact.head(3)

Unnamed: 0,Ping (ms),Download (Mbit/s),Upload (Mbit/s),Date,Time
0,26.992,91.8,14.31,2016-11-24,13:36:25
1,24.532,88.19,14.12,2016-11-24,13:36:55
2,20.225,59.86,14.11,2016-11-24,13:37:25


d) Use the `type` function to print out the variable type of the values in the `Date` and `Time` columns.

In [78]:
print(df_compact['Date'][0], type(df_compact['Date'][0]) )
print(df_compact['Time'][0], type(df_compact['Time'][0]) )

2016-11-24 <class 'datetime.date'>
13:36:25 <class 'datetime.time'>


#### Step 5: Save the new data frame.
Save the pandas dataframe `df_compact` as a csv file called `rpi_data_compact`:

In [79]:
df_compact.to_csv('./rpi_data_compact.csv')

<font size='0.5'>&copy; 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.<font>