<a href="https://colab.research.google.com/github/w4bo/AA2425-unibo-mldm/blob/master/slides/lab-00-introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 00 - Introduction

- **It is absolutely recommended to read the documentation relating to the functions and methods used!**
    - Usually, it is sufficient type on Google the name of the function (and eventually the name of the library used)
- **The usage of ChatGPT and generative AI tools is highly discouraged during the labs**
    - You must train the fundamentals before using advanced models
    - When you start driving, you do not start with a F1 race car!
    - At the exam, you must be capable to explain all the details of your assignment


## Goal of the lab: a first example of data preprocessing

Data is usually stored in (textual) files

- Usually, files cannot be directly handled by machine-learning approaches
- We need to upload them in the notebook and transform them to apply further analysis

How do we load existing datasets?

What data structures do you know in Python?

## File format: CSV

Comma-separated values (CSV)

- CSV is a text file format that uses commas to separate values, and newlines to separate records
- A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record
- Each record consists of the same number of fields
- If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

## Locating the data

These are main options to use files/datasets in a Colab project:

1. Use the existing data in the `/content/sample_data/` folder
2. Upload a new dataset (as file)
  - Upload the file through the Colab GUI (temporary upload!);
  - Upload the file on your Google Drive (you have to use the same Google account that you use with Colab)). Then, it is necessary to mount the drive in your Colab machine (use the "Mount Drive" button.);
  - Use the following code snippet, file are uploaded to `/content/<file_name>`
3. Use existing datasets from the web


## Read the existing data

`open(<path>, <mode>)`: open the file located in the `path`, for reading, writing, ect. depending by the `mode`.


In [8]:
file_path = '/content/sample_data/california_housing_train.csv'
with open(file_path, 'r') as f:
    lines = f.readlines()
print('Read {} lines'.format(len(lines)))

Read 17001 lines


## Upload new data

In [9]:
from google.colab import files
# uploaded = files.upload()  # uncomment this line to open the prompt to upload the dataset

## Import data from the web

In [10]:
!rm -rf iris/ || true # remove the folder if exists
!rm iris.zip || true # remove the zip file if exists
!wget https://archive.ics.uci.edu/static/public/53/iris.zip # download the public dataset
!unzip iris.zip -d iris # unzip it

file_path = '/content/iris/iris.data'
with open(file_path, 'r') as f:
    lines = f.readlines()
print('Read {} lines'.format(len(lines)))

--2024-09-18 14:50:37--  https://archive.ics.uci.edu/static/public/53/iris.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘iris.zip’

iris.zip                [ <=>                ]   3.65K  --.-KB/s    in 0s      

2024-09-18 14:50:38 (763 MB/s) - ‘iris.zip’ saved [3738]

Archive:  iris.zip
  inflating: iris/Index              
  inflating: iris/bezdekIris.data    
  inflating: iris/iris.data          
  inflating: iris/iris.names         
Read 151 lines


## Strings and lists

Tools for managing strings and lists:

- `strip()`: removes both the leading and the trailing characters.
- `split()`: breaks up a string at the specified separator and returns a list of strings.
- List slicing: `list[start:stop:step]`

Examples:

*   `a[start:stop]  # items start through stop-1`
*   `a[start:]      # items start through the rest of the array`
*   `a[:stop]       # items from the beginning through stop-1`
*   `a[:]           # a copy of the whole array`
*   `a[-1]    # last item in the array`
*   `a[-2:]   # last two items in the array`
*   `a[:-2]   # everything except the last two items`
*    `a[::-1]    # all items in the array, reversed`

In [11]:
lines[:5]

['5.1,3.5,1.4,0.2,Iris-setosa\n',
 '4.9,3.0,1.4,0.2,Iris-setosa\n',
 '4.7,3.2,1.3,0.2,Iris-setosa\n',
 '4.6,3.1,1.5,0.2,Iris-setosa\n',
 '5.0,3.6,1.4,0.2,Iris-setosa\n']

In [12]:
for line in lines[:2]:
    print("Raw content:", line)
    content = line.strip()
    print("Content after strip:", content)
    content = content.split(',')
    print("Content after split:", content)
    # now the content is a list with the splitted elements
    data = content[:-2]   # the image name is in the first position
    label = content[-1] # coordinates in the following part
    # print the content
    print('Data:', data, 'Label:', label)
    print('---')

Raw content: 5.1,3.5,1.4,0.2,Iris-setosa

Content after strip: 5.1,3.5,1.4,0.2,Iris-setosa
Content after split: ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
Data: ['5.1', '3.5', '1.4'] Label: Iris-setosa
---
Raw content: 4.9,3.0,1.4,0.2,Iris-setosa

Content after strip: 4.9,3.0,1.4,0.2,Iris-setosa
Content after split: ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
Data: ['4.9', '3.0', '1.4'] Label: Iris-setosa
---


## Changing data types

Note that the element in the list are strings. We must convert these strings in integers (`int`). These integers are then saved in a list.

**Tools**:
- List comprehension: `[expression for item in list]`

In [13]:
print('Type of the data:', type(data[0]))
data = [float(x) for x in data]
print('Type of the data:', [type(x) for x in data])
data

Type of the data: <class 'str'>
Type of the data: [<class 'float'>, <class 'float'>, <class 'float'>]


[4.9, 3.0, 1.4]

## Exercise

1. Read the content of the file `iris.data`
1. **Accumulate** the data and the labels in two different lists, named `dataset` and `labels`
1. Cast all data to float
1. Output the final length of the two lists.

Given the input file with content

    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    ...

The lists should look like

```python
dataset = [[5.1,3.5,1.4,0.2], [4.9,3.0,1.4,0.2], ...]
labels = ['Iris-setosa', 'Iris-setosa', ...]
```

In [14]:
# write the code here