In [5]:
from mads_datasets import DatasetFactoryProvider, DatasetType
import mads_datasets
mads_datasets.__version__

'0.1.5.4'

We are using a package that we created ourselves for this course.
At this point in time, you dont need to grasp the details of the package, but you should be able to use it.

However, the package is an example of how you can create professional, scalable and reusable code in Python. What you will learn quickly, is that the flexibility of Python makes it possible to do things in a lot of different ways. 

Fast prototyping in a notebook can contain lots of hardcoded variables, and that is fine. But when you want to reuse your code, you need to make it more flexible, and more abstract. This is what we have done in the package.

So you might try to have a look at the source code of the package (it's all python anyway), and see if you can trace back everything we are doing here, because the essentials of the preprocessing you see here will also be done in the package.

You might also learn a few new things about how you can write Python. If it looks too complicated, just ignore it for now, and focus on the things you need to do in the notebook.



In [47]:
# we obtain the iris dataset factory
# a dataset factory is a class that can create multiple datasets, just like a factory
# is able to create multiple products
irisdataset = DatasetFactoryProvider.create_factory(DatasetType.IRIS)
irisdataset

<mads_datasets.datasetfactory.IrisDatasetFactory at 0x1391cfcd0>

In [49]:
irisdataset.settings

dataset_url: https://github.com/raoulg/data_assets/raw/main/iris_dirty.csv
filename: iris.csv
name: iris
formats: [<FileTypes.CSV: '.csv'>]
digest: 3679279dc61f6fdd859be9888db27f75
target: class
features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [51]:
# this will download the iris dataset and check the integrity of the 
# downloaded file with the md5 hash of the file
irisdataset.download_data()

[32m2023-06-20 19:08:00.850[0m | [1mINFO    [0m | [36mmads_datasets.datasetfactory[0m:[36mdownload_data[0m:[36m95[0m - [1mDataset already exists at /Users/rgrouls/.cache/mads_datasets/iris[0m
[32m2023-06-20 19:08:00.852[0m | [1mINFO    [0m | [36mmads_datasets.datasetfactory[0m:[36mdownload_data[0m:[36m106[0m - [1mDigest of downloaded /Users/rgrouls/.cache/mads_datasets/iris/iris.csv matches expected digest[0m


In [52]:
# the file is located here
irisdataset.filepath

PosixPath('/Users/rgrouls/.cache/mads_datasets/iris/iris.csv')

Lets load the dataframe from the disk

In [9]:
import pandas as pd

iris = pd.read_csv(
    irisdataset.filepath,
    header=None,
    names=["sepal_length", "sepal_width", "petal_length", "petal_width", "class"],
)

In [53]:
iris.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

In [54]:
## show values of index 80-90
iris[80:90]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
82,5.8,3.057333,3.9,1.2,Iris-versicolor
83,6.0,2.7,5.1,1.6,Iris-versicolor
84,5.4,3.0,4.5,1.5,Iris-versicolor
85,6.0,3.4,4.5,1.6,Iris-versicolor
86,6.7,3.1,4.7,1.5,Iris-versicolor
87,6.3,2.3,4.4,1.3,Iris-versicolor
88,5.6,3.0,4.1,1.3,Iris-versicolor
89,5.5,2.5,4.0,1.3,Iris-versicolor
90,5.5,2.6,4.4,1.2,Iris-versicolor
91,6.1,3.0,4.6,1.4,Iris-versicolor


In [55]:
iris["sepal_width"].fillna(value=iris["sepal_width"].mean(), inplace=False)[80:90]

82    3.057333
83    2.700000
84    3.000000
85    3.400000
86    3.100000
87    2.300000
88    3.000000
89    2.500000
90    2.600000
91    3.000000
Name: sepal_width, dtype: float64

In [56]:
# make this more general in a function
def fill_na_with_mean(df: pd.DataFrame, column: str) -> None:
    df[column].fillna(value=df[column].mean(), inplace=True)


fill_na_with_mean(iris, "sepal_width")

In [57]:
iris.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

In [58]:
iris.groupby("class").count()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,48,48,48,48
Iris-versicolor,50,50,50,50
Iris-virginica,50,50,50,50


In [59]:
# use a regex to replace setsoa with setosa
import re

iris["class"].apply(lambda x: re.sub(r"setsoa", "setosa", x))

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
150    Iris-virginica
Name: class, Length: 148, dtype: object

In [60]:
# or, more abstract, with a function an parameters
def replace_value(
    df: pd.DataFrame, column: str, old_value: str, new_value: str
) -> pd.DataFrame:
    df[column] = df[column].apply(lambda x: re.sub(old_value, new_value, x))
    return df

# you can ommit the r before the regex string
# because in this situation it is not needed
# however, it is good practice to use it in situations where you DO need a raw string.
# for example, when you want to match a backslash or a regex pattern
replace_value(iris, "class", r"setsoa", "setosa")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


In [61]:
iris.groupby("class").count()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,48,48,48,48
Iris-versicolor,50,50,50,50
Iris-virginica,50,50,50,50


Check for duplicates

In [62]:
iris.duplicated().sum()

1

In [63]:
iris.drop_duplicates(inplace=True)

In [64]:
iris.duplicated().sum()

0

In [65]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [23]:
# use regex to extract the numbers before the " mm" string
# and convert them to float

def extract_number(df: pd.DataFrame, column: str) -> pd.DataFrame:
    df[column] = df[column].apply(lambda x: float(re.findall(r"(\d+\.?\d*) mm", x)[0])/10)
    return df

iris = extract_number(iris, "petal_width")
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


explanation of the regular expression pattern used in the code:

- `r"`: This denotes a raw string literal in Python, which is used for regular expressions to avoid interpreting backslashes as escape characters.
- `(\d+\.?\d*)`: This is the main part of the pattern and is enclosed in parentheses to capture the matching substring. It consists of the following components:
- `\d+`: Matches one or more digits: `\d` are digits and `+` means one or more.
- `\.?`: Matches an optional decimal point. The backslash \ is used to escape the period . because the period has a special meaning in regex (matches any character). the `?` means zero or one occurences of the previous item.
- `\d*`: Matches zero or more digits after the decimal point. `*` means zero or more occurences of the previous item.
- `' mm'`: Matches the literal string " mm" exactly as it appears in the text.

In [24]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


In [25]:
# find the index where "sepal_length" is 5.8
idx = iris[iris["sepal_length"] == 58].index

In [26]:
iris.loc[idx]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
143,58.0,2.7,5.1,1.9,Iris-virginica


In [27]:
iris.loc[idx, "sepal_length"] = 5.8

In [28]:
iris.loc[idx]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
143,5.8,2.7,5.1,1.9,Iris-virginica


In [29]:
y = iris['class']
x = iris.drop('class', axis = 1)

In [30]:
train = iris.sample(frac=0.6)

In [31]:
valid = iris.drop(train.index)

In [32]:
len(valid), len(train)

(59, 89)

We can create a Dataset class.
It is usefull to do this, because:
1. Abstraction and Encapsulation: By encapsulating your data within a class, you can hide the internal implementation details and provide a clean and intuitive interface for accessing and working with the data. This helps to separate concerns and make your code more modular.

2. Customized Data Access: By implementing the `__getitem__` method, you can define how individual elements of the dataset should be accessed. This allows you to specify custom indexing and slicing behavior for your data. For example, you could implement logic for random sampling, filtering, or any other specialized data retrieval operations.

3. Integration with Python Data Ecosystem: By implementing the `__len__` method, you can make your dataset compatible with Python's built-in len() function. This makes it easy to determine the size or length of your dataset. Additionally, by utilizing popular data manipulation libraries like pandas, you can easily handle tabular data and perform operations such as dropping columns, selecting subsets, or applying transformations and not bother the user about the details of the implementation, at the same time giving the guarantee that the steps are always performed in the same way.

4. Compatibility with Machine Learning Frameworks: Creating a custom dataset class is particularly useful when working with machine learning frameworks such as PyTorch or TensorFlow. These frameworks often provide built-in utilities for handling custom datasets, allowing you to leverage functionalities like batching, parallel processing, and data loading optimizations. By conforming to the expected interface of these frameworks, you can seamlessly integrate your dataset with the training and evaluation pipelines.

In [33]:
class DataSet:
    def __init__(self, data):
        self.y = data['class']
        self.x = data.drop('class', axis = 1)
    
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.x.iloc[idx], self.y.iloc[idx]
trainset = DataSet(train)
validset = DataSet(valid)

In [67]:
# we can now obtain the length of the dataset
len(trainset)

89

In [69]:
# or obtain a single item
x, y = trainset[0]
x, y

(sepal_length    6.5
 sepal_width     3.0
 petal_length    5.8
 petal_width     2.2
 Name: 105, dtype: float64,
 'Iris-virginica')

In [71]:
# or everything at once
x, y = trainset[:]
len(x) , len(y), type(x)

(89, 89, pandas.core.frame.DataFrame)

Now, we are at the point we can fit a classifier on the data.
Everything we have done in the notebook (the preprocessing of the data and making a dataset object from it) is also done in the package. To show how simple that makes things, here is all the code you would need to download, preprocess and fit a classifier on the data: if you would run this in a separate notebook, it would still work! The preprocessing is just happening in the backend.

In [72]:
from mads_datasets import DatasetFactoryProvider, DatasetType
from sklearn import neighbors

irisdataset = DatasetFactoryProvider.create_factory(DatasetType.IRIS)
datasets = irisdataset.create_dataset()
trainset = datasets["train"]
validset = datasets["valid"]

clf = neighbors.KNeighborsClassifier(n_neighbors=3)
clf.fit(*trainset[:])
clf.score(*validset[:])

[32m2023-06-20 19:17:31.982[0m | [1mINFO    [0m | [36mmads_datasets.datasetfactory[0m:[36mdownload_data[0m:[36m95[0m - [1mDataset already exists at /Users/rgrouls/.cache/mads_datasets/iris[0m
[32m2023-06-20 19:17:31.986[0m | [1mINFO    [0m | [36mmads_datasets.datasetfactory[0m:[36mdownload_data[0m:[36m106[0m - [1mDigest of downloaded /Users/rgrouls/.cache/mads_datasets/iris/iris.csv matches expected digest[0m


0.9661016949152542