# Section 3: speed comparison

### **Theory - speed and file size of pickle files**
As a Data Engineer, you often want the most efficient solution to the challenges that appear during your work. When working with small datasets, different approaches might seem interchangeable. However, working with larger datasets, you will notice differences in the runtime speed of code.

In this section, we will explore how fast writing and reading pickle files are when working with tabular data compared to using .csv files. The pandas libary will be used to create a large dummy dataset, save it to disk and load it again. Since the focus of this module is Pickle, there will not be in-depth info of pandas, but it is always possible to check out the [Pandas Documentation](https://pandas.pydata.org/docs/). 

Furthermore, we will use a builtin magic function of jupyter called *%%time* that will show the time it took to run a certain cell. 




##### ASSIGNMENT 7: create a dummy dataset of 5 columns and 2 million rows

In [None]:
np.random.seed = 42
df_size = 5_000_000

df = pd.DataFrame({'a': np.random.rand(df_size),
                   'b': np.random.rand(df_size),
                   'c': np.random.rand(df_size),
                   'd': np.random.rand(df_size),
                   'e': np.random.rand(df_size)})

display(df)

##### ASSIGNMENT 8: save the DataFrame as a .csv file using the .to_csv() function of Pandas

In [None]:
%%time
#### ADD YOUR CODE HERE ####


##### ASSIGNMENT 9: load the created .csv file as a DataFrame using the pandas.read_csv() funcion of Pandas

In [None]:
%%time
#### ADD YOUR CODE HERE ####


##### ASSIGNMENT 10: now save the original DataFrame we created as a pickle file

In [None]:
%%time
#### ADD YOUR CODE HERE ####


##### ASSIGNMENT 11: now deserialize pickle file containing the serialized DataFrame into a DataFrame 

In [None]:
%%time
#### ADD YOUR CODE HERE ####


##### ASSIGNMENT 12: Now print the file size of both the csv file and the pickle file you created.

*Hint: the OS library has something for this called getsize()*

In [None]:
#### ADD YOUR CODE HERE ####


### **Theory - speed and file size of pickle files**
As you can see Pickle is way more efficient to use as a temporary storage than .csv when it comes to tabular data.
* it is much faster to read/write pickle files than csv files
* the file size is also smaller for pickle files compared to csv files

However, there are also downsides. Pickle creates files unreadable for humans, and cannot be loaded into programs like Excel. Therefore, always make sure if pickle is the right fit for the job!

# Section 4: pickle and machine learning algorithms
In this final section, we are going to save a fully trained machine learning model into a pickle file. As already stated, almost all Python objects can be stored as pickle, and this includes ML models. This is arguably the most powerful usage of Pickle for a Data Engineer when working with ML models.

Since the focus of this Module is not on the creation or training of a ML model, a cell with all code required to do this is given below. Simply run the cell to load the correct data and train a model.

The data we use is from the famous [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), which is often used to showcase classification algorithms.

In [None]:
# Extra libraries needed for this section
import matplotlib.pyplot as plt
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

##### ASSIGNMENT 13: run the following few cells to load and explore the data, and subsequently train a model.

In [None]:
# load data from sklearn
iris = datasets.load_iris()

# turn data into DataFrame
df_iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                       columns= iris['feature_names'] + ['target'])

# Add species column
df_iris['species'] = df_iris['target'].map({
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica'
})

# Show top rows of DataFrame
df_iris.head()

In [None]:
# Plot the petal length and petal width
df_iris.plot.scatter(
    x='petal length (cm)',
    y='petal width (cm)',
    c='target',
    colormap='viridis',
    figsize=(12,8))

plt.xlabel("petal length (cm)", size=12)
plt.ylabel("petal width (cm)", size=12)
plt.title('Iris petals (color indicates class)', size=16);

In [None]:
# PARAMETERS TO CHANGE
input_columns = ['petal length (cm)','petal width (cm)']
test_size = 0.5
random_state = 42
model = LogisticRegression()
model_name = 'LR_with_petaldata'

# Create input X, and target y to train the model with 
X = df_iris[input_columns]  # only use petal length and width
X = X.to_numpy()  # converting into numpy array
y = iris['target']

# Splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5, random_state=42)

# Fit the model based on chosen parameters
model.fit(X,y)

##### ASSIGNMENT 14: save the fitted model as a pickle file

In [None]:
#### ADD YOUR CODE HERE ####


##### ASSIGNMENT 15: load the fitted model and print metrics

In [None]:
#### ADD YOUR CODE HERE ####
loaded_model = None
#### STOP ADDING YOUR CODE HERE ####

training_prediction = loaded_model.predict(X_train)
test_prediction = loaded_model.predict(X_test)

# Precision Recall scores
print("Precision, Recall, Confusion matrix, in training\n")
print(metrics.classification_report(y_train, training_prediction, digits=3))

# Confusion matrix
print('Confuson matrix')
print(metrics.confusion_matrix(y_train, training_prediction))

##### (OPTIONAL) ASSIGNMENT 16
* fit another Logistic Regression model with diffent input columns (sepal data instead of petal)
* save the model under a different name
* load both models and compare metrics 

*Hint: doublecheck you are using the right X_train and X_test for each model when predicting*

Which model performs better?

In [None]:
#### ADD YOUR CODE HERE ####

## **THEORY - Pickle for Machine Learning models**
As you can see, saving different models as different pickle files is done easily. This can be of great help to a Data Engineer working with ML models.

 For example, it makes model comparison all trained with different parameters more organized and efficient. 
 
 Furthermore, it can help productionalize models: train multiple models and compare them in a development environment and only transfer the pickle file of the best performing model to the production environment.

Congratulations on completing this module!