# Introduction to Pickling

---

## Learning objectives

- Learn what serialization and deserialization are
- Learn what "pickling" is in Python
- Review using `with` statements to safely handle file operations
- Pickle and unpickle scikit-learn models in Python

---

## What is pickling?

If you're talking about food, pickling is a method of preserving food for the future. If you're talking about Python, pickling is a method of preserving **objects** for the future, including functions and classes. Since sklearn models are instances of classes, they can be pickled.

To pickle an object, it needs to be **serialized**. Serialization is when we transform an object into byte streams. (Byte streams are collections of bytes. One byte is made up of eight zeros or ones.) To unpickle an object so that it can be used in Python again, it needs to be **deserialized**.

If you've ever saved your progress in a video game, you've already serialized data without knowing it. A save file is your serialized save state. When you load the save, you deserialize the data so you can resume the game right where you were before you quit.

### Some warnings:

Just like you can't open a [Pokemon: Red](https://en.wikipedia.org/wiki/Pok%C3%A9mon_Red_and_Blue) savefile in [Pokemon: Sun](https://en.wikipedia.org/wiki/Pok%C3%A9mon_Sun_and_Moon), you have to unpickle an object in the same version of Python that you pickled it in. 

**Pickle objects can contain malicious code**. Never unpickle an object you don't trust!

## Why pickle?

Pickling makes a lot of sense any time you have a model you want to work with that you don't want to refit. Today, we'll pickle a fitted pipeline so that we can import it into a Streamlit web app, but pickling is useful in many other situations as well.

If you have a model that took twelve hours to fit, you might want to analyze its residuals, work with its coefficients, or make predictions off of it. But without saving it some fashion, you'd need to refit the model every time you restarted your notebook. Pickling the model allows you to load the fitted model _without_ needing to re-run the code where you fit it.

Notes:
- Pickling does **not** compress your model, meaning that some pickled models can end up being fairly large file sizes. Think of K-nearest neighbors, which requires every data point to be stored inside the model (though sklearn _does_ optimize the way the data is stored for speed and efficiency, models can still be large.) 
- Keras has its own [`save` method](https://www.tensorflow.org/guide/keras/save_and_serialize) on models. If you want to save a neural network fit in keras, use that instead of pickle.
- Don't pickle data frames. Export them as csv files instead. Generally, if there's another way to save something, use it.


---

In [1]:
import pandas as pd
import pickle

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

## Pickling a simple datatype

Before we pickle a full model, let's demonstrate pickling on a simple list.

Create a list called `my_vegetables` that contains some strings:

In [2]:
my_vegetables = ['cucumber', 'red pepper', 'onion', 'beet']

### Write the pickled list to disk

Let's review [this link](https://www.pythonforbeginners.com/files/with-statement-in-python) to go over why `with` is such a good tool for file operations.

Let's use `with` to write the list to disk as a `.pkl` file. We'll need to use `open`, pass in a file name, and also tell Python we're **writing** to the file, and writing as **bytes**. The pickle method we'll use is called `dump`.

In [3]:
with open('veggies.pkl', 'wb') as f:
    pickle.dump(my_vegetables, f)

### Open the pickled list

Let's use `with` to open the pickled file and save it as a new variable, `list_from_pickle`. Remember to tell Python that we're **reading** from the file, and that we're reading in **bytes**. The pickle method we'll use is called `load`.

In [6]:
with open('veggies.pkl', 'rb') as f:
    list_from_pickle = pickle.load(f)

In [7]:
foo

['cucumber', 'red pepper', 'onion', 'beet']

---

## Pickle a fitted pipeline

Let's start by building a model to determine whether someone writes more like [Edgar Allen Poe](https://en.wikipedia.org/wiki/Edgar_Allan_Poe) or [Jane Austen](https://en.wikipedia.org/wiki/Jane_Austen).

Our end goal will be a fitted pipeline. But before we export our pipeline, we'll need to settle on a model.

### Import data

In [8]:
df = pd.read_csv('../data/austen_poe.csv').dropna()
df.head(3)

Unnamed: 0,text,author
0,SENSE AND SENSIBILITY,Jane Austen
1,by Jane Austen,Jane Austen
2,(1811),Jane Austen


In [9]:
df['author'].value_counts(normalize=True)

Jane Austen        0.607353
Edgar Allan Poe    0.392647
Name: author, dtype: float64

In [10]:
X = df['text']
y = df['author']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify=y, 
                                                    random_state=42)

In [12]:
pipe = Pipeline([
    ('tf', TfidfVectorizer(min_df=2)),
    ('lr', LogisticRegressionCV(solver='liblinear'))
])

In [13]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('tf', TfidfVectorizer(min_df=2)),
                ('lr', LogisticRegressionCV(solver='liblinear'))])

In [14]:
pipe.score(X_train, y_train), pipe.score(X_test, y_test)

(0.9932717190388171, 0.9571967176757596)

### Model decision: count vectorizer vs TF-IDF vectorizer

Recall that a count vectorizer converts documents into vector representations of word occurrences:

In [17]:
cv = CountVectorizer(min_df=2)
cv.fit(X_train)

cv_text = cv.transform(X_train)
# remember to use .todense() to de-sparsify the count vectorized text
cv_text_df = pd.DataFrame(cv_text.todense(), columns=cv.get_feature_names_out())

In [19]:
cv_text.shape

(13525, 17330)

If we want to perform EDA on word counts, it may be useful to add the original author's name as a column:

**Note**: It might be a mistake to add this information as `cv_text_df['author']` or `cv_text_df['label']`. Why?

In [20]:
cv_text_df['author_label'] = y_train.values
cv_text_df.head(3)

Unnamed: 0,000,10,10th,11,11th,12,12mo,13,13th,14,...,zigzag,zit,zoar,zoilus,zone,zäire,ælfric,æronaut,être,author_label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Edgar Allan Poe
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Jane Austen
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Edgar Allan Poe


----

### TF-IDF

> **Reminder**: "Document" means "one natural language observation." Here, it means "one paragraph from either Jane Austen or Edgar Allen Poe." "Corpus" means "the whole natural language dataset that we're using" -- so here it means "the collected works of Jane Austen and Edgar Allen Poe, as scraped from Gutenberg and split into paragraphs."

An alternative to count vectorization is **TF-IDF vectorization**, where TF-IDF stands for "term frequency-inverse document frequency." Instead of just counting the words in each document, TF-IDF _weights_ the words in each document.

The general idea behind TF-IDF is that words used many times in a document should matter more, unless they're also used many times in very many documents across the corpus! The TF-IDF thus weights words by both the **term frequency**, which is the number of times the word is used in the document, and the **inverse document frequency**, which measures how important a term is across the corpus. To compute the TF-IDF of one word used in one document, we divide the term frequency by the inverse document frequency. We do this for each word in each document.

The formula for the term frequency of a term is written as

$$
\text{tf}(t, d) = t/n
$$

where $t$ is the number of times the term is used in the document, and $n$ is the total number of words in the document.

The formula for inverse document frequency is written as

$$
\text{idf}(t) = \log{\frac{n}{1+\text{df}(t)}}
$$

where $n$ is the number of documents in the corpus, and $\text{df}(t)$ is the number of documents in the corpus that contain the term $t$.

The TF-IDF itself is then computed as

$$
\text{tf}(t,d) \times \text{idf}(t)
$$

which is the product of the term frequency and the inverse document frequency.

> **Note**: There are a few other ways to formulate a TF-IDF. Some other implementations may calculate the inverse document frequency differently. You can see other formulations [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition). The formula above is the most common implementation and is the implementation used by scikit-learn.

### TF-IDF in scikit-learn

The `TfidfVectorizer` functions very similarly to `CountVectorizer`:

In [21]:
tfidf = TfidfVectorizer(min_df=2)
tfidf.fit(X_train)

tfidf_text = tfidf.transform(X_train)
# remember to use .todense() to de-sparsify the count vectorized text
tfidf_text_df = pd.DataFrame(tfidf_text.todense(), columns=tfidf.get_feature_names_out())

However, it does not make sense to sum the terms of the TF-IDF representation of our data.

We can still do some exploration of the TF-IDF scores! The TF-IDF vectorizer has a fitted attribute `.idf_` which stores the inverse document frequency for each word in the corpus. Here, we will construct a data frame of the words in the corpus, and pair them with their IDF scores. Which words have large IDF scores, and which words have small IDF scores?

In [22]:
vocab = tfidf.get_feature_names_out()
len(vocab)

17330

In [23]:
vocab.shape

(17330,)

In [25]:
len(tfidf.idf_)

17330

In [27]:
idf_df = pd.DataFrame(zip(vocab, tfidf.idf_),
                    columns=["Vocabulary", "IDF"])

In [29]:
idf_df.sort_values(by="IDF", ascending=False).head(10)

Unnamed: 0,Vocabulary,IDF
8665,irrecoverably,9.413757
3982,curt,9.413757
3995,cushioned,9.413757
3993,curvetted,9.413757
3988,curtseyed,9.413757
3986,curtis,9.413757
11633,pleiads,9.413757
11635,plentifully,9.413757
11637,pliancy,9.413757
3980,cursing,9.413757


In [31]:
idf_df.sort_values(by="IDF", ascending=False).tail(10)

Unnamed: 0,Vocabulary,IDF
1808,be,2.044576
16821,was,1.992579
10536,not,1.943533
15381,that,1.87172
8712,it,1.833738
8096,in,1.697741
10706,of,1.485952
15604,to,1.476858
1092,and,1.47258
15382,the,1.408835


---

## Comparing models

We could use either the TF-IDF vectorizer or the count vectorizer, and one might work better than the other, so let's try both - and let's try both alongside logistic regression and multinomial Naive Bayes.

In [33]:
pipel = Pipeline([
    ("cv", CountVectorizer()),
    ("lr", LogisticRegressionCV(solver="liblinear"))
])

pipel.fit(X_train, y_train)
pipel.score(X_train, y_train), pipel.score(X_test, y_test)

(0.9911275415896488, 0.9467731204258151)

In [34]:
pipe2 = Pipeline([
    ("tf", TfidfVectorizer()),
    ("lr", LogisticRegressionCV(solver="liblinear"))
])

pipe2.fit(X_train, y_train)
pipe2.score(X_train, y_train), pipe2.score(X_test, y_test)

(0.994011090573013, 0.9580838323353293)

In [35]:
pipe3 = Pipeline([
    ("cv", CountVectorizer()),
    ("nb", MultinomialNB())
])

pipe3.fit(X_train, y_train)
pipe3.score(X_train, y_train), pipe3.score(X_test, y_test)

(0.9510536044362292, 0.9387890884896873)

In [36]:
pipe4 = Pipeline([
    ("tf", TfidfVectorizer()),
    ("nb", MultinomialNB())
])

pipe4.fit(X_train, y_train)
pipe4.score(X_train, y_train), pipe4.score(X_test, y_test)

(0.9191127541589649, 0.8942115768463074)

----

My best model:

In [37]:
pipe = Pipeline([
    ("tf", TfidfVectorizer()),
    ("lr", LogisticRegressionCV(solver="liblinear"))
])

pipe.fit(X_train, y_train)
pipe.score(X_train, y_train), pipe.score(X_test, y_test)

(0.994011090573013, 0.9580838323353293)


---

## Pickle a fitted pipeline


### Export the fitted pipeline to `models` as `author_pipe.pkl`

This time, let's export to the `models` folder in this repository.

> Why bother? First, if you'll have lots of serialized objects, or if your serialized objects take up lots of disk space, you might not want to add and commit them to Github. Keeping them all in the same folder makes it easier to stay organized and not commit them. Second, it's also just a good way to organize a repository!

Just like before, we'll use a `with` statement:

In [38]:
# wb: Write Binary
# When pickle.dump() 
with open('../models/author_pipe.pkl', 'wb') as f:
    pickle.dump(pipe, f)

Next we'll try to un-pickle in the `02-read_a_pickle` notebook.