# Deployment of Text Preprocessing into a Flask App



In [1]:
# import relevant libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import pickle

## Pipeline Overview 

1. Obtaining the Data
1. **Text Preprocessing Pipeline** using `CountVectorizer` - Step 1 to 3
1. Training our `Logistic Regression` Model
1. Making predictions on our own Data

### Dataset

Clickbait = https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

## Preprocessing Pipeline

### Step 1: Obtaining the Data

- Using `Pandas` (great library for data wrangling) library, imported earlier
- Reading the `.csv` file
- Extracting all the rows from the "headline"/"clickbait" column and placing them in a single array
- Check and Print them 

In [2]:
df = pd.read_csv("data.csv")

texts = df["headline"].values
labels = df["clickbait"].values

In [3]:
texts

array(['Should I Get Bings',
       'Which TV Female Friend Group Do You Belong In',
       'The New "Star Wars: The Force Awakens" Trailer Is Here To Give You Chills',
       ...,
       'Drone smartphone app to help heart attack victims in remote areas announced',
       'Netanyahu Urges Pope Benedict, in Israel, to Denounce Iran',
       'Computer Makers Prepare to Stake Bigger Claim in Phones'],
      dtype=object)

In [4]:
labels

array([1, 1, 1, ..., 0, 0, 0])

### Step 2: Fitting our CountVectorizer

- Importing and Initialising the `CountVectorizer` (library which automatically learns the Corpus or Vocabulary from our sentences), imported earlier 
- Fitting our `CountVectorizer` on texts using the `.fit()` method. Here, the Vocabulary is learnt
- View the Vacabulary of our CountVectorizer

In [5]:
vect = CountVectorizer()
vect.fit(texts)

CountVectorizer()

In [6]:
print("Count: " + str(len(vect.vocabulary_)))
print(vect.vocabulary_)

Count: 22760


### Step 3: Preprocessing our Data

- Preprocessing our texts using `.transform()` method (it does...). Our sentences will be converted to their vectorized form.
- Viewing the original and vectorized sentence (at `idx = -1`)

In [7]:
text_vects = vect.transform(texts)

In [8]:
print("Original Sentence: " + str(texts[-1]))
print("Vectorized Sentence: " + str(text_vects[-1]))

Original Sentence: Computer Makers Prepare to Stake Bigger Claim in Phones
Vectorized Sentence:   (0, 2399)	1
  (0, 4084)	1
  (0, 4499)	1
  (0, 10204)	1
  (0, 12317)	1
  (0, 15125)	1
  (0, 15720)	1
  (0, 19285)	1
  (0, 20629)	1


## Model Training

### Step 4: Training our Models

We will be using a simple **Logistic Regression** model/algorithm for our Classification task.

- Initialised the **Logistic Regession** Model, imported earlier from `sklearn` library
- Train our Logistic Regression Model

In [10]:
model = LogisticRegression()
#          (Input, Output)
model.fit(text_vects, labels)

LogisticRegression()

When we fit our model, we pass in:

`model = LogisticRegression()`

- `text_vects` is an array containing the vectorized sentences
- `labels` is the array containing the corresponding labels (0 = non-clickbait / 1 = clickbait)

## Making Predictions

### Text Preprocessing

In order to predict using our own data, we have to pass them through the same preprocessing the training data underwent.

As our training data underwent the CountVectorizer, we also have to pass our own sentence to the CountVectorizer.

- Writing our very own sentence

We'll use this quote

> [Carbon emissions rise by nine percent](https://www.google.com/search?q=Carbon+emissions+rise+by+9%25&rlz=1C5CHFA_enSG977SG977&sxsrf=ALiCzsZ1S7FL-mK4zQS3KvKhy5hyOxUweA:1665300166876&source=lnms&sa=X&ved=2ahUKEwi5iZDHztL6AhUo4XMBHeeaDGMQ_AUoAHoECAIQAg&biw=1289&bih=719&dpr=1)

and check out [this article by Reuters](https://www.reuters.com/business/sustainable-business/china-co2-emissions-9-higher-than-pre-pandemic-levels-q1-research-2021-05-20/)... definitely not clickbait!

- Preprocessing our sentence using our fitted `CountVectorizer`

We have to encapsulate our sentence inside an array, in the `.transform()` method.

Moreover, notice that we do not have to fit our vectorizer again. This is because we are reusing the previously fitted vectorizer to preprocess (vectorizer) our sentence.

- Making predictions using our `Logistic Regression` Model

_pred_ will either be 0 or 1

We will write some control statements to make the output more readable.

In [11]:
sent = ""
#"Carbon emissions rise by nine percent" #

vect_sent = vect.transform([sent]) # Preprocessing our sentence using our fitted `CountVectorizer`

# making predictions
pred = model.predict(vect_sent)

# pred will either be 0 or 1
if pred[0] == 0:
    print("This is Non-Clickbait")
else:
    print("This is Clickbait")

This is Non-Clickbait


## Saving our Model

To deploy our `Clickbait classifier` into our Flask website (if we had to train our model everytime we had to make a prediction, it would be very time-consuming), we will save our model into a file and load this file when we want to make predictions.

This means that we only have to train our model once, and we can use them continuously (by simply loading the saved file). To do this, we will follow these steps:

- Create the "_model.pkl_" file with the `pickle` library, imported earlier
- Save our model to the "_model.pkl_" file.


### Pickle

`Pickle` is a python library that allows us to save python objects into files (e.g. Lists, Dictionaries, Models... etc). The _.pkl_ file is the pickle file extention. These are the steps to saving it (2-steps) :

1. Opening the "_.pkl_" file. "wb" here specifics that you want to **write** in **binary** mode.
1. Dumping (saving) the python object into our file using **.dump()** method.

If _model.pkl_ does not exist, it will automatically be created. If it does, it will be over-written.

In [12]:
saved_model = open("model.pkl", "wb")
saved_vect = open("vect.pkl", "wb")

# This is to save
pickle.dump(model, saved_model)
pickle.dump(vect, saved_vect)

We have now saved our `CountVectorizer` and `Logistic Regression` model into their own `pickle` files (depending on your implementation... on our local computer).





In [None]:
saved_model = open("model.pkl", "wb")
saved_vect = open("vect.pkl", "wb")

# saved_model = open("model.pkl", "rb") # open("model.pkl", "wb")
# saved_vect = open("vect.pkl", "rb") # open("vect.pkl", "wb")

# This is to save
pickle.dump(model, saved_model)
pickle.dump(vect, saved_vect)

# loaded_model = pickle.load(saved_model)
# loaded_vect = pickle.load(saved_vect)

In [14]:
saved_model = open("model.pkl", "rb")
saved_vect = open("vect.pkl", "rb")

loaded_model = pickle.load(saved_model)
loaded_vect = pickle.load(saved_vect)