#  Convolutional Neural Networks (CNN) in Text Classification with PyTorch

The text classification problem can be approached in a number of ways with respect to encoding the text data to numerical values.

1. Text is modeled as the *frequency of occurrence of words* in a given text with respect of these words in the complete corpus. Example: `CountVectorizer()` and `TfidfVectorizer()` in `scikit-learn`.
2. Text is modeled as the *sequence of words or characters*. This type of approach is used mainly by the **Recurrent Neural Networks** (**RNN**).
3. Text is modeled as a *distribution of words in a given space*. This is achieved through the use of the **Convolutional Neural Network** architecture.

## The architecture of CNN in text classification

### What is a semantic space? 

Semantic spaces are representations of natural language that are capable of capturing meaning.

Reference: https://en.wikipedia.org/wiki/Semantic_space.

A *semantic space* is a way of representing the meaning of words using vectors, matrices, or other mathematical structures. 

The idea: 

**"Words that are similar in meaning will have similar or close vectors in the semantic space, while words that are different or unrelated will have distant or orthogonal vectors".**


Slogan: **"You shall know a word by the company it keeps"** (J.R. Firth).


For example, 

*   `fire and dog` are two words unrelated in their meaning, and in fact they are not often used in the same sentence. 
*   On the other hand, the words `dog and cat` are sometimes seen together, so they may share some aspect of meaning.

Mathematically,

![](images/cos.png)


## A common architecture for CNN in text classification


*   each word in a document is represented as an *embedding vector*, 
*   a single convolutional layer with m filters is applied, producing an m-dimensional vector for each document ngram.
*  The vectors are combined using max-pooling followed by a ReLU activation.
*  The result is then passed to a linear layer for the final classification.



### Word embedding

*Word embedding* is a technique of representing words in a numeric vector form that captures their meaning and relationships with other words.

In pytorch, 

*  the *embedding* is computed by using a simple lookup table that maps an index value to a weight matrix of some dimension. 
*  The weight matrix is initialized randomly and then optimized during training to produce more useful vectors. 
*  The input to the `nn.Embedding layer` is a **tensor of indices**, and the output is the corresponding embedding vectors. 

We need to form a python dictionary called `word_to_index` that maps each word in the vocabular, $V$, the collection of unique words in the corpus,  to its corresponding index of appearance in the corpus. 

In pytorch, we use `nn.Embedding` where the number of input nodes, `num_embeddings`, is equal to $seq_len$, the length of each row in $X$ which is 
equalize by adding extra $0$s at the end of each text and the number of output nodes  is the dimensionality of the embeddings, `embedding_dim` or $D$. Embeddings are stored as a $∣V∣ times D$ matrix, such that the word assigned index $i$ has its embedding stored in the $i$th row of the matrix. 
 

### Convolution and max pooling

In the context of Natural Language Processing (NLP), *convolution* refers to a mathematical operation that combines two input functions to produce a third. It differs from the *composition of two functions*, since convolutions are commutative while compositions are not.

*  Consider $A = \begin{bmatrix}1 & 0 & 1 & 1\\0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix}$, $B = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}$.  Then the `dot product` of $A$ and $B$ is 

$$A\cdot B = \begin{bmatrix}1 & 0 & 1 & 1\\0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 1\\ 1 & 1 \\ 0 & 2 \end{bmatrix}$$

In convolution, the *kernel* is a small matrix that slides over the input image to perform the convolution operation. 

*  The `num_embeddings` is `seq_len`, the length of a row (standardized).
*  The  `embedding_dim = D`, the dimensionality of the embeddings.
*  The `input_size` of the embeddings is `shape(seq_len, D)`
*  The `kernel size = n` means that the kernel consists of `n` tokens.
*  The `window size` of the kernel is the size of the kernel matrix with `(n,D)` 
*  The `stride` is the number of rows by which the kernel moves down the embeddings. 
*  The `output size` of the convolution operation depends on the size of the input image, the size of the kernel, the stride, and the padding. 
*  `Padding` is used to add extra `0`s at the end of the input text.
*  `Stride` is used to downsample the size of the output. The general rule is to use `stride=1` in usual convolutions and preserve the spatial size.

___


* Consider the input. $X = \begin{bmatrix}1 & 0 & 1 & 1\\0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix}$, and the kernels 

    * $K_1 = \begin{bmatrix} 1 & 0 & 1 & 0 \end{bmatrix}$,
    
    * $K_2 = \begin{bmatrix} 0 & 1 & 0 & 1 \end{bmatrix}$, and 

    * $K_3 = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix}$,
 
We have 3 convolutions on $X$ with 3 kernels and `stride=1`.


 
**Convolution 1:** $ \;\;\;X \star K_1 = \begin{bmatrix} 
2 \\ 1 \\ 0 
\end{bmatrix}$

**1st stride:**

$$\begin{bmatrix}
\color{red} 1 & \color{red} 0 & \color{red} 1 & \color{red} 1 \\
0 & 1 & 1 & 0 \\
0 & 1 & 0 & 1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\color{red}1 & \color{red}0 & \color{red}1 & \color{red}0 \\
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\end{bmatrix} = 
\begin{bmatrix} 
\color{red}2 \\ \phantom{1} \\ \phantom{0}
\end{bmatrix}$$

**2nd stride:**

$$\begin{bmatrix}
1 & 0 & 1 &  1 \\
\color{red}0 & \color{red}1 & \color{red}1 &\color{red} 0 \\
0 & 1 & 0 & 1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\color{red}1 & \color{red}0 & \color{red}1 & \color{red}0 \\
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\end{bmatrix} = 
\begin{bmatrix} 
\phantom{2} \\ \color{red}1 \\ \phantom{0} 
\end{bmatrix}$$

**3rd stride:**

$$\begin{bmatrix}
1 & 0 & 1 & 1 \\
0 & 1 & 1 & 0 \\
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\color{red}1 & \color{red}0 & \color{red}1 & \color{red}0 \\
\end{bmatrix} = 
\begin{bmatrix} 
\phantom{2} \\\phantom{1} \\ \color{red}0 
\end{bmatrix}$$

**Convolution 2:** $\;\;\;X \star K_2 = \begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix}$

**1st stride:**

$$\begin{bmatrix}
\color{red} 1 & \color{red} 0 & \color{red} 1 & \color{red} 1 \\
0 & 1 & 1 & 0 \\
0 & 1 & 0 & 1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1 \\
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\end{bmatrix} = 
\begin{bmatrix} 
\color{red}1 \\ \phantom{1} \\ \phantom{2} 
\end{bmatrix}$$

**2nd stride:**

$$\begin{bmatrix}
1 & 0 & 1 &  1 \\
\color{red}0 & \color{red}1 & \color{red}1 &\color{red} 0 \\
0 & 1 & 0 & 1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1 \\
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\end{bmatrix} = 
\begin{bmatrix} 
\phantom{1} \\ \color{red}1 \\\phantom{2} 
\end{bmatrix}$$

**3rd stride:**

$$\begin{bmatrix}
1 & 0 & 1 & 1 \\
0 & 1 & 1 & 0 \\
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1 
\end{bmatrix} = 
\begin{bmatrix} 
\phantom{1} \\ \phantom{1} \\ \color{red}2 
\end{bmatrix}$$

**Convolution 3:** $ \;\;\;X \star K_3 = \begin{bmatrix} 
3 \\ 1 \\ 
\end{bmatrix}$

**1st stride:**

$$\begin{bmatrix}
\color{red}1 & \color{red}0 & \color{red}1 & \color{red}1 \\
\color{red}0 & \color{red}1 & \color{red}1 & \color{red}0 \\
0 & 1 & 0 & 1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\color{red}1 & \color{red}0 & \color{red}1 & \color{red}0\\
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1\\
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\end{bmatrix} = 
\begin{bmatrix} 
\color{red}3 \\ \phantom{1}  
\end{bmatrix}$$

**2nd stride:**

$$\begin{bmatrix}
1 & 0 & 1 & 1 \\
\color{red}0 & \color{red}1 & \color{red}1 &\color{red} 0 \\
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1 \\ 
\end{bmatrix} \star 
\begin{bmatrix} 
\phantom{1} & \phantom{0} & \phantom{1} & \phantom{0} \\ 
\color{red}1 & \color{red}0 & \color{red}1 & \color{red}0\\
\color{red}0 & \color{red}1 & \color{red}0 & \color{red}1\\
\end{bmatrix} = 
\begin{bmatrix} 
\phantom{3} \\ \color{red}1\\
\end{bmatrix}$$



**Max pooling** is a technique that reduces the dimensionality of the output by taking the maximum value of a patch of the output.

**Max Pooling on Convolution 1**:

$X \star K_1 = \begin{bmatrix} 2 \\ 1 \\ 0 \end{bmatrix} \Longrightarrow 
\max\left({\begin{bmatrix} 2 \\ 1 \\ 0 \end{bmatrix}}\right) = 
\begin{bmatrix} 2 \end{bmatrix}$

___

**Max Pooling on Convolution 2**:

$X \star K_2 = \begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix} \Longrightarrow 
\max\left({\begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix}}\right) = 
\begin{bmatrix} 2 \end{bmatrix}$

___

**Max Pooling on Convolution 3**:

$X \star K_3 = \begin{bmatrix} 3 \\ 1 \\ \end{bmatrix} \Longrightarrow 
\max\left({\begin{bmatrix} 3 \\ 1 \\ \end{bmatrix}}\right) = 
\begin{bmatrix} 3 \end{bmatrix}$

Finally, each of these outputs will be concatenated in a single tensor to be introduced to a linear layer which will be filtered by an activation function to obtain the final result.

$\begin{align*}
X \star K_1 = \begin{bmatrix} 2 \\ 1 \\ 0 \end{bmatrix} &\Longrightarrow 
\max\left({\begin{bmatrix} 2 \\ 1 \\ 0 \end{bmatrix}}\right) = 
\begin{bmatrix} 2 \end{bmatrix} \\
X \star K_2 = \begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix} &\Longrightarrow 
\max\left({\begin{bmatrix} 1 \\ 1 \\ 2 \end{bmatrix}}\right) = 
\begin{bmatrix} 2 \end{bmatrix} \;\;\; \Longrightarrow \;\;\;
\begin{bmatrix} 2\\ 2\\ 3 \end{bmatrix}\\
 X\star K_3 = \begin{bmatrix} 3 \\ 1 \\ \end{bmatrix} &\Longrightarrow 
\max\left({\begin{bmatrix} 3 \\ 1 \\ \end{bmatrix}}\right) = 
\begin{bmatrix} 3 \end{bmatrix}
\end{align*}$

### A CNN Architecture

![](images/waakss1l.png)


___



## Text processing pipeline

![](images/text_processing_pipeline.PNG)

### Step 1. Load raw data

In [1]:
# Code 1. Loading raw data
# Reads the raw csv file and split into 
# text (x) and target (y)
# tweets on real disaster (1) or not (0).

import pandas as pd

df = pd.read_csv('datasets/tweets.csv')

X_raw = df["text"].values
y = df["target"].values

C:\Users\lexmuga\anaconda3\envs\math103b\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll
C:\Users\lexmuga\anaconda3\envs\math103b\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [3]:
for i in range(5):
    print(i,":",df.text[i])

0 : Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
1 : Forest fire near La Ronge Sask. Canada
2 : All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
3 : 13,000 people receive #wildfires evacuation orders in California 
4 : Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 


In [4]:
for i in range(5):
    print(i,":",df.target[i])

0 : 1
1 : 1
2 : 1
3 : 1
4 : 1


### Step 2. Lowering cases and removing special special symbols 

In this step, we will need to remove special symbols and numbers from the text. We are only going to work with lowercase words.

In [5]:
# Code 2 Lowering cases and removing special symbols

import re 

X_lower = [x.lower() for x in X_raw]
X_no_punctuation = [re.sub(r'[^\w\s]', '', x) for x in X_lower]

In [6]:
for i in range(5):
    print(i, X_no_punctuation[i])

0 our deeds are the reason of this earthquake may allah forgive us all
1 forest fire near la ronge sask canada
2 all residents asked to shelter in place are being notified by officers no other evacuation or shelter in place orders are expected
3 13000 people receive wildfires evacuation orders in california 
4 just got sent this photo from ruby alaska as smoke from wildfires pours into a school 


### Step 3. Tokenization, Removing Stop Words, and Lemmatization

For tokenization, removing stop words and lemmatizaiton, we are going to make use of the functions from the `nltk` library.

In [7]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
# nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lexmuga\anaconda3\envs\math103b\lib\nltk_data
[nltk_data]     ...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [8]:
# Code 3. Tokenization, removing stop words and lemmatization

lemmatizer = WordNetLemmatizer()

# set_stop_words = set(stopwords.words('english'))
X_tokens = [word_tokenize(sentence) for sentence in X_no_punctuation]
#X_no_stops = [[word for word in tokens if word not in set_stop_words] for tokens in X_tokens]

X_lemmas = []
for sentence in X_tokens:
    lemmas = [lemmatizer.lemmatize(word, pos="v") for word in sentence]
    lemmas = [lemmatizer.lemmatize(word, pos="n") for word in lemmas]
    lemmas = [lemmatizer.lemmatize(word, pos="a") for word in lemmas]
    lemmas = [lemmatizer.lemmatize(word, pos="r") for word in lemmas]
    lemmas = [lemmatizer.lemmatize(word, pos="s") for word in lemmas]
    X_lemmas.append(lemmas)

In [9]:
for i in range(5):
    print(f'{i} : {X_tokens[i]}')


0 : ['our', 'deeds', 'are', 'the', 'reason', 'of', 'this', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all']
1 : ['forest', 'fire', 'near', 'la', 'ronge', 'sask', 'canada']
2 : ['all', 'residents', 'asked', 'to', 'shelter', 'in', 'place', 'are', 'being', 'notified', 'by', 'officers', 'no', 'other', 'evacuation', 'or', 'shelter', 'in', 'place', 'orders', 'are', 'expected']
3 : ['13000', 'people', 'receive', 'wildfires', 'evacuation', 'orders', 'in', 'california']
4 : ['just', 'got', 'sent', 'this', 'photo', 'from', 'ruby', 'alaska', 'as', 'smoke', 'from', 'wildfires', 'pours', 'into', 'a', 'school']


In [10]:
for i in range(5):
    print(f'{i} : {X_tokens[i]}')

0 : ['our', 'deeds', 'are', 'the', 'reason', 'of', 'this', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all']
1 : ['forest', 'fire', 'near', 'la', 'ronge', 'sask', 'canada']
2 : ['all', 'residents', 'asked', 'to', 'shelter', 'in', 'place', 'are', 'being', 'notified', 'by', 'officers', 'no', 'other', 'evacuation', 'or', 'shelter', 'in', 'place', 'orders', 'are', 'expected']
3 : ['13000', 'people', 'receive', 'wildfires', 'evacuation', 'orders', 'in', 'california']
4 : ['just', 'got', 'sent', 'this', 'photo', 'from', 'ruby', 'alaska', 'as', 'smoke', 'from', 'wildfires', 'pours', 'into', 'a', 'school']


### Step 4. Building the word_to_index dictionary and vocabulary

In [11]:
# Code 4. word_to_idx
  # By using the dictionary (vocabulary), it is transformed
  # each token into its index based representation	

from itertools import chain

words  = list(chain(*map(lambda sentence: [word for word in sentence], X_lemmas)))

unique_words = set()
vocab = []

for word in words:
    if word not in unique_words:
        vocab.append(word)
        unique_words.add(word)

word_to_idx = {word: i+1 for i, word in enumerate(vocab)}
vocab_size = len(vocab)

In [12]:
print(vocab_size)
print(len(word_to_idx))

20098
20098


In [13]:
list(word_to_idx.items())[:10]

[('our', 1),
 ('deed', 2),
 ('be', 3),
 ('the', 4),
 ('reason', 5),
 ('of', 6),
 ('this', 7),
 ('earthquake', 8),
 ('may', 9),
 ('allah', 10)]

### Step 5. Encoding the words

In [18]:
X_encoded = []
for sentence in X_lemmas:
    word_encoded = []
    for word in sentence:
        word_encoded.append(word_to_idx[word])
    X_encoded.append(word_encoded)


In [19]:
for i in range(5):
    print(i, X_encoded[i])

0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
1 [14, 15, 16, 17, 18, 19, 20]
2 [13, 21, 22, 23, 24, 25, 26, 3, 3, 27, 28, 29, 30, 31, 32, 33, 24, 25, 26, 34, 3, 35]
3 [36, 37, 38, 39, 32, 34, 25, 40]
4 [41, 42, 43, 7, 44, 45, 46, 47, 48, 49, 45, 39, 50, 51, 48, 52]


### Step 5. Padding the sentences

In [20]:
# Code 5. Padding
# Each sentence which does not fulfill the required length
# is padded with the index 0

import numpy as np

pad_idx = 0
X_padded = list()
seq_len = np.max([len(x) for x in X_encoded])

for sentence in X_encoded:
    while len(sentence) < seq_len:
        sentence.append(pad_idx)
    X_padded.append(sentence)

X_padded = np.array(X_padded)

In [21]:
print(seq_len)

31


In [23]:
for i in range(5):
    print(i, X_encoded[i])


0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 [14, 15, 16, 17, 18, 19, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 [13, 21, 22, 23, 24, 25, 26, 3, 3, 27, 28, 29, 30, 31, 32, 33, 24, 25, 26, 34, 3, 35, 0, 0, 0, 0, 0, 0, 0, 0, 0]
3 [36, 37, 38, 39, 32, 34, 25, 40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
4 [41, 42, 43, 7, 44, 45, 46, 47, 48, 49, 45, 39, 50, 51, 48, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### Step 6. Split data 

**Split intro train and test**. The last step in this preprocessing pipeline is to divide the data into training and testing. For this we will use the function provided by scikit learn.

In [24]:
# Code 6. Split the train and test data sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.25, random_state=42)


In [25]:
print(type(X_train))
print(type(y_train))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [26]:
print(X_train[:5])

[[15262 15263     7   421  1615  3666  1599  3585  3666    94    45    83
     99    83   800     3   525    23   619    79 15264  3585     0     0
      0     0     0     0     0     0     0]
 [17640     4    68     6  8424  5660     3 17641 16232    79   201   202
    256     3  6033  1916 17642     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]
 [ 1188    48    80     3 11135    85  4157 11136   189   635  4918    79
   3295   418    13   520     4    26   185   158   139   790     3   418
     23   811    48   197   707     0     0]
 [    4  1876    76  3737   237   563     4    12    51    48  1876  1423
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]
 [ 2197  9347    65   180  9348  9349     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0]]


In [27]:
for i in range(5):
    print(len(X_train[i]))

31
31
31
31
31


In [28]:
print(y_train[:5])

[0 0 0 1 1]


### Step 7. Building TensorDataset and DataLoader

In [32]:
import torch
batch_size = 2  # Adjust as needed
from torch.utils.data import DataLoader, TensorDataset

train_dataset = TensorDataset(torch.tensor(X_train), 
                                  torch.tensor(y_train))
test_dataset =  TensorDataset(torch.tensor(X_test), 
                                  torch.tensor(y_test))
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)  


![](images/text_processing_pipeline.PNG)

# Next steps: creating the cnn model, training and evaluating the model

![](images/text_classification_model.PNG)