<a href="https://colab.research.google.com/github/syedshahlal/Generative_DL/blob/main/Convolution_Neural_Network_(CNN).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Convolution Neural Networks (CNN)**

Convolutional Neural Networks (CNNs) applied to text for natural language processing (NLP) tasks.

##**Overview**

At the core of CNNs are filters (aka weights, kernels, etc.) which convolve (slide) across our input to extract relevant features. The filters are initialized randomly but learn to act as feature extractors via parameter sharing.


In [4]:
from IPython.display import Image

Image(url='https://madewithml.com/static/images/foundations/cnn/convolution.gif')

* ### **Objective:**

  * Extract meaningful spatial substructure from encoded data.

* ### **Advantages:**
  * Small number of weights (shared)
  * Parallelizable
  * Detects spatial substrcutures (feature extractors)
  * Interpretability via filters
  * Can be used for processing in images, text, time-series, etc.

* ### **Disadvantages:**
  * Many hyperparameters (kernel size, strides, etc.) to tune.

* ### **Miscellaneous:**
  * Lot's of deep CNN architectures constantly updated for SOTA performance.
  * Very popular feature extractor that acts as a foundation for many
  architectures.

## **Setup**
Let's set our seed and device.

In [2]:
import numpy as np
import pandas as pd
import random
import torch
import torch.nn as nn

In [3]:
SEED = 1234

In [4]:
def set_seeds (seed =1234):
  """set seeds for reproducibility."""
  np.random.seed(seed)
  random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)    # multi-GPU


In [5]:
# Set seeds for reproducibilty
set_seeds(seed=SEED)

In [6]:
# set device
cuda = True
device = torch.device("cuda" if (
    torch.cuda.is_available() and cuda) else "cpu")
torch.set_default_tensor_type("torch.FloatTensor")
if device.type == "cuda":
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
print (device)


cuda


  _C._set_default_tensor_type(t)


## **Load data**

We will download the AG News dataset, which consists of 120K text samples from 4 unique classes (Business, Sci/Tech, Sports, World)

In [7]:
# Load data
from datasets import load_dataset

ds = load_dataset("fancyzhx/ag_news")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [11]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])

In [20]:
df_new = df.copy()  # Copying the DataFrame to avoid modifying the original one

# Mapping function
def label_to_category(label):
    if label == 0:
        return 'World'
    elif label == 1:
        return 'Sports'
    elif label == 2:
        return 'Business'
    else:
        return 'Sci/Tech'

# Apply the mapping function to each row in the 'label' column
df_new['category'] = df_new['label'].apply(label_to_category)


In [25]:
df_new

Unnamed: 0,text,label,category
0,Wall St. Bears Claw Back Into the Black (Reute...,2,Business
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2,Business
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2,Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,2,Business
4,"Oil prices soar to all-time record, posing new...",2,Business
...,...,...,...
119995,Pakistan's Musharraf Says Won't Quit as Army C...,0,World
119996,Renteria signing a top-shelf deal Red Sox gene...,1,Sports
119997,Saban not going to Dolphins yet The Miami Dolp...,1,Sports
119998,Today's NFL games PITTSBURGH at NY GIANTS Time...,1,Sports


## **Preprocessing**
We're going to clean up our input data first by doing operations such as lower text, removing stop (filler) words, filters using regular expressions, etc.

In [22]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

In [23]:
nltk.download('stopwords')
STOPWORDS = stopwords.words("english")
print(STOPWORDS[:5])

['i', 'me', 'my', 'myself', 'we']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [24]:
porter = PorterStemmer()

In [None]:
def preprocess(text, stopwords=STOPWORDS):
  """ Conditional preprocessing on our text unique to our task. """

  # Lower
  text = text.lower()

  # Remove stopwords
  pattern = re.compile(r"\b(" + r"|".join(stopwords) + r")\b\s*")
  text = pattern.sub("", text)

  # Remove words in parenthesis
  text = re.sub(r"\([^)]*\)", "", text)

  # Spacing and filter
  text = re.sub(r"([-;:.,!?<=>])", r" \1", text)      # separate punctuation tied to words