We will import multi_news dataset and use it for document summarization.

Accessing Dataset: alexfabbri/multi_news

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [2]:
from datasets import load_dataset

In [3]:
ds = load_dataset('imdb')

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Checking our imported dataset.

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We have 3 splits here: train, validation and test.

In [5]:
ds['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [6]:
ds['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

Now we will turn or dataset into pandas dataframe for data preprocessing.

In [7]:
import pandas as pd

In [8]:
df_train = ds['train'].to_pandas()
df_test = ds['test'].to_pandas()
df_unsupervised = ds['unsupervised'].to_pandas()

In [9]:
df_train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [10]:
df_train.shape

(25000, 2)

In [11]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [12]:
df_train.describe()

Unnamed: 0,label
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [13]:
df_unsupervised.head()

Unnamed: 0,text,label
0,This is just a precious little diamond. The pl...,-1
1,When I say this is my favourite film of all ti...,-1
2,I saw this movie because I am a huge fan of th...,-1
3,Being that the only foreign films I usually li...,-1
4,After seeing Point of No Return (a great movie...,-1


In [14]:
df_unsupervised.describe()

Unnamed: 0,label
count,50000.0
mean,-1.0
std,0.0
min,-1.0
25%,-1.0
50%,-1.0
75%,-1.0
max,-1.0


We will use the train and test data only as we will be using labeled data and a supervised model for sentiment analysis. We can divide them up for easier use, we can discard the df_unsupervised from our model for now.

We will do the following for cleaning text first:
1. Removing punctuation
2. Remove newline characters and trailing spaces
3. Lowercasing characters

In [18]:
import string

In [19]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [20]:
# We will define the remove_punctuation function
def remove_punctuation(text):
    text = ''.join([char for char in text if char not in string.punctuation])
    return text

In [21]:
# Now we will apply the function to each column in all dataframes
df_train['text'] = df_train['text'].apply(remove_punctuation)
df_test['text'] = df_test['text'].apply(remove_punctuation)

In [22]:
print(df_train.head())


                                                text  label
0  I rented I AM CURIOUSYELLOW from my video stor...      0
1  I Am Curious Yellow is a risible and pretentio...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godards Mas...      0
4  Oh brotherafter hearing about this ridiculous ...      0


We will now remove newline characters and extra spaces.

In [23]:
# New function to clean text further
def clean_text(text):
    text = text.replace('\n', ' ')
    text = text.strip()
    return text

In [24]:
# Apply the cleaning function to all columns
df_train['text'] = df_train['text'].apply(clean_text)
df_test['text'] = df_test['text'].apply(clean_text)

In [25]:
# Check the cleaned data
print(df_train.head())

                                                text  label
0  I rented I AM CURIOUSYELLOW from my video stor...      0
1  I Am Curious Yellow is a risible and pretentio...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godards Mas...      0
4  Oh brotherafter hearing about this ridiculous ...      0


Now we will lowercase the texts in all dfs.

In [26]:
# Convert everything to lowercase
df_train['text'] = df_train['text'].str.lower()
df_test['text'] = df_test['text'].str.lower()

In [27]:
# Check the lowercase data
print(df_train.head())

                                                text  label
0  i rented i am curiousyellow from my video stor...      0
1  i am curious yellow is a risible and pretentio...      0
2  if only to avoid making this type of film in t...      0
3  this film was probably inspired by godards mas...      0
4  oh brotherafter hearing about this ridiculous ...      0


Now we will do the following for further pre-processing:
1. Tokenization: Break text into tokens.
2. Stop Word Removal: Filter out unnecessary words.
3. Word2Vec: Generate embeddings for the cleaned tokens.

In [29]:
# Tokenizing the text column in the training and testing datasets
df_train['tokens'] = df_train['text'].apply(lambda x: x.split())
df_test['tokens'] = df_test['text'].apply(lambda x: x.split())

In [34]:
# Drop the original 'text' column as it's no longer needed to save memory
df_train.drop(columns=['text'], inplace=True)
df_test.drop(columns=['text'], inplace=True)

In [35]:
df_train.head()

Unnamed: 0,label,tokens
0,0,"[i, rented, i, am, curiousyellow, from, my, vi..."
1,0,"[i, am, curious, yellow, is, a, risible, and, ..."
2,0,"[if, only, to, avoid, making, this, type, of, ..."
3,0,"[this, film, was, probably, inspired, by, goda..."
4,0,"[oh, brotherafter, hearing, about, this, ridic..."


Now we can remove the stop words.

In [36]:
from nltk.corpus import stopwords
import nltk

# Download stopwords
nltk.download('stopwords')

# Define stopwords set of english
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [37]:
# Remove stop words from the tokens
df_train['tokens'] = df_train['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
df_test['tokens'] = df_test['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

In [38]:
df_train.head()

Unnamed: 0,label,tokens
0,0,"[rented, curiousyellow, video, store, controve..."
1,0,"[curious, yellow, risible, pretentious, steami..."
2,0,"[avoid, making, type, film, future, film, inte..."
3,0,"[film, probably, inspired, godards, masculin, ..."
4,0,"[oh, brotherafter, hearing, ridiculous, film, ..."


Now we can do representation using word2vec.

In [39]:
import gensim
from gensim.models import Word2Vec
import numpy as np

In [40]:
# Create the Word2Vec model
word2vec_model = Word2Vec(sentences=df_train['tokens'], vector_size=100, window=5, min_count=5)

In [41]:
# Save the model
word2vec_model.save("word2vec_model")

In [43]:
# Define a function to get the average embedding for a document
def get_average_word2vec(tokens, model, vector_size):
    vector = np.zeros(vector_size)
    count = 0
    for token in tokens:
        if token in model.wv:
            vector += model.wv[token]
            count += 1
    # Return the average
    if count > 0:
        vector /= count
    return vector

In [44]:
# Apply the function to get Word2Vec embeddings for each document
train_embeddings = np.array([get_average_word2vec(tokens, word2vec_model, 100) for tokens in df_train['tokens']])
test_embeddings = np.array([get_average_word2vec(tokens, word2vec_model, 100) for tokens in df_test['tokens']])

In [45]:
print("Train Embeddings Shape:", train_embeddings.shape)
print("Test Embeddings Shape:", test_embeddings.shape)

Train Embeddings Shape: (25000, 100)
Test Embeddings Shape: (25000, 100)


Now we can save our df for now and work on them in notebook 2.