Machine learning models don't learn from raw text data. Instead, you need to convert the text to something numeric.

The simplest common representation is a variation of one-hot encoding. You represent each document as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents).

In [1]:
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from pathlib import Path
from random import randint
import numpy as np
import pandas as pd

# sklearn for feature extraction & modeling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.model_selection import train_test_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set_style('white')

In [2]:
DATA_DIR = Path('../data/01_raw/')
path = DATA_DIR / 'bbc'
files = sorted(list(path.glob('**/*.txt')))
doc_list = []
for i, file in enumerate(files):
    with open(str(file), encoding='latin1') as f:
        topic = file.parts[-2]
        lines = f.readlines()
        heading = lines[0].strip()
        body = ' '.join([l.strip() for l in lines[1:]])
        doc_list.append([topic.capitalize(), heading, body])

In [3]:
# Convert to dataframe
docs = pd.DataFrame(doc_list, columns=['Category', 'Heading', 'Article'])
docs = docs[docs['Category'] != 'Bbc']
docs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2225 entries, 0 to 2225
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  2225 non-null   object
 1   Heading   2225 non-null   object
 2   Article   2225 non-null   object
dtypes: object(3)
memory usage: 69.5+ KB


In [6]:
docs.to_csv('../data/02_intermediate/cleaned_df.csv', index=False)

In [7]:
docs.head()

Unnamed: 0,Category,Heading,Article
0,Business,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,Business,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,Business,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,Business,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,Business,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...


Tokenisation

[Sample ref](!https://www.kaggle.com/code/itratrahman/nlp-tutorial-using-python/notebook#Feature-Engineering )

In [34]:
import nltk
# nltk.download()
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.stem import SnowballStemmer
# stopwords=set(stopwords.words('english'))

from nltk.tokenize import RegexpTokenizer
tokeniser = RegexpTokenizer(r'\w+')

In [19]:
docs = pd.read_csv('../data/02_intermediate/cleaned_df.csv')
docs.columns = docs.columns.str.lower()

In [18]:
def remove_punctuation(text):
    import string
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

# extracting the stopwords from nltk library
sw = stopwords.words('english')
# displaying the stopwords
np.array(sw)
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)

In [21]:
docs['article'] = docs['article'].apply(remove_punctuation)
docs['article'] = docs['article'].apply(stopwords)
docs.head(2)

Unnamed: 0,category,heading,article
0,Business,Ad sales boost Time Warner profit,quarterly profits us media giant timewarner ju...
1,Business,Dollar gains on Greenspan speech,dollar hit highest level euro almost three mon...


Top words before stemming

In [22]:
#create count vectoriser
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text data
count_vectorizer.fit(docs['article'])
# collect the vocabulary of items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()

In [28]:
# store the vocab and counts in a dataframe
vocab = []
count = []
# iteratre of each vocab, append the value to designated lists
for key, value in dictionary:
    vocab.append(key)
    count.append(value)
# store the count in pandas dataframe with vocab as ubdex
vocab_bef_stem = pd.DataFrame(count, index=vocab, columns = ['count'])
vocab_bef_stem = vocab_bef_stem.sort_values(['count'],ascending=False)
vocab_bef_stem

Unnamed: 0,count
zvyagintsev,33893
zvonareva,33892
zutons,33891
zurichs,33890
zurich,33889
...,...
001,4
00051,3
0001,2
000,1


Stemming

In [35]:
# create an object of stemming function
stemmer = SnowballStemmer("english")

def stemming(text):    
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text) 


In [36]:
docs['article'] = docs['article'].apply(stemming)
docs.article.head()

0    quarter profit us media giant timewarn jump 76...
1    dollar hit highest level euro almost three mon...
2    owner embattl russian oil giant yuko ask buyer...
3    british airway blame high fuel price 40 drop p...
4    share uk drink food firm alli domecq risen spe...
Name: article, dtype: object

In [37]:
#collect vocabulary count after stemming
# create the object of tfid vectorizer
tfid_vectorizer = TfidfVectorizer("english")
# fit the vectorizer using the text data
tfid_vectorizer.fit(docs['article'])
# collect the vocabulary items used in the vectorizer
dictionary = tfid_vectorizer.vocabulary_.items()  

In [39]:
# store the vocab and counts in a dataframe
vocab = []
count = []
# iteratre of each vocab, append the value to designated lists
for key, value in dictionary:
    vocab.append(key)
    count.append(value)
# store the count in pandas dataframe with vocab as ubdex
vocab_aftr_stem = pd.DataFrame(count, index=vocab, columns = ['count'])
vocab_aftr_stem = vocab_aftr_stem.sort_values(['count'],ascending=False)
vocab_aftr_stem

Unnamed: 0,count
zvyagintsev,24047
zvonareva,24046
zuton,24045
zurich,24044
zuluaga,24043
...,...
001,4
00051,3
0001,2
000,1


In [40]:
px.bar(vocab_aftr_stem.head(20), x='count', y=vocab_aftr_stem.head(20).index)

## What is a Bag of Words?
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

- A vocabulary of known words.
- A measure of the presence of known words.
  
It is called a bag-of-words , because any information about the <u>order or structure of words in the document is discarded</u>. The model is only concerned with whether known words occur in the document, not where in the document. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.

In [41]:
articles = docs.article.tolist()
articles[0]

'quarter profit us media giant timewarn jump 76 113bn â£600m three month decemb 639m yearearli firm one biggest investor googl benefit sale highspe internet connect higher advert sale timewarn said fourth quarter sale rose 2 111bn 109bn profit buoy oneoff gain offset profit dip warner bros less user aol time warner said friday own 8 searchengin googl internet busi aol mix fortun lost 464000 subscrib fourth quarter profit lower preced three quarter howev compani said aol under profit except item rose 8 back stronger internet advertis revenu hope increas subscrib offer onlin servic free timewarn internet custom tri sign aol exist custom highspe broadband timewarn also restat 2000 2003 result follow probe us secur exchang commiss sec close conclud time warner fourth quarter profit slight better analyst expect film divis saw profit slump 27 284m help boxoffic flop alexand catwoman sharp contrast yearearli third final film lord ring trilog boost result fullyear timewarn post profit 336bn 27

In [42]:
print(len(articles))

2225


In [43]:
#tokenise
tokenised_articles = []
for i in range(len(articles)):
    tokens = word_tokenize(articles[i])
    tokenised_articles.append(tokens)
print(len(tokenised_articles))

2225


### TF-IDF
We will use the sklearn TF-IDF function to create

In [45]:
# extract the tfid representation matrix of the text data
tfid_matrix = tfid_vectorizer.transform(docs['article'])
# collect the tfid matrix in numpy array
array = tfid_matrix.todense()

In [47]:
# store the tf-idf array into pandas dataframe
df = pd.DataFrame(array)
df.head(10)

df.to_csv('../data/02_intermediate/preprocessed_tf_idf_articles.csv', index=True)

In [48]:
len(df)

2225

In [50]:
docs.category.unique()

array(['Business', 'Entertainment', 'Politics', 'Sport', 'Tech'],
      dtype=object)

In [52]:
docs['category_id'] = docs['category'].astype('category')
# Output IDs
docs['category_id'] = docs.category_id.cat.codes

In [54]:
docs.to_csv('../data/02_intermediate/preprocessed_docs_df.csv', index=False)