##  A Step-by-Step Guide to Preprocessing and Collocation Analysis



#### This code is a combination of different steps for preprocessing text data, including installing and importing libraries, importing data, tokenization, removing punctuations, stopwords, and small tokens, lemmatization, and finding collocations.

#### 1. Initial Setup: In this section, we start with installing and importing libraries.

In [1]:
import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder, TrigramCollocationFinder
from nltk.tokenize import word_tokenize, MWETokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import warnings
warnings.simplefilter('ignore', category=DeprecationWarning)


#### 2.Importing Data: 

In [2]:
# Importing the dataset
df = pd.read_csv('https://raw.githubusercontent.com/sivosevic/NYTimesNLP/main/TechArticles.csv')
df

Unnamed: 0.1,Unnamed: 0,title,abstract
0,0,Ty Haney Is Doing Things Differently This Time,The Outdoor Voices founder has a new venture t...
1,1,Washington State Advances Landmark Deal on Gig...,Lawmakers have passed legislation granting ben...
2,2,Google Suspends Advertising in Russia,The move came after a Russian regulator demand...
3,3,"When Electric Cars Rule the Road, They’ll Need...",A wireless infrastructure company is betting i...
4,4,A coalition of state attorneys general opens a...,The group is looking into the Chinese-owned vi...
...,...,...,...
1195,1195,Technology Briefing,PEOPLEAT&T EXECUTIVE JOINS PALM Palm Inc. has ...
1196,1196,Technology Briefing,INTERNET CNET WARNS OF LOWER SALES The online ...
1197,1197,Technology Briefing,INTERNET SEARCH PROVIDER FOR NBCI AND EXPLORER...
1198,1198,Technology Briefing,TELECOMMUNICATIONSNOKIA TO BUY AMBER NETWORKS ...


### 3.Preprocessing 
#### a): In this section, we combine each article's title and abstract into a single body of lowercase text.

In [3]:
# Combining each article's title and abstract into a single body of lowercase text
titles = df['title'].to_numpy()
abstracts = df['abstract'].to_numpy()
articles = [((str(titles[i]) + ' ' + str(abstracts[i])).lower()) for i in range(len(titles))]
articles[:10]

['ty haney is doing things differently this time the outdoor voices founder has a new venture that aims to reward customers with blockchain-based assets. but do brand loyalists really want nfts?',
 'washington state advances landmark deal on gig drivers’ job status lawmakers have passed legislation granting benefits and protections, but allowing lyft and uber to continue to treat drivers as contractors.',
 'google suspends advertising in russia the move came after a russian regulator demanded that the company stop showing ads with what the regulator claimed was false information about the invasion of ukraine.',
 'when electric cars rule the road, they’ll need spots to power up a wireless infrastructure company is betting it can figure out how to locate and install charging stations for a growing wave of new vehicles.',
 'a coalition of state attorneys general opens an investigation into tiktok. the group is looking into the chinese-owned video site for the harms it may pose to younger 

#### b)Tokenizing Text: 
##### This step converts each article's text into a list of tokens using the word_tokenize function.

In [4]:
# Tokenizing text
articles = [word_tokenize(article) for article in articles]
articles[:2]

[['ty',
  'haney',
  'is',
  'doing',
  'things',
  'differently',
  'this',
  'time',
  'the',
  'outdoor',
  'voices',
  'founder',
  'has',
  'a',
  'new',
  'venture',
  'that',
  'aims',
  'to',
  'reward',
  'customers',
  'with',
  'blockchain-based',
  'assets',
  '.',
  'but',
  'do',
  'brand',
  'loyalists',
  'really',
  'want',
  'nfts',
  '?'],
 ['washington',
  'state',
  'advances',
  'landmark',
  'deal',
  'on',
  'gig',
  'drivers',
  '’',
  'job',
  'status',
  'lawmakers',
  'have',
  'passed',
  'legislation',
  'granting',
  'benefits',
  'and',
  'protections',
  ',',
  'but',
  'allowing',
  'lyft',
  'and',
  'uber',
  'to',
  'continue',
  'to',
  'treat',
  'drivers',
  'as',
  'contractors',
  '.']]

#### c)Removing Punctuations, Stopwords and small tokens: 
This code takes in a list of tokens and removes all punctuation marks,stopwords and tokens with fewer than 4 characters.


In [5]:
# Removing punctuations and stopwords
stopwords = set(stopwords.words('english'))
stopwords = stopwords.union({"technology","company","percent","briefing","million","service","internet"})
punctuations = r".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"
#articles = [[token for token in article if (token not in punctuations and token not in stopwords and len(token) > 4)] for article in articles]
articles = [[token for token in article if (token not in punctuations and token not in stopwords and len(token) > 4 and '-' not in token)] for article in articles]
articles[:3]

[['haney',
  'things',
  'differently',
  'outdoor',
  'voices',
  'founder',
  'venture',
  'reward',
  'customers',
  'assets',
  'brand',
  'loyalists',
  'really'],
 ['washington',
  'state',
  'advances',
  'landmark',
  'drivers',
  'status',
  'lawmakers',
  'passed',
  'legislation',
  'granting',
  'benefits',
  'protections',
  'allowing',
  'continue',
  'treat',
  'drivers',
  'contractors'],
 ['google',
  'suspends',
  'advertising',
  'russia',
  'russian',
  'regulator',
  'demanded',
  'showing',
  'regulator',
  'claimed',
  'false',
  'information',
  'invasion',
  'ukraine']]

#### d)Lemmatization: 
This code takes in a list of tokens and applies lemmatization to them. 

In [6]:
# Lemmatizing words
lemmatizer = WordNetLemmatizer()
articles = [[lemmatizer.lemmatize(token) for token in article] for article in articles]
articles[:3]

[['haney',
  'thing',
  'differently',
  'outdoor',
  'voice',
  'founder',
  'venture',
  'reward',
  'customer',
  'asset',
  'brand',
  'loyalist',
  'really'],
 ['washington',
  'state',
  'advance',
  'landmark',
  'driver',
  'status',
  'lawmaker',
  'passed',
  'legislation',
  'granting',
  'benefit',
  'protection',
  'allowing',
  'continue',
  'treat',
  'driver',
  'contractor'],
 ['google',
  'suspends',
  'advertising',
  'russia',
  'russian',
  'regulator',
  'demanded',
  'showing',
  'regulator',
  'claimed',
  'false',
  'information',
  'invasion',
  'ukraine']]

#### e)Finding Collocations:
This code takes in a list of tokens and finds the most frequent bigrams and trigrams.

In [9]:
# Finding most common bigrams
bigram_finder = BigramCollocationFinder.from_documents(articles)
bigram_finder.apply_freq_filter(min_freq=3)
bigrams = list(bigram_finder.ngram_fd.items())
# Finding most common trigrams
trigram_finder = TrigramCollocationFinder.from_documents(articles)
trigram_finder.apply_freq_filter(min_freq=3)
trigrams = list(trigram_finder.ngram_fd.items())
print(bigrams[:3])
print(trigrams[:3])

[(('state', 'attorney'), 3), (('attorney', 'general'), 8), (('giant', 'google'), 3)]
[(('state', 'attorney', 'general'), 3), (('social', 'medium', 'platform'), 5), (('federal', 'trade', 'commission'), 3)]


#### f)Replacing Collocations in Text: 
This code takes in a list of collocations and replaces them in the original text.

In [10]:
# Replacing collocations in text
bigrams = [bigram for bigram, freq in bigram_finder.ngram_fd.items()]
trigrams = [trigram for trigram, freq in trigram_finder.ngram_fd.items()]
mwe_tokenizer = MWETokenizer(bigrams + trigrams, separator='_') # here we are using _ as separator 
articles = [mwe_tokenizer.tokenize(article) for article in articles]
articles[:15]

[['haney',
  'thing',
  'differently',
  'outdoor',
  'voice',
  'founder',
  'venture',
  'reward',
  'customer',
  'asset',
  'brand',
  'loyalist',
  'really'],
 ['washington',
  'state',
  'advance',
  'landmark',
  'driver',
  'status',
  'lawmaker',
  'passed',
  'legislation',
  'granting',
  'benefit',
  'protection',
  'allowing',
  'continue',
  'treat',
  'driver',
  'contractor'],
 ['google',
  'suspends',
  'advertising',
  'russia',
  'russian',
  'regulator',
  'demanded',
  'showing',
  'regulator',
  'claimed',
  'false',
  'information',
  'invasion',
  'ukraine'],
 ['electric',
  'spot',
  'power',
  'wireless',
  'infrastructure',
  'betting',
  'figure',
  'locate',
  'install',
  'charging',
  'station',
  'growing',
  'vehicle'],
 ['coalition',
  'state_attorney_general',
  'open',
  'investigation',
  'tiktok',
  'group',
  'looking',
  'video',
  'harm',
  'younger',
  'user'],
 ['million',
  'crypto',
  'name',
  'necessary',
  'investor',
  'money',
  'pseudo

That concludes this guide. Stay tuned for the next one where we will dive into topic discovery and sentiment analysis.