# Pre-Processing

Before creating and running the models on the articles, some preprocessing is needed first so that the models run efficiently. Here, this includes some general cleaning of the original dataset, tokenisation and removing stopwords.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import string
import re

news = pd.read_csv('news-data.csv', sep='\t')

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [2]:
news.head()

Unnamed: 0,category,filename,title,content
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...


Here we carry out tokenisation. This comprises of separating the content of the articles into individual words, turning any uppercase letters into lowercase letters and removing any punctuation or generally any non-letter characters.

In [3]:
'''
all_words = []

for i in range(len(news["content"])):
    all_words.append(news["content"][i].split())
print(all_words[1])
'''

all_words = ""
print(len(news["content"]))
lowercase_content = []
full_clean = []

for i in range(len(news["content"])):
    current = news["content"][i]
    clean = current.translate(str.maketrans('', '', string.punctuation)).lower()
    clean = clean.replace('Â£', '')
    full_clean.append(clean.split())
    lowercase_content.append(clean)
    all_words = all_words + clean + " "

#print(lowercase_content[2:5])
final = all_words.split()
total_count = len(final)
print(total_count)

2225
839464


In [4]:
print(final[100:200])

['464000', 'subscribers', 'in', 'the', 'fourth', 'quarter', 'profits', 'were', 'lower', 'than', 'in', 'the', 'preceding', 'three', 'quarters', 'however', 'the', 'company', 'said', 'aols', 'underlying', 'profit', 'before', 'exceptional', 'items', 'rose', '8', 'on', 'the', 'back', 'of', 'stronger', 'internet', 'advertising', 'revenues', 'it', 'hopes', 'to', 'increase', 'subscribers', 'by', 'offering', 'the', 'online', 'service', 'free', 'to', 'timewarner', 'internet', 'customers', 'and', 'will', 'try', 'to', 'sign', 'up', 'aols', 'existing', 'customers', 'for', 'highspeed', 'broadband', 'timewarner', 'also', 'has', 'to', 'restate', '2000', 'and', '2003', 'results', 'following', 'a', 'probe', 'by', 'the', 'us', 'securities', 'exchange', 'commission', 'sec', 'which', 'is', 'close', 'to', 'concluding', 'time', 'warners', 'fourth', 'quarter', 'profits', 'were', 'slightly', 'better', 'than', 'analysts', 'expectations', 'but', 'its', 'film']


Here we count the frequency of each word in the data set. With this, we can determine what the stop words should be, which will then be removed.

In [5]:
from collections import Counter
count = Counter(final)

In [6]:
freq = count.items()
# As the next line of code prints every word in all of the articles with their frequencies,
# it has been commented out to make the notebook more presentable

#print(sorted(freq, key=lambda value: value[1], reverse = True))
#print(count.keys())

Stop words are words that have no discernable relation to any topic, such as "the", "or", etc. The stopwords will be manually chosen and put into the file stopwords.txt. From there, they will be read and put in the list stopwords_list.

In [7]:
#Remove stop words
with open('stopwords.txt') as f:
    stopwords = f.read()

stopwords_list = stopwords.split()
print(stopwords.split())
f.close

['the', 'to', 'of', 'and', 'a', 'in', 'for', 'is', 'that', 'on', 'said', 'it', 'was', 'he', 'be', 'with', 'as', 'has', 'have', 'at', 'will', 'by', 'but', 'are', 'from', 'not', 'i', 'its', 'his', 'mr', 'they', 'this', 'an', 'we', 'which', 'had', 'would', 'been', 'their', 'were', 'more', 'also', 'who', 'people', 'up', 'new', 'about', 'us', 'there', 'one', 'after', 'or', 'than', 'year', 'out', 'can', 'all', 'if', 'could', 'you', 'last', 'over', 'when', 'first', 'year', 'two', 'time', 'now', 'other', 'some', 'into', 'what', 'she', 'so', 'them', 'against', 'just', 'do', 'only', 'no', 'best', 'being', 'make', 'told', 'get', 'such', 'made', 'very', 'like', 'many', 'should', 'because', 'before', 'while', 'her', 'next', 'three', 'any', 'most', 'back', 'well', 'added', 'way', 'take', 'my', 'our', 'may', 'say', 'good', 'him', 'how', 'then', 'going', 'those', 'still', 'much', 'down', 'since', 'go', 'use', 'say', 'million', 'want', 'off', 'between', 'see', 'show', 'did', 'week', 'used', 'where', 'u

<function TextIOWrapper.close()>

In [8]:
def num_there(s):
    return any(i.isdigit() for i in s)

This block of code removes any stopwords and numbers from the tokenised articles.

In [9]:
#current = lowercase_content[1]
#regex = 'the'
#re.sub(regex,'',current)
clean_filtered = []
post_count = 0
print(len(full_clean))
for i in range(len(full_clean)):
    filtered = full_clean[i]
    for j in range(len(stopwords_list)):
        filtered = [i for i in filtered if (stopwords_list[j] != i) and (not num_there(i))]
    post_count += len(filtered)
    
    if(i % 100 == 0):
        print(i)
    
    clean_filtered.append(' '.join(filtered))
print(post_count)

2225
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
417238


In [10]:
print(post_count / total_count * 100)

49.702905663613926


The above number is the percentage of words that remain after stopwords have been removed, so over half have been removed.

In [11]:
print(clean_filtered[2:4])

['owners embattled russian oil giant yukos ask buyer former production unit pay loan stateowned rosneft bought yugansk unit sale forced russia part settle tax claim yukos yukos owner menatep group says ask rosneft repay loan yugansk secured assets rosneft faces similar repayment demand foreign banks legal experts rosnefts purchase yugansk include obligations pledged assets rosneft pay real money creditors avoid seizure yugansk assets moscowbased lawyer jamie firestone connected case menatep groups managing director tim osborne reuters news agency default fight rule law exists international arbitration clauses credit rosneft officials unavailable comment company intends action menatep recover tax claims debts owed yugansk yukos filed bankruptcy protection court attempt prevent forced sale main production arm sale went ahead december yugansk sold littleknown shell company turn bought rosneft yukos claims downfall punishment political ambitions founder mikhail khodorkovsky vowed sue parti

We then take the strings and a complete vocabulary list and turn them into .csv files. We include only the category and the preprocessed content of each article, the original filename and titles have been omitted as these are not necessary for the models to run.

In [None]:
df = pd.DataFrame(clean_filtered, columns = ['string'])
df.insert(0,"category", news["category"])
df["category"] = news["category"]
df.to_csv('cleaned_strings.csv', sep=",", index=True)
df.to_csv('removed.csv', sep =",", index=False)

In [None]:
for j in range(len(stopwords_list)):
    final = [i for i in final if stopwords_list[j] != i]

In [None]:
entire_list = list(set(final))
df2 = pd.DataFrame(entire_list)
df2.to_csv('allwords.csv', sep =",", index=False)