# TXT PURIFIER

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

The following code is about how purify a list of documents by removing their most common words to reveal the words that truly distinguish the documents from each other.

Required Python packages are `gensim` and `pandas`.

In [208]:
from gensim import corpora, models, similarities
import pandas as pd

## Text input

In [188]:
f = open("docs.txt", 'r') # a txt file with one document per line
documents = f.readlines() # documents is now a Python list of documents

In [189]:
len(documents) # number of documents

370

In [190]:
print(documents[0][:100]) # inspect the beginning of the first document

umeå universitetsbibliotekbibsam uttag 2017 06 12 nyheter: ingen rubrik tillgänglig när offer blir f


## Stop word removal etc.

In [191]:
# remove stop words, tokenize, and convert to lowercase
stoplist = set('your stop words here'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]

In [192]:
# remove project-specific stop words
stoplist2 = set('another set of stop words here'.split())
texts = [[word for word in text if word not in stoplist2] for text in texts]

In [193]:
# remove words that appear less than X (e.g. 2) time(s)
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

In [194]:
# remove anything which is not pure letters
# the method isalpha() checks whether the string consists of alphabetic characters only

texts = [[token for token in text if token.isalpha()] for text in texts]

In [196]:
# remove one-letter words
texts = [[token for token in text if len(token) > 1] for text in texts]

In [197]:
print(texts[0][:40]) # see the beginning of the first tokenized and cleaned document

['offer', 'förövare', 'gränserna', 'våld', 'snäva', 'kräver', 'tur', 'samhällets', 'skydd', 'rätten', 'självförsvar', 'sitter', 'djupt', 'folkliga', 'lagen', 'sätter', 'snäva', 'exempel', 'skadar', 'dödar', 'normalfallet', 'noga', 'övervägt', 'slags', 'yttersta', 'fallet', 'sätter', 'gränser', 'rätten', 'värna', 'vissa', 'stater', 'döda', 'hotar', 'privata', 'nära', 'förutsätts', 'passivitet', 'polis', 'myndigheter']


## tf-idf stuff

In [198]:
# we now create a gensim corpus from this set of documents
# to be able to get tf-idf scores for words

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(doc) for doc in texts]
tfidf = models.TfidfModel(corpus, id2word = dictionary)
corpus_tfidf = tfidf[corpus]

low_value = 0.25

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = []
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

d = {dictionary.get(id): value for doc in corpus_tfidf for id, value in doc} # a dictionary of the tfidf values

In [174]:
d

{'offer': 0.013507229909325909,
 'förövare': 0.14168475133412164,
 'gränserna': 0.13396265132658886,
 'våld': 0.1281389847594295,
 'snäva': 0.25346146030049277,
 'kräver': 0.05130201259794378,
 'tur': 0.1267196787833677,
 'samhällets': 0.043198014282212475,
 'skydd': 0.07639898474007265,
 'rätten': 0.015540185235692613,
 'självförsvar': 0.4095527918919187,
 'sitter': 0.03832066992481726,
 'djupt': 0.020959715650123174,
 'folkliga': 0.07187425902884143,
 'lagen': 0.06702278220204899,
 'sätter': 0.0499376019169909,
 'exempel': 0.028088442656096835,
 'skadar': 0.10403330815920565,
 'dödar': 0.053585323829868976,
 'normalfallet': 0.10641189178787883,
 'noga': 0.15417679493309316,
 'övervägt': 0.10174470394355753,
 'slags': 0.1247196155658233,
 'yttersta': 0.036034598942559753,
 'fallet': 0.10346180976074694,
 'gränser': 0.037720057326637635,
 'värna': 0.014665694800212311,
 'vissa': 0.04438202745787728,
 'stater': 0.04608188266984941,
 'döda': 0.05575908029702715,
 'hotar': 0.0450345148948

In [199]:
# Read the dictionary into a Pandas dataframe and sort descending based on tf-idf
df = pd.DataFrame([[key,value] for key,value in d.items()],columns=["word","tf-idf"])
df = df.sort_values(['tf-idf'], ascending=[False])
df

Unnamed: 0,word,tf-idf
6173,taylor,0.734250
700,fatima,0.723459
8668,bilal,0.721969
3192,spirea,0.709959
7299,linnea,0.706975
7826,ståhl,0.702313
8621,finspång,0.654291
8667,brottsofferdagen,0.648304
8483,alfborger,0.647660
6977,kauppi,0.647120


In [201]:
tfidf_threshold = 0.058 # set manually (experiment and iterate)

df2 = df.loc[df['tf-idf'] > tfidf_threshold]
print(str(len(df2)) + ' are left of ' + str(len(df)))

4297 are left of 8877


In [202]:
# extract the word column from df2 as a list
# this is a list of all words with tf-idf above the threshold

keep_words = df2['word'].tolist()

In [203]:
# remove the low tf-idf words that are not to be kept

texts = [[word for word in text if word not in keep_words] for text in texts]

## Text output

In [204]:
# 'texts' is now a list of lists of tokens
# we transform it back to the initial format (a list of documents)

doc_list = [] # initialise an empty list

for token_list in texts:
    #print(token_list)
    token_string = ' '.join(token_list)
    #print(token_string)
    #print("==========")
    doc_list.append(token_string)

In [205]:
# remove duplicates, if any, from the doc_list

doc_list2= set(doc_list)
print(str(len(doc_list)-len(doc_list2)) + " duplicate documents removed.")
doc_list = doc_list2

17 duplicate documents removed.


In [206]:
# how many documents are left?
len(doc_list2)

353

In [207]:
# write the documents as lines to a new txt file
with open('docs.txt', 'w') as outfile:
    for item in doc_list:
        outfile.write("%s\n" % item)