In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.



In [4]:
from sklearn.datasets import fetch_20newsgroups

In [26]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents=dataset.data

In [27]:
len(dataset.data)

11314

In [28]:
news_df = pd.DataFrame({'document':documents})


In [32]:
news_df.loc[0].document

"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [33]:
# removing everything except alphabets`
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ")

# removing short words
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

# make all text lowercase
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

In [34]:
news_df.loc[0].clean_doc

'well sure about story seem biased what disagree with your statement that media ruin israels reputation that rediculous media most israeli media world having lived europe realize that incidences such described letter have occured media whole seem ignore them subsidizing israels existance europeans least same degree think that might reason they report more clearly atrocities what shame that austria daily reports inhuman acts commited israeli soldiers blessing received from government makes some holocaust guilt away after look jews treating other races when they power unfortunate'

### NLTK is needed

In [37]:
pip install nltk

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 203 kB/s eta 0:00:01
Collecting regex
  Downloading regex-2020.7.14-cp38-cp38-manylinux2010_x86_64.whl (672 kB)
[K     |████████████████████████████████| 672 kB 11.6 MB/s eta 0:00:01
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434677 sha256=83be330f68f44300b085e2b30e03d3c38aa3a752d2216e43b5241b23e7ed2aaf
  Stored in directory: /home/sampath/.cache/pip/wheels/ff/d5/7b/f1fb4e1e1603b2f01c2424dd60fbcc50c12ef918bafc44b155
Successfully built nltk
Installing collected packages: regex, nltk
Successfully installed nltk-3.5 regex-2020.7.14
Note: you may need to restart the kernel to use updated packages.


In [41]:
import nltk
from nltk.corpus import stopwords
#nltk.download()
stop_words = stopwords.words('english')

# tokenization
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())

# remove stop-words
# result is list of lists
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization
detokenized_doc = []
for i in range(len(news_df)):
    # make document back again
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

In [48]:
news_df['clean_doc'] = detokenized_doc

In [53]:
news_df['clean_doc'][0]

'well sure story seem biased disagree statement media ruin israels reputation rediculous media israeli media world lived europe realize incidences described letter occured media whole seem ignore subsidizing israels existance europeans least degree think might reason report clearly atrocities shame austria daily reports inhuman acts commited israeli soldiers blessing received government makes holocaust guilt away look jews treating races power unfortunate'

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
max_features= 1000, # keep top 1000 terms 
max_df = 0.5, 
smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc'])

X.shape # check shape of the document-term matrix

(11314, 1000)

#### Rows are documents and columns are term (pruned to top 1000)

In [69]:
from sklearn.decomposition import TruncatedSVD

#Apply SVD to documentsxterms matrix and take top 20 singular vectors. 
# TruncatedSVD handles sparse matrix and since tf-idf is sparse its appropriate
# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=100, algorithm='randomized', n_iter=100, random_state=122)

svd_model.fit(X)

# top 20 singular vectors, but the dimension of each vector is same as feature dimension. These are just top 20 basis 
len(svd_model.components_)

100

In [70]:
svd_model.components_.shape

(100, 1000)

In [71]:
len(svd_model.components_[0])

1000