In [1]:
article = '''The biggest known star in our universe is Stephenson 2-18. It is a type of “Red Supergiant Star” and these types of stars are the largest stars in the universe according to volume. 
Stephenson 2-18 Star located in a cluster of stars “Stephenson 2” that exist in the constellation of Scutum. 
The estimated distance of Stephenson 2-18 star is around 19000 light-years away from the Earth. That is equal to almost 6,000 parsecs.
Stephenson 2-18 star is so big that 10 billion Sun size star can enter inside of it. 
The luminosity of the biggest star Stephenson 2-18 could be around 440,000 times of sun’s luminosity. Whereas the usual luminosity of any red supergiant star is around 30,000 to 50,000 times the sun’s luminosity. So this is totally unusual for this largest star. And for this unusual property, it exists at the top right corner of the Hertzsprung–Russell diagram for its M6 spectral type.'''

In [2]:
article

'The biggest known star in our universe is Stephenson 2-18. It is a type of “Red Supergiant Star” and these types of stars are the largest stars in the universe according to volume. \nStephenson 2-18 Star located in a cluster of stars “Stephenson 2” that exist in the constellation of Scutum. \nThe estimated distance of Stephenson 2-18 star is around 19000 light-years away from the Earth. That is equal to almost 6,000 parsecs.\nStephenson 2-18 star is so big that 10 billion Sun size star can enter inside of it. \nThe luminosity of the biggest star Stephenson 2-18 could be around 440,000 times of sun’s luminosity. Whereas the usual luminosity of any red supergiant star is around 30,000 to 50,000 times the sun’s luminosity. So this is totally unusual for this largest star. And for this unusual property, it exists at the top right corner of the Hertzsprung–Russell diagram for its M6 spectral type.'

Preprocessing of text is required. We will convert article in sentences, stop unecessary words and lemmatize them.

In [3]:
import re
from nltk import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Initialization of the Lemmatizer

In [4]:
wordnet = WordNetLemmatizer()

Convert entire article into sentences first.

In [5]:
sentences = sent_tokenize(article)

In [6]:
sentences

['The biggest known star in our universe is Stephenson 2-18.',
 'It is a type of “Red Supergiant Star” and these types of stars are the largest stars in the universe according to volume.',
 'Stephenson 2-18 Star located in a cluster of stars “Stephenson 2” that exist in the constellation of Scutum.',
 'The estimated distance of Stephenson 2-18 star is around 19000 light-years away from the Earth.',
 'That is equal to almost 6,000 parsecs.',
 'Stephenson 2-18 star is so big that 10 billion Sun size star can enter inside of it.',
 'The luminosity of the biggest star Stephenson 2-18 could be around 440,000 times of sun’s luminosity.',
 'Whereas the usual luminosity of any red supergiant star is around 30,000 to 50,000 times the sun’s luminosity.',
 'So this is totally unusual for this largest star.',
 'And for this unusual property, it exists at the top right corner of the Hertzsprung–Russell diagram for its M6 spectral type.']

Create an empty list to hold the preprocessed text.

In [6]:
lemmatized_holder = []

Preprocess the article using iterative procedure.

In [7]:
for i in range(len(sentences)):
    keeping_important_stuff = re.sub('[^a-zA-Z0-9]', ' ', sentences[i])
    lowercasing = keeping_important_stuff.lower()
    splitting = lowercasing.split()
    lemmatization = [wordnet.lemmatize(w) for w in splitting if not w in set(stopwords.words('english'))]
    joining = ' '.join(lemmatization)
    lemmatized_holder.append(joining)

In [8]:
lemmatized_holder

['biggest known star universe stephenson 2 18',
 'type red supergiant star type star largest star universe according volume',
 'stephenson 2 18 star located cluster star stephenson 2 exist constellation scutum',
 'estimated distance stephenson 2 18 star around 19000 light year away earth',
 'equal almost 6 000 parsec',
 'stephenson 2 18 star big 10 billion sun size star enter inside',
 'luminosity biggest star stephenson 2 18 could around 440 000 time sun luminosity',
 'whereas usual luminosity red supergiant star around 30 000 50 000 time sun luminosity',
 'totally unusual largest star',
 'unusual property exists top right corner hertzsprung russell diagram m6 spectral type']

**Bag of Words**

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval. In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">Wikipedia</a>.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features = 2000)

In [10]:
X_data = bow.fit_transform(lemmatized_holder).toarray()

In [11]:
X_data

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0,
        0, 1, 0, 0, 0, 2, 1, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 2,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 