# Web Scraping and Data Cleaning

# List of articles used - 
http://www.natgeotraveller.in/train-to-nowhere/

http://www.natgeotraveller.in/getting-saucy-about-food/

http://www.natgeotraveller.in/what-dreams-may-come/

http://www.natgeotraveller.in/six-years-and-counting/

# Step 1. Import required libraries

In [1]:
import bs4 as bs
import urllib.request

# Step 2. Define URL and articles

In [2]:
url='http://www.natgeotraveller.in/'

In [3]:
articles=['train-to-nowhere/','getting-saucy-about-food/','what-dreams-may-come/','six-years-and-counting/']
sources=[urllib.request.urlopen(url + i).read() for i in articles]
soups=[bs.BeautifulSoup(i,'lxml') for i in sources]

In [12]:
texts=[]
ex=' Lakshmi Sankaran\r\r                                          fantasizes about a bucket-list journey to witness the aurora borealis someday. Editor in Chief at National Geographic Traveller India, she will also gladly follow a captivating tune to the end of this world.\r\r                                 Hey there! Like what you see (or not)? Tell us what you think at web.editor@natgeotraveller.in.'
for soup in soups:
  text=[]
  for paragraph in soup.find_all('p'):
      text.append(paragraph.text)
  text=' '.join(text)
 
  text=text[:-1*len(ex)]
  print(text,'\n\n')
  texts.append(text)

I like a bit of pow-wow in any place. Let me rephrase before you think I am eternally hankering for a fight. What I mean is I would choose crooked streets over straight highways, sweaty mayhem over pristine elegance. This is why no matter where I go in this world, coming home to India, and especially Bombay, is never dull. I blame growing up in the city for my pugilistic predilections. One of the many descriptors that Mark Twain used in relation to Bombay was “pow-wow.” The place seemed to confound him: “Bewitching”, “Bewildering”, “Enchanting”, “Arabian Nights come again?”—the man was repulsed and riveted at the same time. It was a place befitting the number of exclamations he used. At 13, I was yet to be permitted the pleasures of travelling unchaperoned outside Bombay but within its confines, I had free rein to indulge my inner flâneur. I became the weekend loafer, slacking through parts of the city I really had no business being in. My itinerary hardly ever changed: Take the BEST b

In [13]:
texts

['I like a bit of pow-wow in any place. Let me rephrase before you think I am eternally hankering for a fight. What I mean is I would choose crooked streets over straight highways, sweaty mayhem over pristine elegance. This is why no matter where I go in this world, coming home to India, and especially Bombay, is never dull. I blame growing up in the city for my pugilistic predilections. One of the many descriptors that Mark Twain used in relation to Bombay was “pow-wow.” The place seemed to confound him: “Bewitching”, “Bewildering”, “Enchanting”, “Arabian Nights come again?”—the man was repulsed and riveted at the same time. It was a place befitting the number of exclamations he used. At 13, I was yet to be permitted the pleasures of travelling unchaperoned outside Bombay but within its confines, I had free rein to indulge my inner flâneur. I became the weekend loafer, slacking through parts of the city I really had no business being in. My itinerary hardly ever changed: Take the BEST

# Step 3. Import NLTK

In [14]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\Swati
[nltk_data]    |     Singhvi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to C:\Users\Swati
[nltk_data]    |     Singhvi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to C:\Users\Swati
[nltk_data]    |     Singhvi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to C:\Users\Swati
[nltk_data]    |     Singhvi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to C:\Users\Swati
[nltk_data]    |     Singhvi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package ces

True

# Step 4. Stemming using Lancaster Stemmer

In [15]:
from nltk.stem import LancasterStemmer
LS=LancasterStemmer()

stem=[" ".join([LS.stem(j) for j in i.split()]) for i in texts]
stem

['i lik a bit of pow-wow in any place. let me rephras bef you think i am etern hank for a fight. what i mean is i would choos crook streets ov straight highways, sweaty mayhem ov pristin elegance. thi is why no mat wher i go in thi world, com hom to india, and espec bombay, is nev dull. i blam grow up in the city for my pugil predilections. on of the many describ that mark twain us in rel to bombay was “pow-wow.” the plac seem to confound him: “bewitching”, “bewildering”, “enchanting”, “arabian night com again?”—the man was repuls and rivet at the sam time. it was a plac befit the numb of exclam he used. at 13, i was yet to be permit the pleas of travel unchaperon outsid bombay but within it confines, i had fre rein to indulg my in flâneur. i becam the weekend loafer, slack through part of the city i real had no busy being in. my itin hard ev changed: tak the best bus to chowpatty; aft fil up on chaat, sampl som mor at the khau gul in churchgate; sometimes, pretend to shop for mus i co

# Step 5. Count Vectorizer

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
vector=CountVectorizer(binary=True)

vector.fit(texts)
vocab=vector.vocabulary_
len(vocab)

947

In [18]:
vector.fit(stem)
vocab=vector.vocabulary_
len(vocab)

903

# Step 5. Cosine Similarity

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

for i in range(len(stem)-1):
  for j in range(i+1,len(stem)):
      similarity = cosine_similarity(vector.transform([stem[i]]).toarray(), vector.transform([stem[j]]).toarray())
      print('The similarity between Articles: ',i,' and ',j,' : ',similarity)

The similarity between Articles:  0  and  1  :  [[0.21078196]]
The similarity between Articles:  0  and  2  :  [[0.21708323]]
The similarity between Articles:  0  and  3  :  [[0.19642857]]
The similarity between Articles:  1  and  2  :  [[0.19987028]]
The similarity between Articles:  1  and  3  :  [[0.21078196]]
The similarity between Articles:  2  and  3  :  [[0.206407]]
