## Extract content of https://en.wikipedia.org/wiki/Machine_learning using request library

In [2]:
!pip install requests



In [3]:
import requests

# Define the URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/Machine_learning"

In [5]:
# Make a GET request to fetch the raw HTML content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Request was successful.")
else:
    print("Request was unsuccessful.")

Request was successful.


In [7]:
print(response.text)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Machine learning - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-

## Finding all the paragraphs using regular expression in \<p> \</p> tag

In [8]:
import re

In [9]:
html_content = response.text

# Use regular expressions to find all <p> tags
p_tags = re.findall(r'<p>(.*?)</p>', html_content, re.DOTALL)

# Print the content of each <p> tag
for i, p in enumerate(p_tags, 1):
    print(f"Paragraph {i}:")
    print(p)
    print("-" * 80)


Paragraph 1:
<b>Machine learning</b> (<b>ML</b>) is a <a href="/wiki/Field_of_study" class="mw-redirect" title="Field of study">field of study</a> in <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned with the development and study of <a href="/wiki/Computational_statistics" title="Computational statistics">statistical algorithms</a> that can learn from <a href="/wiki/Data" title="Data">data</a> and <a href="/wiki/Generalize" class="mw-redirect" title="Generalize">generalize</a> to unseen data and thus perform <a href="/wiki/Task_(computing)" title="Task (computing)">tasks</a> without explicit <a href="/wiki/Machine_code" title="Machine code">instructions</a>.<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">&#91;1&#93;</a></sup> Recently, <a href="/wiki/Artificial_neural_network" class="mw-redirect" title="Artificial neural network">artificial neural networks</a> have been able to surpass many previous approaches i

In [10]:
from bs4 import BeautifulSoup

# Define the URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/Machine_learning"

# Make a GET request to fetch the raw HTML content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the title of the page
    title = soup.find('h1', {'id': 'firstHeading'}).text
    print(f"Title: {title}\n")

    # Extract the main content of the page
    content = soup.find('div', {'class': 'mw-parser-output'})

    # Extract and print paragraphs from the content
    paragraphs = content.find_all('p')
    for i, para in enumerate(paragraphs, 1):
        print(f"Paragraph {i}:")
        print(para.get_text())
        print("-" * 80)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


Title: Machine learning

Paragraph 1:
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions.[1] Recently, artificial neural networks have been able to surpass many previous approaches in performance.[2]

--------------------------------------------------------------------------------
Paragraph 2:
ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.[3][4] When applied to business problems, it is known under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.

--------------------------------------------------------------------------------
Paragraph 3:
The mathematical foundations of ML are p

In [14]:
# Concatenate all paragraphs into a single string
all_paragraphs = "\n\n".join([para.get_text() for para in paragraphs])
all_paragraphs

'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions.[1] Recently, artificial neural networks have been able to surpass many previous approaches in performance.[2]\n\n\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.[3][4] When applied to business problems, it is known under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field\'s methods.\n\n\nThe mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis (EDA) through unsupervised learning.[6][7]\n\n\nFrom 

In [16]:
new_paragraphs = all_paragraphs.replace("\n"," ")
new_paragraphs

'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions.[1] Recently, artificial neural networks have been able to surpass many previous approaches in performance.[2]   ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.[3][4] When applied to business problems, it is known under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field\'s methods.   The mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis (EDA) through unsupervised learning.[6][7]   From a theoret

## Using nltk library

In [17]:
!pip install nltk



In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Step 1: Stopword Removal
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(new_paragraphs)

filtered_words = [word for word in word_tokens if word.lower() not in stop_words]

print("Filtered Words (Stopword Removal):")
print(filtered_words)
print("\n")

Filtered Words (Stopword Removal):
['Machine', 'learning', '(', 'ML', ')', 'field', 'study', 'artificial', 'intelligence', 'concerned', 'development', 'study', 'statistical', 'algorithms', 'learn', 'data', 'generalize', 'unseen', 'data', 'thus', 'perform', 'tasks', 'without', 'explicit', 'instructions', '.', '[', '1', ']', 'Recently', ',', 'artificial', 'neural', 'networks', 'able', 'surpass', 'many', 'previous', 'approaches', 'performance', '.', '[', '2', ']', 'ML', 'finds', 'application', 'many', 'fields', ',', 'including', 'natural', 'language', 'processing', ',', 'computer', 'vision', ',', 'speech', 'recognition', ',', 'email', 'filtering', ',', 'agriculture', ',', 'medicine', '.', '[', '3', ']', '[', '4', ']', 'applied', 'business', 'problems', ',', 'known', 'name', 'predictive', 'analytics', '.', 'Although', 'machine', 'learning', 'statistically', 'based', ',', 'computational', 'statistics', 'important', 'source', 'field', "'s", 'methods', '.', 'mathematical', 'foundations', 'ML'

[nltk_data] Downloading package punkt to /home/sudip/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sudip/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
# Step 2: Word Tokenization
print("Word Tokens:")
print(word_tokens)
print("\n")

Word Tokens:
['Machine', 'learning', '(', 'ML', ')', 'is', 'a', 'field', 'of', 'study', 'in', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'development', 'and', 'study', 'of', 'statistical', 'algorithms', 'that', 'can', 'learn', 'from', 'data', 'and', 'generalize', 'to', 'unseen', 'data', 'and', 'thus', 'perform', 'tasks', 'without', 'explicit', 'instructions', '.', '[', '1', ']', 'Recently', ',', 'artificial', 'neural', 'networks', 'have', 'been', 'able', 'to', 'surpass', 'many', 'previous', 'approaches', 'in', 'performance', '.', '[', '2', ']', 'ML', 'finds', 'application', 'in', 'many', 'fields', ',', 'including', 'natural', 'language', 'processing', ',', 'computer', 'vision', ',', 'speech', 'recognition', ',', 'email', 'filtering', ',', 'agriculture', ',', 'and', 'medicine', '.', '[', '3', ']', '[', '4', ']', 'When', 'applied', 'to', 'business', 'problems', ',', 'it', 'is', 'known', 'under', 'the', 'name', 'predictive', 'analytics', '.', 'Although', 'not', 'all', 'mach

In [22]:
# Step 3: Sentence Tokenization
sentences = sent_tokenize(new_paragraphs)

print("Sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")

Sentences:
Sentence 1: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions.
Sentence 2: [1] Recently, artificial neural networks have been able to surpass many previous approaches in performance.
Sentence 3: [2]   ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.
Sentence 4: [3][4] When applied to business problems, it is known under the name predictive analytics.
Sentence 5: Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.
Sentence 6: The mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods.
Sentence 7: Data mining is a related (parallel) field of study, foc

## Now using wordtoVec using gensim

In [24]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.3 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.0.4-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.16.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading gensim-4.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hDownloading smart_open-7.0.4-py3-none-any.whl (61 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m801.0 kB/s[0m eta [36m0:00:00[0mB/s[0m eta [36m0:00:01[0m
[?25hDownloading wrapt-1.16.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_6

In [25]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

In [28]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [27]:
# Prepare sentences for Word2Vec
# Tokenize each sentence and remove stopwords
tokenized_sentences = []
for sentence in sentences:
    words = word_tokenize(sentence)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    tokenized_sentences.append(filtered_words)
tokenized_sentences

[['Machine',
  'learning',
  '(',
  'ML',
  ')',
  'field',
  'study',
  'artificial',
  'intelligence',
  'concerned',
  'development',
  'study',
  'statistical',
  'algorithms',
  'learn',
  'data',
  'generalize',
  'unseen',
  'data',
  'thus',
  'perform',
  'tasks',
  'without',
  'explicit',
  'instructions',
  '.'],
 ['[',
  '1',
  ']',
  'Recently',
  ',',
  'artificial',
  'neural',
  'networks',
  'able',
  'surpass',
  'many',
  'previous',
  'approaches',
  'performance',
  '.'],
 ['[',
  '2',
  ']',
  'ML',
  'finds',
  'application',
  'many',
  'fields',
  ',',
  'including',
  'natural',
  'language',
  'processing',
  ',',
  'computer',
  'vision',
  ',',
  'speech',
  'recognition',
  ',',
  'email',
  'filtering',
  ',',
  'agriculture',
  ',',
  'medicine',
  '.'],
 ['[',
  '3',
  ']',
  '[',
  '4',
  ']',
  'applied',
  'business',
  'problems',
  ',',
  'known',
  'name',
  'predictive',
  'analytics',
  '.'],
 ['Although',
  'machine',
  'learning',
  'statisti

In [29]:
# Step 4: Train Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4) #tokenized_sentences should be in list of list format
model.save("word2vec.model")

## Find the corresponding Word2Vec Vector for a given word

In [30]:
# Load the trained model
# model = Word2Vec.load("word2vec.model")

# Find the vector for the word "Machine"
word = "Machine"
if word in model.wv:
    vector = model.wv[word]
    print(f"Vector for the word '{word}':\n{vector}")
else:
    print(f"The word '{word}' is not in the vocabulary.")

Vector for the word 'Machine':
[ 0.00150826 -0.0024985  -0.00669736 -0.00129595  0.00144523 -0.01028868
 -0.0031829   0.02162249  0.00278311 -0.01081619  0.00784815 -0.00194442
 -0.00905797  0.00060963  0.00594343  0.00201026 -0.00183678 -0.01278487
  0.00193024 -0.0230144  -0.00250346  0.00138534  0.01050903 -0.00726646
  0.00488788  0.00039099 -0.00353672  0.00594831 -0.00508274  0.0104986
  0.00750088 -0.0082659   0.01087621 -0.01123405  0.00282733  0.00255969
 -0.0043264  -0.01028149  0.00156563 -0.0109769  -0.00118626 -0.01426741
 -0.00795052 -0.00619176  0.01374099 -0.01100496  0.00078017 -0.00205435
  0.00474125  0.0142902  -0.00124429 -0.01107891  0.00023554  0.0086534
  0.00287692 -0.00241479  0.00543925  0.00356257 -0.01135803  0.01057508
 -0.00511914  0.00256314  0.00540672  0.00328266 -0.01123825  0.0097449
  0.00930161  0.01481379 -0.01874232  0.01311307  0.00261064  0.00230949
  0.01372692 -0.00696366  0.00442876 -0.00305065 -0.00111572  0.00162028
 -0.00809384 -0.0052075