<a href="https://colab.research.google.com/github/tekle-eyesus/nlp-basics-assignment-/blob/main/NLP_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing (NLP) Assignment
This assignment will guide you through the basic concepts of Natural Language Processing including:
- Text preprocessing
- Tokenization and N-grams
- Named Entity Recognition (NER)
- Converting text into numbers (vectorization)
- Word embeddings (for experienced learners)

You can run and modify the code cells below to complete the tasks.

In [None]:
# Import required libraries
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## 1. Text Preprocessing
Clean the following text by converting it to lowercase, removing punctuation and stop words.

In [None]:
# Sample text
text = "Natural Language Processing is a fascinating field. It combines linguistics and computer science!"

# TODO: Preprocess the text
def preprocess(text):
    text = text.lower()
    tokens = word_tokenize(text)
    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
    return cleaned_tokens

# Print cleaned tokens
cleaned_tokens = preprocess(text)
print(cleaned_tokens)

['natural', 'language', 'processing', 'fascinating', 'field', 'combines', 'linguistics', 'computer', 'science']


## 2. Tokenization and N-grams
Generate bigrams (2-grams) from the cleaned tokens.

In [None]:
# Generate bigrams from cleaned tokens
bigrams = list(ngrams(cleaned_tokens, 2))
print("Bigrams:", bigrams)

Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'fascinating'), ('fascinating', 'field'), ('field', 'combines'), ('combines', 'linguistics'), ('linguistics', 'computer'), ('computer', 'science')]


## 3. Named Entity Recognition (NER)
Use spaCy to perform NER on a new sentence.

In [None]:
# Example sentence
sentence = "Barack Obama was born in Hawaii and was elected president in 2008."
doc = nlp(sentence)
for ent in doc.ents:
    print(ent.text, ent.label_)
# for the givven text

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)


Barack Obama PERSON
Hawaii GPE
2008 DATE
Natural Language Processing ORG


## 4. Converting Text to Numbers
Use CountVectorizer and TfidfVectorizer to convert a list of sentences into numeric vectors.

In [None]:
sentences = [
    "I love machine learning.",
    "Natural language processing is a part of AI.",
    "AI is the future."
]

# CountVectorizer
count_vec = CountVectorizer()
X_count = count_vec.fit_transform(sentences)
print("Count Vectorizer Output:\n", X_count.toarray())

# TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform(sentences)
print("\nTF-IDF Vectorizer Output:\n", X_tfidf.toarray())

Count Vectorizer Output:
 [[0 0 0 0 1 1 1 0 0 0 0 0]
 [1 0 1 1 0 0 0 1 1 1 1 0]
 [1 1 1 0 0 0 0 0 0 0 0 1]]

TF-IDF Vectorizer Output:
 [[0.         0.         0.         0.         0.57735027 0.57735027
  0.57735027 0.         0.         0.         0.         0.        ]
 [0.30650422 0.         0.30650422 0.40301621 0.         0.
  0.         0.40301621 0.40301621 0.40301621 0.40301621 0.        ]
 [0.42804604 0.5628291  0.42804604 0.         0.         0.
  0.         0.         0.         0.         0.         0.5628291 ]]


## 5. Word Embeddings (Advanced)
Use spaCy to get word vectors (embeddings) for given words.

In [None]:
# Note: en_core_web_sm does not have word vectors. You can install and use en_core_web_md
# Uncomment below to install and load the medium model if needed.

# !python -m spacy download en_core_web_md
# nlp = spacy.load("en_core_web_md")

# Example word vector
word = nlp("machine")[0]
print("Vector for 'machine':\n", word.vector)

Vector for 'machine':
 [-0.72883    0.20718   -0.0033379 -0.0027673 -0.17204    0.023277
  0.1297    -0.2112     0.32876    0.67447    0.10047   -0.30559
  0.11213    0.22959   -0.32997    0.1389    -0.57289    2.523
 -0.32921    0.06045    0.23895    0.1091     0.19358   -0.1765
  0.11583    0.63204   -0.13644   -0.24354    0.20061   -0.50244
  0.40537   -0.38688    0.73784    0.093937  -0.30643    0.045874
  0.097915  -0.082114   0.13082   -0.039022   0.088084  -0.27023
 -0.077658  -0.0045355  0.18986   -0.063083  -0.138      0.40474
 -0.16199   -0.10953    0.22923   -0.67634   -0.65763   -0.044595
 -0.12119    0.071167   0.25993   -0.27052   -0.22474   -0.13818
  0.20692    0.87604   -0.35257   -0.1498     0.72804    0.68768
  0.19993    0.084733  -0.2234     0.11301    0.29895   -0.090119
  0.038172  -0.32912    0.014221  -0.36335    0.5898     0.10467
  0.16549    0.47199    0.078939  -0.19985    0.84014   -0.2277
 -0.22907   -0.26243   -0.32598    1.0146    -0.079235  -0.34248
  

Words with similar contexts (like "machine" and "robot") should have higher similarity scores than unrelated words (like "machine" and "apple").

In [None]:
# Compare word similarities
word1 = nlp("machine")[0]
word2 = nlp("robot")[0]
word3 = nlp("apple")[0]

print(f"Similarity between 'machine' and 'robot': {word1.similarity(word2):.4f}")
print(f"Similarity between 'machine' and 'apple': {word1.similarity(word3):.4f}")

Similarity between 'machine' and 'robot': 0.4400
Similarity between 'machine' and 'apple': 0.0060
