<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/C-1.1%3A%20NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Python

This notebook will guide you through the basics of Natural Language Processing (NLP) using Python. We will cover text preprocessing, tokenization, part-of-speech tagging, named entity recognition, and text classification.

## 1. Introduction to NLP

Natural Language Processing is a subfield of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret, and generate human languages in a way that is valuable.

## 2. Setting Up the Environment

In [1]:
#Before we begin, make sure you have the following libraries installed:

!pip install nltk
!pip install spacy
!pip install sklearn
!python -m spacy download en_core_web_sm

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m105.9 MB/s[0m eta [36m0:00:00[0m
[

## 3. Text Preprocessing

Text preprocessing is essential to prepare raw text for analysis. Common steps include:
- Lowercasing
- Removing punctuation
- Removing stop words
- Lemmatization

Here’s an example of preprocessing using NLTK:

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

In [8]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab') # Add this line to download the specific resource

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [9]:
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens
t = preprocess_text("Hello, welcome to the world of NLP!")
print(t)

['hello', 'welcome', 'world', 'nlp']


## 4. Tokenization

Tokenization is the process of splitting text into smaller parts (tokens). Here’s how you can perform tokenization using NLTK:
```python

In [10]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
from nltk.tokenize import word_tokenize

text = "Hello, welcome to the world of NLP!"
tokens = word_tokenize(text)
print(tokens)

['Hello', ',', 'welcome', 'to', 'the', 'world', 'of', 'NLP', '!']


## 5. Part-of-Speech Tagging

Part-of-speech (POS) tagging assigns parts of speech to each word (e.g., noun, verb). Here’s an example using NLTK:

In [12]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [14]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [15]:
text = "Natural Language Processing is fascinating."
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fascinating', 'VBG'), ('.', '.')]


## 6. Named Entity Recognition

Named Entity Recognition (NER) identifies named entities in text (e.g., people, organizations). Using spaCy:

In [16]:
import spacy

nlp = spacy.load('en_core_web_sm')
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


## 7. Text Classification

Text classification assigns predefined categories to text. Here’s an example using the Naive Bayes classifier:

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [18]:
# Sample data
documents = ["I love programming!", "Python is great.", "I dislike bugs.", "Debugging is fun."]
labels = [1, 1, 0, 1]  # 1: Positive, 0: Negative

In [19]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2)

# Vectorization
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)

In [20]:
# Train the classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

In [21]:
# Test the classifier
X_test_vectorized = vectorizer.transform(X_test)
predictions = classifier.predict(X_test_vectorized)
predictions

array([1])

In [22]:
# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
accuracy

1.0