In [8]:
# QUE.1:- Basic text preprocessing (tokenization, lowercasing, removing punctuation).

#Basic text preprocessing is an essential step in natural language processing (NLP) tasks
# 1.Tokenization: Splitting text into smaller parts, such as words, sentences, or subwords.
#Example:Input: "Hello, World!" ,Output: ["Hello", ",", "World", "!"] (Word tokenization)

# 2.Lowercasing: Converting all text to lowercase to ensure consistency.
#Example:Input: "Hello World!" , Output: "hello world!"

#3. Removing Punctuation: Eliminating punctuation marks to focus on the core content of the text.
#Example:Input: "Hello, World!",Output: "Hello World"
import re
import nltk
from nltk.tokenize import word_tokenize

# Download required NLTK data (only need to run once)
nltk.download('punkt')

def preprocess_text(text):
    # Lowercasing
    text = text.lower()

    # Removing punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenizing
    tokens = word_tokenize(text)

    return tokens

# Example usage
text = "Toutche electric bicycles are eco-friendly and efficient "
preprocessed_text = preprocess_text(text)
print(preprocessed_text)


['toutche', 'electric', 'bicycles', 'are', 'ecofriendly', 'and', 'efficient']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [1]:
#A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern
import re
text = "My email address is john.doe@example.com"

# Search for an email address
match = re.search(r"\w+@\w+\.\w+", text)

if match:
    print("Email found:", match.group())

Email found: doe@example.com


In [2]:
# QUE2:-  function that can identify keywords related to Toutche's products in agiven text


import re

# Define a list of keywords related to Toutche's products
keywords = [
    "e-bike", "ebike", "electric bike", "electric bicycle",
    "cycle", "bicycle", "pedal assist", "battery", "motor",
    "charging", "range", "Toutche", "Heileo", "speed", "brakes", "tyres"
]

def identify_keywords(text, keywords):
    # Convert text to lowercase for case-insensitive matching
    text = text.lower()

    # Create a set to store found keywords
    found_keywords = set()

    # Check each keyword in the text
    for keyword in keywords:
        # Use regex to find whole words matching the keywords
        if re.search(r'\b' + re.escape(keyword.lower()) + r'\b', text):
            found_keywords.add(keyword)

    return list(found_keywords)

# Example usage
text = "The new Toutche Heileo e-bike comes with an improved battery and pedal assist."
detected_keywords = identify_keywords(text, keywords)
print(detected_keywords)


['e-bike', 'battery', 'Heileo', 'Toutche', 'pedal assist']


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# Sample training data
X = ["How much the bicyles cost in mountain range series?"," Is it availabale offline on stores? ","Do you offer test rides?"]
y = ["pricing", "product_info", "services"]
# Create a pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# Train the model
model.fit(X, y)
# Predict intent
new_query = "i can buy online or offline?"
print(model.predict([new_query]))

['product_info']


#COMMENTS EXPLAINING YOUR CODE AND THOUGHT PROCESS
**Thought Process Behind the Code:**

1.**Keyword List**


The list of keywords (keywords) contains
terms that are specifically related to Toutche's products and the e-bike industry, such as "e-bike," "battery," "pedal assist," and product names like "Heileo." These keywords are predefined based on knowledge of the product.

The list can be expanded with new terms or tailored to focus on specific product features.

2.**Lowercasing:**

Why lowercasing?
-- Text data can have mixed cases, and we want to ensure that the matching is case-insensitive. Lowercasing both the input text and keywords allows us to find matches regardless of case.

**3.Regex for Matching:**

The regular expression r'\b' + re.escape
(keyword.lower()) + r'\b' is used to find exact keyword matches,
 where:
--\b ensures the keyword is a whole word. This prevents false positives, such as finding "cycle" inside the word "recycle."

--re.escape() is used to handle any special characters in the keyword properly (though our current keyword list doesn't have special characters).
This approach ensures robust and accurate keyword matching.

**4.Set for Storing Found Keywords:**

We use a set (found_keywords) to store the detected keywords. Why a set? Sets automatically handle duplicates, so if a keyword appears multiple times in the text, it’s added only once.

After finding all keywords, we convert the set back to a list to maintain a simple structure for output.

**5.Returning the List:**

The function returns a list of found keywords for easier handling by the user. Lists are a common and easy-to-use data structure for further processing or displaying results.