In [3]:
faqs = """About the Project
Who has developed this Machine Learning Project?
This project has been developed by **Vikas Malviya**, a postgraduate student pursuing M.Tech in Artificial Intelligence at MANIT, Bhopal.

What is the purpose of this ML project?
The purpose of this project is to demonstrate a complete end-to-end Machine Learning pipeline including data preprocessing, model building, evaluation, deployment, and interactive UI integration.

What type of project is this?
It is an end-to-end ML system with:
Data Cleaning
Feature Engineering
EDA
Model Training
Model Optimization
Model Deployment (Streamlit)
User Interface
Report Documentation

What technology stack has been used?
Python
NumPy, Pandas
Scikit-learn
Matplotlib/Seaborn
Streamlit
Pickle for model saving
Git & GitHub
Basic UI Components

Does this project contain Deep Learning models?
No, this particular project focuses only on classical ML algorithms for fast execution and one-day build feasibility.

Is the project suitable for beginners?
Yes. The entire workflow is structured to be beginner-friendly and easy to understand.

Is this project deployable?
Yes. The model is deployed using Streamlit and runs locally or on cloud hosting platforms.

About the Developer
Who is the creator of this project?
My name is **Vikas Malviya**, and I am currently pursuing M.Tech in Artificial Intelligence at Maulana Azad National Institute of Technology (MANIT), Bhopal.

What is the educational background of the developer?
• M.Tech – Artificial Intelligence, MANIT Bhopal
• B.Tech – Electrical Engineering, MITS Gwalior (CGPA 8.5)

Does the developer have ML-related certifications?
Yes. The developer holds multiple certifications including:
Google Cloud: Professional Data Engineer
Microsoft Azure (DP-203)
Microsoft Azure Fundamentals (AZ-900)

Project Workflow and Content
What steps are included in this ML project?
The project includes:
Data Collection
Data Cleaning
Exploratory Data Analysis
Statistical Insights
Feature Engineering
Model Selection
Model Training
Evaluation Metrics
Saving the model
Building a Streamlit Web App
Report generation

What algorithms are used in this project?
Depending on the dataset:
Logistic Regression
Random Forest
Support Vector Machine
Decision Tree
Naive Bayes
KNN
Gradient Boosting (optional)

Does the project include visualizations?
Yes. It includes:
Distribution plots
Correlation heatmaps
Class imbalance analysis
Pair plots
Confusion matrix
Accuracy graphs

Dataset Related Questions
What dataset has been used in the project?
A publicly available dataset from Kaggle/UCI that is open-source and suitable for ML classification tasks.

Is the dataset balanced?
The dataset may contain imbalance; this has been analyzed during EDA and addressed through ML techniques if required.

Are there missing values in the dataset?
Missing values are handled using mean/median imputation or appropriate preprocessing steps.

Deployment Related Questions
How can I run this project?
After installing the required libraries, use:
streamlit run app.py

Where is the model deployed?
The model is deployed locally using Streamlit UI. It can also be pushed to Streamlit Cloud or other hosting services.

Can a non-technical user run the app?
Yes. The UI is designed to be simple and easy to use, with input fields and instant predictions.

What happens when I input the values?
The trained ML model processes the given features and outputs the predicted result instantly.

Support and Communication
Can I contact the developer for questions?
Yes. You can reach out at: **vikasmlv99@gmail.com**

Will support be provided if I face issues?
Basic clarification can be provided by contacting the developer.

Project Documentation
Does this ML project include a complete report?
Yes. The report contains:
Introduction
Problem Statement
Dataset Description
Methodology
Model Pipeline
Results & Accuracy
Screenshots of Web App
Conclusion and Future Scope

Can I use this project for my resume?
Absolutely. This is a complete end-to-end project suitable for portfolio and academic usage.

General Project Queries
Can this project be expanded?
Yes. You can add:
Hyperparameter tuning
More algorithms
Explainability (SHAP/LIME)
API deployment with FastAPI
Cloud Deployment
Database integration

Can I switch to a different model?
You may replace the current model with any sklearn classifier while maintaining the same pipeline.

Can I convert this into a research project?
Yes. By adding more datasets, evaluation metrics, and comparative studies, it can be turned into a research-style document.

Why is the FAQ section so large?
The FAQ dataset is intentionally expanded to be used for:
Documentation
NLP training
FAQ chatbots
Search models
Project clarity

Is this FAQ written manually?
Yes, this FAQ is intentionally designed to resemble large, structured FAQ datasets for model training or submission purposes.
"""


In [4]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [5]:
tokenizer = Tokenizer()

In [6]:
tokenizer.fit_on_texts([faqs])

In [7]:
tokenizer.word_index

{'the': 1,
 'project': 2,
 'is': 3,
 'this': 4,
 'model': 5,
 'and': 6,
 'to': 7,
 'can': 8,
 'a': 9,
 'yes': 10,
 'ml': 11,
 'for': 12,
 'dataset': 13,
 'what': 14,
 'be': 15,
 'i': 16,
 'of': 17,
 'streamlit': 18,
 'in': 19,
 'end': 20,
 'data': 21,
 'developer': 22,
 'faq': 23,
 'has': 24,
 'deployment': 25,
 'been': 26,
 'tech': 27,
 'ui': 28,
 'it': 29,
 'with': 30,
 'training': 31,
 'report': 32,
 'used': 33,
 'does': 34,
 'or': 35,
 'cloud': 36,
 'are': 37,
 'app': 38,
 'machine': 39,
 'learning': 40,
 'by': 41,
 'm': 42,
 'artificial': 43,
 'intelligence': 44,
 'at': 45,
 'manit': 46,
 'bhopal': 47,
 'complete': 48,
 'pipeline': 49,
 'evaluation': 50,
 'engineering': 51,
 'documentation': 52,
 'on': 53,
 'algorithms': 54,
 'suitable': 55,
 'deployed': 56,
 'using': 57,
 'related': 58,
 'support': 59,
 'questions': 60,
 'values': 61,
 'run': 62,
 'use': 63,
 'you': 64,
 'about': 65,
 'who': 66,
 'developed': 67,
 'vikas': 68,
 'malviya': 69,
 'pursuing': 70,
 'purpose': 71,
 'in

In [8]:
len(tokenizer.word_index)

334

In [9]:
input_sequences = []
for sentence in faqs.split('\n'):
  tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]

  for i in range(1,len(tokenized_sentence)):
    input_sequences.append(tokenized_sentence[:i+1])

In [10]:
input_sequences

[[65, 1],
 [65, 1, 2],
 [66, 24],
 [66, 24, 67],
 [66, 24, 67, 4],
 [66, 24, 67, 4, 39],
 [66, 24, 67, 4, 39, 40],
 [66, 24, 67, 4, 39, 40, 2],
 [4, 2],
 [4, 2, 24],
 [4, 2, 24, 26],
 [4, 2, 24, 26, 67],
 [4, 2, 24, 26, 67, 41],
 [4, 2, 24, 26, 67, 41, 68],
 [4, 2, 24, 26, 67, 41, 68, 69],
 [4, 2, 24, 26, 67, 41, 68, 69, 9],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42, 27],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42, 27, 19],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42, 27, 19, 43],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42, 27, 19, 43, 44],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42, 27, 19, 43, 44, 45],
 [4, 2, 24, 26, 67, 41, 68, 69, 9, 119, 120, 70, 42, 27, 19, 43, 44, 45, 46],
 [4,
  2,
  24,
  26,
  67,
  41,
  68,
  69,
  9,
  119,
  120,
  70,
  4

In [11]:
max_len = max([len(x) for x in input_sequences])

In [12]:
max_len

27

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_input_sequences = pad_sequences(input_sequences, maxlen = max_len, padding='pre')

In [14]:
X = padded_input_sequences[:,:-1]

In [15]:
y = padded_input_sequences[:,-1]

In [16]:
X.shape

(608, 26)

In [17]:
y.shape

(608,)

In [18]:
from tensorflow.keras.utils import to_categorical
y = to_categorical(y,num_classes=335)

In [19]:
y.shape

(608, 335)

In [20]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [21]:
model = Sequential()
model.add(Embedding(len(tokenizer.word_index) + 1, 100))
model.add(LSTM(150, return_sequences=True))
model.add(LSTM(150))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))

In [22]:
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [23]:
model.summary()

In [None]:
model.fit(X,y,epochs=100)

Epoch 1/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 81ms/step - accuracy: 0.0238 - loss: 5.7685
Epoch 2/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 50ms/step - accuracy: 0.0334 - loss: 5.3306
Epoch 3/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 50ms/step - accuracy: 0.0466 - loss: 5.2705
Epoch 4/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 59ms/step - accuracy: 0.0453 - loss: 5.2626
Epoch 5/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 58ms/step - accuracy: 0.0694 - loss: 5.0896
Epoch 6/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 54ms/step - accuracy: 0.0688 - loss: 5.0206
Epoch 7/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 60ms/step - accuracy: 0.0521 - loss: 5.0545
Epoch 8/100
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 57ms/step - accuracy: 0.0744 - loss: 4.8901
Epoch 9/100
[1m19/19[0m [32m━━━━━━━━━

In [None]:
import numpy as np
import time
text = "my name is"
for i in range(10):
  # tokenize
  token_text = tokenizer.texts_to_sequences([text])[0]
  # padding
  padded_token_text = pad_sequences([token_text], maxlen=56, padding='pre')
  # predict
  pos = np.argmax(model.predict(padded_token_text))

  for word,index in tokenizer.word_index.items():
    if index == pos:
      text = text + " " + word
      print(text)
      time.sleep(2)