In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/ai-powered-study-buddy-sample-pdf/sample-input.pdf


In [2]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

# ü§ñ AI-Powered Study Buddy

## üìå Project Overview 

The AI-Powered Study Buddy is an academic project developed as part of the Edunet Foundation ‚Äì IBM SkillsBuild Artificial Intelligence and Machine Learning internship to demonstrate the application of Machine Learning and Natural Language Processing (NLP) techniques in education.

The project processes educational study material provided in PDF format and helps students by generating summaries, answers to questions, quizzes, and flashcards for effective exam preparation. Text is extracted from documents, preprocessed, and analyzed using TF-IDF‚Äìbased methods to identify important information.

Implemented using Python, scikit-learn, NLTK, and PyPDF2, the project runs in a Kaggle Notebook environment with a private Kaggle dataset created for demonstration purposes. This project showcases how basic ML techniques can be used to build an intelligent and practical study assistance system.

## üîÅ Project Workflow

1. Load educational dataset from Kaggle  
2. Preprocess text data  
3. Explain concepts using AI  
4. Summarize study content  
5. Generate quiz questions  
6. Create flashcards  
7. Display outputs for students

## ‚öôÔ∏è Methodology

This project follows a traditional Machine Learning‚Äìbased Natural Language Processing (NLP) pipeline to assist students in studying educational content.

1. **Data Collection**  
   Educational study material is collected in PDF format and uploaded as a private Kaggle dataset.

2. **Text Extraction**  
   Text is extracted from PDF documents using the PyPDF2 library.

3. **Text Preprocessing**  
   The extracted text is cleaned by converting to lowercase and removing unnecessary line breaks to improve processing quality.

4. **Text Representation**  
   TF-IDF (Term Frequency‚ÄìInverse Document Frequency) is used to convert text into numerical vectors for similarity-based analysis.

5. **Summarization**  
   Important sentences are selected using TF-IDF scores to generate concise summaries of study material.

6. **Question Answering**  
   User questions are answered by identifying the most relevant sentences based on TF-IDF similarity.

7. **Flashcard & Quiz Generation**  
   Key concepts are identified using TF-IDF keywords to automatically generate flashcards and quiz questions.

This approach ensures an explainable, lightweight, and efficient study assistant without relying on large language models.


## üß™ Sample Study Content (Initial Testing)

This section contains sample educational content used for initial testing of the study buddy pipeline before applying it to PDF or dataset inputs.



In [3]:
# Sample study content for initial testing
study_text = """
Electromagnetic induction is a fundamental concept in physics and electrical engineering.
It explains how a changing magnetic field can induce an electric current in a conductor.
This principle was discovered by Michael Faraday and forms the basis of many electrical devices.
Generators, transformers, and inductors operate using electromagnetic induction.
Understanding this concept is very important for students studying electrical and electronics engineering.
"""

print(study_text)


Electromagnetic induction is a fundamental concept in physics and electrical engineering.
It explains how a changing magnetic field can induce an electric current in a conductor.
This principle was discovered by Michael Faraday and forms the basis of many electrical devices.
Generators, transformers, and inductors operate using electromagnetic induction.
Understanding this concept is very important for students studying electrical and electronics engineering.



In [4]:
# Basic text preprocessing
def preprocess_text(text):
    text = text.lower()              # convert to lowercase
    text = text.replace("\n", " ")   # remove line breaks
    return text

clean_text = preprocess_text(study_text)
print(clean_text)

 electromagnetic induction is a fundamental concept in physics and electrical engineering. it explains how a changing magnetic field can induce an electric current in a conductor. this principle was discovered by michael faraday and forms the basis of many electrical devices. generators, transformers, and inductors operate using electromagnetic induction. understanding this concept is very important for students studying electrical and electronics engineering. 


## üß† Concept Explanation & Text Preprocessing


In [5]:
# Concept explanation function
def explain_concept(text):
    explanation = (
        "Simple Explanation:\n"
        + text
        + "\n\nIn simple words, this concept explains how changes in physical conditions "
          "can produce useful effects in real-world applications. It is widely used in "
          "engineering and technology."
    )
    return explanation

In [6]:
# Test concept explanation
concept_output = explain_concept(clean_text)
print(concept_output)

Simple Explanation:
 electromagnetic induction is a fundamental concept in physics and electrical engineering. it explains how a changing magnetic field can induce an electric current in a conductor. this principle was discovered by michael faraday and forms the basis of many electrical devices. generators, transformers, and inductors operate using electromagnetic induction. understanding this concept is very important for students studying electrical and electronics engineering. 

In simple words, this concept explains how changes in physical conditions can produce useful effects in real-world applications. It is widely used in engineering and technology.


## ‚úçÔ∏è Study Material Summarization (TF-IDF)


In [7]:
# Keyword-based summarization (improved)
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def summarize_text_tfidf(text, max_sentences=2):
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    
    if len(sentences) <= max_sentences:
        return text
    
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    
    sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten()
    top_sentence_indices = sentence_scores.argsort()[-max_sentences:][::-1]
    
    summary = '. '.join([sentences[i] for i in sorted(top_sentence_indices)])
    return summary + '.'

In [8]:
# Test TF-IDF based summarization
summary_output = summarize_text_tfidf(clean_text)
print("Summary:")
print(summary_output)

Summary:
it explains how a changing magnetic field can induce an electric current in a conductor. this principle was discovered by michael faraday and forms the basis of many electrical devices.


## ‚ùì Question Answering


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def answer_question(text, question, top_n=2):
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences + [question])
    
    similarity_scores = (tfidf_matrix[:-1] * tfidf_matrix[-1].T).toarray().flatten()
    
    top_indices = similarity_scores.argsort()[-top_n:][::-1]
    
    answer = ". ".join([sentences[i] for i in sorted(top_indices)])
    return answer + "."

In [10]:
# Test Question Answering
question = "What is electromagnetic induction?"
answer = answer_question(clean_text, question)

print("Question:", question)
print("Answer:", answer)

Question: What is electromagnetic induction?
Answer: electromagnetic induction is a fundamental concept in physics and electrical engineering. generators, transformers, and inductors operate using electromagnetic induction.


## üìù Quiz Generation

In [11]:
import random

def generate_quiz(text, num_questions=3):
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    
    templates = [
        "What does the following statement explain?",
        "Identify the concept described below:",
        "What is being referred to in this statement?"
    ]
    
    quiz_questions = []
    for i, sentence in enumerate(sentences[:num_questions]):
        template = random.choice(templates)
        question = f"Q{i+1}. {template}\n{sentence}"
        quiz_questions.append(question)
    
    return quiz_questions

In [12]:
# Test quiz generation
quiz = generate_quiz(clean_text)

for q in quiz:
    print(q)
    print()

Q1. Identify the concept described below:
electromagnetic induction is a fundamental concept in physics and electrical engineering

Q2. Identify the concept described below:
it explains how a changing magnetic field can induce an electric current in a conductor

Q3. What is being referred to in this statement?
this principle was discovered by michael faraday and forms the basis of many electrical devices



## üÉè Flashcard Generation

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

def generate_flashcards(text, num_cards=5):
    sentences = [s.strip() for s in text.split('.') if len(s.strip()) > 30]

    vectorizer = TfidfVectorizer(
        stop_words='english',
        ngram_range=(1, 2),
        max_features=10
    )

    tfidf = vectorizer.fit_transform(sentences)
    keywords = vectorizer.get_feature_names_out()

    flashcards = []

    for kw in keywords[:num_cards]:
        for sent in sentences:
            if kw in sent:
                flashcards.append({
                    "Front": kw.title(),
                    "Back": sent[:120]  # LIMIT length
                })
                break

    return flashcards

In [14]:
import nltk
from nltk import word_tokenize, pos_tag
from nltk.chunk import RegexpParser

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


def extract_noun_phrases(sentence):
    tokens = word_tokenize(sentence)
    tagged = pos_tag(tokens)

    grammar = r"""
        NP: {<JJ>*<NN|NNS|NNP|NNPS>+}
    """
    chunker = RegexpParser(grammar)
    tree = chunker.parse(tagged)

    noun_phrases = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
        phrase = " ".join(word for word, tag in subtree.leaves())
        if len(phrase.split()) > 1:
            noun_phrases.append(phrase)

    return noun_phrases


def generate_flashcards(text):
    sentences = [s.strip() for s in text.split('.') if len(s.strip()) > 25]
    flashcards = []

    for sentence in sentences:
        noun_phrases = extract_noun_phrases(sentence)

        if noun_phrases:
            flashcards.append({
                "Front": noun_phrases[0].title(),
                "Back": sentence
            })

    return flashcards

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [15]:
# Test flashcard generation
flashcards = generate_flashcards(clean_text)

for card in flashcards:
    print("Front:", card["Front"])
    print("Back:", card["Back"])
    print()

Front: Electromagnetic Induction
Back: electromagnetic induction is a fundamental concept in physics and electrical engineering

Front: Changing Magnetic Field
Back: it explains how a changing magnetic field can induce an electric current in a conductor

Front: Michael Faraday
Back: this principle was discovered by michael faraday and forms the basis of many electrical devices

Front: Electromagnetic Induction
Back: generators, transformers, and inductors operate using electromagnetic induction

Front: Electronics Engineering
Back: understanding this concept is very important for students studying electrical and electronics engineering



In [16]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m232.6/232.6 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [17]:
import PyPDF2

## üìÇ Dataset Description

A custom Kaggle dataset was created for this project using educational study material provided in PDF format.  
For initial testing and demonstration, a small sample PDF containing introductory concepts of Artificial Intelligence is used.

The dataset includes:
- Theoretical explanations of core concepts
- Structured educational text suitable for summarization
- Content appropriate for question answering, quiz generation, and flashcard creation

This dataset is uploaded to Kaggle as a **private dataset** and accessed directly within the notebook for processing.  
It is used strictly for educational and demonstration purposes.


In [18]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

In [19]:
pdf_text = extract_text_from_pdf(
    "/kaggle/input/ai-powered-study-buddy-sample-pdf/sample-input.pdf"
)

In [20]:
# Simple cleaning
clean_pdf_text = pdf_text.lower()

# Use existing pipeline (already built earlier)
summary = summarize_text_tfidf(clean_pdf_text)
answer = answer_question(clean_pdf_text, "Explain the concept")
flashcards = generate_flashcards(clean_pdf_text)

print("SUMMARY:\n", summary)
print("\nANSWER:\n", answer)
print("\nFLASHCARDS:\n")
for i, card in enumerate(flashcards, 1):
    print(f"{i}. Front: {card['Front']}")
    print(f"   Back : {card['Back']}\n")

SUMMARY:
 objectives of artificial intelligence
the main objectives of artificial intelligence are:
‚Ä¢ to develop intelligent systems that can perform human-like tasks
‚Ä¢ to enable machines to learn from experience
‚Ä¢ to improve efficiency and accuracy in problem solving
‚Ä¢ to automate repetitive and time-consuming tasks
‚Ä¢ to assist humans in decision making
ai helps reduce human effort and improves productivity in various fields. applications of artificial intelligence
artificial intelligence is used in many areas:
‚Ä¢ healthcare ‚Äì disease diagnosis, medical imaging, and treatment
planning
‚Ä¢ education ‚Äì personalized learning, virtual tutors, and automated
grading
‚Ä¢ finance ‚Äì fraud detection, credit scoring, and customer support
‚Ä¢ transportation ‚Äì self-driving cars and traffic management
‚Ä¢ entertainment ‚Äì movie and music recommendation systems
6.

ANSWER:
 c) super ai
super ai is a theoretical concept where machines surpass human
intelligence in all aspects, inc

In [21]:
# Quiz Generation
quiz_questions = generate_quiz(clean_pdf_text)

print("QUIZ QUESTIONS:\n")
for i, q in enumerate(quiz_questions, 1):
    print(f"{i}. {q}")

QUIZ QUESTIONS:

1. Q1. Identify the concept described below:
artificial intelligence (ai) 
1
2. Q2. What is being referred to in this statement?
introduction to artificial intelligence
artificial intelligence (ai) is a field of computer science that focuses on
creating machines capable of performing tasks that normally require
human intelligence
3. Q3. What is being referred to in this statement?
these tasks include learning, reasoning, problem
solving, decision making, understanding human language, and recognizing
patterns


## üìä Results & Outputs

The AI-Powered Study Buddy successfully processes educational PDF content and produces the following outputs:

- üìå **Summarized study material** using TF-IDF based extractive summarization  
- ‚ùì **Question answering** by identifying the most relevant sentences from the content  
- üìù **Automatically generated quiz questions** for self-assessment  
- üÉè **Flashcards** highlighting key concepts for quick revision  

These outputs demonstrate how traditional NLP techniques can support effective learning and exam preparation.


## üìå Key Features

- PDF-based learning material processing  
- TF-IDF powered text summarization  
- Context-aware question answering  
- Automatic quiz generation  
- Flashcard creation for concept revision  
- Fully executable and reproducible Kaggle notebook


## üõ†Ô∏è Technologies Used

- Python  
- Scikit-learn (TF-IDF Vectorizer)  
- NLTK  
- PyPDF2  
- Kaggle Notebook


## ‚ö†Ô∏è Limitations

- The summarization is extractive and may not fully capture semantic meaning  
- Flashcard relevance depends on keyword frequency  
- Performance may vary with very large or unstructured documents  

Future improvements can include transformer-based models for better semantic understanding.


## üöÄ Future Enhancements

- Integrate transformer-based models (BERT / T5)  
- Improve flashcard accuracy using semantic similarity  
- Add a web interface using Streamlit or Gradio  
- Support multiple PDFs and subjects  
- Enable student performance tracking


## ‚úÖ Conclusion

This project demonstrates how NLP techniques can be applied to build an AI-powered learning assistant.  
By combining text extraction, TF-IDF based processing, and intelligent content generation, the system supports effective self-study and revision.

The project highlights the practical application of machine learning concepts in the education domain and serves as a foundation for more advanced AI-driven learning systems.
