Welcome to our Big Data Search Engine project — a comprehensive, end-to-end system built as part of our Big Data course. This project showcases real-world applications of big data concepts, including large-scale data collection, indexing, ranking, and search.
🔊 Check out the live demo video to see how everything works in action!
demo.mp4
This project demonstrates how to build a fully functional search engine using Big Data technologies. The workflow includes:
- Web scraping over 200,000 URLs
- Storing each page as an individual file
- Building an inverted index to map words to documents
- Calculating page ranks using hyperlink structure
- Storing processed data in a database
- Developing an API to serve the search functionality
- Frontend interface where users can search for keywords and get results showing:
- The list of pages containing the word
- Frequency of occurrence
- PageRank of each result
- We scraped data from over 200,000 web pages using Python.
- Tools used:
requests
BeautifulSoup
- Data extracted includes:
- Text content
- URLs
- Each page is saved in a separate text file.
- Filenames are based on the URL (safely encoded).
- All files are uploaded and stored in HDFS (Hadoop Distributed File System).
- Built using MapReduce on top of Hadoop.
- Each word is mapped to a list of files (documents) where it appears.
- Also includes the frequency of each word per document.
- This structure allows for fast and scalable keyword searches.
- We implemented the PageRank algorithm to evaluate the importance of each page.
- Based on link structure and references between the pages.
- Helps rank search results by relevance and authority.
- Both the inverted index and PageRank scores are stored in a database (SQL / Entity Framwork).
- Enables efficient querying by the API.
- A back-end API
- Accepts search queries and returns:
- Matching documents
- Word frequency in each document
- PageRank scores
We built a modern, high-performance frontend using:
⚛️ React with TypeScript for robust component-based architecture
🎨 Tailwind CSS for elegant, responsive, and utility-first styling
💡 Optimized UX with a sleek, intuitive, and professional design
Function | Tools / Technologies Used |
---|---|
Web Scraping | Python (requests , BeautifulSoup ) |
Data Storage | HDFS |
Indexing & Ranking | Hadoop, MapReduce |
Database | SQL / Entity Framwork |
Backend API | Local API |
Frontend UI | HTML, TypeScript, Tailwind CSS |
This project was a collaborative effort by a talented team. Each member played a key role in building and delivering this Big Data Search Engine:
Name | Role & Contribution |
---|---|
Mohamed Abdelghany | Web Scraping & HDFS Storage — Collected data from 200k+ URLs and stored them in HDFS |
Youssef Mahmoud | Big Data Processing — Built the Inverted Index in distrbuted Files on HDFS and implemented the PageRank algorithm by neighbors Links |
Mohamed Mohy | Backend Developer — Developed the API and connected it with the database and frontend |
Omar ElSayed | Frontend Developer — Built a modern UI using React, TypeScript, and Tailwind CSS |
🙌 Thanks to each team member for their dedication, collaboration, and exceptional work throughout the project.