# Online Plagiarism Checker
Online plagiarism checker scrapes the markdown text from submitted notebook link, searches the text on Google for possible matching notebook links on Jovian, Kaggle & Github sites, and returns the notebook links that contain the text.

> **NOTE**: This code will not work on Binder or Colab. It runs only on your computer locally.

## Working of functions
- `get_markdown_list(url)` function takes the jovian noteboook link as the input, scrapes all the markdown cells in the notebook, cleans the markdown text by splitting at every new line, strips any extras in the sentences and appends all the texts in the markdown. 
- `get_matching_results(markdown_list)` function takes the appended markdown list from `get_markdown_list(url)` function, invokes the Selenium driver to load google.com. At present, the google search happens for the first two sentences in the markdown cells text i.e. the function checks for the length of each markdown cell text, if the number of sentences is equal to 1, the driver takes that sentence as the input and scrapes the first two links in the first page of the sentence search. If the length of markdown text is greater than 1, meaning there are multiple sentences in a markdown text, the function takes the first sentence in the markdown text and scrapes the first two links in the first page of the sentence search. The function google searches the sites [jovian.ai](https://jovian.ai/), [kaggle.com](https://kaggle.com/), and [github.com](https://github.com/) for a match and returns the appened list of matching links.
- `check_matching_sentences(markdown_list,pages_links)` function takes the markdown list from `get_markdown_list(url)` and pages links from `get_matching_results(markdown_list)` as the inputs. The function scrapes all the markdown content in each matching link and looks for the sentence match of the input jovian url. If there is any match, it prints the matching url or else the loop breaks
- `getting_matching_links(url)` is the final function that uses `get_markdown_list(url)`, `get_matching_results(markdown_list)` and `check_matching_sentences(markdown_list,pages_links)` functions with the jovian notebook link as input and returns the possible matching links.

## Installations & Imports

In [1]:
pip install requests beautifulsoup4 selenium nltk --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
import requests
import re
from bs4 import BeautifulSoup
import selenium
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
from time import sleep
from nltk import tokenize

## Function to scrape markdown text

In [2]:
def get_markdown_list(url):
    #scrapes the url
    response = requests.get(url)
    #gets the contents of the page
    page_contents = response.text
    doc = BeautifulSoup(page_contents, 'html.parser')
    #finds all the markdown cells
    markdowns=doc.find_all('div',class_='rendered_html')
    markdown_list=[]
    for markdown in markdowns:
        #splits the text to new line in markdown cell
        lines=markdown.text.split('\n')
        for line in lines:
            #cleans the sentence, if any extras
            sentence=re.sub(r'\s+', ' ', line).strip()
            if sentence:
                #appends all the texts in the markdown cells
                markdown_list.append(sentence)
    return markdown_list

## Function to get possible matching links from Google search

In [3]:
def get_matching_results(markdown_list):
    #path to selenium drive
    PATH = "/Users/samanvitha/Downloads/chromedriver" #change the path here
    #to load the chrome browser without popup 
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    #loading the driver
    driver = webdriver.Chrome(PATH,options=chrome_options)
    #checking the number of sentences in markdown text and choosing the first sentence
    for sentence in markdown_list[:2]:
        token=tokenize.sent_tokenize(sentence)
        if len(token)==1:
            line=token[0]
        else:
            line=token[0]
        pages_links=[]
        sites=['jovian.ai','kaggle.com','github.com']
        #searching for possible matching links in each site
        for site in sites:
            try:
                driver.get('https://www.google.com/search?q='+ site +' "'+ line +'"')
                time.sleep (5)
                #check for no results found class in the search
                check_noresult=driver.find_elements(By.XPATH,'//div[@class="card-section rQUFld"]')[0].get_attribute('innerHTML')
                if 'No results found for ' in check_noresult:
                    continue
            except IndexError:
                #if there are possible matches, there will be no class for 'no results found' leading to IndexError
                hrefs = driver.find_elements(By.XPATH,'//div[@class="yuRUbf"]/a')
                for href in hrefs[:3]:
                    #scrapping the first two links of the page
                    link = href.get_attribute('href')
                    #appending all the links
                    pages_links.append(link)
        return pages_links

## Function to check if the markdown text is present in the possible matching links

In [4]:
def check_matching_sentences(markdown_list,pages_links):
    for page_link in pages_links:
        #scrapes the markdown cells in each link
        link_sentence=get_markdown_list(page_link)
        common_prefix = ''
        #checks for matching sentence in the input jovian link and the scrapped link
        for original_sen,match_sen in zip(markdown_list[:2],link_sentence[:2]):
            for original,match in zip(original_sen,match_sen):
                if original == match:
                    common_prefix += original
                else:
                    break
        #if there's any match, prints the matching link, else it breaks           
        if common_prefix == None:
            break
        else:
            print(page_link)

## One single function to do the job

In [5]:
def single_function_to_get_matching_links(url):
    #getting the input url markdown list
    markdown_list=get_markdown_list(url)
    #getting the possible matching links
    pages_links=get_matching_results(markdown_list)
    #checking for matching sentences
    check_matching_sentences(markdown_list,pages_links)

In [6]:
#calling the final function
single_function_to_get_matching_links('https://jovian.ai/anil1999-aaak1228/ipl-data-analysis')

  driver = webdriver.Chrome(PATH,options=chrome_options)


https://jovian.ai/aviraljoshi143/ipl-data-analysis
https://jovian.ai/dilpreetsinghrawal21/ipl-data-analysis
https://jovian.ai/ajit2001/ipl-data-analysis
https://jovian.ai/dilpreetsinghrawal21/ipl-data-analysis
https://jovian.ai/akankshabisht9897/ipl-data-analysis-2008-2019
https://jovian.ai/suryatn04u4727/ipl-data-analysis?action=duplicate_notebook
https://jovian.ai/dilpreetsinghrawal21/ipl-data-analysis?action=duplicate_notebook


#### https://jovian.ai/ajit2001/ipl-data-analysis was an almost match after checking the links and we failed the submission sharing this link.