# **Plagiarism Detection**

In this project, you will be tasked with building a plagiarism detector that examines an answer text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided, source text.

Your first task will be to create some features that can then be used to train a classification model. This task will be broken down into a few discrete steps:





*   Clean and pre-process the data.
*   Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
*   Select "good" features, by analyzing the correlations between different features.
*   Create train/test .csv files that hold the relevant features and class labels for train/test data points.


In the next notebook, Notebook 3, you'll use the features and .csv files you create in this notebook to train a binary classification model in a SageMaker notebook instance.

You'll be defining a few different similarity features, as outlined in this paper, which should help you build a robust plagiarism detector!

To complete this notebook, you'll have to complete all given exercises and answer all the questions in this notebook.

> All your tasks will be clearly labeled EXERCISE and questions as QUESTION.


It will be up to you to decide on the features to include in your final training and test data.


In [None]:
# NOTE:
# you only need to run this cell if you have not yet downloaded the data
# otherwise you may skip this cell or comment it out

!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip
!unzip data

--2022-07-15 17:46:22--  https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.205.128
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.205.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113826 (111K) [application/zip]
Saving to: ‘data.zip.1’


2022-07-15 17:46:23 (2.81 MB/s) - ‘data.zip.1’ saved [113826/113826]

Archive:  data.zip
replace data/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: All
  inflating: data/.DS_Store          
  inflating: __MACOSX/data/._.DS_Store  
  inflating: data/file_information.csv  
  inflating: __MACOSX/data/._file_information.csv  
  inflating: data/g0pA_taska.txt     
  inflating: __MACOSX/data/._g0pA_taska.txt  
  inflating: data/g0pA_taskb.txt     
  inflating: __MACOSX/data/._g0pA_taskb.txt  
  inflating: data/g0pA_taskc.txt     
  inflating: __MACOSX/data/._g0pA_taskc.txt  
  inflating: data/g0pA_taskd.txt     
  infl

In [None]:
# import libraries
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import CountVectorizer

This plagiarism dataset is made of multiple text files; each of these files has characteristics that are is summarized in a **.csv** file named **file_information.csv**, which we can read in using pandas.

In [None]:
csv_file = 'data/file_information.csv'
plagiarism_df = pd.read_csv(csv_file)

# print out the first few rows of data info
plagiarism_df.head()

Unnamed: 0,File,Task,Category
0,g0pA_taska.txt,a,non
1,g0pA_taskb.txt,b,cut
2,g0pA_taskc.txt,c,light
3,g0pA_taskd.txt,d,heavy
4,g0pA_taske.txt,e,non


**Types of Plagiarism**

Each text file is associated with one **Task** (task A-E) and one **Category** of plagiarism, which you can see in the above DataFrame.

**Tasks, A-E**

Each text file contains an answer to one short question; these questions are labeled as tasks A-E. For example, Task A asks the question: "What is inheritance in object oriented programming?"

**Categories of plagiarism**

Each text file has an associated plagiarism label/category:

**1. Plagiarized categories:** cut, light, and heavy.

These categories represent different levels of plagiarized answer texts. cut answers copy directly from a source text, light answers are based on the source text but include some light rephrasing, and heavy answers are based on the source text, but heavily rephrased (and will likely be the most challenging kind of plagiarism to detect).

**2. Non-plagiarized category:** non.

non indicates that an answer is not plagiarized; the Wikipedia source text is not used to create this answer.

**3. Special, source text category:** orig.

This is a specific category for the original, Wikipedia source text. We will use these files only for comparison purposes.

# Pre-Process the Data


Pre-Process the Data
In the next few cells, you'll be tasked with creating a new DataFrame of desired information about all of the files in the data/ directory. This will prepare the data for feature extraction and for training a binary, plagiarism classifier.

In [None]:
# Define function to Convert all Category labels to numerical labels according to the following rules
# (a higher value indicates a higher degree of plagiarism):
# 0 = non
# 1 = heavy
# 2 = light
# 3 = cut
# -1 = orig, this is a special value that indicates an original file.

def num_cat(x):
    if x == 'non':
        return 0
    elif x == 'heavy':
        return 1
    elif x == 'light':
        return 2
    elif x == 'cut':
        return 3
    elif x == 'orig':
        return -1

In [None]:
# Define a function to create a new 'Class' column as per following statements:
# Any answer text that is not plagiarized (non) should have the class label 0.
# Any plagiarized answer texts should have the class label 1.
# And any orig texts will have a special label -1.

def col_class(x):
    if x == 'non':
        return 0
    elif x in ['heavy','light','cut']:
        return 1
    elif x == 'orig':
        return -1

In [None]:
# Read in a csv file and return a transformed dataframe
def numerical_dataframe(csv_file='data/file_information.csv'):
    '''Reads in a csv file which is assumed to have `File`, `Category` and `Task` columns.
       This function does two things: 
       1) converts `Category` column values to numerical values 
       2) Adds a new, numerical `Class` label column.
       The `Class` column will label plagiarized answers as 1 and non-plagiarized as 0.
       Source texts have a special label, -1.
       :param csv_file: The directory for the file_information.csv file
       :return: A dataframe with numerical categories and a new `Class` label column'''
    
    df = pd.read_csv(csv_file)
    
    # Use function num_cat & col_class
    df['Category_new'] = df['Category'].apply(lambda x: num_cat(x))
    df['Class'] = df['Category'].apply(lambda x: col_class(x))
    
    # Drop original column 'Category' and rename column 'Category_new' to 'Category'
    df.drop(['Category'],axis=1, inplace=True)
    df.rename(columns={'Category_new': 'Category'}, inplace=True)
    
    return df

In [None]:
# informal testing, print out the results of a called function
# create new `transformed_df`
transformed_df = numerical_dataframe(csv_file ='data/file_information.csv')

# check work
# check that all categories of plagiarism have a class label = 1
transformed_df.head(10)

Unnamed: 0,File,Task,Category,Class
0,g0pA_taska.txt,a,0,0
1,g0pA_taskb.txt,b,3,1
2,g0pA_taskc.txt,c,2,1
3,g0pA_taskd.txt,d,1,1
4,g0pA_taske.txt,e,0,0
5,g0pB_taska.txt,a,0,0
6,g0pB_taskb.txt,b,0,0
7,g0pB_taskc.txt,c,3,1
8,g0pB_taskd.txt,d,2,1
9,g0pB_taske.txt,e,1,1
