Ryan Smith </br>
15 January 2024 </br>
Galactic Advisors - Interview Assignment

## Drive Scanner

### Foreword

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A customer has requested that a drive scanner program be developed. This program must be easily usable for a non-technical user and must function on a variety of operating systems. The requested function is that this scanner search for Social Security Numbers (SSNs) and Credit Card Numbers (CCNs), detecting and recording these strings across a wide variety of file types. 

### Application

###### Packages

The following packages are imported now, for use throughout the program:

In [79]:
import re
import os
%pip install textract
import textract
import pathlib
from pathlib import Path
import pandas as pd
%pip install odfpy
import numpy as np
import zipfile
import shutil

[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


#### Pluggable Algorithms

First, the pluggable algorithms are created and tested. These will later be implemented in the greater program, but each pluggable algorithm is designed to accurately identify and return SSNs and CCNs, respectively.

###### SSN Algorithm

The SSN Search algorithm implements a complex Regular Expression (RegEx) to search a given string of text for number strings that meet the requirements for an SSN. These requirements are:
- Per the Social Security Agency: The nine-digit SSN is composed of three parts:
    - The first set of three digits is called the Area Number
    - The second set of two digits is called the Group Number
    - The final set of four digits is the Serial Number 
- These digits may be separated into the three blocks by either spaces or dashes in the format 'xxx-xx-xxxx'
- The Area Number will not be '000', '666', or begin with the digit '9'
- The Group Number will not be '00'
- The Serial Number will not be '0000'
Based upon the above criteria, a RegEx was written such that all files will be searched for 9-digit number strings matching the specifications will be recoded.

In [2]:
def ssnSearch(text, outFile):
    search  = r'(?!(\d){3}(-| |)\1{2}\2\1{4})(?!666|000|9\d{2})(\b\d{3}(-| |)(?!00)\d{2}\4(?!0{4})\d{4}\b)'
    results = re.findall(search, text)
    for result in results:
        result = str(result)
        result = result.replace(' ','')
        result = result.replace('-','')
        result = result.replace('.','')
        outFile.write("SSN: " + result + '\n')

###### CCN Algorithm

A similar algorithm must be implemented in searching for CCNs. However, a simple regular expression is not sufficient for limiting false positives. Therefore, The Luhn Algorithm is also implemented. The Luhn algorithm is  a built-in verification system for Debit and Credit Card Numbers that is widely used in America. Here, it can easily be used to verify if a given Credit Card number meets this specification which will drastically reduce the rate of false positives in collecting numbers.

In [3]:
def luhn(ccn):
    def digitsOf(n):
        return [int(d) for d in str(n)]
    digits = digitsOf(ccn)
    odd = digits[-1::-2]
    even = digits[-2::-2]
    checksum = 0
    checksum += sum(odd)
    for d in even:
        checksum += sum(digitsOf(d*2))
    checksum %= 10
    if checksum == 0:
        return True
    else:
        return False

Now that the Luhn Algorithm has been implemented, the CCN Search can be constructed. This search will consist of a universal RegEx, which can accomodate every CCN format in the USA. Upon searching for these numbers based on the RegEx, the Luhn Algorithm is employed to further validate results

In [4]:
def ccnSearch(text, outFile):
    search = r'\b(?:\d[ -]*?){13,16}\b'
    results = re.findall(search, text)
    for result in results:
        result = str(result)
        result = result.replace(' ','')
        result = result.replace('-','')
        result = result.replace('.','')
        if luhn(result) == True:
            outFile.write("CCN: " + result + '\n')
        else:
            pass

#### Drive Searching

Scanning a drive for the above information poses two problems:
1. The drive must be iterated through recursively, in a manner that is functional across multiple operating systems
2. The scanning algorithm must be prepared for mixed and highly varied file input

Consequently, the function is designed such that it may be implemented on any operating system using the Pathlib package. Further, the file input is handled using the Textract which will convert various types of file input into plain text, which is accessible by the above pluggable algorithms. Upon accessing each file, the program employs the above algorithms and writes the results to a plaintext report file.

###### Excel and Excel-Like Handling

First, some exceptions must be outlined before the primary Drive Searching function is declared. As the 'Textract' package struggles to handle some excel file extensions, that exception is handled with the below helper function.

In [72]:
excel = ('.xlsx', '.ods')

def excelHandler(text):
    text = pd.read_excel(text)
    text.to_numpy().tolist()
    return text

###### Compressed File Handling

Similar to the above helped function, another must be created for the handling of Compressed Files.

In [90]:
compressed = ('.zip', '.gz')

def zipHandler(text, driveName):
    if str(text).endswith('.zip'):
        with zipfile.ZipFile(text,"r") as zip_ref:
            zip_ref.extractall(driveName)
    elif str(text).endswith('.gz'):
        shutil.unpack_archive(filename=text, extract_dir=driveName)

###### Drive Search Algorithm

With the above helper functions implemented, the master drive-searching function can be built to take varied file input, and scan them using the earlier built algorithms for CCNs and SSNs.

In [91]:
def driveSearch(driveName):
    out = open('report.txt', 'w')
    drive = pathlib.Path(driveName)
    for item in drive.rglob("*"):
        if item.is_file():
            if str(item).endswith(excel):
                text = excelHandler(item)
                text = str(text)
                ssnSearch(text, out)
                ccnSearch(text, out)
            elif str(item).endswith(compressed):
                zipHandler(item, driveName)
            else:
                try:
                    text = textract.process(item)
                    text = text.decode('ascii')
                    text = str(text)
                    ssnSearch(text, out)
                    ccnSearch(text, out)
                except:
                    print("ERROR. File: " + str(item) + " is unreadable.")
                    pass
    out.close()        

### User Input

Finally, with user-input directory, the scan can take place at the specified location.

In [92]:
print("Welcome! This program scans a given drive or directory for Credit Card and Social Security Numbers")
driveName = input("Please input the address of the drive to be scanned")
driveSearch(driveName)

Welcome! This program scans a given drive or directory for Credit Card and Social Security Numbers
Please input the address of the drive to be scanned/home/rycaga/Desktop/Data Science/Galactic Programming Assignment/Sample Data/
ERROR. File: /home/rycaga/Desktop/Data Science/Galactic Programming Assignment/Sample Data/driveScanner.py is unreadable.


### Conclusion

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;This concludes the drive-scanning program. Although a rudimentary implementation, this should serve to scan through a majority of standard file types. There is certainly room for improvement, as this program could become quite extensive in nature. Some areas of improvement may be:
- A GUI for ease of use on the consumer end.
- Support for a greater variety of file extensions and types.
- Greater efficiency, perhaps even in a more modern language or a Cython implementation

Although rudimentary, this program is effective at scanning for the requested data by the customer. Furthermore, the application is in an easy-to-use, executable, format. The development lifecyle of a program such as this would be well-served by a full CI/CD pipeline process, including the collaboration of cybersecurity professionals, software developers, and data scientists.

### Sources

- https://www.ssa.gov/history/ssn/geocard.html
- https://www.ssa.gov/kc/SSAFactSheet--IssuingSSNs.pdf
- https://regex101.com/library/kdXrYe
- https://www.computerweekly.com/tip/How-to-find-credit-card-numbers-and-other-sensitive-data-on-your-users-computers
- https://www.techtarget.com/searchsecurity/definition/LUHN-formula
- https://allwin-raju-12.medium.com/credit-card-number-validation-using-luhns-algorithm-in-python-c0ed2fac6234
