## Requirements:
Make sure you have python with pandas module installed.<br>
Version I used while developing:<br>
python - 3.10.4<br>
pandas - 1.4.1<br>

## Warning:
To get access token it's required to have an app registered in yandex.<br>
You can use [my app](https://oauth.yandex.ru/client/549db500306c4bd68adc834cdaff626b) or [create your own](https://yandex.ru/dev/id/doc/dg/oauth/tasks/register-client.html) for the purpose.<br>
You will need to replace _CLIENT_ID_ variable to yours.

Script logic:

0. It redirects you to verification page to get a token and waits for your to provide it in console
    - it will open a session in your default browser and ask for you to authenticate
    - you will be redirected on the page with token
1. It retrieves all files from your Yandex.Disk
2. It compares files against others with fields provided in _size_name_type_ and _size_type_ variables
3. It saves 

### Useful links
1. https://yandex.ru/dev/id/doc/dg/oauth/concepts/about.html
2. https://yandex.ru/dev/disk/poligon/
3. https://yandex.ru/dev/disk/api/reference/meta.html
4. https://realpython.com/api-integration-in-python/


### Modules and Functions

In [None]:
# MODULES
import requests
import os, sys
import webbrowser
import pandas as pd

# FUNCTIONS
def get_ya_token(client_id:str):
    if client_id == None:
        raise Warning("client_id was not supplied!")
    request = f"https://oauth.yandex.ru/authorize?response_type=token&client_id={client_id}"
    webbrowser.open_new_tab(request)
    return input("Enter access_token from webpage here:")

def get_ya_files(access_token: str):    
    # get files using access_token supplied
    headers = {'Authorization':f'OAuth {access_token}'}
    request = f"https://cloud-api.yandex.net/v1/disk/resources/files?limit={sys.maxsize}"
    print("Trying to retrieve your files. It might take some time. Pls wait ⏳")
    response = requests.get(request, headers=headers)
    
    if response.status_code != 200:
        raise Warning(f"Oops... Finished too fast...\nReceived: {response}\nThere's some problem with your request.\nSee Error codes here: https://yandex.ru/dev/dialogs/smart-home/doc/concepts/responses-codes-alerts.html")
    
    files = pd.DataFrame(response.json()['items'])
    print("Files list retrieved.")
    return files


def find_ya_duplicates(df: pd.DataFrame, fields_subset: pd.Series, fields_out: pd.Series = ['size','name','mime_type','path'], save_log: bool = False, path: str = 'duplicates_found.csv' ):
    """
    Function to find duplicates in DataFrame using supplied fields
        Takes:
        - df: a pandas DataFrame with files recieved from YaDisk
        - fields_subset: a pandas Series with fields to identify duplicates over
        - fields_out: a pandas Series with fields you want to see in output file/dataframe [optional]
            default: ['size','name','mime_type','path']
        - save_log: a boolean flag to save output dataframe to a csv [optional]
            default: False
        - path: a string with path and filename you want to save output [optional]
            default: '.\duplicates_found.csv'
        Returns:
        - pandas dataframe with all duplicating entries found over fileds supplied in fields_subset
    """
    print(df.shape)
    print(f"{df.shape[0]}\t- total number of files in your YaDisk")
    print(f"{df.duplicated(subset=fields_subset).value_counts()[0]}\t- unique files")
    print(f"{df.duplicated(subset=fields_subset).value_counts()[1]}\t- duplicates")

    df_dups = df[df.duplicated(keep=False, subset=fields_subset)==True].sort_values(['size','name'])[fields_out]

    if save_log: 
        path = os.path.join(os.getcwd(), path)
        df_dups.to_csv(path, index=False)
        print(f"output saved as csv to {path}")
    return df_dups

def get_disk_info(access_token:str):
    """
    https://yandex.ru/dev/disk/api/reference/capacity.html
    """
    headers = {'Authorization':f'OAuth {access_token}'}
    request = f"https://cloud-api.yandex.net/v1/disk/"
    response = requests.get(request, headers=headers)
    return response.json()

### Getting access token
Change _CLIENT_ID_ below in case you want to use your application 

In [None]:
# App: https://oauth.yandex.ru/client/549db500306c4bd68adc834cdaff626b
CLIENT_ID = '549db500306c4bd68adc834cdaff626b'
access_token = get_ya_token(CLIENT_ID)
all_files_df = get_ya_files(access_token)

print(f"Files total:\t{all_files_df.shape[0]}")
print(f"Files size:\t{all_files_df['size'].sum()}")

### Getting duplicates

In [None]:
# fields of interest: 'name', 'size', 'path', 'mime_type'
fields = ['name', 'size', 'path', 'mime_type']
# what's [mime_type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types)

# fields combo #1 - to find 100% duplicates
size_name_type = ['size', 'name', 'mime_type']
# fields combo #2 - to find identical candidates
size_type = ['size', 'mime_type']

find_ya_duplicates(all_files_df, size_name_type, save_log=True, path = 'size_name_mime_out.csv')
find_ya_duplicates(all_files_df, size_type, save_log=True, path='size_mime_out.csv')

# why pandas?
# pandas is faster: https://stackoverflow.com/a/39280934
