# Vérification de la complétude

Ce notebook permet de s\'assurer que l\'on a bien le même nombre de fichiers:
- dans le FTP INPI
- dans le S3 Raw
- dans le S3 Données source

Fonctions utiles pour vérifier la complétude:

- lister les objets présents sur le FTP à partir d'un extract xml via ce notebook
- count_objects_s3 : lister tous les fichiers présents dans un dossier du S3
- lire les listes d'objets et vérifier les différences et doublons potentiels via ce notebook

La liste horodatée des fichiers présents en RAW et SOURCE est disponible dans le [dossier de logs](https://s3.console.aws.amazon.com/s3/buckets/calfdata/INPI/TC_1/logs/count%2520files/?region=eu-west-3&tab=overview)

# Connect to S3

In [None]:
# Import S3 connectors librairies

In [None]:
!pip install git+git://github.com/thomaspernet/aws-python

In [None]:
!pip install --upgrade git+git://github.com/thomaspernet/aws-python

In [None]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_s3 import service_s3
from awsPy.aws_athena import service_athena

In [None]:
# Connect to S3
import os
from pathlib import Path

bucket = 'calfdata'
path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = "{}/programme_matching/credential_AWS.json".format(parent_path)

con = aws_connector.aws_instantiate(credential = path_cred,
                                            region = 'eu-west-3')
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                          bucket = 'calfdata')

bucket_name='calfdata'
bucket = client['resource'].Bucket(bucket_name)    

# List files in S3

In [None]:
def count_objects_s3(Prefix_):
    list_obj = bucket.objects.filter(Prefix=folder)
    c=0
    for obj in list_obj:
        c+=1
    return (c,list_obj)

In [None]:
folder='INPI/TC_1/01_donnee_source/Flux/2019/'
(c,list_obj)=count_objects_s3(folder)
c

In [None]:
# Save file list to a file
import csv, time
from tqdm import tqdm

timestr = time.strftime("%Y%m%d-%H%M%S")
logtd_name="{}/{}/{}_{}_{}{}".format('data','count',timestr,'listobj','SourceFlux2019','.csv')
logtd = open(logtd_name, 'w')
with logtd as myfile:
    wr = csv.writer(myfile,delimiter=';',quoting=csv.QUOTE_ALL)
    wr.writerow(['key'])
    for obj in tqdm(list_obj):
        wr.writerow([obj.key])
logtd.close()

# Compare and check duplicates

In [2]:
# Read FTP file list
import pandas as pd
filenameFTPcsv='/home/usr/FTPFlux2018.csv'

ftp_=pd.read_csv(filenameFTPcsv)
ftp_['full_path']=ftp_['key'].apply(lambda x: ('\\flux' + x).replace('\\','/'))
ftp_['filename']=ftp_['key'].apply(lambda x: x.split('\\')[-1])
ftp_.head()

Unnamed: 0,key,full_path,filename
0,\2018\01\01\7501\122\7501_122_20180101_201523_...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_9_ets_nouveau_modifie...
1,\2018\01\01\7501\122\7501_122_20180101_201523_...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_8_ets.csv
2,\2018\01\01\7501\122\7501_122_20180101_201523_...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_7_rep_partant_EVT.csv
3,\2018\01\01\7501\122\7501_122_20180101_201523_...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_6_rep_nouveau_modifie...
4,\2018\01\01\7501\122\7501_122_20180101_201523_...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_5_rep.csv


In [3]:
print(str(len(ftp_)) + ' files in FTP.')

552088

In [None]:
# Check if there are duplicate rows/files based on filename
duplicateRowsDF = ftp_[ftp_.duplicated(['filename'])]
duplicateRowsDF.head()

In [9]:
# Read Raw S3 file list
filenameRawcsv='data/count/20200412-221748_listobj_RawFlux2018.csv'

raw_=pd.read_csv(filenameRawcsv)
raw_['full_path']=raw_['key'].apply(lambda x: x[50:])
raw_['filename']=raw_['key'].apply(lambda x: x.split('/')[-1])
raw_.head()

Unnamed: 0,key,full_path,filename
0,INPI/TC_1/00_RawData/public/IMR_Donnees_Saisie...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_10_ets_supprime_EVT.csv
1,INPI/TC_1/00_RawData/public/IMR_Donnees_Saisie...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_11_obs.csv
2,INPI/TC_1/00_RawData/public/IMR_Donnees_Saisie...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_12_actes.csv
3,INPI/TC_1/00_RawData/public/IMR_Donnees_Saisie...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_13_comptes_annuels.csv
4,INPI/TC_1/00_RawData/public/IMR_Donnees_Saisie...,/flux/2018/01/01/7501/122/7501_122_20180101_20...,7501_122_20180101_201523_1_PM.csv


In [10]:
print(str(len(raw_)) + ' files in Raw S3.')

552089

In [11]:
# Select all duplicate rows based on filename
duplicateRowsDF = raw_[raw_.duplicated(['filename'])]
duplicateRowsDF.head()

Unnamed: 0,key,full_path,filename


In [None]:
# Find difference between two data frames
diff_=ftp_.merge(raw_,on='filename',indicator = True, how='left').loc[lambda x : x['_merge']!='both']

In [None]:
diff_.head()

In [None]:
diff_[['filename','full_path_x','_merge']]

In [None]:
# Number of missing
len(diff_)

In [None]:
# Save missing list
diff_['key_x'].to_csv(r'missing.csv', sep=';', index=False)