# Version Notes: 

### v1: 
* add Data_HourMinute for all exported datasources


# Tip for quick search

* Needs attention: the place where needs update or better logic
* question to be answered: the place where things are still not clear
* Manual Check: Unit test where you can drill in to find the data that leads to the check results for a specific project and specific check
* TODO: things needs to be done
* bookmark: stop point from last visit


# Admin Notes:


1. The AMTool dataset is archived daily as csv files and used for the project book check. 
The csv files are located at: 
r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Tableau Dashboards\DataLake'

2. The excel input files are checked daily and archived with datestamp whenever it is modified.
The continuously updated excel input files are located at: r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Projectbook_WorkingFolder\excel'
The excel input file are archived at: r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Tableau Dashboards\Data_MiscInput'
To recover the archived excel file used in project book check for a target date, select the excel file with latest datestamp but is still earlier than the target date.

3. The check summary export action is logged daily. It can be used for daily monitoring. 
The file export log is located at: \\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Projectbook_WorkingFolder\output_internal\log

4. The published data are at:

    * csv files for district asset manager: http://svgcshopp.dot.ca.gov/DataLake/ProjectBookCheck/
    * csv files for HQ AM: \\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Projectbook_WorkingFolder\output_internal
    * tableau workbook with live data source: https://tableau.dot.ca.gov/#/site/AssetManagement/workbooks/1815/views


# General Approach

use Minor raw data as basis for data checks. 
Each project only occupies one line

can expand columns, only if it will not create duplicate rows in the SHOPP raw dataset. 


# Data clean process

* funding amount: remove dollar sign, 
* fill missing value, string, numerical, 
* remove leading single quote for string value
* strip off leading and trailing space 

* regulate column names




# Import common modules

<a id='TableOfContents'></a>

# Table Of Contents

## Data Preprocessing

### [Global Constants](#GlobalConstants)


### [Load and cleanup source data](#Read_Data)


## Add fields to SHOPP raw data (calculate and join)
* [Calculated Fields](#AddDataColumns)
* [Join Tables](#DataJoining)



## Data Check and Export


## [Data Check List](#Issue_Table1)
The main table of check issues, 
one issue per row, 


* [Will_this_project_be_included_in_the_Project_Book](#Will_this_project_be_included_in_the_Project_Book)
* [Does_project_cost_exceed_Minor_Program_limits](#Does_project_cost_exceed_Minor_Program_limits)



## [Export Internal Check Summary](#Export_internal_check_summary)
* internal check summary (csv)


## [Final Clean Up](#FinalCleanUp)


In [73]:
%load_ext autoreload
%autoreload 2

In [74]:

from datetime import datetime
import os.path

# import requests
import pandas as pd

import numpy as np
import re

import shutil

In [75]:
import time
start_time = time.time()

In [76]:
#show dataframe without skip column
pd.options.display.max_columns = 100

In [77]:
# from config_datasource import *
from projectbookcheck_utilityfunction import *

<a id='GlobalConstants'></a>
## Global Constants

In [78]:
# # use 'csv' to read data from data lake, use 'live' to read data directly from AmTool Server
# DATA_SOURCE_TYPE = 'csv'

# # DATALAKE_FOLDER = r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Tableau Dashboards\DataLake'

# #input data
# DATALAKE_FOLDER = r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Tableau Dashboards\DataLake'
# PROJECTBOOKCHECK_INPUT_FOLDER = r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Projectbook_WorkingFolder\excel'

# #output data
# DATALAKE_HTTPSEVER_FOLDER = 'C:\inetpub\wwwroot\DataLake\ProjectBookCheck'
# PROJECTBOOKCHECK_OUTPUT_FOLDER = r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Projectbook_WorkingFolder\output_internal'

# #log data
# log_folder = r'\\ct.dot.ca.gov\dfshq\DIROFC\Asset Management\4e Project Book\Projectbook_WorkingFolder\output_internal\log'

# TARGET_FY = 2021


# # CURRENT_FY

# TARGETDATE = datetime.today().strftime("%m-%d-%Y")

In [79]:
from constants import *

In [80]:
TARGETDATE = datetime.today().strftime("%m-%d-%Y")
CURRENT_FY = fiscalyear (datetime.today())

In [81]:
def regulate_EFIS(EFIS):
    #check if is all numerical
    #convert to 10-digit string
    if isinstance(EFIS, str) and EFIS.strip()[0] == "'":
        EFIS = EFIS[1:]
    try: 
        return "{:10.0f}".format(float(EFIS))
    except: 
        return 0

In [82]:
# filename = 'Programming_Summary_'
# df_Programming_Summary = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE))

<a id='Read_Data'></a>

# Read Data


In [83]:
if DATA_SOURCE_TYPE == 'csv':
    filename = 'Minor_Project_Details_Raw_Data_'
    df_Minor_raw_data = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE))

    filename = 'Minor_Performance_Raw_Data_'
    df_Minor_perf_raw_data = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE))

    filename = 'Programming_Summary_'
    df_Programming_Summary = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE))

    filename = 'Minor_Project_Postmile_Check_'
    df_Minor_pm_check = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE), header = 0)

    filename = 'Rawdata_Bridge_Worksheet_'
    df_brg_raw_data = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE), skiprows = [0], header = 0)

    filename = 'Rawdata_Pavement_Worksheet_'
    df_pav_raw_data = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE), skiprows = [0], header = 1)


    filename = 'Rawdata_Drainage_Worksheet_'
    df_drain_raw_data = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE), header = 0)


    filename = 'Rawdata_TMS_Worksheet_'
    df_tms_raw_data = pd.read_csv(r'{}\{}{}.csv'.format(DATALAKE_FOLDER, filename, TARGETDATE), header = 0)
    

else:
    print('skip getting csv data.')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Done with TenYrShopp_RawData_ in -13.425710678100586
Done with TenYrShopp_PerfM_Raw_Data_ in -96.45661997795105
Done with Rawdata_Pavement_Worksheet_ in -8.857773303985596
Done with Rawdata_Drainage_Worksheet_ in -43.5689959526062
Done with Rawdata_Bridge_Worksheet_ in -7.964364290237427
Done with Rawdata_TMS_Worksheet_ in -9.304267168045044
Done with Project_Postmile_Check_ in -12.754770517349243
Done with Programming_Summary_ in -15.230461597442627
Done with HM_Project_Details_Raw_Data_ in -7.1786627769470215
Done with Minor_Project_Details_Raw_Data_ in -7.667008399963379
total time: -222.40863466262817

In [84]:
#question answered: 2021 and 2022 approved list project id duplication will be resolved with later excel files


# Data quality check and cleaning

<a id='Minor_Raw_Data'></a>

## Minor Raw Data


In [85]:
dict_rename = {'Project ID':'EFIS',
               'ID': 'AMT_ID', 
              'FY.1': 'FY_ALN',
               'Prog Appr Date': 'Prog Appr Date_ALN',
               'FY': 'FY_WP',
               'Prog Approval Date': 'Prog Appr Date_WP',
              }
df_Minor_raw_data = df_Minor_raw_data.rename(dict_rename, axis = 1)

In [86]:
df_Minor_raw_data.shape

(1251, 78)

In [87]:
# for programmed FY year of 9999, skip all the checks

# No need to check, since the raw data is filtered before download

In [88]:

df_Minor_raw_data['District'] = df_Minor_raw_data['District'].apply(remove_punction)
df_Minor_raw_data['District'] = df_Minor_raw_data['District'].astype(int)

<a id='Minor_Perf_RawData'></a>
## Minor_Perf_RawData

In [89]:
#rename columns
dict_rename_perf_rawdata = {
                           'ID': 'AMT_ID',
#                             'ProjectedRTL FY': 'Projected RTL FY',

              }

df_Minor_perf_raw_data = df_Minor_perf_raw_data.rename(dict_rename_perf_rawdata, axis = 1)

In [90]:
cols_strip = ['EA','EFIS']
for c in cols_strip :
    df_Minor_perf_raw_data[c] = df_Minor_perf_raw_data[c].str.strip("'")

In [91]:
#data clean 
#data type regulation

df_Minor_perf_raw_data['Quantity'] = df_Minor_perf_raw_data['Quantity'].fillna(0)
df_Minor_perf_raw_data['Assets in Good Cond'] = df_Minor_perf_raw_data['Assets in Good Cond'].fillna(0)
df_Minor_perf_raw_data['Assets in Fair Cond'] = df_Minor_perf_raw_data['Assets in Fair Cond'].fillna(0)
df_Minor_perf_raw_data['Assets in Poor Cond'] = df_Minor_perf_raw_data['Assets in Poor Cond'].fillna(0)
df_Minor_perf_raw_data['New Assets Added'] = df_Minor_perf_raw_data['New Assets Added'].fillna(0)

# df_Minor_perf_raw_data['EFIS'] = df_Minor_perf_raw_data['EFIS'].apply(regulate_EFIS)
df_Minor_perf_raw_data['EFIS'] = pd.to_numeric(df_Minor_perf_raw_data['EFIS'], errors='coerce')


In [92]:
#data trimming
#row
df_Minor_perf_raw_data= df_Minor_perf_raw_data[df_Minor_perf_raw_data['District'] != 56]
#column
df_Minor_perf_raw_data.drop(['PID Cycle', 'TYP','ProjectedSHOPP Cycle','RequestedRTL FY','DistrictPriority'],
  axis='columns', inplace=True, errors='ignore')

In [93]:
df_Minor_perf_raw_data.name = 'df_Minor_perf_raw_data'

<a id='Counties'></a>
## Counties


In [94]:
filename = 'Counties.xlsx'

df_counties = pd.read_excel(r'{}\{}'.format(PROJECTBOOKCHECK_INPUT_FOLDER, filename))

In [95]:
df_counties['Co. Name Abbr.'] = df_counties['Co. Name Abbr.'].str.upper()

In [96]:
df_counties.shape

(60, 6)

In [97]:
df_counties.name = 'df_counties'

In [98]:
# df_perf_raw_prog_county = df_perf_raw_prog_candidate.merge(df_counties, how = 'left', left_on = 'County', right_on = 'Co. Name Abbr.')

In [99]:
#no need for the following, already added to the df_Minor_perf_raw_data

# #rename columns
# dict_rename_4= {
#                'Performance Objective':'Performance Objective Original', 
#               }

# df_perf_raw_prog_county = df_perf_raw_prog_county.rename(dict_rename_4, axis = 1)

<a id='Postmile_Check'></a>
## Postmile Check

In [100]:
dict_PM_ck_rename = {
 'ID': 'AMT_ID',
 '№': 'No'                            }
df_Minor_pm_check.rename(dict_PM_ck_rename, axis = 1, inplace = True)

In [101]:
df_Minor_pm_check['District'] = df_Minor_pm_check['District'].str.strip("'")
df_Minor_pm_check['District'] =df_Minor_pm_check['District'].astype(int)
df_Minor_pm_check = df_Minor_pm_check[df_Minor_pm_check['District']!= 56]

In [102]:
df_Minor_pm_check.name = 'df_Minor_pm_check'
df_Minor_pm_check.shape

(1610, 29)

<a id='ProgrammingSummary'></a>
## Programming Summary

In [103]:

dict_renamee = {'ID': 'AMT_ID',
                               }
df_Programming_Summary.rename(dict_renamee, axis = 1, inplace = True)

In [104]:
cols_strip = ['EA','EFIS']
for c in cols_strip :
    df_Programming_Summary[c] = df_Programming_Summary[c].str.strip("'")
    
df_Programming_Summary['EFIS'] = df_Programming_Summary['EFIS'].apply(regulate_EFIS)
df_Programming_Summary['EFIS'] = pd.to_numeric(df_Programming_Summary['EFIS'], errors='coerce')


In [105]:
# df_Programming_Summary.head()

# Approved Project List

In [106]:
#read xlsx files
df_approved_2021 = pd.read_excel(r'{}\{}'.format('H:\Jupyter\Dev\data', 'FY2021_Minor Approved list.xlsx'))
df_approved_2022 = pd.read_excel(r'{}\{}'.format('H:\Jupyter\Dev\data', 'FY2022_Minor Approved list.xlsx'))

In [107]:
df_approved_2021['In the 2021 Approved List?'] = 'Yes'
df_approved_2022['In the 2022 Approved List?'] = 'Yes'

In [108]:
#question answered: should 2021 and 2022 approved list be treated seperately?


In [109]:
dict_rename = {
    'Project ID':'EFIS',
    'Total Project Cost ($K)': 'Construction Capital Cost ($K)'
}
df_approved_2021 = df_approved_2021.rename(dict_rename, axis = 1)


In [110]:
dict_rename = {
    'Project ID':'EFIS',
    'Contruction': 'Construction Capital Cost ($K)'
              }
df_approved_2022 = df_approved_2022.rename(dict_rename, axis = 1)

In [111]:
df_approved_2021['EFIS'] = df_approved_2021['EFIS'].apply(regulate_EFIS)
df_approved_2022['EFIS'] = df_approved_2022['EFIS'].apply(regulate_EFIS)

In [112]:
df_approved_2021['EFIS'] = pd.to_numeric(df_approved_2021['EFIS'], errors='coerce')
df_approved_2022['EFIS'] = pd.to_numeric(df_approved_2022['EFIS'], errors='coerce')

In [113]:
df_approved_2021['Approve Year'] = 21
df_approved_2022['Approve Year'] = 22

In [114]:
target_cols = ['EFIS','EA','Performance Value','Performance Measure','Approve Year','Program Code','Construction Capital Cost ($K)']

df_approved = df_approved_2021[target_cols].append(df_approved_2022[target_cols])
# only use 21 to check if project is in both 21 and 22
df_approved = df_approved.sort_values(by =['EFIS','Approve Year'], ascending = True)
df_approved = df_approved.groupby('EFIS').first().reset_index()

df_approved['In the Approved List?'] = 'Yes'

<a id='AddDataColumns'></a>
## Calculate and join additional fields


In [115]:
#this logic needs to consider the programming list
df_Minor_raw_data['Section'] = df_Minor_raw_data['Section In Use']

df_Minor_raw_data['Unique EA'] = df_Minor_raw_data.apply(calc_unique_EA, axis = 1)

df_Minor_raw_data['FY In Use'] = df_Minor_raw_data['FY.2'].str[-2:]

In [116]:
#filter data to keep Minor program and active section only.
# df_Programming_Summary
print(df_Programming_Summary.shape)
df_Programming_Summary_filtered = pd.merge(df_Programming_Summary, df_Minor_raw_data[['AMT_ID','Section',]],
               how= 'inner', left_on = ['AMT_ID','Section',], right_on = ['AMT_ID','Section',])
print(df_Programming_Summary_filtered.shape)

(15091, 25)
(1002, 25)


In [117]:
print(df_Programming_Summary_filtered.shape)
df_Programming_Summary_filtered = pd.merge(df_Programming_Summary_filtered, df_approved,
               how= 'left', left_on = ['EFIS'], right_on = ['EFIS'],
               suffixes=['','_ApprovedList'])
print(df_Programming_Summary_filtered.shape)
df_Programming_Summary_filtered['In the Approved List?'].fillna('No', inplace=True)

(1002, 25)
(1002, 32)


In [118]:
ck_col = 'Matches Minor Approved List Performance Measure?'

def ck_performance_measure(df):
    if pd.isna(df['Performance Measure_ApprovedList']):
        return 'Not in the Approved Lists'
    else:
        if df['Performance Measure_ApprovedList'] == df['Performance Measure']:
            return 'Yes'
        else:
            return 'No'

df_Programming_Summary_filtered[ck_col]= df_Programming_Summary_filtered.apply(ck_performance_measure, axis = 1)

In [119]:

ck_col = 'Matches Minor Approved List Performance Value?'
def ck_performance_value(df):
    if pd.isna(df['Performance Value_ApprovedList']):
        return 'Not in the Approved Lists'
    else:
        if df['Performance Value_ApprovedList'] == df['Performance Value']:
            return 'Yes'
        else:
            return 'No'

df_Programming_Summary_filtered[ck_col]= df_Programming_Summary_filtered.apply(ck_performance_value, axis = 1)

In [120]:
ck_col = 'Matches Minor Approved List Performance Value and Measure?'
def ck_performance(df):
    if df['Matches Minor Approved List Performance Value?'] == 'Not in the Approved Lists':
        return 'Not in the Approved Lists'
    elif (df['Matches Minor Approved List Performance Value?'] == 'Yes') and (df['Matches Minor Approved List Performance Measure?'] == 'Yes'):
        return 'Yes'
    else:
        return 'No'
    

df_Programming_Summary_filtered[ck_col]= df_Programming_Summary_filtered.apply(ck_performance, axis = 1)

# Check Minor Data

In [121]:
df_Minor_raw_data['Program Code in Use'] = df_Minor_raw_data.apply(lambda x: x['Program Code'] if x['Section In Use'] == 'WP' else x['Program Code.1'], axis = 1)

df_Minor_raw_data['Const Capital in Use'] = df_Minor_raw_data.apply(lambda x: x['Construction Capital ($K)'] if x['Section In Use'] == 'WP' else x['Total Capital Project Cost ($K)'], axis = 1)

In [122]:
#question answered: we focus on checking the data only in the Section in Use

In [123]:
# df_Minor_raw_data_backup = df_Minor_raw_data.copy()

# df_Minor_raw_data = df_Minor_raw_data_backup

In [124]:
print(df_Minor_raw_data.shape)

df_Minor_raw_data = pd.merge(df_Minor_raw_data, df_approved[['EFIS','EA','In the Approved List?','Approve Year','Program Code','Construction Capital Cost ($K)' ]],
                            how = 'left', left_on = 'EFIS', right_on = 'EFIS', suffixes=['','_ApprovedList'])

print(df_Minor_raw_data.shape)

df_Minor_raw_data['In the Approved List?'].fillna('No', inplace= True)


(1251, 83)
(1251, 88)


In [125]:
# df_Minor_raw_data.columns

In [126]:
def ck_match_2022_approved_list(df):
    if df['In the Approved List?'] == 'Yes' and df['Approve Year'] == 22:
        if df['FY In Use'] == 22:
            return 'OK'
        else:
            return r'The FY {} does not match Approved year {}'.format(df['FY In Use'], df['Approve Year'])
    else:
        return 'Not in the 2022 Approved list'

ck_col = 'FY Matches 2022 List?'
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_match_2022_approved_list, axis = 1)

### is EFIS duplicate within Minor raw data?

In [127]:
# temp = df_Minor_raw_data.groupby(['EFIS'])['AMT_ID'].nunique().reset_index(name = 'EFIS_Counts')
# duplicated_EFIS= temp[temp['EFIS_Counts']> 1]

# df_Minor_raw_data.drop(columns=['EFIS_Counts'],inplace=True , errors='ignore')
# print(df_Minor_raw_data.shape)
# df_Minor_raw_data = pd.merge(df_Minor_raw_data, duplicated_EFIS, 
#                              how = 'left', left_on = ['EFIS'], right_on=['EFIS'])
# print(df_Minor_raw_data.shape)

# def ck_EFIS_Uniqueness(df):
#     if pd.isna(df['EFIS_Counts']):
#         return 'OK'
#     elif df['EFIS'] == 0: 
#         return 'Missing/Invalid EFIS'
#     else:
#         return 'Duplicate EFIS'
    
# df_Minor_raw_data['EFIS Uniqueness Check'] = df_Minor_raw_data.apply(ck_EFIS_Uniqueness, axis = 1)

In [131]:
temp = df_Minor_raw_data.groupby(['EFIS'])['AMT_ID'].agg([pd.Series.nunique, list]).reset_index()
temp['AMT_IDs'] = temp['list'].apply(lambda l: ','.join(l))
duplicated_EFIS= temp[temp['nunique']> 1]

df_Minor_raw_data.drop(columns=['nunique','AMT_IDs'],inplace=True , errors='ignore')
print(df_Minor_raw_data.shape)
df_Minor_raw_data = pd.merge(df_Minor_raw_data, duplicated_EFIS, 
                             how = 'left', left_on = ['EFIS'], right_on=['EFIS'])
print(df_Minor_raw_data.shape)

def ck_EFIS_Uniqueness(df):
    if pd.isna(df['nunique']):
        return 'OK'
    elif df['EFIS'] == 0: 
        return 'Missing/Invalid EFIS'
    else:
        return 'Duplicate EFIS {} is found in the following projects: {}'.format(df['EFIS'], df['AMT_IDs'])
    
df_Minor_raw_data['EFIS Uniqueness Check'] = df_Minor_raw_data.apply(ck_EFIS_Uniqueness, axis = 1)

(1251, 93)
(1251, 96)


In [134]:
df_Minor_raw_data['EFIS'].value_counts()

0.000000e+00    46
1.120000e+09     3
1.120000e+09     3
1.120000e+09     3
1.118000e+09     2
                ..
1.121000e+09     1
1.170002e+08     1
4.200003e+08     1
3.210002e+08     1
1.119000e+09     1
Name: EFIS, Length: 1135, dtype: int64

In [None]:
# df_Minor_raw_data['EFIS Uniqueness Check'].value_counts()

In [135]:
# df_Minor_raw_data['EA']

0       0C930
1       0F080
2       0J010
3       0H390
4       2H140
        ...  
1246    0Y130
1247    1N840
1248    3A492
1249    3A510
1250    2J600
Name: EA, Length: 1251, dtype: object

In [None]:
# #flag if EFIS is invalid
# def ck_invalid_EFIS(df):
#     if len(str(EFIS)) < 5: 
#         return 'Invalid EFIS'
#     else:
#         return 'OK'
# df_Minor_raw_data['EFIS is valid?'] = df_Minor_raw_data['EFIS'].apply(ck_invalid_EFIS)

In [None]:
# df_Minor_raw_data.columns

### flag if District + EA duplicate within Minor raw data

In [None]:
temp = df_Minor_raw_data.groupby(['Unique EA'])['AMT_ID'].nunique().reset_index(name = 'UnqiueEA_Counts')
duplicated_EA= temp[temp['UnqiueEA_Counts']> 1]

df_Minor_raw_data.drop(columns=['UnqiueEA_Counts'],inplace=True , errors='ignore')

df_Minor_raw_data = pd.merge(df_Minor_raw_data, duplicated_EA[['Unique EA','UnqiueEA_Counts']].drop_duplicates(), 
                             how = 'left', left_on = ['Unique EA'], right_on=['Unique EA'])

def ck_EA_Uniqueness(df):
    if pd.isna(df['UnqiueEA_Counts']):
        return 'OK'
    else:
        return 'Duplicate District+EA is found.'
    
df_Minor_raw_data['EA Uniqueness Check'] = df_Minor_raw_data.apply(ck_EA_Uniqueness, axis = 1)

In [None]:
ck_col = 'Does Project have a Repeated EA or Project ID repeated in Minor Profile?'

def ck_ID_Uniqueness(df):
    if df['EA Uniqueness Check'] == 'OK' and df['EA Uniqueness Check'] == 'OK':
        return 'OK'
    else:
        return 'Duplicate District+EA and/or Project ID(EFIS) is found.'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_ID_Uniqueness, axis=1 )

In [None]:
# Does FY Need Updates?


def ck_FY_consistancy(df):
    if df['In the Approved List?'] == 'Yes':
        if df['FY In Use'] == df['Approve Year']:
            return 'OK'
        else:
            return 'Please update FY. It is in the {} Approved List'.format(df['Approve Year'])
    else:
        return 'OK'
    
df_Minor_raw_data['Does FY Need Updates?'] = df_Minor_raw_data.apply(ck_FY_consistancy, axis = 1)


In [None]:
# Does EA Need Updates?

def ck_EA_consistancy(df):
    
    if df['EA'] == df['EA_ApprovedList']:
        return 'OK'
    else:
        return 'Update EA. It does not match EA in Approved List of year {}'.format(df['Approve Year'])
    
df_Minor_raw_data['Does EA Need Updates?'] = df_Minor_raw_data.apply(ck_EA_consistancy, axis = 1)

In [None]:
#Does Program Code Need Updates?

ck_col = 'Does Program Code Need Updates?'

def ck_program_code_update(df):
    if pd.isna(df['Program Code_ApprovedList']) or (df['Program Code in Use'] == df['Program Code_ApprovedList']):
        return 'OK'
    else:
        return 'The program code for Section In Use does not match Approved project list.'

df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_program_code_update, axis = 1)

In [None]:
# ck_col = 'Does Program Code Need Updates?'
# df_Minor_raw_data[ck_col].value_counts()

In [None]:
# df_Minor_raw_data[df_Minor_raw_data['Does Program Code Need Updates?'] != 'OK'][['AMT_ID','Program Code in Use','Program Code_ApprovedList']]

In [None]:
# drainage, needs to have at least one C activity id
# bridge,needs to have at least one A activity id
# pavement: needs to have at least one B activity id
# safety: needs to have at least one within the list []


In [None]:
# def ck_shape(*args, **kwargs):
#     def wrapper_func(original_func):
#         print (kwargs['df'].shape)
#         results = original_func(*args, **kwargs)
#         print (kwargs['df']..shape)
#         return results
#     return wrapper_func

# @ck_shape(df = df_Minor_raw_data)
# def add(x,y):
#     return (x+y) 



In [None]:
# add(2,3)

In [None]:
#Does Construction Capital Cost ($K) Need Updates?

ck_col = 'Does Construction Capital Cost ($K) Need Updates?'

def ck_construction_capital_cost(df):
       
    if abs(df['Construction Capital Cost ($K)'] - df['Const Capital in Use']) < 0.01:
        return 'OK'
    else:
        # question to be answered: for construct cost ck, for 22, only check WP band if Section in WP
        if df['Section'] == 'ALN' or (df['Section'] == 'WP' and df['Approve Year'] == 22): 
            return 'Update Capital Cost. It does not match Approved List'
        else:
            return 'OK'
        
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_construction_capital_cost, axis = 1)

### flag if no performance
 performance value can be zero

In [None]:
#TODO
# FY in Use needs to be the same as the pavement, TMS worksheet plan year

# flag the invalid locations


In [None]:
ck_col = 'Was Performance Tab Completed in Section in Use?'

temp = df_Minor_perf_raw_data.groupby(['AMT_ID','Section']).first().reset_index()
temp['Has performance raw data?'] = 'Yes'

df_Minor_raw_data.drop(columns=['Has performance raw data?',],inplace=True , errors='ignore')
print(df_Minor_raw_data.shape)

df_Minor_raw_data = pd.merge(df_Minor_raw_data, temp[['AMT_ID','Section','Has performance raw data?']].drop_duplicates(), 
                             how = 'left', left_on = ['AMT_ID','Section'], right_on=['AMT_ID','Section'])

df_Minor_raw_data['Has performance raw data?'].fillna('No', inplace=True)

print(df_Minor_raw_data.shape)


def ck_performance_availability(df):
    if df['Has performance raw data?'] == 'No':
        return '"Please complete Performance Tab in Section {}'.format(df['Section'])
    else:
        return 'OK'

df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_performance_availability, axis = 1)

In [None]:
# Does Performance in Section in Use Match Approved List?
# check shape
print(df_Minor_raw_data.shape)
#remove column

col_name = 'Matches Minor Approved List Performance Value and Measure?'
df_Minor_raw_data.drop(columns=[col_name],inplace=True , errors='ignore')

#join
df_Minor_raw_data = pd.merge(
    df_Minor_raw_data, 
    df_Programming_Summary_filtered[['AMT_ID', 'Section',col_name]],
    how='left', left_on=['AMT_ID', 'Section'], right_on=['AMT_ID', 'Section']
)
#fill na
#question to be answered: for projects not in the programming summary list, we assigned the performance value and measure check to No

df_Minor_raw_data[col_name].fillna('No', inplace=True)

print(df_Minor_raw_data.shape)

In [None]:
ck_col = 'Was project with FY Before 2021/22 Closed-Out?'

def ck_project_closeout_status(df):
    if pd.isna(df['FY In Use']):
        return 'Please Identify FY'
    elif int(df['FY In Use']) < 22 and df['Section'] == 'ALN':
        return 'OK'
    else:
        return 'Please work with HQ Minor Program to Close-out Project'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_project_closeout_status, axis=1)

In [None]:
ck_col = 'Data Needs Review?'

input_cols = ['Does Project have a Repeated EA or Project ID repeated in Minor Profile?',
       'Does FY Need Updates?', 'Does EA Need Updates?',
       'Does Program Code Need Updates?',
       'Does Construction Capital Cost ($K) Need Updates?',
       'Has performance raw data?',
       'Was Performance Tab Completed in Section in Use?',
       'Matches Minor Approved List Performance Value and Measure?',
       'Was project with FY Before 2021/22 Closed-Out?']

def ck_review_needs(df, input_cols):
    for col in input_cols:
        if df[col] != 'OK:':
            return 'District needs to review project data (Profile and/or RTL)'
    return 'OK'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_review_needs, args = [input_cols], axis=1)


In [None]:

ck_col = 'Data needs review other that Close-out?'

input_cols = ['Does Project have a Repeated EA or Project ID repeated in Minor Profile?',
       'Does FY Need Updates?', 'Does EA Need Updates?',
       'Does Program Code Need Updates?',
       'Does Construction Capital Cost ($K) Need Updates?',
       'Has performance raw data?',
       'Was Performance Tab Completed in Section in Use?',
       'Matches Minor Approved List Performance Value and Measure?',
#          'Was project with FY Before 2021/22 Closed-Out?'     
       ]

def ck_review_needs_2(df, input_cols):
    for col in input_cols:
        if df[col] != 'OK:':
            return 'District needs to review project data (Profile and/or RTL)'
    return 'OK'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_review_needs_2, args = [input_cols], axis=1)

In [None]:
#question to be answered, can we convert the following checks into "OK" or others, to be used in filter out flagged items in the punchlist.


In [None]:
ck_col = 'Was information Entered in the Allocation Band?'

def ck_ALN_band_info_completeness(df):
    if pd.isna(df['FY_ALN']) or df['Has performance raw data?'] == 'No' or pd.isna(df['Total Capital Project Cost ($K)']):
        return 'No'
    else: 
        return 'Yes'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_ALN_band_info_completeness, axis=1)

In [None]:
ck_col = 'Is Project ready to enter data in the Allocation Band?'

def ck_readiness_to_enter_ALN_band(df):
    if pd.notna(df['Prog Appr Date_WP']):  #has approval date in WP band
        if df['Section'] == 'ALN':
            return 'Project was closed-out'
        elif df['Was information Entered in the Allocation Band?'] == 'Yes':
            return 'Allocation Band needs review by HQ Minor Program. If all data Accurate HQ Minor will enter the approval date'
        else:
            return 'Project ready to enter data in the Allocation Band (Cost, Schedule, RTL, And/Or Performance Tab)'
    else: 
        return 'Workplan Band needs review by HQ Minor Program. If all data Accurate HQ Minor will enter the approval date'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_readiness_to_enter_ALN_band, axis=1)

In [None]:

ck_col = 'Is Project Project Ready for Review and Approval Date?'

def ck_readiness_for_review(df):
    if df['Data needs review other that Close-out?'] != 'OK':
        return 'No'
    elif df['FY In Use'] > 22: 
        return 'No'
    elif pd.isna(df['Prog Appr Date_WP']):
        return 'HQ Needs to review Workplan band and enter Approval Date if data is accurate'
    
    elif df['Was information Entered in the Allocation Band?'] == 'Yes':
        if pd.notna(df['Prog Appr Date_ALN']):
            return 'No, Project Already Closed-out'
        else:
            return 'HQ Needs to review Allocation band and enter Approval Date if data is accurate'
    else:
        return 'No'
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_readiness_for_review, axis=1)

In [None]:
ck_col = 'Does Worplan Band needs Approval Removal?'

def ck_WP_data_error(df):
    if pd.isna(df['Prog Appr Date_WP']):  #has no approval data in WP band
        return 'No'
    elif (pd.isna(df['FY In Use'])
        or int(df['FY In Use']) > 22 
        or (df['FY In Use'] in ['21', '22'] and df['In the Approved List?'] == 'No')
         ): 
        return 'HQ Minor Program needs to remove Approval date fromWorkplan Band, so District can updated the project FY. Project not in Approved lists or in the future'
    else:
        return 'No'
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_WP_data_error, axis=1)

In [None]:
ck_col = 'Does Allocation Band needs Approval Removal?'

def ck_ALN_data_error(df):
    if pd.isna(df['Prog Appr Date_ALN']):  #has no approval data in ALN band
        return 'No'
    elif df['Data needs review other that Close-out?'] == 'OK': 
        return 'No'
    else:
        return 'HQ Minor Program needs to remove Approval date from Allocation Band, so District can updated the project data'

df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_ALN_data_error, axis=1)

In [None]:
ck_col = 'HQ Minor Program Needs Review?'

def ck_review_needs_HQ_Minor(df):
    if (df['Is Project Project Ready for Review and Approval Date?'] == "No"
        and df['Does Worplan Band needs Approval Removal?'] == 'No'
        and df['Does Allocation Band needs Approval Removal?'] == 'No'
       ):
        return 'No'
    else: 
        return "HQ Minor Needs Review"
    
df_Minor_raw_data[ck_col] = df_Minor_raw_data.apply(ck_review_needs_HQ_Minor, axis=1)

In [None]:
#question to be answered: 
# for every project in Minor raw data, the section-in-use performance needs to be filled. 
# the performance measure unit and value should match with approved project list performance meansure, if available in approved project list. 
# if the raw data has all the information needed, including FY and performance data, and it is not only the approved project list, Minor project HQ needs to review and approve the project. 
#if the project is on the approved list, the HQ needs reach out the district to get the project close out. 


In [None]:
#question to be answered: 
# do we need the following checks

### flag if total project cost is zero

In [None]:

def ck_total_project_cost(df):
    if pd.isna(df['Total Project Cost ($K)']) or df['Total Project Cost ($K)'] == 0:
        return 'Total project cost can not Empty or zero.'
    else:
        return 'OK'
    
df_Minor_raw_data['Total Project Cost Check'] = df_Minor_raw_data.apply(ck_total_project_cost, axis = 1)

### flag if project description is blank


In [None]:
def ck_project_description(df):
    if pd.isna(df['Project Location/Description']) or df['Project Location/Description'] == '':
        return 'Project Location/Description can not empty.'
    else:
        return 'OK'
    
df_Minor_raw_data['Project Location/Description Check'] = df_Minor_raw_data.apply(ck_project_description, axis = 1)

### check pm validation

In [None]:
df_Minor_pm_invalid = df_Minor_pm_check[df_Minor_pm_check['Valid PM'] != 'Yes']

AMT_IDs_withInvalidPM = df_Minor_pm_invalid['AMT_ID'].unique()

In [None]:

def ck_invalid_pm(df):
    if df['AMT_ID'] in AMT_IDs_withInvalidPM:
        return 'The PM is invalid.'
    else:
        return 'OK'
    
df_Minor_raw_data['PM Validity Check'] = df_Minor_raw_data.apply(ck_invalid_pm, axis = 1)


In [None]:
end_time =  time.time()
elapsed = end_time - start_time
print('time elapsed : {} seconds'.format(elapsed))

<a id='Export_Data'></a>
# Export Data

In [None]:
DATA_HHMM = datetime.now().strftime("%H%M")

file_export_log = open(LOG_FILE, "a")  # append mode
file_export_log.write("#####{}, time(HHMM):{} \n".format(TARGETDATE, DATA_HHMM))

In [None]:
df_Minor_raw_data['Data_HourMinute'] = DATA_HHMM
df_Minor_raw_data['Data_Date'] = TARGETDATE

## export check flags

In [None]:
ck_cols = [
    'EFIS Uniqueness Check',
    'EA Uniqueness Check',
    'Does Project have a Repeated EA or Project ID repeated in Minor Profile?',
    'Does FY Need Updates?', 
    'Does EA Need Updates?',
    'Does Program Code Need Updates?',
    'Does Construction Capital Cost ($K) Need Updates?',
    'Has performance raw data?',
    'Was Performance Tab Completed in Section in Use?',
    'Matches Minor Approved List Performance Value and Measure?',
    'Was project with FY Before 2021/22 Closed-Out?', 
    'Data Needs Review?',
    'Data needs review other that Close-out?',
           
    #additional checks
    'PM Validity Check',]


In [None]:
#export all projects with all checks in matrix
out_cols = [
    
    #project information
    'AMT_ID', 'Minor', 'EFIS', 'EA', 'District', 
    'Data_Date',
    'Data_HourMinute',
                    ]

out_cols.extend(ck_cols)


filename = 'minor_datachecks_matrix'

try: 
    df_Minor_raw_data[out_cols].to_csv('.\output\{}.csv'.format(filename), index= False)
    shutil.copy('.\output\{}.csv'.format(filename), '{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename))
    file_export_log.write("Succeeded: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))
except:
    file_export_log.write("Failed: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))


In [None]:
#table 1

df_melted = pd.melt(df_Minor_raw_data, 
                    id_vars=['AMT_ID'], 
                    value_vars=ck_cols, var_name = 'Check Description')

df_melted.columns = ['AMT_ID','Check Description','Check Summary']
df_melted_filtered = df_melted[df_melted['Check Summary']!= 'OK']


out_cols = [
    
    #project information
    'AMT_ID', 'Minor', 'EFIS', 'EA', 'District', 
    'Data_Date',
    'Data_HourMinute',
                    ]

df_out = pd.merge(df_melted_filtered, df_Minor_raw_data[out_cols],
                  how = 'left', left_on = 'AMT_ID', right_on = 'AMT_ID')



In [None]:
filename = 'Minor_Datachecks_Punchlist'
try: 
    df_out.to_csv('.\output\{}.csv'.format(filename), index= False)
    shutil.copy('.\output\{}.csv'.format(filename), '{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename))
    file_export_log.write("Succeeded: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))
except:
    file_export_log.write("Failed: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))

hyper_name = '{}.hyper'.format(filename)

try: 
    publish_datasource(df_out, hyper_name)
    file_export_log.write("Succeeded: {} \n".format('{}'.format(hyper_name)))
except:
    file_export_log.write("Failed: {} \n".format('{}'.format(hyper_name)))


## export action items for Minor District Engineer

In [None]:
filename = 'Minor_District_ActionItem'

out_cols = [
    #project information
    'AMT_ID', 'Minor', 'EFIS', 'EA', 'District', 
    'Data_Date',
    'Data_HourMinute',
    
    'Is Project ready to enter data in the Allocation Band?'
                    ]
df_out = df_Minor_raw_data[out_cols]



try: 
    df_out.to_csv('.\output\{}.csv'.format(filename), index= False)
    shutil.copy('.\output\{}.csv'.format(filename), '{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename))
    file_export_log.write("Succeeded: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))
except:
    file_export_log.write("Failed: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))

hyper_name = '{}.hyper'.format(filename)

try: 
    publish_datasource(df_out, hyper_name)
    file_export_log.write("Succeeded: {} \n".format('{}'.format(hyper_name)))
except:
    file_export_log.write("Failed: {} \n".format('{}'.format(hyper_name)))

## export action items for Minor HQ Engineer

In [None]:
filename = 'Minor_HQ_ActionItem'

out_cols = [
    #project information
    'AMT_ID', 'Minor', 'EFIS', 'EA', 'District', 
    'Data_Date',
    'Data_HourMinute',
    
    'Is Project Project Ready for Review and Approval Date?',
    'Does Worplan Band needs Approval Removal?',
    'Does Allocation Band needs Approval Removal?',
    'HQ Minor Program Needs Review?',
                    ]
df_out = df_Minor_raw_data[out_cols]

try: 
    df_out.to_csv('.\output\{}.csv'.format(filename), index= False)
    shutil.copy('.\output\{}.csv'.format(filename), '{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename))
    file_export_log.write("Succeeded: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))
except:
    file_export_log.write("Failed: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))

hyper_name = '{}.hyper'.format(filename)

try: 
    publish_datasource(df_out, hyper_name)
    file_export_log.write("Succeeded: {} \n".format('{}'.format(hyper_name)))
except:
    file_export_log.write("Failed: {} \n".format('{}'.format(hyper_name)))

<a id='Export_programming_summary'></a>

### Export Programming Summary

In [None]:
out_col =df_Programming_Summary_filtered.columns

filename = 'Minor_Programming_Summary'
df_out = df_Programming_Summary_filtered[out_col]
df_out['Data_Date'] = TARGETDATE
df_out['Data_HourMinute'] = DATA_HHMM

try: 
    df_out.to_csv('.\output\{}.csv'.format(filename), index= False)
    shutil.copy('.\output\{}.csv'.format(filename), '{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename))
    file_export_log.write("Succeeded: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))
except:
    file_export_log.write("Failed: {} \n".format('{}\{}.csv'.format(PROJECTBOOKCHECK_HTTPSEVER_FOLDER, filename)))


hyper_name = '{}.hyper'.format(filename)

try: 
    publish_datasource(df_out, hyper_name)
    file_export_log.write("Succeeded: {} \n".format('{}'.format(hyper_name)))
except:
    file_export_log.write("Failed: {} \n".format('{}'.format(hyper_name)))

In [None]:
file_export_log.close()


<a id='FinalCleanUp'></a>
## Final Clean Up

In [None]:


#clean up tableau publishing log file

import os
import glob
# get a recursive list of file paths that matches pattern
fileList = glob.glob('./*.log')
# Iterate over the list of filepaths & remove each file.
for filePath in fileList:
    try:
        os.remove(filePath)
    except OSError:
        print("Error while deleting file")


In [None]:
end_time =  time.time()
elapsed = end_time - start_time
print('time elapsed : {} seconds'.format(elapsed))