# Data Cleaning project

In [227]:
# Cleaning will be performed mainly with Pandas.
import pandas as pd

## Data set
The data set is an extract from American `itdashboard.gov` archive. It containt information about governmental IT projects, their dates, costs, stakeholders (i.e. specific departments).

In [226]:
df = pd.read_csv('dataset.csv', index_col=False)

## Cleaning columns one by one
1. `Unique Investment Identifier`:
    1. Not unique, despite the name. It shouldn't be the first column, because it looks like Primary Key.
	2. **Identifier** could be **ID**.

In [207]:
# Rename the column
# df.rename(columns={'Unique Investment Identifier':'Investment ID'}, inplace=True)
df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unique Investment Identifier,Business Case ID,Agency Code,Agency Name,Investment Title,Project ID,Agency Project ID,Project Name,Project Description,Start Date,Completion Date (B1),Planned Project Completion Date (B2),Projected/Actual Project Completion Date (B2),Planned Cost ($ M),Projected/Actual Cost ($ M),Updated Date,Updated Time
0,0,0,0,005-000001723,212.0,5.0,Department of Agriculture,AMS Infrastructure WAN and DMZ (AMSWAN),656.0,,Operations,Annual Agency Operations.,01/10/2011,2012-30-09,,,15.297000,15.297000,22/09/2011,10:22:25
1,1,1,1,005-000001723,212.0,5.0,Department of Agriculture,AMS Infrastructure WAN and DMZ (AMSWAN),657.0,,Virtualization,Program Areas will migrate their data over to ...,01/10/2011,31/03/2012,31/03/2012,31/03/2012,0.179000,0.179000,30/11/2011,06:09:57
2,2,2,2,005-000001723,212.0,5.0,Department of Agriculture,AMS Infrastructure WAN and DMZ (AMSWAN),658.0,,Refresh,Programs Areas will replace 1/3 of their compu...,01/04/2012,30/09/2012,,,1.460000,1.460000,28/10/2011,05:50:19
3,3,3,3,005-000001822,213.0,5.0,Department of Agriculture,APHIS Electronic Permits System (ePermits),661.0,,ePermits O&M FY11 Part 1.,"Production Support including Analysis, Softwar...",01/04/2011,30/09/2011,30/09/2011,30/09/2011,1.820500,1.456400,31/05/2012,14:20:11
4,4,4,4,005-000001822,213.0,5.0,Department of Agriculture,APHIS Electronic Permits System (ePermits),662.0,,ePermits O&M FY12 Part 1,"Production Support including Analysis, Softwar...",01/04/2012,30/09/2012,30/09/2012,30/09/2012,1.713000,1.713000,31/05/2012,14:20:11
5,5,5,5,005-000001822,213.0,5.0,Department of Agriculture,APHIS Electronic Permits System (ePermits),663.0,,ePermits O&M FY12 Part 2,"Production Support including Analysis, Softwar...",01/10/2012,31/03/2013,,,1.450000,1.450000,31/05/2012,14:20:11
6,6,6,6,005-000001822,213.0,5.0,Department of Agriculture,APHIS Electronic Permits System (ePermits),664.0,,BRS Field Reports Phase 2,UAT Bug Fixes.,31/12/2010,31/03/2011,31/03/2011,31/03/2011,0.010000,0.010000,22/09/2011,10:34:24
7,7,7,7,005-000000038,214.0,5.0,Department of Agriculture,APHIS Enterprise Infrastructure,665.0,,Enterprise Management,Management of APHIS Enterprise IT Systems.,01/10/2010,30/09/2011,30/09/2011,20/09/2011,46.960000,46.960000,22/09/2011,10:34:32
8,8,8,8,005-000000038,214.0,5.0,Department of Agriculture,APHIS Enterprise Infrastructure,666.0,,VoIP Riverdale,Installation of voice over IP phones in Riverd...,15/08/2011,30/04/2012,30/04/2012,30/04/2012,0.662000,0.662000,31/05/2012,14:20:13
9,9,9,9,005-000000038,214.0,6.0,Department of Agriculture,APHIS Enterprise Infrastructure,667.0,,Enterprise Security,Maintain Enterprise Security integrity.,06/09/2011,31/12/2012,,,0.572000,0.572000,01/08/2012,13:21:08


2. `Business Case ID`
    1. It is float, but should be int.
    2. It corresponds 1:1 with `Investment ID`, so it should be kept in another lookup table.
    3. There are some `NInvestment IDaN` values. These are rows with totals and they should be removed - they don't add any value, but only complicate an analysis.

In [144]:
# Drop rows with totals
indices_to_drop = df.loc[df['Business Case ID'].isna()].index
df.drop(indices_to_drop, inplace=True)

# Change type to int
df['Business Case ID'] = df['Business Case ID'].astype(int)

# Save into another DF
business_case = pd.DataFrame(data=df[['Investment ID', 'Business Case ID']])
business_case.drop_duplicates(inplace=True)
business_case.set_index(keys='Investment ID', inplace=True)

# Export into a file
business_case.to_csv('business_case.csv')

# Drop extracted columns
df.drop(labels=['Investment ID', 'Business Case ID'], axis=1, inplace=True)

3. `Agency Code`
    1. **Code** could be **ID**.
	2. It is float, but should be int.

In [118]:
# Rename the column to `Agency ID`:
df.rename(columns={'Agency Code':'Agency ID'}, inplace=True)

# Change type to int
df['Agency ID'] = df['Agency ID'].astype(int)

4. `Agency Name`
	1. It corresponds 1:1 with `Agency Code`, so it should be kept in another lookup table. In fact there are only two cases in whole DataFrame, where `Agency ID` doesn't reflect `Agency Name` ideally. By comparing lengths of `Agency ID`, `Agency Name` and pair of these two, we see that there is **one more of unique entries** within Code+Name pair. By checking uniqueness we can conclude that value `6` is our point of pain. Two entries, which have wrong `Agency ID` values (not `Agency Name`, what can be easily deduced from other entries), should have `5`. 
    2. Names should be extracted to separate table.

In [133]:
# Define indices of rows to be changed
indices_to_change = df[(df['Agency ID'] == 6) & (df['Agency Name'] != 'Department of Commerce')].index

# Assign new values
for index in indices_to_change:
    df.at[index, 'Agency ID'] = 5
    
# Save into another DF
col_agencies = ['Agency ID', 'Agency Name']
agencies = pd.DataFrame(data=df[col_agencies])
agencies.drop_duplicates(inplace=True)

# Export into a file
agencies.to_csv('agencies.csv')

# Drop extracted columns
df.drop(labels=['Agency Name', axis=1, inplace=True)

5. `Investment Title`
    1. There are some minor formatting problems (e.g. '&amp;' instead of '&'), however they won't be touched, because these are long business names and could be defined and described somewhere else with whichever formatting.

In [204]:
len(df[['Investment ID', 'Investment Title']].drop_duplicates())

KeyError: "['Investment ID'] not in index"

6. `Project ID`
    1. It's a float, but should be an int.
	2. It's unique, so it should be the Primary Key and the first column in data set. This fits also the business context of the whole data set. 

In [135]:
# Change type to int
df['Project ID'] = df['Project ID'].astype(int)

# Set index
df.set_index(keys='Project ID', inplace=True)

7. `Agency Project ID`
	1. Almost half of the values are NaN (but read by Pandas correctly). Since these are some IDs of internal use of agencies, they won't be touched.

8. `Project Name`
	1. It corresponds with `Project ID`, so it should be kept in another lookup table with `Project Description` (see next point) in a `project_descriptions` table.

9. `Project Description`
	1. It corresponds with `Project ID`, so it should be kept in another lookup table with `Project Name` in a `project_descriptions` table.

In [145]:
# Save into another DF
project_descriptions = pd.DataFrame(data=df[['Project Name', 'Project Description']])
project_descriptions.drop_duplicates(inplace=True)

# Export into a file
project_descriptions.to_csv('project_descriptions.csv')

# Drop extracted columns
df.drop(labels=['Project Name', 'Project Description'], axis=1, inplace=True)

### Date columns
10. `Start Date`, 11. `Completion Date (B1)`, 12. `Planned Project Completion Date (B2)`, 13. `Projected/Actual Project Completion Date (B2)`
	1. They are objects (Pandas' string), but should be dates.
    2. All dates should be extracted to separate table (not obligatory, but it's good not to mix a business contexts).
    3. Multiple formats (`2012-30-09`, `31/03/2012`)
    4. Column names too long.
    5. Since there also `Completion Date (B1)`, `Planned Project Completion Date (B2)` and `Projected/Actual Project Completion Date (B2)`, a purpose of these columns is unobvious. However the full business context is not known, so they won't be deleted or merged.

In [194]:
import re

# Rename the columns
col_renames = {'Completion Date (B1)':'Completion',
               'Planned Project Completion Date (B2)':'Planned Completion',
               'Projected/Actual Project Completion Date (B2)':'Projected/Actual Completion'}
df.rename(columns=col_renames, inplace=True)

# Define a parser for two different formats
pattern = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}')

def date_parser(arg):
    global pattern
    if pattern.match(str(arg)):
        return pd.to_datetime(arg=arg, format='%Y-%d-%m')
    return pd.to_datetime(arg=arg)

# Parse dates
col_dates = ['Start Date', 'Completion Date', 'Planned Completion Date', 'Projected/Actual Completion']
df[col_dates] = df[col_dates].apply(date_parser)

# Extract dates
dates = pd.DataFrame(data=df[col_dates])

# Export into a file
dates.to_csv('dates.csv')

# Drop extracted columns
df.drop(labels=col_dates, axis=1, inplace=True)

KeyError: "['Planned Completion Date', 'Completion Date', 'Start Date'] not in index"

### Cost columns
14. `Planned Cost ($ M)`, 15. `Projected/Actual Cost ($ M)`
	1. Columns should be extracted to separate table (not obligatory, but it's good not to mix a business contexts).

In [189]:
col_costs = ['Planned Cost ($ M)', 'Projected/Actual Cost ($ M)']

# Extract costs
costs = pd.DataFrame(data=df[col_costs])

# Export into a file
costs.to_csv('costs.csv')

# Drop extracted columns
df.drop(labels=col_costs, axis=1, inplace=True)

### Update columns
16. `Updated Date`, 17. `Updated Time`
	1. Should be merged together as a timestamp.

In [197]:
df['Updated'] = df['Updated Date'] + " " + df['Updated Time']
df['Updated'] = pd.to_datetime(arg=df['Updated'], format='%d/%m/%Y %H:%M:%S')