# Structuring
***
## Learning Objectives
In this lesson you will learn about:
- Identifying your data types
- Converting data types
- Handling missing values
- Handling duplicate values
- Handling outliers
- Handling invalid values
- Handling inconsistent values
- Handling mixed values
- Handling null values
- Handling empty values
- Handling whitespace
- Handling special characters
- Handling case sensitivity
- Handling data types
- Handling data formats
- Handling data structures
- Handling data values
- Handling data ranges
- Handling data units
- Handling data precision
- Handling data accuracy
- Handling data integrity
- Handling data completeness
- Handling data consistency
- Handling data duplication
- Handling data redundancy
- Handling data normalization
- Handling data standardization
- Handling data validation

## Links

## Data Types


In [211]:
# Load /structuring/files/16100013-eng  into a df

import pandas as pd

df_csv = pd.read_csv('./files/16100013-eng/16100013.csv')
df_csv.head() #So far so good


Unnamed: 0,REF_DATE,GEO,DGUID,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


In [212]:
#What about the other file in the folder?

df_meta = pd.read_csv('./files/16100013-eng/16100013_MetaData.csv')
df_meta.head() #Looks interesting, but we'll have to do some more research to figure out what it means

Unnamed: 0,Cube Title,Product Id,CANSIM Id,URL,Cube Notes,Archive Status,Frequency,Start Reference Period,End Reference Period,Total number of dimensions
"Real manufacturing sales, orders, inventory owned and inventory to sales ratio, 2017 dollars, seasonally adjusted",16100013,377-0010,https://www150.statcan.gc.ca/t1/tbl1/en/tv.act...,1;2;3;4,CURRENT - a cube available to the public and t...,Monthly,2002-01-01,2023-05-01,3.0,
Dimension ID,Dimension name,Dimension Notes,Dimension Definitions,,,,,,,
1,Geography,,,,,,,,,
2,Principal statistics,,,,,,,,,
3,North American Industry Classification System ...,,,,,,,,,


In [213]:
# Back to the main CSV file. 
df_csv

Unnamed: 0,REF_DATE,GEO,DGUID,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.00,,,,0
1,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.00,,,,0
2,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.00,,,,0
3,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.00,,,,0
4,2002-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.00,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13359,2023-05,Canada,2016A000011124,Inventory of finished goods,Non-durable goods,Dollars,81,millions,6,v123263955,1.8.3,17861.00,,,,0
13360,2023-05,Canada,2016A000011124,Inventory of finished goods,Durable goods,Dollars,81,millions,6,v123263956,1.8.2,13552.00,,,,0
13361,2023-05,Canada,2016A000011124,Inventory to sales ratio,"Total, durable and non-durable goods",Dollars,81,units,0,v123263957,1.9.1,1.68,,,,2
13362,2023-05,Canada,2016A000011124,Inventory to sales ratio,Non-durable goods,Dollars,81,units,0,v123263958,1.9.3,1.36,,,,2


In [214]:
#Print out all the columns of the dataframe 
print(df_csv.columns)

# The code print(df.columns) should be generated automatically by the notebook

Index(['REF_DATE', 'GEO', 'DGUID', 'Principal statistics',
       'North American Industry Classification System (NAICS)', 'UOM',
       'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE',
       'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'],
      dtype='object')


# Create a data dictionary in markdown format for the following dataset

| Column | Data Type | Description |
|---|---|---|
| REF_DATE | Date | Reference date of the estimate |
| GEO | Text | Province or territory |
| DGUID | Text | Dissemination Geography Unique Identifier |
| Principal statistics | Text | Principal statistics |
|North American Industry Classification System (NAICS)| Text | North American Industry Classification System (NAICS) |
| UOM | Text | Unit of measure |
| UOM_ID | Number | Unit of measure identifier |
| SCALAR_FACTOR | Text | Scalar factor |
| SCALAR_ID | Number | Scalar factor identifier |
| VECTOR | Text | Vector identifier |
| COORDINATE | Text | Coordinate |
| VALUE | Number | Estimate value |
| STATUS | Text | Symbol indicating the quality of the estimate |
| SYMBOL | Text | Symbol indicating the type of release of the estimate |
| TERMINATED | Text | Symbol indicating data quality issues |
| DECIMALS | Number | Number of decimals |


Co-pilot did it all for us 😀. Copilot gave as a good starting point. Let's make sure everything is correct and add some more information.

First, `REF_DATE`. The data type of `Date` looks correct. Currently, the column is a string, though. Let's convert it to a date.

In [215]:
# Change REF_DATE to a date field
df_csv['REF_DATE'] = pd.to_datetime(df_csv['REF_DATE'])
df_csv.head()


Unnamed: 0,REF_DATE,GEO,DGUID,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


Notice the change in the `REF_DATE` column values. They now have as days in them. This allows us to do calculations on the dates. For example, we can find the difference between two dates. The data set is also not an estimate of anything. The description need to change to: 
| Column | Data Type | Description |
|---|---|---|
| REF_DATE | Date | Reference date of the estimate |

A more descriptive name for the column might also make sense: 

In [216]:
# Rename REF_DATE to REFERENCE_PERIOD in  `df_csv`
df_csv.rename(columns={'REF_DATE':'REFERENCE_PERIOD'}, inplace=True) #Code automatically generated by VSC co-pilot
df_csv.head()


Unnamed: 0,REFERENCE_PERIOD,GEO,DGUID,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Canada,2016A000011124,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


It might be better to update the dictionary at the end. On to the next column - GEO. Let's understand the data in this column.

In [217]:
# Group all the different values in the GEO column and their counts
value_counts = df_csv['GEO'].value_counts()
print(value_counts)




Canada    13364
Name: GEO, dtype: int64


The first line in the result tells us that the only value in the `GEO` column is Canada. Here's an example of a dataframe with different values: 

In [218]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'C'],
    'Value': [10, 20, 30, 40, 50, 60, 70]
}

df = pd.DataFrame(data)

# Get the count of different values in the 'Category' column
value_counts = df['Category'].value_counts()

print(value_counts)


A    3
B    3
C    1
Name: Category, dtype: int64


At this stage, we have two options:
1. We can either transform the column, by for example, renaming it and making sure the casing is correct.
2. We can drop the column altogether.

Why drop it? The important point to consider is that we are not actually deleting the data from the original file but preparing *what we need* for our analysis. If we don't need it, we can drop it.


In [219]:
# Drop `GEO` column
df_csv.drop(columns=['GEO'], inplace=True)

df_csv.head()

Unnamed: 0,REFERENCE_PERIOD,DGUID,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,2016A000011124,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,2016A000011124,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,2016A000011124,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,2016A000011124,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,2016A000011124,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


`DGUID` means *Dissemination Geography Unique Identifier* according to [Statistics Canada](https://www.statcan.gc.ca/eng/developers/wds/user-guide#dguid).
I found the following information by doing a search for it: https://www12.statcan.gc.ca/census-recensement/2021/ref/dict/az/Definition-eng.cfm?ID=geo055

> A unique identifier assigned to each geographic area. The first two digits represent the province or territory, the next two digits represent the census division, the next four digits represent the census subdivision, and the last two digits represent the dissemination block. The dissemination block is the smallest geographic area for which population and dwelling counts are disseminated. The dissemination block is generally bounded by physical features such as roads and railways and by non-physical features such as city, town or municipal boundaries.

> The dissemination block is the smallest geographic area for which population and dwelling counts are disseminated. The dissemination block is generally bounded by physical features such as roads and railways and by non-physical features such as city, town or municipal boundaries.

Let's see how many unique values there are for `DGUID`:


In [220]:
value_counts = df_csv['DGUID'].value_counts() # counts the number of unique values in a column
value_counts

2016A000011124    13364
Name: DGUID, dtype: int64

Only one number. We can drop it. 

In [221]:
df_csv.drop(columns=['DGUID'], inplace=True)

df_csv.head()

Unnamed: 0,REFERENCE_PERIOD,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


`Principal Statistics` doesn't mean much as a column name. We can see that the values are some kind of classification the items sold. A look at the number of unique values might clarify. 


In [222]:
value_counts = df_csv['Principal statistics'].value_counts()
value_counts


Sales of goods manufactured (shipments)    7967
New orders                                  771
Unfilled orders                             771
Inventories                                 771
Inventory of raw materials                  771
Inventory of goods or work in process       771
Inventory of finished goods                 771
Inventory to sales ratio                    771
Name: Principal statistics, dtype: int64

Now the column name makes sense - It is different types of statistics, all lumped into one CSV file. 
It is interesting that all the categories, except for `Sales of goods manufactured (shipments)` have the same number of rows. Right now, we are only interested in the `Sales of goods manufactured (shipments)` category, so we will filter out the other categories.


In [223]:
# Filter out only Sales of goods manufactured (shipments) in the Principal Statistics column
filter =df_csv['Principal statistics'] == 'Sales of goods manufactured (shipments)'
df_csv = df_csv[filter]


df_csv.head()

Unnamed: 0,REFERENCE_PERIOD,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


In [224]:
# Remove the column "Principal statistics" after we're done with it
# df_csv.drop(columns=['Principal statistics'], inplace=True)

df_csv.head()

Unnamed: 0,REFERENCE_PERIOD,Principal statistics,North American Industry Classification System (NAICS),UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


The data is already looking much cleaner. Time for the next column - `North American Industry Classification System (NAICS)`. It might be difficult to type the name out every time. Let's shorten it to `NAICS`.


In [225]:
#Change `North American Industry Classification System (NAICS)` to `NAICS` in as a column name in the `df_csv` dataframe.
df_csv.rename(columns={'North American Industry Classification System (NAICS)': 'NAICS'}, inplace=True)
df_csv.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_csv.rename(columns={'North American Industry Classification System (NAICS)': 'NAICS'}, inplace=True)


Unnamed: 0,REFERENCE_PERIOD,Principal statistics,NAICS,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",Dollars,81,millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Sales of goods manufactured (shipments),Non-durable goods,Dollars,81,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Sales of goods manufactured (shipments),Food manufacturing,Dollars,81,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,Dollars,81,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Sales of goods manufactured (shipments),Textile mills,Dollars,81,millions,6,v123263912,1.1.6,382.0,,,,0


Let's see how clean the data is in the `NACIS` code column. 


In [226]:
value_counts = df_csv['NAICS'].value_counts() # counts the number of unique values in a column
value_counts


Total, durable and non-durable goods                           257
Primary metal manufacturing                                    257
Furniture and related product manufacturing                    257
Other transportation equipment manufacturing                   257
Ship and boat building                                         257
Railroad rolling stock manufacturing                           257
Aerospace product and parts manufacturing                      257
Motor vehicle parts manufacturing                              257
Motor vehicle body and trailer manufacturing                   257
Motor vehicle manufacturing                                    257
Transportation equipment manufacturing                         257
Electrical equipment, appliance and component manufacturing    257
Computer and electronic product manufacturing                  257
Machinery manufacturing                                        257
Fabricated metal product manufacturing                        

There are an equal number of rows for each category. Let's do a count of the periods to see if this makes sense.

In [227]:
value_counts = df_csv['REFERENCE_PERIOD'].value_counts() # counts the number of unique values in a column
value_counts

2002-01-01    31
2012-10-01    31
2015-08-01    31
2015-09-01    31
2015-10-01    31
              ..
2009-08-01    31
2009-09-01    31
2009-10-01    31
2009-11-01    31
2023-05-01    31
Name: REFERENCE_PERIOD, Length: 257, dtype: int64

Yes. The data is rolled up, or aggregated by period. I.E. the data is grouped by month and the values are summed. 
Are there any `UOM`s that are not Dollars? 

In [228]:
value_counts = df_csv['UOM'].value_counts() # counts the number of unique values in a column
value_counts

Dollars    7967
Name: UOM, dtype: int64

No. We can drop the `UOM` column as it is not needed for our analysis.

In [229]:
columns_to_drop = ['UOM' , 'UOM_ID'] # We might as well drop UOM_ID as well

# Drop the columns
df_csv.drop(columns_to_drop, axis=1, inplace=True)

df_csv.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_csv.drop(columns_to_drop, axis=1, inplace=True)


Unnamed: 0,REFERENCE_PERIOD,Principal statistics,NAICS,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2002-01-01,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",millions,6,v123263908,1.1.1,54905.0,,,,0
1,2002-01-01,Sales of goods manufactured (shipments),Non-durable goods,millions,6,v123263909,1.1.3,26029.0,,,,0
2,2002-01-01,Sales of goods manufactured (shipments),Food manufacturing,millions,6,v123263910,1.1.4,6594.0,,,,0
3,2002-01-01,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,millions,6,v123263911,1.1.5,1583.0,,,,0
4,2002-01-01,Sales of goods manufactured (shipments),Textile mills,millions,6,v123263912,1.1.6,382.0,,,,0


The next two columns are `SCALAR_FACTOR` and `SCALAR_ID`. First, let's see what the possible values are for `SCALAR_FACTOR`:


In [230]:
value_counts = df_csv['SCALAR_FACTOR'].value_counts() # counts the number of unique values in a column
value_counts

millions    7967
Name: SCALAR_FACTOR, dtype: int64

Only `millions`. It would make sense that the factor is used to multiply `VALUE` in the frame. I.E.:  

In [231]:
% r = v^s

UsageError: Line magic function `%` not found.


In calculations, we don't want the two columns to be separate. We need to add a new column called `CALCULATED_VALUE` according to the formula above. 


In [232]:
# Add a new column to a dataframe

df_csv['CALCULATED_VALUE'] = df_csv['VALUE'] * 1E6

df_csv.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_csv['CALCULATED_VALUE'] = df_csv['VALUE'] * 1E6


Unnamed: 0,REFERENCE_PERIOD,Principal statistics,NAICS,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS,CALCULATED_VALUE
0,2002-01-01,Sales of goods manufactured (shipments),"Total, durable and non-durable goods",millions,6,v123263908,1.1.1,54905.0,,,,0,54905000000.0
1,2002-01-01,Sales of goods manufactured (shipments),Non-durable goods,millions,6,v123263909,1.1.3,26029.0,,,,0,26029000000.0
2,2002-01-01,Sales of goods manufactured (shipments),Food manufacturing,millions,6,v123263910,1.1.4,6594.0,,,,0,6594000000.0
3,2002-01-01,Sales of goods manufactured (shipments),Beverage and tobacco product manufacturing,millions,6,v123263911,1.1.5,1583.0,,,,0,1583000000.0
4,2002-01-01,Sales of goods manufactured (shipments),Textile mills,millions,6,v123263912,1.1.6,382.0,,,,0,382000000.0


In [233]:
# Let's remove the columns we no longer need. 

columns_to_drop = [ 'SCALAR_FACTOR', 'SCALAR_ID' , 'VALUE', 'Principal statistics'] # We might as well drop UOM_ID as well

# Drop the columns
df_csv.drop(columns_to_drop, axis=1, inplace=True)

df_csv.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_csv.drop(columns_to_drop, axis=1, inplace=True)


Unnamed: 0,REFERENCE_PERIOD,NAICS,VECTOR,COORDINATE,STATUS,SYMBOL,TERMINATED,DECIMALS,CALCULATED_VALUE
0,2002-01-01,"Total, durable and non-durable goods",v123263908,1.1.1,,,,0,54905000000.0
1,2002-01-01,Non-durable goods,v123263909,1.1.3,,,,0,26029000000.0
2,2002-01-01,Food manufacturing,v123263910,1.1.4,,,,0,6594000000.0
3,2002-01-01,Beverage and tobacco product manufacturing,v123263911,1.1.5,,,,0,1583000000.0
4,2002-01-01,Textile mills,v123263912,1.1.6,,,,0,382000000.0


We are almost done. I am not sure what `VECTOR` is, but it might be a unique value for each row from the looks of it. Let's check that out.


In [234]:
# Get the number of `unique` values in each column
df_csv.nunique()

REFERENCE_PERIOD     257
NAICS                 31
VECTOR                31
COORDINATE            31
STATUS                 1
SYMBOL                 0
TERMINATED             0
DECIMALS               1
CALCULATED_VALUE    4368
dtype: int64

It looks like `VECTOR`, `COORDINATE` and `NAICS` match. Let's see how many different combinations of the three there are.


In [235]:
# Let's take a random `NACIS` code and see if it looks the same over all the months


In [236]:
filter =df_csv['NAICS'] == 'Ship and boat building'
df_csv[filter]


Unnamed: 0,REFERENCE_PERIOD,NAICS,VECTOR,COORDINATE,STATUS,SYMBOL,TERMINATED,DECIMALS,CALCULATED_VALUE
27,2002-01-01,Ship and boat building,v123263935,1.1.28,,,,0,168000000.0
79,2002-02-01,Ship and boat building,v123263935,1.1.28,,,,0,248000000.0
131,2002-03-01,Ship and boat building,v123263935,1.1.28,,,,0,168000000.0
183,2002-04-01,Ship and boat building,v123263935,1.1.28,,,,0,214000000.0
235,2002-05-01,Ship and boat building,v123263935,1.1.28,,,,0,211000000.0
...,...,...,...,...,...,...,...,...,...
13131,2023-01-01,Ship and boat building,v123263935,1.1.28,,,,0,193000000.0
13183,2023-02-01,Ship and boat building,v123263935,1.1.28,,,,0,200000000.0
13235,2023-03-01,Ship and boat building,v123263935,1.1.28,,,,0,158000000.0
13287,2023-04-01,Ship and boat building,v123263935,1.1.28,,,,0,204000000.0


There are better ways to check if the values are the same, but for now, this will do. We can drop `VECTOR` and `COORDINATE` too

In [237]:

columns_to_drop = [ 'VECTOR' , 'COORDINATE'] 
# Drop the columns
df_csv.drop(columns_to_drop, axis=1, inplace=True)

df_csv.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_csv.drop(columns_to_drop, axis=1, inplace=True)


Unnamed: 0,REFERENCE_PERIOD,NAICS,STATUS,SYMBOL,TERMINATED,DECIMALS,CALCULATED_VALUE
0,2002-01-01,"Total, durable and non-durable goods",,,,0,54905000000.0
1,2002-01-01,Non-durable goods,,,,0,26029000000.0
2,2002-01-01,Food manufacturing,,,,0,6594000000.0
3,2002-01-01,Beverage and tobacco product manufacturing,,,,0,1583000000.0
4,2002-01-01,Textile mills,,,,0,382000000.0


In [238]:
# Get the number of `unique` values in each column
df_csv.nunique()

REFERENCE_PERIOD     257
NAICS                 31
STATUS                 1
SYMBOL                 0
TERMINATED             0
DECIMALS               1
CALCULATED_VALUE    4368
dtype: int64

In [242]:
# Interestingly, Status has 1 non-unique value: 

df_csv['STATUS'].unique() # returns the unique values in a column


array([nan, 'x'], dtype=object)

I am sure `x` must mean something in the status column, but we can't do anything with it at the moment. Let's leave it for now. The same applies to `SYMBOL` and `TERMINATED`. The last column to look at is `DECIMALS`


In [243]:
df_csv['DECIMALS'].unique()

array([0])

It only has the value of 0 in it. Let's drop it. 

In [245]:

columns_to_drop = [ 'SYMBOL' , 'TERMINATED' , 'DECIMALS'] 
# Drop the columns
df_csv.drop(columns_to_drop, axis=1, inplace=True)

df_csv.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_csv.drop(columns_to_drop, axis=1, inplace=True)


Unnamed: 0,REFERENCE_PERIOD,NAICS,STATUS,CALCULATED_VALUE
0,2002-01-01,"Total, durable and non-durable goods",,54905000000.0
1,2002-01-01,Non-durable goods,,26029000000.0
2,2002-01-01,Food manufacturing,,6594000000.0
3,2002-01-01,Beverage and tobacco product manufacturing,,1583000000.0
4,2002-01-01,Textile mills,,382000000.0
