<a href="https://colab.research.google.com/github/randy-ar/gcolab/blob/main/Preprocessing_data_EPA_Chemical_Data_Reporting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import gdown
import matplotlib.pyplot as plt

# Goals

### Analisis risiko dampak ke lingkungan dan dampak ke manusia yang ditimbulkan oleh bahan kimia

### Output analisis:
1. Choropleth Map: Mengambarkan penggunaan bahan kimia tertinggi dalam suatu state & Mengambarkan jumlah perkerja yang terpapar bahan kimia dalam suatu city
2. Bar Chart: Mengambarkan 10 bahan kimia tertinggi yang tidak di daur ulang yang terdaftar di TSCA (Toxic Substances Control Act)
3. Bar Chart: Mengambarkan 10 sector dengan penggunaan bahan kimia tertinggi

### Manfaat analisis:
1. Dari informasi Choropleth Map tersebut kita dapat melakukan tindakan pencengahan agar bahan kimia tidak memberikan dampak serius ke lingkungan dan pekerja
2. Dari output analisis ke-2 kita dapat memberikan perhatian khusus ke bahan kimia toxic yang tidak di daur ulang
3. Dari output analisis ke-3 kita dapat memberikan perhatian khusus ke sector yang menggunakan bahan kimia paling tinggi

# Read Data CSV

In [2]:
table_consumer_and_use_information = "1dGgcbPnVmOeAP03MMG_jf7FwfQEoTa2P"
table_industrial_processing_and_use_information = "1uS0ucuC24KjENVmqssHiQeL05n4j-LNd"
table_nationally_aggregated_production_volumes = "1bTz5YRHcw--kzIGEbUQSC8taqCrAxGaG"

download_url = "https://docs.google.com/uc?export=download&id="

In [3]:
gdown.download(download_url+table_consumer_and_use_information, 'table_consumer_and_use_information.csv', quiet=False)
gdown.download(download_url+table_industrial_processing_and_use_information, 'table_industrial_processing_and_use_information.csv', quiet=False)
gdown.download(download_url+table_nationally_aggregated_production_volumes, 'table_nationally_aggregated_production_volumes.csv', quiet=False)

Downloading...
From: https://docs.google.com/uc?export=download&id=1dGgcbPnVmOeAP03MMG_jf7FwfQEoTa2P
To: /content/table_consumer_and_use_information.csv
100%|██████████| 37.0M/37.0M [00:00<00:00, 141MB/s]
Downloading...
From: https://docs.google.com/uc?export=download&id=1uS0ucuC24KjENVmqssHiQeL05n4j-LNd
To: /content/table_industrial_processing_and_use_information.csv
100%|██████████| 48.2M/48.2M [00:00<00:00, 72.9MB/s]
Downloading...
From: https://docs.google.com/uc?export=download&id=1bTz5YRHcw--kzIGEbUQSC8taqCrAxGaG
To: /content/table_nationally_aggregated_production_volumes.csv
100%|██████████| 1.32M/1.32M [00:00<00:00, 103MB/s]


'table_nationally_aggregated_production_volumes.csv'

In [4]:
df_consumer = pd.read_csv('table_consumer_and_use_information.csv')
df_industrial = pd.read_csv('table_industrial_processing_and_use_information.csv')
df_nationally = pd.read_csv('table_nationally_aggregated_production_volumes.csv')

  df_consumer = pd.read_csv('table_consumer_and_use_information.csv')
  df_industrial = pd.read_csv('table_industrial_processing_and_use_information.csv')


In [5]:
df_consumer.count()

Unnamed: 0,0
CHEMICAL REPORT ID,56000
CHEMICAL NAME,56000
CHEMICAL ID,56000
CHEMICAL ID W/O DASHES,56000
CHEMICAL ID TYPE,56000
...,...
C / C PV PCT,30426
C / C MAX CONC CODE,30426
C / C MAXIMUM CONCENTRATION,30426
COMM WORKERS CODE,30425


# Preprocessing Data

## CONSUMER DATA

### Mengambil kolom yang dibutuhkan untuk analisa

In [6]:
selected_column = [
    'CHEMICAL ID',
    'CHEMICAL NAME',
    'CHEMICAL ID TYPE',
    'RECYCLED',
    '2019 DOMESTIC PV',
    '2019 IMPORT PV',
    '2019 PV',
    '2018 PV',
    '2017 PV',
    '2016 PV',
    'SITE LATITUDE',
    'SITE LONGITUDE',
    'SITE CITY',
    'SITE COUNTY / PARISH',
    'SITE STATE',
    'SITE POSTAL CODE',
    'SITE NAICS CODE 1',
    'SITE NAICS ACTIVITY 1',
    'SITE NAICS CODE 2',
    'SITE NAICS ACTIVITY 2',
    'SITE NAICS CODE 3',
    'SITE NAICS ACTIVITY 3',
    'WORKERS CODE',
    'WORKERS'
]

In [7]:
df_consumer = df_consumer[selected_column]
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,32
CHEMICAL NAME,32
CHEMICAL ID TYPE,32
RECYCLED,3181
2019 DOMESTIC PV,2101
2019 IMPORT PV,2106
2019 PV,2101
2018 PV,108
2017 PV,109
2016 PV,109


### Mencari jumlah baris yang informasinya di private

In [8]:
cbi_rows = df_consumer[df_consumer.apply(lambda x: x.astype(str).str.contains('CBI', na=False).any(), axis=1)]
cbi_rows.count()

Unnamed: 0,0
CHEMICAL ID,31551
CHEMICAL NAME,31551
CHEMICAL ID TYPE,31551
RECYCLED,30329
2019 DOMESTIC PV,30720
2019 IMPORT PV,30720
2019 PV,30720
2018 PV,31507
2017 PV,31506
2016 PV,31506


### Menghapus data yang dirahasiakan, data ini sulit dianalisa karena kerahasiannya

In [9]:
# prompt: delete row with CBI Values

# delete rows where any column contains 'CBI'
df_consumer = df_consumer[~df_consumer.apply(lambda x: x.astype(str).str.contains('CBI', na=False).any(), axis=1)]

# verify that CBI rows are removed
cbi_rows_after_removal = df_consumer[df_consumer.apply(lambda x: x.astype(str).str.contains('CBI', na=False).any(), axis=1)]
print("Number of rows containing 'CBI' after removal:", cbi_rows_after_removal.count().sum())

# show the first few rows of the cleaned dataframe
df_consumer.head()

Number of rows containing 'CBI' after removal: 0


Unnamed: 0,CHEMICAL ID,CHEMICAL NAME,CHEMICAL ID TYPE,RECYCLED,2019 DOMESTIC PV,2019 IMPORT PV,2019 PV,2018 PV,2017 PV,2016 PV,...,SITE STATE,SITE POSTAL CODE,SITE NAICS CODE 1,SITE NAICS ACTIVITY 1,SITE NAICS CODE 2,SITE NAICS ACTIVITY 2,SITE NAICS CODE 3,SITE NAICS ACTIVITY 3,WORKERS CODE,WORKERS
3,18849,"(Polyisobutenyl)dihydro-2,5-furandione esters ...",Accession Number,No,0,26564,26564,27056,10321,11988,...,IN,47130-8425,324191 Petroleum Lubricating Oil And Grease Ma...,Import,,,,,W3,25 – 49
18,56038-13-2,".alpha.-D-Galactopyranoside, 1,6-dichloro-1,6-...",CASRN,No,0,25187,25187,495,0,0,...,KY,42420-9662,424690 Other Chemical And Allied Products Merc...,,,,,,NKRA,Not Known or Reasonably Ascertainable
27,57-50-1,".alpha.-D-Glucopyranoside, .beta.-D-fructofura...",CASRN,No,2074294,0,2074294,1541960,2099661,982152,...,LA,70052,111930 Sugarcane Farming,Manufacture,,,,,W5,100 – 499
29,12738-64-6,".alpha.-D-Glucopyranoside, .beta.-D-fructofura...",CASRN,No,0,56867,56867,218213,158700,79358,...,MI,49508,424690 Other Chemical And Allied Products Merc...,Import,,,,,W1,< 10
30,12738-64-6,".alpha.-D-Glucopyranoside, .beta.-D-fructofura...",CASRN,,0,0,0,0,34780,102241,...,NC,27406-3799,325199 All Other Basic Organic Chemical Manufa...,Manufacture,,,,,,


In [10]:
cbi_rows = df_consumer[df_consumer.apply(lambda x: x.astype(str).str.contains('CBI', na=False).any(), axis=1)]
cbi_rows.count()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,0
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


### Data public yang dapat kami analisa adalah 43.66% dari keseluruhan data

In [11]:
df_consumer.count()

Unnamed: 0,0
CHEMICAL ID,24449
CHEMICAL NAME,24449
CHEMICAL ID TYPE,24449
RECYCLED,22522
2019 DOMESTIC PV,23211
2019 IMPORT PV,23206
2019 PV,23211
2018 PV,24417
2017 PV,24417
2016 PV,24417


### Conversi type data ke Integer dan Float

In [12]:
# prompt: convert this column to integer
#     ['2019 DOMESTIC PV',
#     '2019 IMPORT PV',
#     '2019 PV',
#     '2018 PV',
#     '2017 PV',
#     '2016 PV',]

import pandas as pd
cols_to_convert = [
    '2019 DOMESTIC PV',
    '2019 IMPORT PV',
    '2019 PV',
    '2018 PV',
    '2017 PV',
    '2016 PV',
]

for col in cols_to_convert:
    df_consumer[col] = pd.to_numeric(df_consumer[col], errors='coerce').astype('Int64')

In [13]:
# prompt: convert SITE LATITUDE SITE LONGITUDE to float
df_consumer['SITE LATITUDE'] = pd.to_numeric(df_consumer['SITE LATITUDE'], errors='coerce')
df_consumer['SITE LONGITUDE'] = pd.to_numeric(df_consumer['SITE LONGITUDE'], errors='coerce')

In [14]:
df_consumer.dtypes

Unnamed: 0,0
CHEMICAL ID,object
CHEMICAL NAME,object
CHEMICAL ID TYPE,object
RECYCLED,object
2019 DOMESTIC PV,Int64
2019 IMPORT PV,Int64
2019 PV,Int64
2018 PV,Int64
2017 PV,Int64
2016 PV,Int64


### Mengisi nilai kosong untuk Production Volume

In [15]:
df_consumer['2019 DOMESTIC PV'].fillna(df_consumer['2019 DOMESTIC PV'].median(), inplace=True)
df_consumer['2019 IMPORT PV'].fillna(df_consumer['2019 IMPORT PV'].median(), inplace=True)
df_consumer['2019 PV'].fillna(df_consumer['2019 PV'].median(), inplace=True)
df_consumer['2018 PV'].fillna(df_consumer['2018 PV'].median(), inplace=True)
df_consumer['2017 PV'].fillna(df_consumer['2017 PV'].median(), inplace=True)
df_consumer['2016 PV'].fillna(df_consumer['2016 PV'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer['2019 DOMESTIC PV'].fillna(df_consumer['2019 DOMESTIC PV'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer['2019 IMPORT PV'].fillna(df_consumer['2019 IMPORT PV'].median(), inplace=True)
The behavior will change in pandas 3.0. This i

### Menghapus nilai chemical id dan chemical name yang kosong

In [16]:
df_consumer[['CHEMICAL ID', 'CHEMICAL NAME']].isnull().sum()

Unnamed: 0,0
CHEMICAL ID,32
CHEMICAL NAME,32


### Memeriksa apakah ada data yang bernilai pada `['CHEMICAL ID', 'CHEMICAL NAME']` yang kosong

In [17]:
# Filter rows where 'CHEMICAL ID' or 'CHEMICAL NAME' is null
null_chemical_info_rows = df_consumer[(df_consumer['CHEMICAL ID'].isnull()) | (df_consumer['CHEMICAL NAME'].isnull())]

# Select the desired columns from the filtered rows
result = null_chemical_info_rows[['CHEMICAL ID TYPE',
                                'RECYCLED',
                                '2019 DOMESTIC PV',
                                '2019 IMPORT PV',
                                '2019 PV',
                                '2018 PV',
                                '2017 PV',
                                '2016 PV',
                                'SITE LATITUDE',
                                'SITE LONGITUDE',
                                'SITE CITY',
                                'SITE COUNTY / PARISH',
                                'SITE STATE',
                                'SITE POSTAL CODE',
                                'SITE NAICS CODE 1',
                                'SITE NAICS ACTIVITY 1',
                                'SITE NAICS CODE 2',
                                'SITE NAICS ACTIVITY 2',
                                'SITE NAICS CODE 3',
                                'SITE NAICS ACTIVITY 3',]]

# Print the resulting DataFrame
result

Unnamed: 0,CHEMICAL ID TYPE,RECYCLED,2019 DOMESTIC PV,2019 IMPORT PV,2019 PV,2018 PV,2017 PV,2016 PV,SITE LATITUDE,SITE LONGITUDE,SITE CITY,SITE COUNTY / PARISH,SITE STATE,SITE POSTAL CODE,SITE NAICS CODE 1,SITE NAICS ACTIVITY 1,SITE NAICS CODE 2,SITE NAICS ACTIVITY 2,SITE NAICS CODE 3,SITE NAICS ACTIVITY 3
56000,,,0,0,0,0,0,0,,,,,,,,,,,,
56001,,,0,0,0,0,0,0,,,,,,,,,,,,
56002,,,0,0,0,0,0,0,,,,,,,,,,,,
56003,,,0,0,0,0,0,0,,,,,,,,,,,,
56004,,,0,0,0,0,0,0,,,,,,,,,,,,
56005,,,0,0,0,0,0,0,,,,,,,,,,,,
56006,,,0,0,0,0,0,0,,,,,,,,,,,,
56007,,,0,0,0,0,0,0,,,,,,,,,,,,
56008,,,0,0,0,0,0,0,,,,,,,,,,,,
56009,,,0,0,0,0,0,0,,,,,,,,,,,,


In [18]:
# prompt: i want to delete row who has null chemical id or chemical name

# Delete rows where 'CHEMICAL ID' or 'CHEMICAL NAME' is null
df_consumer.dropna(subset=['CHEMICAL ID', 'CHEMICAL NAME'], inplace=True)

# Check for null values after dropping
df_consumer[['CHEMICAL ID', 'CHEMICAL NAME']].isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0


In [19]:
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,1927
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


### Mengubah format nama tempat menjadi seragam menggunakan UPPERCASE

In [20]:
# prompt: i want to format SITE CITY, SITE COUNTY / PARISH, and  SITE STATE to uppercase

df_consumer['SITE CITY'] = df_consumer['SITE CITY'].str.upper()
df_consumer['SITE COUNTY / PARISH'] = df_consumer['SITE COUNTY / PARISH'].str.upper()
df_consumer['SITE STATE'] = df_consumer['SITE STATE'].str.upper()
df_consumer['SITE POSTAL CODE'] = df_consumer['SITE POSTAL CODE'].str.upper()

In [21]:
# prompt: i want to know state city who has site latitude and site longtitude null or 0, i want unique value of that city list

# Filter rows where SITE LATITUDE or SITE LONGITUDE is null or 0
null_zero_lat_lon = df_consumer[(df_consumer['SITE LATITUDE'].isnull()) |
                              (df_consumer['SITE LATITUDE'] == 0) |
                              (df_consumer['SITE LONGITUDE'].isnull()) |
                              (df_consumer['SITE LONGITUDE'] == 0)]

# Get the unique list of cities from the filtered rows
cities_with_null_zero_lat_lon = null_zero_lat_lon['SITE CITY'].unique()
states_with_null_zero_lat_lon = null_zero_lat_lon['SITE STATE'].unique()

# Print the unique list of cities
print("Cities with null or 0 Site Latitude or Site Longitude:")
print(cities_with_null_zero_lat_lon)
print(states_with_null_zero_lat_lon)


Cities with null or 0 Site Latitude or Site Longitude:
['HENDERSON' 'NEW CASTLE' 'OKLAHOMA CITY' 'HOUSTON' 'TRENTON'
 'FARMINGTON HILLS' 'PHOENIX' 'WITHHELD' 'SADDLE BROOK' 'NEWARK'
 'GEORGETOWN' 'MILL HALL' 'BAYTOWN' 'ROCKET CENTER' 'PASADENA'
 'LAKE CHARLES' 'EAST WINDSOR' 'WEST POINT' 'NEW YORK' 'ROME'
 'GOLDEN MEADOW' 'FREEPORT' 'SHELTON' 'LOUISVILLE' 'NASHVILLE' 'KOTZEBUE'
 'MCCARRAN' 'SOUTH DEERFIELD' 'BROOKFIELD' 'BRIDGEWATER' 'SONORA'
 'GOOSE CREEK' 'NEW KENSINGTON' 'IMPERIAL' 'MARYSVILLE' 'TROY' 'FRIENDLY'
 'CHICAGO HEIGHTS' 'DORADO' 'BROOKSVILLE' 'RAPID CITY' 'THOMASTON'
 'PAULDING' 'SELLERSBURG' 'STOCKTON' 'MARYNEAL' 'WASHINGTON'
 'OAKBROOK TERRACE' 'OAKLAND' 'CORTLAND' 'IRVING' 'POST FALLS'
 'MOUNDSVILLE' 'BRILLIANT' 'BEULAH' 'SCHOFIELD BARRACKS' 'PORTLAND'
 'ROCK SPRINGS' 'WHEATFIELD' 'LATHROP' 'POCATELLO' 'FORT MADISON'
 'LAVERGNE' 'MAIDSVILLE' 'DIBOLL' 'PENSACOLA' 'PANAMA' 'PINEVILLE'
 'SUNNYSIDE' 'ROOPVILLE' 'COLSTRIP' 'BATTLE MOUNTAIN' 'WEST CHICAGO'
 'SAINT GABRIEL' '

### Memeriksa kota dengan nilai state `NaN`

In [22]:
# prompt: i want to know city name, lat, long, and postal code where state is nan

nan_state_info = df_consumer[df_consumer['SITE STATE'].isnull()][['SITE CITY', 'SITE LATITUDE', 'SITE LONGITUDE', 'SITE POSTAL CODE']]
print("Information for entries where SITE STATE is NaN:")
nan_state_info

Information for entries where SITE STATE is NaN:


Unnamed: 0,SITE CITY,SITE LATITUDE,SITE LONGITUDE,SITE POSTAL CODE
23872,"BURRA, SA 5417",0.0,0.0,
23923,"BURRA, SA 5417",0.0,0.0,


### Menghapus baris dari kota BURRA, SA 5417, karena kota tersebut bukan ada di wilayah Amerika Serikat

In [23]:
# prompt: i want to delete row where state is nan

df_consumer.dropna(subset=['SITE STATE'], inplace=True)
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,1927
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


### Memeriksa nama tempat yang dirahasiakan

In [24]:
# prompt: i want to know city name, lat, long, and postal code where state is WITHHELD

df_consumer[df_consumer['SITE STATE'] == 'WITHHELD'][['SITE CITY', 'SITE LATITUDE', 'SITE LONGITUDE', 'SITE POSTAL CODE']]

Unnamed: 0,SITE CITY,SITE LATITUDE,SITE LONGITUDE,SITE POSTAL CODE
980,WITHHELD,,,WITHHELD
981,WITHHELD,,,WITHHELD
982,WITHHELD,,,WITHHELD
1036,WITHHELD,,,WITHHELD
1262,WITHHELD,,,WITHHELD
...,...,...,...,...
55221,WITHHELD,,,WITHHELD
55222,WITHHELD,,,WITHHELD
55223,WITHHELD,,,WITHHELD
55224,WITHHELD,,,WITHHELD


### Membuat Informasi Lat Long untuk di petakan pada nilai kosong yang memiliki nama kota

In [25]:
long_lat_missing_city = {
    'HENDERSON': [37.842777, -87.587222],
    'NEW CASTLE': [39.679558, -75.599933],
    'OKLAHOMA CITY': [35.467560, -97.516428],
    'HOUSTON': [29.760427, -95.369803],
    'TRENTON': [40.220109, -74.766861],
    'FARMINGTON HILLS': [42.482811, -83.376884],
    'PHOENIX': [33.448377, -112.074037],
    'SADDLE BROOK': [40.916766, -74.073479],
    'NEWARK': [40.735657, -74.172366],
    'GEORGETOWN': [38.718693, -75.122687],
    'MILL HALL': [41.135246, -77.464731],
    'BAYTOWN': [29.749947, -95.031326],
    'ROCKET CENTER': [39.549266, -78.895066],
    'PASADENA': [29.610508, -95.207705],
    'LAKE CHARLES': [30.224021, -93.217384],
    'EAST WINDSOR': [40.297444, -74.526278],
    'WEST POINT': [33.606775, -88.647547],
    'NEW YORK': [40.7128, -74.0060],
    'ROME': [34.257038, -85.164673],
    'GOLDEN MEADOW': [29.387167, -90.257545],
    'FREEPORT': [40.658717, -73.582631],
    'SHELTON': [41.325659, -73.136224],
    'LOUISVILLE': [38.252665, -85.758456],
    'NASHVILLE': [36.162664, -86.781601],
    'KOTZEBUE': [66.898056, -162.585833],
    'MCCARRAN': [39.545899, -119.569485],
    'SOUTH DEERFIELD': [42.493988, -72.607590],
    'BROOKFIELD': [43.059458, -88.093144],
    'BRIDGEWATER': [40.575654, -74.586616],
    'SONORA': [37.984638, -120.383526],
    'GOOSE CREEK': [33.003223, -80.034252],
    'NEW KENSINGTON': [40.573950, -79.761719],
    'IMPERIAL': [32.846430, -115.564440],
    'MARYSVILLE': [39.143789, -121.591901],
    'TROY': [42.728333, -73.692500],
    'FRIENDLY': [39.564522, -81.047067],
    'CHICAGO HEIGHTS': [41.503923, -87.641716],
    'DORADO': [18.455209, -66.273775],
    'BROOKSVILLE': [28.555556, -82.395833],
    'RAPID CITY': [44.080556, -103.227222],
    'THOMASTON': [32.890691, -84.288544],
    'PAULDING': [41.144497, -84.582458],
    'SELLERSBURG': [38.411171, -85.760803],
    'STOCKTON': [37.957702, -121.290780],
    'MARYNEAL': [32.253056, -101.442222],
    'WASHINGTON': [40.173685, -80.245065],
    'OAKBROOK TERRACE': [41.854477, -87.954784],
    'OAKLAND': [37.804363, -122.271113],
    'CORTLAND': [42.599793, -76.177264],
    'IRVING': [32.814018, -96.948895],
    'POST FALLS': [47.715732, -116.953767],
    'MOUNDSVILLE': [39.914246, -80.749806],
    'BRILLIANT': [40.354238, -80.607028],
    'BEULAH': [47.241389, -101.777778],
    'SCHOFIELD BARRACKS': [21.4925, -158.058056],
    'PORTLAND': [45.523062, -122.676482],
    'ROCK SPRINGS': [41.591079, -109.202353],
    'WHEATFIELD': [41.228122, -87.112799],
    'LATHROP': [37.795908, -121.240502],
    'POCATELLO': [42.871032, -112.433220],
    'FORT MADISON': [40.627257, -91.317926],
    'LAVERGNE': [35.945203, -86.568600],
    'MAIDSVILLE': [39.733979, -79.996172],
    'DIBOLL': [31.189623, -94.795213],
    'PENSACOLA': [30.421309, -87.216912],
    'PANAMA': [39.068930, -89.431200],
    'PINEVILLE': [31.323516, -92.433470],
    'SUNNYSIDE': [46.333189, -120.007550],
    'ROOPVILLE': [33.565116, -85.122170],
    'COLSTRIP': [45.892789, -106.613360],
    'BATTLE MOUNTAIN': [40.638515, -116.909538],
    'WEST CHICAGO': [41.897258, -88.209800],
    'SAINT GABRIEL': [30.298254, -91.077051],
    'CALHOUN, GORDON': [34.502035, -84.945070], # Assuming this refers to Gordon, Calhoun County, GA
    'UNION': [37.766768, -80.540356],
    'POINT PLEASANT': [38.847585, -82.128768],
    'CLEVELAND': [41.499320, -81.694361],
    'SAHUARITA': [31.954546, -111.002873],
    'ALABASTER': [33.242898, -86.820847],
    'TICONDEROGA': [43.834458, -73.415053],
    'DUVALL': [47.781216, -121.977067],
    'MCALESTER': [34.927038, -95.770267],
    'COLUMBUS': [39.961176, -82.998794],
    'CEDAR SPRINGS': [43.208649, -85.556419],
    'PARSIPPANY': [40.854823, -74.407659],
    'WICKLIFFE': [41.602551, -81.470409],
    'FLORENCE': [38.006940, -84.620000],
    'FRIESLAND': [43.559439, -89.043729],
    'CHRISTIANSTED': [17.747978, -64.703487],
    'MANCHESTER': [42.993056, -71.464167],
    'SAGINAW': [43.419470, -83.950807],
    'WELLESLEY HILLS': [42.296541, -71.258814],
    'LATROBE': [40.327568, -79.395039],
    'BALDWIN': [33.090956, -83.479590],
    'GHENT': [42.348692, -73.619010],
    'WOODS CROSS': [40.893116, -111.916053],
    'CROSSETT': [33.102334, -91.996245],
    'CANBY': [45.260124, -122.693149],
    'ANCHORAGE': [61.218056, -149.900278],
    'MARIETTA': [33.952607, -84.549933],
    'CAMDEN CITY': [39.945833, -75.101111],
    'PUEBLO': [38.254477, -104.609100],
    'BAY CITY': [28.980269, -96.146919],
    'HELM': [36.467451, -120.151259],
    'MCINTOSH': [44.757739, -95.955437],
    'LAWRENCE': [38.971667, -95.235278],
    'MULBERRY': [27.904467, -81.994537],
    'GARRETSON': [43.702758, -96.486745],
    'FORT LUPTON': [40.198319, -104.809971],
    'GREELEY': [40.426578, -104.709968],
    'SILVER BOW': [46.037166, -112.569472],
    'ELMENDORF': [29.231061, -98.411689],
    'CARTERSVILLE': [34.166667, -84.806389],
    'PLYMOUTH': [41.958333, -70.667500],
    'MONTICELLO': [40.758368, -86.764506],
    'LYONS': [40.231221, -98.666141],
    'SCHUYLER': [41.442222, -96.903056],
    'SANTA ANA': [33.745484, -117.867623]
}

In [26]:
# prompt: I want to fill null or 0 value in SITE LATITUDE and SITE LONGITUDE with long_lat_missing_city

import pandas as pd
# Iterate through the DataFrame and fill null/0 values
for index, row in df_consumer.iterrows():
    if (pd.isnull(row['SITE LATITUDE']) or row['SITE LATITUDE'] == 0 or
        pd.isnull(row['SITE LONGITUDE']) or row['SITE LONGITUDE'] == 0):
        city = row['SITE CITY']
        if city in long_lat_missing_city:
            df_consumer.loc[index, 'SITE LATITUDE'] = long_lat_missing_city[city][0]
            df_consumer.loc[index, 'SITE LONGITUDE'] = long_lat_missing_city[city][1]

# Verify the changes
print("\nNull/0 values in SITE LATITUDE and SITE LONGITUDE after filling:")
print(df_consumer[(df_consumer['SITE LATITUDE'].isnull()) |
                (df_consumer['SITE LATITUDE'] == 0) |
                (df_consumer['SITE LONGITUDE'].isnull()) |
                (df_consumer['SITE LONGITUDE'] == 0)][['SITE CITY', 'SITE LATITUDE', 'SITE LONGITUDE']].count())


Null/0 values in SITE LATITUDE and SITE LONGITUDE after filling:
SITE CITY         66
SITE LATITUDE      0
SITE LONGITUDE     0
dtype: int64


In [27]:
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,1927
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


### Mengisi nilai lat long `[-1, -1]` untuk informasi wilayah yang dirahasiakan

In [28]:
# prompt: i want to fill lat long where value is NaN with [-1, -1]

df_consumer['SITE LATITUDE'].fillna(-1, inplace=True)
df_consumer['SITE LONGITUDE'].fillna(-1, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer['SITE LATITUDE'].fillna(-1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer['SITE LONGITUDE'].fillna(-1, inplace=True)


### Mecari nilai NKRA atau Not Known or Reasonably Ascertainable

In [29]:
# prompt: i want to find column where has value 'Not Known or Reasonably Ascertainable'

for col in df_consumer.columns:
    if (df_consumer[col] == 'Not Known or Reasonably Ascertainable').any():
        print(f"Column '{col}' contains 'Not Known or Reasonably Ascertainable'")

Column 'WORKERS' contains 'Not Known or Reasonably Ascertainable'


In [30]:
# prompt: i want to replace 'Not Known or Reasonably Ascertainable' to 'NKRA'

df_consumer.replace('Not Known or Reasonably Ascertainable', 'NKRA', inplace=True)

In [31]:
# prompt: i want to find WORKERS value where WORKERS CODE value is NKRA and WORKER value is not NKRA

workers_nkra_worker_not_nkra = df_consumer[(df_consumer['WORKERS CODE'] == 'NKRA') & (df_consumer['WORKERS'] != 'NKRA')]

print("Entries where 'WORKERS CODE' is 'NKRA' and 'WORKERS' is not 'NKRA':")
print(workers_nkra_worker_not_nkra[['WORKERS CODE', 'WORKERS']])

Entries where 'WORKERS CODE' is 'NKRA' and 'WORKERS' is not 'NKRA':
Empty DataFrame
Columns: [WORKERS CODE, WORKERS]
Index: []


### Membuat kolom workers median

In [33]:
# prompt: create fuction to calculate workers median: 1. split string with space or any white space. 2. filter array, take only integer value 3. return sum(list)/len(list)

import pandas as pd
def calculate_workers_median(workers_string):
    """
    Calculates the median of worker values from a string.

    1. Splits the input string by whitespace.
    2. Filters the resulting array to include only integer values.
    3. Calculates and returns the median of the integer values.
       Returns None if no valid integers are found.
    """
    if pd.isnull(workers_string) or workers_string == 'NKRA':
        return None

    parts = workers_string.split()
    integer_workers = [int(part) for part in parts if part.isdigit()]

    if not integer_workers:
        return None

    integer_workers.sort()
    n = len(integer_workers)
    if n % 2 == 1:
        return integer_workers[n // 2]
    else:
        return (integer_workers[n // 2 - 1] + integer_workers[n // 2]) / 2

df_consumer['WORKERS MEDIAN'] = df_consumer['WORKERS'].apply(calculate_workers_median)

# Display the first few rows with the new column
print(df_consumer[['WORKERS', 'WORKERS CODE', 'WORKERS MEDIAN']].head())

# Check for null values in the new column
print("\nNull values in WORKERS MEDIAN after calculation:")

      WORKERS WORKERS CODE  WORKERS MEDIAN
3     25 – 49           W3            37.0
18       NKRA         NKRA             NaN
27  100 – 499           W5           299.5
29       < 10           W1            10.0
30        NaN          NaN             NaN

Null values in WORKERS MEDIAN after calculation:


In [35]:
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,1927
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


### Mengisi nilai NaN pada Workers Median berdasarkan rata-rata di kelompok kota atau nilai 0 jika tidak ada informasi wilayah

In [36]:
# prompt: fill NaN value in workers median base on mean WORKERS MEDIAN in SITE CITY groups, if theres no SITE CITY information fill it with 0

# Calculate the mean 'WORKERS MEDIAN' for each 'SITE CITY' group
city_worker_median_mean = df_consumer.groupby('SITE CITY')['WORKERS MEDIAN'].transform('mean')

# Fill NaN values in 'WORKERS MEDIAN' with the calculated mean for the city group
df_consumer['WORKERS MEDIAN'].fillna(city_worker_median_mean, inplace=True)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer['WORKERS MEDIAN'].fillna(city_worker_median_mean, inplace=True)


In [37]:
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,1927
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


In [38]:
# Fill any remaining NaN values in 'WORKERS MEDIAN' (where there was no SITE CITY information) with 0
df_consumer['WORKERS MEDIAN'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer['WORKERS MEDIAN'].fillna(0, inplace=True)


In [39]:
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,1927
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


### Mengisi Nilai `Unknown` ke semua kolom yang nilainya kosong

In [40]:
for col in selected_column:
    df_consumer[col].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_consumer[col].fillna('Unknown', inplace=True)


In [41]:
df_consumer.isnull().sum()

Unnamed: 0,0
CHEMICAL ID,0
CHEMICAL NAME,0
CHEMICAL ID TYPE,0
RECYCLED,0
2019 DOMESTIC PV,0
2019 IMPORT PV,0
2019 PV,0
2018 PV,0
2017 PV,0
2016 PV,0


In [42]:
# prompt: i want to know unique value of chemical name

# Get the unique values of 'CHEMICAL NAME'
unique_chemical_names = df_consumer['CHEMICAL NAME'].unique()

# Print the unique chemical names
print("Unique Chemical Names:")
print(len(unique_chemical_names))

Unique Chemical Names:
4659


In [43]:
# prompt: i want to know unique value of chemical id

print("Unique Chemical IDs:")
print(df_consumer['CHEMICAL ID'].nunique())

Unique Chemical IDs:
4662


In [44]:
# prompt: i want to export df_consumer to csv

df_consumer.to_csv('df_consumer.csv', index=False)