# COGS 108 - Data Checkpoint

## Authors
- Yechan Park: Project Administration, Software, Writing 

## Research Question

Is there a statistically significant correlation between the global production rate of plastics and key indicators of global warming, such as atmospheric CO₂ concentration, fossil fuel consumption, and global average temperature anomalies?


## Background and Prior Work

The rapid growth of global plastic production over recent decades has raised increasing concern due to its environmental and climate impacts. Since the 1950s, plastic production has increased from almost zero to hundreds of millions of metric tons per year, largely driven by industrialization and rising consumer demand [1]. Because most plastics are produced from fossil fuels, their manufacturing and disposal require large amounts of energy and result in greenhouse gas emissions. As a result, plastic production is closely connected to broader patterns of fossil fuel use and industrial activity.
Long-term observations show that atmospheric carbon dioxide (CO₂) concentrations have risen steadily since the late 1950s. Measurements collected by the National Oceanic and Atmospheric Administration (NOAA) indicate that current CO₂ levels are significantly higher than pre-industrial values and continue to increase each year [2]. This rise is mainly caused by the widespread burning of fossil fuels and other human activities, and it is strongly linked to global warming. Increasing CO₂ concentrations are associated with other major climate indicators, including rising global temperature anomalies, reflecting an enhanced greenhouse effect in Earth’s atmosphere.
Scientific assessments by the Intergovernmental Panel on Climate Change (IPCC) provide strong evidence that human-driven increases in greenhouse gases have caused widespread warming across the climate system. The IPCC Sixth Assessment Report explains that excess heat trapped by greenhouse gases has been absorbed largely by the oceans since the mid-20th century, leading to accelerating glacier and ice-sheet melt and rising global sea levels [3]. These observed changes closely follow long-term increases in fossil fuel consumption and industrial production, suggesting that other fossil-fuel-intensive activities may exhibit similar relationships with climate indicators.
Global temperature records further support this warming trend. Data from NASA’s Goddard Institute for Space Studies show that recent decades are significantly warmer than the mid-20th century average, indicating a clear and persistent rise in global temperature anomalies [4]. Because plastics are derived from fossil fuels and contribute to greenhouse gas emissions throughout their lifecycle, examining the statistical relationship between global plastic production and key climate indicators. Atmospheric CO₂ concentrations and global temperature anomalies can help clarify how industrial production aligns with observed global warming trends.


## Hypothesis


We hypothesize that there is a statistically significant positive correlation between global plastic production levels and indicators of global warming, including atmospheric CO₂ concentration, fossil fuel consumption, and global average temperature anomalies. This relationship is expected because plastic production is highly dependent on fossil fuels and contributes to greenhouse gas emissions throughout its lifecycle. As plastic production increases steadily over time, we anticipate upward trends in the climate variables as well.


## Data

### Data overview
Data Set 1:
    
    Dataset Name: Global Monthly Atmospheric CO₂ Data

    Link: https://gml.noaa.gov/ccgg/trends/gl_data.html

    Number of observations: ~800+ monthly global observations

    Number of variables:

        year

        month

        decimal date

        monthly mean CO₂ (ppm)

        deseasonalized trend

        number of days measured

    Variables most relevant to this project:

        year — calendar year

        month — month of observation

        average — monthly global CO₂ concentration (ppm)

        trend — seasonally adjusted global CO₂ concentration

    Shortcomings:

        Still observational (no causation)

        Aggregated global average may hide regional variation

Data Set 2: 

    Dataset Name: Global Annual Mean CO₂ Data

    Link: https://gml.noaa.gov/ccgg/trends/gl_data.html

    Number of observations: ~60+ annual observations

    Variables most relevant:

        year

        annual mean CO₂ (ppm)

    Shortcomings:

        Annual averaging removes seasonal detail



In [28]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    {
        'url': 'https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_gl.csv',
        'filename': 'co2_mm_gl.csv'
    },
    {
        'url': 'https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_gl.csv',
        'filename': 'co2_annmean_gl.csv'
    }
]

get_data.get_raw(datafiles, destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading co2_mm_gl.csv:   0%|          | 0.00/24.4k [00:00<?, ?B/s][A
Overall Download Progress:  50%|█████     | 1/2 [00:00<00:00,  5.79it/s]A

Successfully downloaded: co2_mm_gl.csv



Downloading co2_annmean_gl.csv:   0%|          | 0.00/2.50k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  6.48it/s]   [A

Successfully downloaded: co2_annmean_gl.csv





### Dataset #1 

This dataset contains monthly global atmospheric carbon dioxide (CO₂) measurements provided by NOAA’s Global Monitoring Laboratory. Each row represents one month of globally averaged CO₂ concentration. The dataset includes variables such as year, month, decimal date, average CO₂ concentration, deseasonalized trend, and number of days measured.

CO₂ concentration is measured in parts per million (ppm), which represents how many CO₂ molecules exist per one million molecules of air. Monthly data allows us to observe both long-term upward trends and seasonal fluctuations in global atmospheric CO₂ levels. The seasonal variation reflects natural plant growth cycles, while the long-term increase reflects human-related greenhouse gas emissions.

A limitation of this dataset is that, although it represents global averages, it does not capture regional differences in CO₂ levels. Additionally, this dataset shows trends and correlations but does not establish causation.


In [30]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

import pandas as pd

# Load dataset
co2_monthly = pd.read_csv(
    "data/00-raw/co2_mm_gl.csv",
    comment="#"
)

co2_monthly.head()

# Size of dataset
co2_monthly.shape

# Data types
co2_monthly.info()

# Missing values
co2_monthly.isna().sum()

# Summary statistics
co2_monthly.describe()

# Duplicate rows
co2_monthly.duplicated().sum()

# Save cleaned version
co2_monthly.to_csv(
    "data/02-processed/co2_mm_gl_cleaned.csv",
    index=False
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 563 entries, 0 to 562
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   year         563 non-null    int64  
 1   month        563 non-null    int64  
 2   decimal      563 non-null    float64
 3   average      563 non-null    float64
 4   average_unc  563 non-null    float64
 5   trend        563 non-null    float64
 6   trend_unc    563 non-null    float64
dtypes: float64(5), int64(2)
memory usage: 30.9 KB


### Dataset #2 

This dataset contains annual global mean atmospheric CO₂ concentrations provided by NOAA. Each row represents one year of globally averaged CO₂ levels. The dataset includes the calendar year and the annual mean CO₂ concentration (ppm).

Annual data removes seasonal variation and highlights long-term trends more clearly. This makes it especially useful for comparing CO₂ levels with other yearly indicators, such as global plastic production or global temperature anomalies.

A limitation of this dataset is that annual averaging removes seasonal detail, which may hide short-term fluctuations. Additionally, like the monthly dataset, it provides observational data and does not directly explain the causes of CO₂ increases.

In [31]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

import pandas as pd

# Load dataset
co2_annual = pd.read_csv(
    "data/00-raw/co2_annmean_gl.csv",
    comment="#"
)

co2_annual.head()

# Size of dataset
co2_annual.shape

# Data types
co2_annual.info()

# Missing values
co2_annual.isna().sum()

# Summary statistics
co2_annual.describe()

# Duplicate rows
co2_annual.duplicated().sum()

# Save cleaned version
co2_annual.to_csv(
    "data/02-processed/co2_annmean_gl_cleaned.csv",
    index=False
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    46 non-null     int64  
 1   mean    46 non-null     float64
 2   unc     46 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 1.2 KB


## Ethics

The data used in this project are public, aggregated datasets from sources like Our World in Data, NOAA, NASA, and the IPCC, so there are no privacy concerns since no personal or individual data are included. The data are free to use for research and follow open data terms. However, there may be some bias in how the data were collected. For example, plastics production data rely on country and industry reports, which may be less accurate in regions with weaker reporting systems. Climate data may also reflect more measurements from developed countries, which could slightly bias global averages.
Another issue is that all datasets show strong upward trends over time, which could make correlations look stronger than they really are. To handle this, the data will be checked for missing values and inconsistencies, and trends will be visualized before running any statistical tests. Results will be explained carefully, making it clear that correlation does not mean non-checked sources. When sharing results, the limits of the data will be clearly stated to avoid misleading conclusions or unfairly blaming specific groups or regions.


## Team Expectations 

Team members: Jin Choi, Sujin Kim, Idhant Kumar, Rowoon Lee, Yechan Park

Team expectation 1: Communication
Primary communication method: discord chat and call 
Response time usually within a day and everyone should answer the weekly group meeting call since we are all contributing to the proposal
If a deadline is within 48 hours, we aim to start at least 12 hours before the dead line
If someone is unavailable to answer the call or do the work, they have to notice the group as soon as possible

Team expectation 2: Weekly Meeting Schedule 
Meeting will be held every week usually wednesday around 3-5 pm since we know everyone is available during that period of time 
Each meeting we discuss what to do by the deadline and what to expect, and plan for the next meeting
We use google doc to do the assignments and submit whoever is available

Team expectation 3: Decision-making
During the team meeting, we go for the majority 
Whoever is available can create the assignment document or submit the assignment
If a quick decision needs to be made, whoever answer the first gets the chance 

Team expectation 4: Equal Contribution
Everyone puts equal amount of time and effort to finish the assignment 
We will use our Github page and google doc to work on most of our project 
Everyone must contribute into the weekly meeting and has to let everyone know if something happens on the discord chat 
Respect every member and make sure to keep the boundaries 



## Project Timeline Proposal