In [2]:
pip install numpy



In [None]:
pip install pandas



In [None]:
pip install matplotlib



In [None]:
pip install seaborn




In [None]:
pip install plotly



In [None]:
pip install scipy



In [None]:
pip install statsmodels



In [3]:
import math

math.pi *2

6.283185307179586

In [4]:
math.sqrt(55)

7.416198487095663

In [5]:
print (math.pow(12, 3)) #shows the power

1728.0


## Data Definition

* **Air Quality Measures on the National Environmental Health Tracking Network.** <br>
Last Updated: July 20, 2023. <br>
https://catalog.data.gov/dataset/air-quality-measures-on-the-national-environmental-health-tracking-network<br>
This dataset combines the Environmental Protection Agency (EPA) Air Quality System (AQS) database containing data from approximately 4,000 monitoring stations around the country, mainly in urban areas. Data from the AQS is considered the "gold standard" for determining outdoor air pollution. Centers for Disease Control and Prevention (CDC) and EPA have worked together to develop a statistical model (Downscaler) to make modeled predictions available for environmental public health tracking purposes in areas of the country that do not have monitors and to fill in the time gaps when monitors may not be recording data.

* **Global Fire Emissions Database, Version 4.1 (GFEDv4)** <br>
Last Updated: July 27, 2023 <br>
https://catalog.data.gov/dataset/global-fire-emissions-database-version-4-1-gfedv4 <br>
This dataset provides global estimates of monthly burned area, monthly emissions and fractional contributions of different fire types. National Aeronautics and Space Administration (NASA) emissions data are available for carbon (C), dry matter (DM), carbon dioxide (CO2), carbon monoxide (CO), methane (CH4), hydrogen (H2), nitrous oxide (N2O), nitrogen oxides (NOx), non-methane hydrocarbons (NMHC), organic carbon (OC), black carbon (BC), particulate matter less than 2.5 microns (PM2.5), total particulate matter (TPM), and sulfur dioxide (SO2) among others. These data are yearly totals by region, globally, and by fire source for each region.

* **PM2.5 and cardiovascular mortality rate data: Trends modified by county socioeconomic status in 2,132 US counties** <br>
Last Updated: November 12, 2020 <br>
https://catalog.data.gov/dataset/annual-pm2-5-and-cardiovascular-mortality-rate-data-trends-modified-by-county-socioeconomi <br>
U.S. Environmental Protection Agency Data on county socioeconomic status for 2,132 US counties and each county’s average annual cardiovascular mortality rate (CMR) and total PM2.5 concentration for 21 years (1990-2010). County CMR, PM2.5, and socioeconomic data were obtained from the U.S. National Center for Health Statistics, U.S. Environmental Protection Agency’s Community Multiscale Air Quality modeling system, and the U.S. Census, respectively.

* **Superfund Site Information**<br>
Last Updated: May 17, 2021<br>
https://catalog.data.gov/dataset/superfund-site-information <br>
U.S. Environmental Protection Agency asset includes a number of individual data sets related to site-specific information for Superfund, which contains basic site description, location, schedule of activities, enforcement and settlement data, contaminants and selected remedy and much more, as well as the records that clearly document site decisions. This asset also includes sampling data and lab results (CLPSS, EDDs), redevelopment and technical assistance case studies, site reuse and land revitalization information, EPAOSC.net information, Superfund Technical Assistance Grants information, site management information records (RODs, Remediation plans, cleanup directives), contract management information, and more.

* **Superfund cleanups and children’s lead exposure in six states** <br>
Last Updated: July 26, 2021 <br>
https://catalog.data.gov/dataset/superfund-cleanups-and-childrens-lead-exposure-in-six-states <br>
Data for the study include restricted access and non-restricted access files. Restricted access files include individual children's blood lead data from six states, property assessment data from Zillow, Inc., and Census tract characteristics processed by GeoLytics. This dataset includes contaminated site locations and characteristics (Superfund, brownfields, and RCRA sites), ambient air lead concentrations, state-month average temperatures, and vehicle miles traveled in 1980.

* **EPA Region 6 REAP Sustainability Geodatabase** <br>
Last Updated: November 10, 2020 <br>
https://catalog.data.gov/dataset/epa-region-6-reap-sustainability-geodatabase <br>
The Regional Ecological Assessment Protocol (REAP) is a screening level assessment tool created as a way to identify priority ecological resources within the five EPA Region 6 states (Arkansas, Louisiana, New Mexico, Oklahoma, and Texas). The REAP divides eighteen individual measures into three main sub-layers: diversity, rarity, and sustainability. This geodatabase contains the 2 grids (sustain and sustainrank) representing the sustainability layer which describes the state of the environment in terms of stability (sustainble areas are those that can maintain themselves into the future without human management). There are eleven measures that make up the sustainability layer: contiguous land cover, regularity of ecosystem boundary, appropriateness of land cover, waterway obstruction, road density, airport noise, Superfund sites, Resource Conservation and Recovery Act (RCRA) sites, water quality, air quality, and urban/agriculture disturbance.

# Data Science Process
The data science process consists of seven key steps. Each week, we will complete another step of the process. The work we complete in class each week will structure the work you will complete on the group project. This week you will work on working with your dataset file, loading data into the python environment, and making a plan for data cleaning steps. <br>

There are seven main steps in the data science process. The steps you take will vary depending on the specific problem you are trying to solve. However, the general process will be the same.
1. Problem framing: This is the first and most important step in the data science process. It involves understanding the research question that you are trying to solve and defining the specific questions that you want to answer with data.
2. Data Collection: Once you have defined your problem, you need to acquire the data you need to answer your questions. This can involve collecting data from various sources, such as surveys, databases, and social media.
3. Data Cleaning: Once you have acquired your data, you must prepare it for analysis. This involves cleaning the data, removing errors and outliers, and formatting it so that it is easy to work with.
4. Data Exploration: This step involves exploring your data to understand it better. This includes looking at summary statistics, creating visualizations, and asking questions about the data.
5. Modeling: This step involves building models to predict or explain the data. You can use many different types of models, such as linear regression, logistic regression, and decision trees.
6. Evaluation: Once you have built your models, you must evaluate them to see how well they perform. This involves using metrics such as accuracy, precision, and recall to measure their performance.
7. Deployment: Once you have found a model that performs well, you need to deploy it so that it can be used to make predictions or decisions. This can involve creating a web application, a mobile app, or a dashboard.

## Data Collection:
Data collection begins with identifying a reliable and accurate data source and using tools to download the dataset for examination. Next, the necessary libraries are imported, which contain pre-written code that performs specific tasks. Python has several libraries, which are robust data analysis and visualization tools.

Once the dataset is loaded and the libraries imported, the dataset can be read, and the dataframe can be created. Now, the data is checked, and the data cleaning process begins.

In [None]:
# Importing Google Drive Connection (For Colab use only)
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# Import the libraries
import numpy as np                  # Scientific Computing
import pandas as pd                 # Data Analysis
import matplotlib.pyplot as plt     # Plotting
import seaborn as sns               # Statistical Data Visualization

# Let's make sure pandas returns all the rows and columns for the dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Force pandas to display full numbers instead of scientific notation
# pd.options.display.float_format = '{:.0f}'.format

# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')