# Setting up our Code Environment
Congratulations on getting your Integrated Development Environment (IDE) installed. Let's walk through some of the basics of your environment. Jupyter Notebook and the cloud-based Jupyter Lab are block editors. Many block editors are available, and Jupyter Notebook is just one. Feel free to use the one that you are most familiar with. There are two main types of blocks we will use. <br>

## Introduction to Jupyter Notebook
The **Markdown** block allows a block of text and images to be placed in the Notebook. This is essential to your work because it allows the work to be kept in line with the observations of the results. Using this method, you will not have to create a separate document or capture screenshots of the code. <br>

* Markdown accepts HTML code to format the text. The `<br>` is a manual line break in markdown.
* The asterisk begins an unordered list similar to the `<ul>` in HTML.
* The `#` is used to create headers. A single `#` is the largest, and adding multiple `#` symbols creates nested headers within the larger header. This helps you navigate to a specific section using the Table of Contents.

The **Code** block is a block that allows you multiple lines to generate your code. This is read as code, and any text placed here without the # symbol will create an error in the code. The `#` symbol allows you to add a single line of markdown text to your code. It is essential to comment on what you are doing, and it can help direct your thoughts for a process requiring multiple lines of code. <br>

As you can see above, we have designated our code kernel as Python 3. You can change this to other versions of Python or another language. JupyterLab also codes in R. Other Anaconda Navigator programs code in Julia and even in C++ or Ruby.

The **Docustring** is markdown text placed in a code block that spans several lines. It is usually used to convey longer text describing what a process will do or the expected outcome. Docustrings are traditionally added to a user-defined function to document what the processes are intended to do. To define the start of the docstring, we use three single quotes in a row, type the information in the docstring, and then close with three single quotes. <br>

You can also embed images in your Markdown text if needed to illustrate a point. For instance, if you were taking notes in the Markdown fields and wanted to embed a screenshot of a slide from our lecture, you could do that. I will create a separate document to illustrate image embedding for review offline.

### Text suggestions in Jupyter
One very useful feature is the predictive text feature of Jupyter Notebook and Jupyter Lab. When working on a new method, you can type the start of the method name and press the tab; this will offer suggestions for methods that match what you have typed. When you hover, with your mouse, over the method name, it will give you additional information from the developer documents that include the arguments and keyword arguments (args and kwargs) for the method.

Several packages are needed that will be used throughout our work in this course. This notebook will serve as the environment setup. You will only need to run this once, but when saved, it can generate the environment on a new installation. Let's start with pip installs of numpy, pandas, and matplotlib.

In [8]:
'''This is the beginning of the docustring
This function will perform the desired action.
The expected outcome is...'''
x = 7
x

7

In [10]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [12]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install matplotlib

Note: you may need to restart the kernel to use updated packages.


## Developing a Research Question
The Data Science process begins with a robust research question.  This topic should be exciting and incorporate domain knowledge from your expertise. The research question should be relevant to a historical or current event that prompts a study or requires further inquiry.

- Do some preliminary research on your topic to learn more about it. This will help you narrow your focus and identify potential research questions.
- Narrow your topic down to a specific area you want to investigate. This will help to make your research question more manageable and feasible.
- Formulate several research questions that you could answer about your topic. Be sure to ask open-ended questions that cannot be answered with a simple yes or no.
- Evaluate your research questions to make sure that they are:
  - Specific: They should be focused on a specific aspect of your topic.
  - Researchable: There should be enough information available to answer them.
  - Feasible: They should be possible to answer within the scope of your research project.
  - Interesting: They should be something that you are motivated to answer.

Choose one or two research questions to focus on for your project. Revise your questions as needed as you learn more about your topic.

Here are some additional tips for developing a robust research question:

- Start with a question that you are genuinely curious about. This will make your research more enjoyable and productive.
- Be sure to consider your audience when formulating your research question. Who are you writing for? What do they need to know?
- Use clear and concise language in your research question. Ensure your question reaches a broad audience of multiple  levels of understanding. Avoid jargon and technical terms that your audience may not understand.
- Be sure to define your terms. What do you mean by the key terms in your research question?
Use a variety of sources to inform your research question. This will help you gain a well-rounded perspective on your topic.

# Problem Statement
**Environmental Racism** includes the intentional pollution of the environments where predominately Black, Indigenous, and other People of Color (BIPOC) communities reside. These communities tend to be impoverished or low-income communities, frequently redlined by local, state, or federal governments. These communities are often further marginalized with lower wages resulting from undereducation from schools with low ratings or low instructor retention rates.
<br>
***
<br>

1. <font color=#003A63>*In your role as a member of the Data Team please formulate a research question that discusses **environmental racism** and the effect on the surrounding communities.*</font>

2. <font color=#003A63>*Explore and curate at least one data set to support your research question. Examples of some datasets about **environmental racism** are below*.

> Link to  __[Envrionmental Racism Datasets](https://colab.research.google.com/drive/17usNiFPJ1jilIsn2RE9NN7FmWSKgbN7Z#scrollTo=LH-F6wzIak6f)__</font>

3. <font color=#003A63>*Cleaning the dataset. This is a critical step in the analysis of the data. Explore the size of the data, the type of data present, look for gaps in the data and missing values that will affect the analysis of the data. Determine the next steps to ensure the data can be analyzed.*</font>

4. <font color=#003A63>*Analyze the dataset. Data Scientists use their knowledge of statistical methods to analyze the data and identify relationships between the variables. This step allows the team to determine the best way to prepare the data for a machine learning algorithm and select the appropriate model(s) to fully explore the research question.*</font>

5. <font color=#003A63>*Interpret and report the results of the data analysis and/or machine learning models. Communicate the results using charts and figures as well as words to convey the findings to other researchers, policymakers, and business partners. Be sure to identify the implications of the findings for future research studies.*</font>

6. <font color=#003A63>*Scientific research is reproducible. This means that the techniques used and the models created should be able to be used by other researchers to derive the same or similar results. This adds a layer of transparency to the research findings and permits peer review, identifying any gaps in the conclusions as well as celebrating the impact of the research performed. Additional team members and those outside of the team should be able to run the code produced and arrive at similar results. To ensure the same results for machine learning models consider the use of Random State which generates the same partition of training, validation, and test data splits as opposed to randomly generated splits which would yield similar but different results for consecutive runs.*</font>

#### Example of a research topic
Environmental racism as a topic can geolocate communities of color stratified by racial identification and ethnicity compared to the locations where pollution concentration is highest. This would need to define pollution as a topical argument, define the racial and ethnic groups comprising the BIPOC Community, and the criteria for pollution levels.  

## Potential subtopics
* The intersectionality of public health disparities, such as reactive airway disorders and air quality ratings.
* The frequency of unmanageable or untreatable asthma, chronic obstructive pulmonary disorder (COPD), and measured particulate matter in air.
* Geolocation of measured particulate matter, ozone, nitrogen dioxide, and sulfur dioxide in communities of color.
* Location of Superfund Sites and occurrences of childhood cancer within a fifty-mile radius.
* Spatial study of cancer clusters by diagnosis type within a radius of Superfund Sites.
* Longitudinal study of groundwater contamination from mining wastewater.
* Heavy mineral and metals poisoning of surface and groundwater from mining sludge and wastewater.
* The sources and effects of heavy mineral pollution of soil and agricultural commodities for animal consumption.
*  The relative effects on aquatic life and native niche plants from lithium mining used for electric vehicle batteries.
* Water consumption for extinguishing thermal runaway fires from lithium-ion batteries.

## Data Definition

* **Air Quality Measures on the National Environmental Health Tracking Network.** <br>
Last Updated: July 20, 2023. <br>
https://catalog.data.gov/dataset/air-quality-measures-on-the-national-environmental-health-tracking-network<br>
This dataset combines the Environmental Protection Agency (EPA) Air Quality System (AQS) database containing data from approximately 4,000 monitoring stations around the country, mainly in urban areas. Data from the AQS is considered the "gold standard" for determining outdoor air pollution. Centers for Disease Control and Prevention (CDC) and EPA have worked together to develop a statistical model (Downscaler) to make modeled predictions available for environmental public health tracking purposes in areas of the country that do not have monitors and to fill in the time gaps when monitors may not be recording data.

* **Global Fire Emissions Database, Version 4.1 (GFEDv4)** <br>
Last Updated: July 27, 2023 <br>
https://catalog.data.gov/dataset/global-fire-emissions-database-version-4-1-gfedv4 <br>
This dataset provides global estimates of monthly burned area, monthly emissions and fractional contributions of different fire types. National Aeronautics and Space Administration (NASA) emissions data are available for carbon (C), dry matter (DM), carbon dioxide (CO2), carbon monoxide (CO), methane (CH4), hydrogen (H2), nitrous oxide (N2O), nitrogen oxides (NOx), non-methane hydrocarbons (NMHC), organic carbon (OC), black carbon (BC), particulate matter less than 2.5 microns (PM2.5), total particulate matter (TPM), and sulfur dioxide (SO2) among others. These data are yearly totals by region, globally, and by fire source for each region.

* **PM2.5 and cardiovascular mortality rate data: Trends modified by county socioeconomic status in 2,132 US counties** <br>
Last Updated: November 12, 2020 <br>
https://catalog.data.gov/dataset/annual-pm2-5-and-cardiovascular-mortality-rate-data-trends-modified-by-county-socioeconomi <br>
U.S. Environmental Protection Agency Data on county socioeconomic status for 2,132 US counties and each county’s average annual cardiovascular mortality rate (CMR) and total PM2.5 concentration for 21 years (1990-2010). County CMR, PM2.5, and socioeconomic data were obtained from the U.S. National Center for Health Statistics, U.S. Environmental Protection Agency’s Community Multiscale Air Quality modeling system, and the U.S. Census, respectively.

* **Superfund Site Information**<br>
Last Updated: May 17, 2021<br>
https://catalog.data.gov/dataset/superfund-site-information <br>
U.S. Environmental Protection Agency asset includes a number of individual data sets related to site-specific information for Superfund, which contains basic site description, location, schedule of activities, enforcement and settlement data, contaminants and selected remedy and much more, as well as the records that clearly document site decisions. This asset also includes sampling data and lab results (CLPSS, EDDs), redevelopment and technical assistance case studies, site reuse and land revitalization information, EPAOSC.net information, Superfund Technical Assistance Grants information, site management information records (RODs, Remediation plans, cleanup directives), contract management information, and more.

* **Superfund cleanups and children’s lead exposure in six states** <br>
Last Updated: July 26, 2021 <br>
https://catalog.data.gov/dataset/superfund-cleanups-and-childrens-lead-exposure-in-six-states <br>
Data for the study include restricted access and non-restricted access files. Restricted access files include individual children's blood lead data from six states, property assessment data from Zillow, Inc., and Census tract characteristics processed by GeoLytics. This dataset includes contaminated site locations and characteristics (Superfund, brownfields, and RCRA sites), ambient air lead concentrations, state-month average temperatures, and vehicle miles traveled in 1980.

* **EPA Region 6 REAP Sustainability Geodatabase** <br>
Last Updated: November 10, 2020 <br>
https://catalog.data.gov/dataset/epa-region-6-reap-sustainability-geodatabase <br>
The Regional Ecological Assessment Protocol (REAP) is a screening level assessment tool created as a way to identify priority ecological resources within the five EPA Region 6 states (Arkansas, Louisiana, New Mexico, Oklahoma, and Texas). The REAP divides eighteen individual measures into three main sub-layers: diversity, rarity, and sustainability. This geodatabase contains the 2 grids (sustain and sustainrank) representing the sustainability layer which describes the state of the environment in terms of stability (sustainble areas are those that can maintain themselves into the future without human management). There are eleven measures that make up the sustainability layer: contiguous land cover, regularity of ecosystem boundary, appropriateness of land cover, waterway obstruction, road density, airport noise, Superfund sites, Resource Conservation and Recovery Act (RCRA) sites, water quality, air quality, and urban/agriculture disturbance.

# Data Science Process
The data science process consists of seven key steps. Each week, we will complete another step of the process. The work we complete in class each week will structure the work you will complete on the group project. This week you will work on working with your dataset file, loading data into the python environment, and making a plan for data cleaning steps. <br>

There are seven main steps in the data science process. The steps you take will vary depending on the specific problem you are trying to solve. However, the general process will be the same.
1. Problem framing: This is the first and most important step in the data science process. It involves understanding the research question that you are trying to solve and defining the specific questions that you want to answer with data.
2. Data Collection: Once you have defined your problem, you need to acquire the data you need to answer your questions. This can involve collecting data from various sources, such as surveys, databases, and social media.
3. Data Cleaning: Once you have acquired your data, you must prepare it for analysis. This involves cleaning the data, removing errors and outliers, and formatting it so that it is easy to work with.
4. Data Exploration: This step involves exploring your data to understand it better. This includes looking at summary statistics, creating visualizations, and asking questions about the data.
5. Modeling: This step involves building models to predict or explain the data. You can use many different types of models, such as linear regression, logistic regression, and decision trees.
6. Evaluation: Once you have built your models, you must evaluate them to see how well they perform. This involves using metrics such as accuracy, precision, and recall to measure their performance.
7. Deployment: Once you have found a model that performs well, you need to deploy it so that it can be used to make predictions or decisions. This can involve creating a web application, a mobile app, or a dashboard.

## Data Collection:
Data collection begins with identifying a reliable and accurate data source and using tools to download the dataset for examination. Next, the necessary libraries are imported, which contain pre-written code that performs specific tasks. Python has several libraries, which are robust data analysis and visualization tools.

Once the dataset is loaded and the libraries imported, the dataset can be read, and the dataframe can be created. Now, the data is checked, and the data cleaning process begins.