#Introduction
After following these steps, you'll be able to create a smaller subset from your dataset like the one below. The original dataset contains information about environmental pollutants in various regions across the United States from the Environmental Protection Agency. By using the following python techniques in the step by step guide, you'll narrow down specific data points to further analyse.

###*Getting Started*
1. Naviagate to the National Environmental Protection Agency's site and download the Air Emissions CSV file.
2. Create a folder on your computer containing your dataset file. This folder will also store the subset you're going to create.
3. Open a new notebook in Google CoLab, attach the CSV file.


Before starting, knowing basics about python is essential to creating your finalized data set.


**Python Basics**
1. Using the alphabet in coding to make words and sentences is called  **string or str** and should be entered in quotation marks inside of square brackets. example: `["string"]`
2. Using numbers in coding is called **integer or int** and is also entered in square brackets. `[1234]`
3. To write multiple numbers or multiple texts, seperate each by a comma. example: `["hi", "bye"]` or `[1, 2]`
4. To use a combination of strings and integers, seperate them by brackets. example: `["hello"][1, 2]`


###*NP and PD*
1. Add a code and begin by wriing: `import numpy as np`
2. Following this, in the same code frame write: `import panda as pd`

The numpy package will allow pandas to complete mathmatical functions while the pandas package allows us to store data.

In [2]:
import numpy as np
import pandas as pd

##*Working with the Data*
Before we proceed, we need to read the data as a CSV file. Define the data by assigning it to a variable (df). I used df to refer to the term Data Frame. Use the read_csv function to create a pathway from the defined variable to the name of the CSV file. In this case, the CSV file name is air_emissions1.csv.

In [3]:
df=pd.read_csv("air_emissions1.csv")

##*Identifying Data*
**Columns and Rows**

Using different functions, you can identify different types of data and pinpoint where it belongs.
Creating the code, `df.columns` prints every columns name.

In [4]:
df.columns

Index(['Facility Id', 'FRS Id', 'Facility Name', 'State where Emissions Occur',
       'Reported City', 'Reported State', 'Reported Zip Code',
       'Reported Address', 'Reported County', 'Reported Latitude',
       'Reported Longitude',
       'Total reported direct emissions from Local Distribution Companies',
       'Carbon emissions (non-biogenic)', 'Methane (CH4) emissions',
       'Nitrous Oxide (N2O) emissions',
       'Does the facility employ continuous emissions monitoring? '],
      dtype='object')

**Data Types**

Creating the code, `df.dtypes` prints the code type of each column. For example, if a column contains numbers, this will print `int`.

Knowing what type of data you are working with can be helpful when you are creating a dataset.

In [5]:
df.dtypes

Facility Id                                                            int64
FRS Id                                                               float64
Facility Name                                                         object
State where Emissions Occur                                           object
Reported City                                                         object
Reported State                                                        object
Reported Zip Code                                                      int64
Reported Address                                                      object
Reported County                                                       object
Reported Latitude                                                    float64
Reported Longitude                                                   float64
Total reported direct emissions from Local Distribution Companies    float64
Carbon emissions (non-biogenic)                                      float64

**Accessing Data Points**

Creating the code, `df.iloc[row, column]` prints the item that the row and column you are identifying.

This code allows you to access specific data points within a DataFrame. This capability is crucial for data manipulation and analysis tasks in Python, especially when working with large datasets.

I used the 4th row and the 8th column which displays the 4th Facility ID and the 8th County.

In [6]:
df.iloc[4, 8]

'DALLAS COUNTY'

Another way to access data points is to pinpoint the column name of the data you want to see. Create the code `df["column name"]`. I chose to see the carbon emissions column to veiw the data connected to the locations. This can be used with any data title.

The column name should be entered as a str.

In [7]:
df["Carbon emissions (non-biogenic)"]

0       1102.6
1         17.9
2      11793.4
3         19.3
4         28.1
        ...   
160       72.6
161       37.9
162      162.6
163      444.6
164       55.9
Name: Carbon emissions (non-biogenic), Length: 165, dtype: float64

**Pinpointing Data Points within a Column**

Creating the code, `df["column name"][row start number:row end number`] prints the exact column and row numbers you need. This code allows you to access specific data points within a DataFrame. This capability is crucial for data manipulation and analysis tasks in Python, especially when working with large datasets.

The column name should be entered as a str and the row numbers should be entered as int.

In [8]:
df["Carbon emissions (non-biogenic)"][20:30]

20     35.1
21      0.7
22    674.7
23      1.4
24     19.2
25      2.0
26     16.9
27     23.7
28    200.4
29     81.2
Name: Carbon emissions (non-biogenic), dtype: float64

##Final Data

This is where you will utilize the previous skills to create your final data set that you would use. Since you only wanted to veiw certain catagories, you want to identify which columns and rows needed in order to create this set.

The code you will use to do this will be as follows:

`df.loc[row start number, row end number,[column 1, column 2, column 3, column 4]]`

The row start and end number should be entered as an int, while the columns should be entered as strings.

In [15]:
df.loc[0:165,["Facility Name", "State where Emissions Occur", "Reported County", "Carbon emissions (non-biogenic)"]]

Unnamed: 0,Facility Name,State where Emissions Occur,Reported County,Carbon emissions (non-biogenic)
0,Ameren Illinois,IL,PEORIA COUNTY,1102.6
1,Ameren Missouri,MO,ST. LOUIS CITY,17.9
2,Atlanta Gas Light Company,GA,FULTON COUNTY,11793.4
3,Atmos Energy Corporation - Colorado,CO,DALLAS COUNTY,19.3
4,Atmos Energy Corporation - Kansas,KS,DALLAS COUNTY,28.1
...,...,...,...,...
160,Washington Gas Light Company (VA),VA,FAIRFAX COUNTY,72.6
161,"West Texas Gas Utility, LLC",TX,MIDLAND COUNTY,37.9
162,Wisconsin Power & Light Gas Distribution,WI,DANE COUNTY,162.6
163,Wisconsin Public Service Corporation,WI,BROWN COUNTY,444.6


##Exporting your New Subset
After creating your new subset using your old data, you need to call to the csv so that it is identified and can be exported. I named this, `df_subset`. This code will create a new DataFrame df_subset containing the specified columns and rows from your original DataFrame df.

Now, it is ready to be exported as a CSV file for use.

To do this we can use the method .to_csv(). After this, write the function index=false indicates that the values from the original data set should be cut off.

In [18]:
df_subset = df.loc[0:165,["Facility Name", "State where Emissions Occur", "Reported County", "Carbon emissions (non-biogenic)"]]

In [20]:
df_subset.to_csv("co2.csv", index=False)