# AIR-QUALITY HISTORY OF USA-CITIES

## Abstract:

The United States of America (USA), also known as the United States (U.S. or US) or America, is a country mainly located in North America. It chiefly consists of 51 states including Washington DC and 5 territories and had a population of 328 million people in 2019.
    Looking back at the figures from 2020 on the IQAir website the US obtained an overall ranking of 84 out of a total of 106 world cities. The average annual figure was US AQI 40, in comparison to Bangladesh which was the most polluted country with a US AQI figure of 162.

The cleanest city was Waimea, Hawaii with a figure of just 9, whereas the most polluted city was Yosemite Lakes, California with a figure of 107.

<img src='https://rdamp.com/wp-content/uploads/sites/10/2019/11/PM2.5-air-pollution-still-kills-thousands-in-the-United-States-1024x536.jpg'/>

## Air-Pollution:

Air pollution is the presence of substances in the atmosphere that are harmful to the health of humans and other living beings, or cause damage to the climate or to materials. There are many different types of air pollutants, such as gases (such as **ammonia(NH4)**, **carbon monoxide(CO)**, **sulfur dioxide(SO2)**,**nitrous oxides(N2O)**, **methane(CH4)** and **chlorofluorocarbons(CFCs)**), particulates (both organic and inorganic), and biological molecules. Air pollution may cause diseases, allergies and even death to humans; it may also cause harm to other living organisms such as animals and food crops, and may damage the natural environment (for example, climate change, ozone depletion or habitat degradation) or built environment (for example, acid rain). Both human activity and natural processes can generate air pollution.

## Our Goal:
Our goal is to create a dateset from API.So I choose Rapid-API for doing this task.I want to create a dataset of air pollution history of cities of USA by taking data of each hour of continous 3 days.For getting the USA cities with respective latitude and longitude values, I will take the help of dataset from kaggle. 
 

### Importing Important libraries

In [32]:
import requests
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

### Loading the dataset from Kaggle:
https://www.kaggle.com/mahbubrob/usa-cities

In [28]:
# Loading the dataset('city') from kaggle
city=pd.read_csv("C:/Users/sushr/Downloads/USA-Cities/usa_cities.csv")

In [30]:
city.head()

Unnamed: 0,city,city_ascii,lat,lng,pop,country,iso2,iso3,province,abbr
0,Calais,Calais,45.165989,-67.242392,1781.5,United States of America,US,USA,Maine,
1,Houlton,Houlton,46.125517,-67.83972,6051.5,United States of America,US,USA,Maine,
2,Presque Isle,Presque Isle,46.793409,-68.002165,9466.0,United States of America,US,USA,Maine,
3,Bar Harbor,Bar Harbor,44.387897,-68.204375,4483.5,United States of America,US,USA,Maine,
4,Bangor,Bangor,44.801153,-68.778345,40843.0,United States of America,US,USA,Maine,


In [31]:
# Checking the shape of dataframe('city')
city.shape

(769, 10)

## Getting the data from API

https://rapidapi.com/weatherbit/api/air-quality/

The API provides three endpoints: Air Quality History, Current Air Quality and Air Quality Forecast. The 'Air Quality History' endpoint was selected for this analysis. Under this endpoint, rapid api offers unlimited numbers of requests per month for free account.

All needed for the request are Latitude and Longitude value of a place. These values are taken for getting the current air quality of USA, which can be looked up on usa-cities dataframe from kaggle.

## Description of Data

- lat: Latitude (Degrees).
- lon: Longitude (Degrees).
- timezone: Local IANA Timezone.
- city_name: Nearest city name.
- country_code: Country abbreviation.
- state_code: State abbreviation/code.
https://en.wikipedia.org/wiki/ISO_3166-2:US#:~:text=Current%20codes%20%20%20%20Code%20%20,%20%20state%20%2053%20more%20rows%20

- timestamp_local: Timestamp at local time.
- timestamp_utc: Timestamp at UTC time.
- ts: Unix Timestamp at UTC time.
- aqi: Air Quality Index [US - EPA standard 0 - +500]
- o3: Concentration of surface O3 (µg/m³)
- so2: Concentration of surface SO2 (µg/m³)
- no2: Concentration of surface NO2 (µg/m³)
- co: Concentration of carbon monoxide (µg/m³)
- pm25: Concentration of particulate matter < 2.5 microns (µg/m³)
- pm10: Concentration of particulate matter < 10 microns (µg/m³)

### Creating the dataframe from RAPID API which will take the Latitude & Longitude as  parameter from Kaggle dataset('city')

In [34]:
# creating an empty pandas dataframe ("usa_cities")
usa_cities = pd.DataFrame()

# Keep the range of i till length of the kaggle dataframe("city")
for i in range(len(city)):
        print(i)

        url = "https://air-quality.p.rapidapi.com/history/airquality"

        querystring = {"lon":(city.loc[i,"lng"]),"lat":(city.loc[i,"lat"])}

        headers = {
        'x-rapidapi-key': "78ac36594fmsh463a4013612e25ep152223jsn647171b1d90f",
        'x-rapidapi-host': "air-quality.p.rapidapi.com"
        }

        respons = requests.request("GET", url, headers=headers, params=querystring)
        respons=respons.json()
        respons=pd.json_normalize(data=respons,record_path="data",meta=["city_name","lon","timezone","lat","country_code","state_code"])

        # Appending the pandas dataframe ("some_file") to the created empty dataframe ("usa_cities")
        usa_cities= usa_cities.append(respons,ignore_index=True)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Here we can see that after rows no. 697,there is error.So in **next step** we will take the range of len(city) upto 697. It means we will collect the data from API regarding Air quality history of 697 no. of cities of USA instead of 769 no. of cities of USA.

In [36]:
# creating an empty pandas dataframe ("usa_cities")
usa_cities = pd.DataFrame()

# Keep the range of i till length of the kaggle dataframe("city")
for i in range(697):
        

        url = "https://air-quality.p.rapidapi.com/history/airquality"

        querystring = {"lon":(city.loc[i,"lng"]),"lat":(city.loc[i,"lat"])}

        headers = {
        'x-rapidapi-key': "78ac36594fmsh463a4013612e25ep152223jsn647171b1d90f",
        'x-rapidapi-host': "air-quality.p.rapidapi.com"
        }

        respons = requests.request("GET", url, headers=headers, params=querystring)
        respons=respons.json()
        respons=pd.json_normalize(data=respons,record_path="data",meta=["city_name","lon","timezone","lat","country_code","state_code"])

        # Appending the pandas dataframe ("some_file") to the created empty dataframe ("usa_cities")
        usa_cities= usa_cities.append(respons,ignore_index=True)

In [37]:
usa_cities

Unnamed: 0,aqi,pm10,pm25,o3,timestamp_local,so2,no2,timestamp_utc,datetime,co,ts,city_name,lon,timezone,lat,country_code,state_code
0,21.8,0.342353,0.227110,47.5560,2021-06-23T17:00:00,0.051863,0.468337,2021-06-23T21:00:00,2021-06-23:21,265.944,1624482000,Whitlocks Mill,-67.24,America/New_York,45.17,US,ME
1,22.8,0.322100,0.182965,49.8604,2021-06-23T16:00:00,0.042550,0.371042,2021-06-23T20:00:00,2021-06-23:20,266.528,1624478400,Whitlocks Mill,-67.24,America/New_York,45.17,US,ME
2,23.6,0.302576,0.141753,51.7713,2021-06-23T15:00:00,0.039872,0.274416,2021-06-23T19:00:00,2021-06-23:19,268.781,1624474800,Whitlocks Mill,-67.24,America/New_York,45.17,US,ME
3,24.8,0.282950,0.106047,53.9326,2021-06-23T14:00:00,0.045402,0.182207,2021-06-23T18:00:00,2021-06-23:18,270.533,1624471200,Whitlocks Mill,-67.24,America/New_York,45.17,US,ME
4,26.0,0.278646,0.081059,56.5946,2021-06-23T13:00:00,0.056811,0.096057,2021-06-23T17:00:00,2021-06-23:17,274.539,1624467600,Whitlocks Mill,-67.24,America/New_York,45.17,US,ME
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50467,34.0,0.163128,0.096507,74.4760,2021-06-20T18:00:00,0.206521,0.425061,2021-06-21T02:00:00,2021-06-21:02,252.008,1624240800,Fairbanks,-147.71,America/Anchorage,64.84,US,AK
50468,35.0,0.188700,0.124921,75.2807,2021-06-20T17:00:00,0.248197,0.326327,2021-06-21T01:00:00,2021-06-21:01,251.174,1624237200,Fairbanks,-147.71,America/Anchorage,64.84,US,AK
50469,36.0,0.223833,0.155735,77.1582,2021-06-20T16:00:00,0.336673,0.323984,2021-06-21T00:00:00,2021-06-21:00,255.346,1624233600,Fairbanks,-147.71,America/Anchorage,64.84,US,AK
50470,36.0,0.240127,0.169641,77.1582,2021-06-20T15:00:00,0.397675,0.348751,2021-06-20T23:00:00,2021-06-20:23,255.346,1624230000,Fairbanks,-147.71,America/Anchorage,64.84,US,AK


In [38]:
# checking the duplicate values
usa_cities.duplicated().sum()

0

In [39]:
#Checking the NaN values
usa_cities.isna().sum()

aqi                0
pm10               0
pm25               0
o3                 0
timestamp_local    0
so2                0
no2                0
timestamp_utc      0
datetime           0
co                 0
ts                 0
city_name          0
lon                0
timezone           0
lat                0
country_code       0
state_code         0
dtype: int64

In [40]:
# Drpping the columns ('timestamp_local' & 'country_code')
usa_cities.drop(columns=["timestamp_local","country_code"],inplace=True)

In [41]:
usa_cities.head()

Unnamed: 0,aqi,pm10,pm25,o3,so2,no2,timestamp_utc,datetime,co,ts,city_name,lon,timezone,lat,state_code
0,21.8,0.342353,0.22711,47.556,0.051863,0.468337,2021-06-23T21:00:00,2021-06-23:21,265.944,1624482000,Whitlocks Mill,-67.24,America/New_York,45.17,ME
1,22.8,0.3221,0.182965,49.8604,0.04255,0.371042,2021-06-23T20:00:00,2021-06-23:20,266.528,1624478400,Whitlocks Mill,-67.24,America/New_York,45.17,ME
2,23.6,0.302576,0.141753,51.7713,0.039872,0.274416,2021-06-23T19:00:00,2021-06-23:19,268.781,1624474800,Whitlocks Mill,-67.24,America/New_York,45.17,ME
3,24.8,0.28295,0.106047,53.9326,0.045402,0.182207,2021-06-23T18:00:00,2021-06-23:18,270.533,1624471200,Whitlocks Mill,-67.24,America/New_York,45.17,ME
4,26.0,0.278646,0.081059,56.5946,0.056811,0.096057,2021-06-23T17:00:00,2021-06-23:17,274.539,1624467600,Whitlocks Mill,-67.24,America/New_York,45.17,ME


In [50]:
# checking the unique values of all columns
usa_cities.nunique()

aqi             475
pm10          45262
pm25          45264
o3             4970
so2            4729
no2            7476
co             2275
ts               72
city_name       665
lon             657
timezone         14
lat             598
state_code       50
dtype: int64

In [43]:
# checking the data type of column('timestamp_utc')
usa_cities.timestamp_utc.dtype

dtype('O')

**From above we can see that the dtype of column('timestamp_utc') is 'Object' type, which should be in date-time format.**

In [44]:
# converting the column('timestamp_utc') into datetime format
usa_cities['timestamp_utc']=pd.to_datetime(usa_cities.timestamp_utc, format="%Y-%m-%dT%H:%M:%S")

In [45]:
usa_cities.head()

Unnamed: 0,aqi,pm10,pm25,o3,so2,no2,timestamp_utc,datetime,co,ts,city_name,lon,timezone,lat,state_code
0,21.8,0.342353,0.22711,47.556,0.051863,0.468337,2021-06-23 21:00:00,2021-06-23:21,265.944,1624482000,Whitlocks Mill,-67.24,America/New_York,45.17,ME
1,22.8,0.3221,0.182965,49.8604,0.04255,0.371042,2021-06-23 20:00:00,2021-06-23:20,266.528,1624478400,Whitlocks Mill,-67.24,America/New_York,45.17,ME
2,23.6,0.302576,0.141753,51.7713,0.039872,0.274416,2021-06-23 19:00:00,2021-06-23:19,268.781,1624474800,Whitlocks Mill,-67.24,America/New_York,45.17,ME
3,24.8,0.28295,0.106047,53.9326,0.045402,0.182207,2021-06-23 18:00:00,2021-06-23:18,270.533,1624471200,Whitlocks Mill,-67.24,America/New_York,45.17,ME
4,26.0,0.278646,0.081059,56.5946,0.056811,0.096057,2021-06-23 17:00:00,2021-06-23:17,274.539,1624467600,Whitlocks Mill,-67.24,America/New_York,45.17,ME


In [46]:
# dropping the column('datetime')
usa_cities.drop(columns='datetime',inplace=True)

In [47]:
# setting the index as date-time index
usa_cities.set_index("timestamp_utc",inplace=True)

In [48]:
usa_cities.head()

Unnamed: 0_level_0,aqi,pm10,pm25,o3,so2,no2,co,ts,city_name,lon,timezone,lat,state_code
timestamp_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2021-06-23 21:00:00,21.8,0.342353,0.22711,47.556,0.051863,0.468337,265.944,1624482000,Whitlocks Mill,-67.24,America/New_York,45.17,ME
2021-06-23 20:00:00,22.8,0.3221,0.182965,49.8604,0.04255,0.371042,266.528,1624478400,Whitlocks Mill,-67.24,America/New_York,45.17,ME
2021-06-23 19:00:00,23.6,0.302576,0.141753,51.7713,0.039872,0.274416,268.781,1624474800,Whitlocks Mill,-67.24,America/New_York,45.17,ME
2021-06-23 18:00:00,24.8,0.28295,0.106047,53.9326,0.045402,0.182207,270.533,1624471200,Whitlocks Mill,-67.24,America/New_York,45.17,ME
2021-06-23 17:00:00,26.0,0.278646,0.081059,56.5946,0.056811,0.096057,274.539,1624467600,Whitlocks Mill,-67.24,America/New_York,45.17,ME


### Conclusion:
 We have now cleaned dataset('usa_cities') which has data of air-pollution history of 697 cities of USA.We can use this dataset for further analysis in future.