# Before you begin


1.   Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2.   [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3.   [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.


### Provide your credentials to the runtime

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [2]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


# Use BigQuery through google-cloud-bigquery

See [BigQuery documentation](https://cloud.google.com/bigquery/docs) and [library reference documentation](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html).

The [GSOD sample table](https://bigquery.cloud.google.com/table/bigquery-public-data:samples.gsod) contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.


### Declare the Cloud project ID which will be used throughout this notebook

In [3]:
project_id = 'cmpe272-332502'

In [5]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)


df = client.query('''
 SELECT  * FROM `bigquery-public-data.epa_historical_air_quality.co_daily_summary` where date_local >= "2009-01-01"
''').to_dataframe()


In [13]:
df.head()

Unnamed: 0,state_code,county_code,site_num,parameter_code,poc,latitude,longitude,datum,parameter_name,sample_duration,pollutant_standard,date_local,units_of_measure,event_type,observation_count,observation_percent,arithmetic_mean,first_max_value,first_max_hour,aqi,method_code,method_name,local_site_name,address,state_name,county_name,city_name,cbsa_name,date_of_last_change
0,2,20,48,42101,1,61.191514,-149.93493,WGS84,Carbon monoxide,1 HOUR,CO 1-hour 1971,2012-10-12,Parts per million,,24,100.0,1.041667,4.5,7,,54.0,INSTRUMENTAL - NONDISPERSIVE INFRARED,UNITARIAN CHURCH,3201 TURNAGAIN STREET,Alaska,Anchorage,Anchorage,"Anchorage, AK",2018-06-04
1,2,20,52,42101,1,61.215027,-149.903111,WGS84,Carbon monoxide,1 HOUR,CO 1-hour 1971,2012-01-11,Parts per million,,24,100.0,0.279167,1.3,17,,54.0,INSTRUMENTAL - NONDISPERSIVE INFRARED,DHHS,727 L Street,Alaska,Anchorage,Anchorage,"Anchorage, AK",2016-04-08
2,2,20,52,42101,1,61.215027,-149.903111,WGS84,Carbon monoxide,1 HOUR,CO 1-hour 1971,2012-03-29,Parts per million,,24,100.0,0.445833,0.9,6,,54.0,INSTRUMENTAL - NONDISPERSIVE INFRARED,DHHS,727 L Street,Alaska,Anchorage,Anchorage,"Anchorage, AK",2016-04-08
3,2,20,52,42101,1,61.215027,-149.903111,WGS84,Carbon monoxide,8-HR RUN AVG END HOUR,CO 8-hour 1971,2012-11-01,Parts per million,,24,100.0,0.379167,0.7,0,8.0,,-,DHHS,727 L Street,Alaska,Anchorage,Anchorage,"Anchorage, AK",2016-04-08
4,6,29,2012,42101,1,35.331612,-118.999961,NAD83,Carbon monoxide,1 HOUR,CO 1-hour 1971,2012-08-21,Parts per million,,21,88.0,0.261905,0.4,9,,54.0,INSTRUMENTAL - NONDISPERSIVE INFRARED,Bakersfield-Muni,2000 South Union Ave. Bakersfield CA 93307,California,Kern,Bakersfield,"Bakersfield, CA",2016-04-09


### Describe the sampled data

In [6]:
df.describe()

Unnamed: 0,parameter_code,poc,latitude,longitude,observation_count,observation_percent,arithmetic_mean,first_max_value,first_max_hour,aqi,method_code
count,2356603.0,2356603.0,2356603.0,2356603.0,2356603.0,2356603.0,2356603.0,2356603.0,2356603.0,1178308.0,1178295.0
mean,42101.0,1.147431,37.52664,-98.05205,23.27752,97.04557,0.3163669,0.5150626,7.932945,4.993341,234.5471
std,0.0,0.6839221,5.5499,19.38084,2.158731,8.937323,0.2552046,0.5030489,7.915253,4.347717,230.9118
min,42101.0,1.0,18.00956,-159.3662,1.0,4.0,-0.5,-0.5,0.0,0.0,41.0
25%,42101.0,1.0,33.99958,-117.4006,23.0,96.0,0.175,0.204,0.0,2.0,54.0
50%,42101.0,1.0,38.1936,-95.25759,24.0,100.0,0.2625,0.4,6.0,3.0,93.0
75%,42101.0,1.0,41.0604,-80.59232,24.0,100.0,0.4,0.6,15.0,7.0,554.0
max,42101.0,9.0,64.84569,-66.05224,24.0,100.0,45.66667,50.0,23.0,453.0,593.0


In [7]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [9]:
df.to_csv("/content/drive/MyDrive/272/final.csv", index=False)

# Regressor Model

In [11]:
import pandas as pd

In [12]:
df['date_local'] = pd.to_datetime(df['date_local'])

In [14]:
df_extract = df[["state_code","county_code","site_num", "date_local","arithmetic_mean", "parameter_name"]]

In [15]:
df_extract.head()

Unnamed: 0,state_code,county_code,site_num,date_local,arithmetic_mean,parameter_name
0,2,20,48,2012-10-12,1.041667,Carbon monoxide
1,2,20,52,2012-01-11,0.279167,Carbon monoxide
2,2,20,52,2012-03-29,0.445833,Carbon monoxide
3,2,20,52,2012-11-01,0.379167,Carbon monoxide
4,6,29,2012,2012-08-21,0.261905,Carbon monoxide


In [17]:
df_extract["year"] = pd.DatetimeIndex(df_extract['date_local']).year  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [19]:
df_extract["month"] = pd.DatetimeIndex(df_extract['date_local']).month
df_extract["day"] = pd.DatetimeIndex(df_extract['date_local']).day   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [23]:
df_extract.drop(["date_local"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [25]:
df_extract["state_code"] = df_extract["state_code"].astype(int)
df_extract["county_code"] = df_extract["county_code"].astype(int)
df_extract["site_num"] = df_extract["site_num"].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [26]:
df_extract.head()

Unnamed: 0,state_code,county_code,site_num,arithmetic_mean,parameter_name,year,month,day
0,2,20,48,1.041667,Carbon monoxide,2012,10,12
1,2,20,52,0.279167,Carbon monoxide,2012,1,11
2,2,20,52,0.445833,Carbon monoxide,2012,3,29
3,2,20,52,0.379167,Carbon monoxide,2012,11,1
4,6,29,2012,0.261905,Carbon monoxide,2012,8,21


In [27]:
df_g = df_extract.groupby(["state_code", "county_code","site_num","year","month","day"])["arithmetic_mean"].mean().reset_index()

In [28]:
df_g.head()

Unnamed: 0,state_code,county_code,site_num,year,month,day,arithmetic_mean
0,1,73,23,2011,1,12,0.504084
1,1,73,23,2011,1,13,0.441
2,1,73,23,2011,1,14,0.505455
3,1,73,23,2011,1,15,0.817803
4,1,73,23,2011,1,16,0.722992


In [30]:
df_g.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1166474 entries, 0 to 1166473
Data columns (total 7 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   state_code       1166474 non-null  int64  
 1   county_code      1166474 non-null  int64  
 2   site_num         1166474 non-null  int64  
 3   year             1166474 non-null  int64  
 4   month            1166474 non-null  int64  
 5   day              1166474 non-null  int64  
 6   arithmetic_mean  1166474 non-null  float64
dtypes: float64(1), int64(6)
memory usage: 62.3 MB


In [39]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

In [32]:
X = df_g.drop(["arithmetic_mean"], axis=1)
y = df_g[["arithmetic_mean"]]

In [36]:
train_X = X[:1100000]
test_X = X[1100000:]
train_y = y[:1100000]
test_y = y[1100000:]

In [37]:
scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train_X)
test_scaled = scaler.fit_transform(test_X)

In [38]:
xg = xgb.XGBRegressor()
xg.fit(train_scaled, train_y)
pred = xg.predict(test_scaled)



In [40]:
print("MSE: ", mean_squared_error(pred, test_y))

MSE:  0.3846755143074041
