**Name:**

**ID:**

### **Authenticate and authorize access**

In [None]:

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

**BigQuery DataFrames**

- bigframes.pandas provides a pandas-compatible API for analytics.

- bigframes.ml provides a scikit-learn-like API for ML.


> https://cloud.google.com/python/docs/reference/bigframes/latest


> https://cloud.google.com/bigquery/docs/kmeans-tutorial






### **Get data from BigQuery**

[London Bicycle Hires public dataset](https://console.cloud.google.com/marketplace/details/greater-london-authority/london-bicycles?filter=solution-type:dataset&id=95374cac-2834-4fa2-a71f-fc033ccb5ce4&_ga=2.262055304.448628190.1710568646-1928647564.1691428223&project=ds-on-gcp-411105)



ข้อมูลการเช่ารถจักรยานในลอนดอน



In [None]:
import bigframes.pandas as bpd
bpd.options.bigquery.project = 'ds-on-gcp-411105'
bpd.options.bigquery.location = 'EU'

### **ข้อมูลการเช่า (Hire)**

> *bigquery-public-data.london_bicycles.cycle_hire*

In [None]:
h = bpd.read_gbq("bigquery-public-data.london_bicycles.cycle_hire",
                  col_order=["start_station_name", "start_station_id", "start_date", "duration"],
                ).rename(columns={
                                    "start_station_name": "station_name",
                                    "start_station_id": "station_id",
                                }
                        )

In [None]:
print(h.shape)
h.head()

### **ข้อมูลสถานี (Station)**

> *bigquery-public-data.london_bicycles.cycle_stations*


**Geography functions**

> https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_distance

> The coordinates -0.1 longitude and 51.5 latitude correspond roughly to the central area of London, UK. Longitude values are measured in degrees east or west of the Prime Meridian, and latitude values are measured in degrees north or south of the equator.

In [None]:
# create distance_from_city_center in kilometers.
s = bpd.read_gbq(
    """
    SELECT
    id,
    ST_DISTANCE(
        ST_GEOGPOINT(s.longitude, s.latitude),
        ST_GEOGPOINT(-0.1, 51.5)
    ) / 1000 AS distance_from_city_center
    FROM
    `bigquery-public-data.london_bicycles.cycle_stations` s
    """
)

In [None]:
print(s.shape)
s.head()

### **เลือกข้อมูลการเช่าปี 2015**

In [None]:
import datetime
#https://docs.python.org/3/library/datetime.html#datetime-objects
sample_time = datetime.datetime(2015, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)  #year, month, day, hour, minute, second, microsecond, tzinfo
sample_time2 = datetime.datetime(2016, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)

In [None]:
h = h.loc[(h["start_date"] >= sample_time) & (h["start_date"] <= sample_time2)]


In [None]:
print(h.shape)
h.head()

### **สร้างข้อมูล Day of Week**

In [None]:
h = h.assign(
    isweekday=h.start_date.dt.dayofweek.map(
        {
            0: "weekday",
            1: "weekday",
            2: "weekday",
            3: "weekday",
            4: "weekday",
            5: "weekend",
            6: "weekend",
        }
    )
)

In [None]:
print(h.shape)
h.head()

### **Join/Merge ตารางการเช่า (h) และตารางสถานี (s) ด้วย station ID**

In [None]:
merged_df = h.merge(
    right=s,
    how="inner",
    left_on="station_id",
    right_on="id",
)

In [None]:
print(merged_df.shape)
merged_df.head()

### **Feature Engineering เพื่อจัดกลุ่มสถานี**

Group by "station_name", "isweekday"

Features: mean of duration, count, max of distance_from_city_center


In [None]:
stationstats = merged_df.groupby(["station_name", "isweekday"]).agg(
    {"duration": ["mean", "count"], "distance_from_city_center": "max"}
)

In [None]:
stationstats.columns = ["duration", "num_trips", "distance_from_city_center"]

In [None]:
print(stationstats.shape)
stationstats.head()

### **Create model Kmeans bigframes.ml.cluster**

> $k = 4$

In [None]:
from bigframes.ml.cluster import KMeans

model_kmeans4 = KMeans(n_clusters=4)
model_kmeans4.fit(stationstats)

In [None]:
model_kmeans4.cluster_centers_

### **ประเมินประสิทธิภาพ (Evaluation)**

In [None]:
model_kmeans4.score(stationstats)

### **Save model to Dataset in BigQuery**

In [None]:
model_kmeans4.to_gbq("ds-on-gcp-411105.DemoKmeans.modelKmeans4",  replace=True)


### **Load Model**

In [None]:
model_kmeans4_loaded = bpd.read_gbq_model("ds-on-gcp-411105.DemoKmeans.modelKmeans4")

In [None]:
resultK4 = model_kmeans4_loaded.predict(stationstats)

In [None]:
resultK4 = resultK4.reset_index()


In [None]:
print(resultK4.shape)
resultK4.head()

In [None]:
resultK4.to_gbq("ds-on-gcp-411105.DemoKmeans.ResultStationKmeans4")

### **งาน: จัดกลุ่มด้วย K-Means k = 2,3,...,10**

> เพื่อหา $k$ ที่เหมาะสม