# ACTIVITY SUGGESTION 



## Students
    
    18127183 - Tran Bao Phuc
    18127140 - Thai Hoang Long


## Introduction

https://www.boredapi.com/
The Bored API helps you find things to do when you're bored! There are fields like the number of participants, activity type, and more that help you narrow down your results.

Website return a random activity within a simple request, example:

http://www.boredapi.com/api/activity/

```json
    {
        "activity": "Learn Express.js",
        "accessibility": 0.25,
        "type": "education",
        "participants": 1,
        "price": 0.1,
        "link": "https://expressjs.com/",
        "key": "3943506"
    } 
```

We need to understand the fields's descriptions are:
- `activity`: Description of the queried activity
- `accessibility`: A factor describing how possible an event is to do with zero being the most accessible [0.0, 1.0]
- `type`: Description of the queried activity ["education", "recreational", "social", "diy", "charity", "cooking", "relaxation", "music", "busywork"]
- `participants`: The number of people that this activity could involve [0, n]
- `price`: A factor describing the cost of the event with zero being free
- `key`: A unique numeric id


## Scope

For this project we, students will get random activity suggestion from the service. Then we will program model to predict `accessibility` based on certain factors.



## Setup


For this we will use `min-ds_env` as environment
To verify:

In [None]:
import sys
sys.executable

'/usr/bin/python3'

#### Import

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns # seaborn là thư viện được xây trên matplotlib, 
                      # giúp việc visualization đỡ khổ hơn
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LinearRegression
from sklearn import set_config
set_config(display='diagram') # Để trực quan hóa pipeline

In [None]:
!pip install --upgrade scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 66.4 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.2.0


## Data collection

In [None]:
import requests
import csv
import json
import time

A function that send get requests up to `n` time and save response data to the `csv` file
`interval`: sleep time between request
`amount`: amount of requests

In [None]:
csv_file = './data.csv'
url = "http://www.boredapi.com/api/activity"
colnames=['key', 'activity', 'accessibility', 'type', 'participants', 'price', 'link']

In [None]:
 
def collectData(amount=2, interval=1):
    f = csv.DictWriter(open(csv_file, "w+"), fieldnames=colnames)
    for i in range(amount):
        time.sleep(interval)
        response = requests.get(url)
        res = response.json()
        f.writerow(res)

In [None]:
# collectData(100015, 1)

We collected 10015 record in `data.csv`

### Data Discovering

In [None]:

data_df = pd.read_csv(csv_file, names=colnames, 
                      index_col=0) # index column: "key"
data_df.head()


Unnamed: 0_level_0,activity,accessibility,type,participants,price,link
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4558850,Go to a concert with some friends,0.4,social,4,0.6,
2237769,Explore the nightlife of your city,0.32,social,1,0.1,
1799120,Cook something together with someone,0.8,cooking,2,0.3,
4614092,Draw and color a Mandala,0.1,relaxation,1,0.05,https://en.wikipedia.org/wiki/Mandala
9216391,Learn woodworking,0.3,diy,1,0.3,


In [None]:
data_df.shape

(10015, 6)

Because data is randomize for every request, we check if any duplicated rows

In [None]:
data_df.duplicated().sum()

9819

Wow, too many duplicated rows!!!

In [None]:
data_df = data_df.drop_duplicates(keep='first')

In [None]:
data_df.shape

(196, 6)

In [None]:
### preprocess data (Raw)


In [None]:
y_sr = data_df["accessibility"]
X_df = data_df.drop(["accessibility"], axis=1)

In [None]:
y_sr

key
4558850    0.40
2237769    0.32
1799120    0.80
4614092    0.10
9216391    0.30
           ... 
3646173    0.00
2581372    0.10
2085321    0.30
9026787    0.10
6813070    0.10
Name: accessibility, Length: 196, dtype: float64

In [None]:
X_df

Unnamed: 0_level_0,activity,type,participants,price,link
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4558850,Go to a concert with some friends,social,4,0.60,
2237769,Explore the nightlife of your city,social,1,0.10,
1799120,Cook something together with someone,cooking,2,0.30,
4614092,Draw and color a Mandala,relaxation,1,0.05,https://en.wikipedia.org/wiki/Mandala
9216391,Learn woodworking,diy,1,0.30,
...,...,...,...,...,...
3646173,Learn Morse code,education,1,0.00,https://en.wikipedia.org/wiki/Morse_code
2581372,Take a bubble bath,relaxation,1,0.15,
2085321,Take a spontaneous road trip with some friends,social,4,0.20,
9026787,Clean out your closet and donate the clothes y...,charity,1,0.00,


In [None]:
train_X_df, val_X_df, train_y_sr, val_y_sr = \
                              train_test_split(X_df, y_sr, 
                                               test_size=0.3, 
                                               random_state=0)

In [None]:
train_X_df.head().index

Int64Index([5881028, 6778219, 6808057, 4290333, 4266522], dtype='int64', name='key')

In [None]:
### Discovering data

First, we have a look at data types of each fields

In [None]:
train_X_df.dtypes

activity         object
type             object
participants      int64
price           float64
link             object
dtype: object

In [None]:
def missing_percentage(c):
    return (c.isna().mean() * 100).round(1)

In [None]:
train_X_df.agg([missing_percentage])

Unnamed: 0,activity,type,participants,price,link
missing_percentage,0.0,0.0,0.0,0.0,88.3


Only column `link` has missing values. Other columns does not.

How many types of activity do we get?

In [None]:
train_X_df['type'].value_counts()

recreational    31
social          29
busywork        22
education       17
relaxation      15
charity          7
music            6
diy              5
cooking          5
Name: type, dtype: int64

In [None]:
### Preprocess data (Train set)

- col 'activity' => number of words (more words means more complicating the activity)
- col 'type' => one hot encoding 
- col 'link' => number of 0 and 1 (0 for empty, 1 for any value)

In [None]:
class ColPreprocess(BaseEstimator, TransformerMixin):
    def __init__(self):
       pass

    def fit(self, X_df, y=None):
        return self
    
    def transform(self, X_df, y=None):
        df = X_df.copy()
        df['activity'] = [len(x.split()) for x in X_df['activity']]
        df['link'] = [1 if x else 0 for x in X_df['link']]
        return df

# # TEST Transform METHOD
col_prep = ColPreprocess()
col_prep.transform(train_X_df)

Unnamed: 0_level_0,activity,type,participants,price,link
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5881028,5,education,1,0.10,1
6778219,6,education,1,0.00,1
6808057,6,recreational,1,0.00,1
4290333,8,relaxation,1,0.10,1
4266522,4,busywork,1,0.50,1
...,...,...,...,...,...
6693574,6,recreational,1,0.10,1
2581372,4,relaxation,1,0.15,1
4688012,2,recreational,1,0.00,1
6706598,5,education,1,0.00,1


In [None]:
col_preprocess2 = ColumnTransformer([
    ("onehot", OneHotEncoder(handle_unknown="ignore"), ['type'])
])

preprocess_pipeline = Pipeline([
    ('col_preprocess', ColPreprocess()),
    ('col_preprocess2', col_preprocess2),
    ('normalize', StandardScaler(with_mean=False)),
])



In [None]:
preprocess_pipeline

In [None]:
preprocessed_train_X = preprocess_pipeline.fit_transform(train_X_df)

In [None]:
#full_pipeline

full_pipeline = Pipeline([
    ('col_preprocess', ColPreprocess()),
    ('col_preprocess2', col_preprocess2),
    ('normalize', StandardScaler(with_mean=False)),
    ('classifier', LinearRegression())
])

#training
full_pipeline.fit(X_df, y_sr)

#độ lệch trung binhg của mô hình khi training trên X_df với y_sr
pred = full_pipeline.predict(X_df)# mảng giá trị các kết quả accessibility khi training trên X_df
df = y_sr.values # mảng giá trị accessibility của y_sr
result = []
for i in range(len(pred)):
  x = abs(pred[i] - df[i])  # độ lệch accessibility của từng key
  result.append(x)

r = (sum(result) / len(result))*100
print("phan tram trung binh do lech:", r)



phan tram trung binh do lech: 22.035372141609884


In [None]:
full_pipeline

In [34]:
#dự đoán mô hình
df_test = pd.read_csv('test.csv', names=colnames, 
                      index_col=0)
df_test= df_test.drop_duplicates(keep='first')
df_test = df_test.drop(['accessibility'], axis = 1)
col = list(df_test.columns)
prediction = full_pipeline.predict(df_test)
df_test['accessibility'] = prediction
df_test.drop(columns = col ).to_csv('my_predict.csv')
