## Intro

This notebook includes the answers of Patrick Leung to the HK01 interview question.

To make all the code in Q2 & Q3 executable, all the required data are stored on the google sheet below:
https://docs.google.com/spreadsheets/d/1LvC5wRR4x2BL-cKPr_8sb4NGdW1TtdNeGj-r_yGJAHo/edit?usp=sharing

## Q1 SQL Question

In [None]:
SELECT DISTINCT uid
FROM piwik_track
WHERE event_name = ‘FIRST_INSTALL’
AND time = ‘2017-04-01’

INTERSECT

SELECT DISTINCT uid
FROM piwik_track
WHERE time >= ‘2017-04-02’
AND time <= ‘2017-04-08’

## Q2 Data Frame Question

#### import the libraries

In [None]:
import pandas as pd
import time

#### Import the data, which was copy and pasted from a google sheet

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRybGDvxCZ2dK7UPdKHUv8BG3QkNedvDrDOkrYf664i1uA_mpKh7L3BexFLwMkWytapJCXeQW2gJD7q/pub?gid=0&single=true&output=tsv'
df = pd.read_csv(url, sep = '\t')

#### Transform the date & time to datetime

In [None]:
df['date']  = df.date.apply(lambda x: x[:4]+'-'+x[5:7]+'-'+x[8:10])
df['datetime'] = pd.to_datetime(df.date +' ' +df.time)

#### Set the criteria for the rows

In [None]:
condition = (df.datetime>'2017-08-24') & (df.datetime<'2017-08-25') & df.url.apply(lambda x: '.jpg' in x)

#### Get the sum

In [None]:
df.loc[condition,'size'].sum()

# Q3a Chinese NLP Question

Questions:

##### 1) How well does your model perform?
Please check 3.1 below. According to a 10 fold cross validation, the model performs close to ~0.99 precision, and it generalize well in unseen data

##### 2) How did you choose the parameters of the final model?
Please check 3.2. To choose the best parameters of the final model, we can use a grid search which loops through all potential parameters within a chosen range. 


##### 3) On a high level, please explain your final model’s structure, and how it predicts tags from the article text
The methodology of the model is, 

1) clean and tokenize chinese words into bigram & trigram, then use word2vec to transform these words into vector. The word vector is built based on this "universe" of chinese words 

2) Given other news article on HK01, if the word has been seen in the previous text, a word vector will be calculated based on the word-cocurence matrix. Average word vector is being used to turn a paragraph into vectors.

3) This is a classification problem, so i decided to use random forest. Random forest is a model based on multiple decision tree, randomly selected partial features & each tree vote for the final outcome. The random forest learn the relationship between the tag & the word matrix, and therefore able to predict the probable text based on the word feature. Please note that "out of bag" words is not taken care in this case.


#### Import libraries

In [0]:
import pandas as pd
import nltk
from nltk import bigrams
from bs4 import BeautifulSoup as bs
from nltk.util import ngrams
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
import gensim
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier()

#### Helper functions

In [0]:
# Get bigram & tri-gram
def get_token(contents):
    phrases = []
    for content in contents:
        phrases.extend([''.join(i) for i in ngrams([i for i in content],2)])
    for content in contents:
        phrases.extend([''.join(i) for i in ngrams([i for i in content],3)])
    return phrases


#Flatten a nested list
def flatten(l):
    return [a for b in l for a in b]
  
def get_vector(x):
    try:
        return model.wv[x]
    except:
        return ''

#Split text based on a list of seperators
def split(txt, seps):
    default_sep = seps[0]
    for sep in seps[1:]:
        txt = txt.replace(sep, default_sep)
    return [i.strip() for i in txt.split(default_sep)]

#### Data preparation

In [0]:
#Get the data
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRybGDvxCZ2dK7UPdKHUv8BG3QkNedvDrDOkrYf664i1uA_mpKh7L3BexFLwMkWytapJCXeQW2gJD7q/pub?gid=2044258515&single=true&output=csv'
df = pd.read_csv(url)

# define the lookup for the label, and also the seperators for chinese text
lookup = {'足球':1,'梁振英':2,'美國大選':3}
sep = ('、','，','。','「','」','︰','||','】','【','：')

#### Data cleaning

In [7]:
# clean the label
df['tags'] = df.tags.map(lookup)

# parse the text by beautifulsoup
df['text']  = df.text.apply(lambda x: bs(x))
df['text'] = df.text.apply(lambda x: x.text)

#clean the data by seperators
df['text'] = df.text.apply(lambda x:'||'.join(x.split()))
df['text'] = df.text.apply(lambda x: split(x, sep))

#Create text token, get the n-gram (bigram & trigram) of the text
df['token'] = df.text.apply(lambda x: get_token(x))

  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':


#### Train word2vec model using gensim and the word token

In [0]:
model = gensim.models.Word2Vec(list(df.token), min_count=1)

#### Get the average word vector of a paragraphs for every token

In [0]:
df['vector'] = df.token.apply(lambda x: np.mean(np.array([model.wv[i] for i in x]), axis = 0))

#### Look at the dataframe

In [14]:
df.head(3)

Unnamed: 0,id,tags,text,token,vector
0,3443,足球,"[利物浦重賽擊敗乙組仔, 英足盃過關, 英格蘭足總盃第三圈今晨重賽, 貴為英超勁旅的利物浦上...","[利物, 物浦, 浦重, 重賽, 賽擊, 擊敗, 敗乙, 乙組, 組仔, 英足, 足盃, 盃...","[0.9069747, -0.039209668, -0.4738193, -0.74298..."
1,76056,足球,"[, 中超, 恒大, 暴力戰, 絕殺國安, 楊智反重力插水惹爭議（有片）, 中超首輪賽事重頭...","[中超, 恒大, 暴力, 力戰, 絕殺, 殺國, 國安, 楊智, 智反, 反重, 重力, 力...","[0.60577834, -0.075798534, -0.3843291, -0.5955..."
2,93405,足球,"[, 歐霸決賽, 阿積士控球率起腳佔優, 隊長卡拉臣輸波不服氣, 阿積士以歐洲主要決賽最年輕...","[歐霸, 霸決, 決賽, 阿積, 積士, 士控, 控球, 球率, 率起, 起腳, 腳佔, 佔...","[0.7996664, -0.08279441, -0.51416713, -0.68100..."


#### Convert the word vector into a dataframe of word vector

In [0]:
word_vector = pd.DataFrame(list(df.vector))

## 3.1 cross validation

In [23]:
scores = cross_val_score(rf, word_vector, df[['tags']], cv=10)
scores                                              

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


array([0.99488491, 0.99232737, 0.99230769, 0.99228792, 0.99485861,
       0.98457584, 0.98457584, 0.98457584, 0.99742931, 0.99484536])

In [18]:
rf.fit(word_vector, df[['tags']])

  """Entry point for launching an IPython kernel.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Get test prediction

In [0]:
df_test = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vRybGDvxCZ2dK7UPdKHUv8BG3QkNedvDrDOkrYf664i1uA_mpKh7L3BexFLwMkWytapJCXeQW2gJD7q/pub?gid=1219023622&single=true&output=csv')

#### Clean the test dataframe

In [25]:
df_test['text']  = df_test.text.apply(lambda x: bs(x))
df_test['text'] = df_test.text.apply(lambda x: x.text)
df_test['text'] = df_test.text.apply(lambda x:'||'.join(x.split()))
df_test['text'] = df_test.text.apply(lambda x: split(x, sep))
df_test['token'] = df_test.text.apply(lambda x: get_token(x))
df_test['vector']  =df_test.token.apply(lambda x: np.mean(np.array([get_vector(i) for i in x if get_vector(i)!='']), axis = 0))

  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':
  


In [0]:
df_test['prediction'] = rf.predict(pd.DataFrame(list(df_test.vector)))
df_test['prediction'] =  df_test.prediction.map({1:'足球',2:'梁振英',3:'美國大選'})

#### Randomly check the model performance by eyeball

In [30]:
df_test.sample(10)

Unnamed: 0,id,text,token,vector,prediction
819,71755,"[, 亞冠盃, 東方主場套票售罄, 對恒大單場門票僅餘特惠票, 2017年亞冠盃小組賽第一輪...","[亞冠, 冠盃, 東方, 方主, 主場, 場套, 套票, 票售, 售罄, 對恒, 恒大, 大...","[0.721277, -0.0036470664, -0.48476762, -0.7835...",足球
406,34879,"[, 港超, 足總批准富力參賽, 放寬本地球員限制, 上陣名額4變3, 早前足總放風, 要求...","[港超, 足總, 總批, 批准, 准富, 富力, 力參, 參賽, 放寬, 寬本, 本地, 地...","[0.78223294, -0.09413832, -0.60155964, -0.8187...",足球
937,90787,"[, 金球, 瑞士盃賽現, 驚世金球, , 窄角度倒掛解圍擺烏龍（有片）, 烏龍球往往是金球...","[金球, 瑞士, 士盃, 盃賽, 賽現, 驚世, 世金, 金球, 窄角, 角度, 度倒, 倒...","[0.4636246, 0.10082367, -0.2101591, -0.5601781...",足球
785,66947,"[, 特朗普就任, 慶典獻舞, , 火箭女郎, 不情願, 他不是我的總統, 珀爾自小熱愛跳舞...","[特朗, 朗普, 普就, 就任, 慶典, 典獻, 獻舞, 火箭, 箭女, 女郎, 不情, 情...","[0.5901156, -0.2431006, -0.4420096, -0.6486404...",美國大選
620,53047,"[, 西甲, C朗續約皇家馬德里至2021年, , 我還可多踢10年, , C朗和母親及其經...","[西甲, C朗, 朗續, 續約, 約皇, 皇家, 家馬, 馬德, 德里, 里至, 至2, 2...","[0.9795927, -0.17059559, -0.6822006, -0.998758...",足球
157,13705,"[, 美國大選, 特朗普場外大戰克魯茲, 互嘲對方太座愈來愈卑鄙, , 警告你別搞我妻子, ...","[美國, 國大, 大選, 特朗, 朗普, 普場, 場外, 外大, 大戰, 戰克, 克魯, 魯...","[0.57302237, -0.4207809, -0.3553377, -0.600226...",美國大選
489,46314,"[, 英超, 曼聯憾和史篤城, 普巴漸上力, 朗尼甩腳造就入球, 走出谷底, 挾3連勝之姿的...","[英超, 曼聯, 聯憾, 憾和, 和史, 史篤, 篤城, 普巴, 巴漸, 漸上, 上力, 朗...","[0.92521346, -0.13470419, -0.5673328, -0.85081...",足球
481,45919,"[, 蔡東豪專欄, 如果梁振英是CEO不是特首, 曾俊華捱到幾耐？, 美國總統候選人特朗普認...","[蔡東, 東豪, 豪專, 專欄, 如果, 果梁, 梁振, 振英, 英是, 是C, CE, E...","[0.69771504, -0.31719276, -0.51172084, -0.5740...",梁振英
324,26302,"[, 歐國盃, 神射手華迪貶後備, 英格蘭鬥威爾斯要出鞘, B組, 英格蘭, Vs, 威爾斯...","[歐國, 國盃, 神射, 射手, 手華, 華迪, 迪貶, 貶後, 後備, 英格, 格蘭, 蘭...","[0.85919666, -0.106865436, -0.49711698, -0.742...",足球
286,23860,"[假梁振英fb宣布選特首, 大量網民, 中伏, , 真正梁振英fb帳戶名稱為, CY, Le...","[假梁, 梁振, 振英, 英f, fb, b宣, 宣布, 布選, 選特, 特首, 大量, 量...","[0.76285106, 0.016055072, -0.30766246, -0.6783...",梁振英


## 3.2 Hyperparameter tuning
to search for the optimal parameter, we can use a gridsearch to search for the best parameters

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
parameters = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
clf = GridSearchCV(rf, parameters, cv=5)

scores = cross_val_score(clf, word_vector, df[['tags']], cv=10)
scores   