## External features

1. Run webscraper to generate CSV file containting compressed html data. (`html.csv`)
2. Merge `html.csv` that contains compressed html data with the current dataset `merged.csv`.
3. Move successful requests to `external-feature-dataset.csv`. (Requests which have status codes != 0)
4. Retry failed status codes using higher timeout.

#### Step 1: Run quick webscraper

Save result to html.csv

In [None]:
! python3 ../webscraper/main.py "../processed-datasets/merged.csv" "../processed-datasets/html.csv" 3

In [1]:
import pandas as pd

df = pd.read_csv("../processed-datasets/html.csv")

df.head()

Unnamed: 0.1,Unnamed: 0,url,status_code,html
0,0,https://crowdyfan.com/,200,b'eNrtfet220bS4G/rHL9Dh5kxpS8CCYB3ymI+WZJtzcS2...
1,1,https://dreamaways.com/,403,b'eNqzySjJzbHj5bLJSE1MsbMpySzJSbUzMTBWcMsvSspM...
2,2,https://www.worldtravelserver.com/,200,b'eNrsvXuX4zaSL/h396fgpI+PK68lFfWWUq7ccdvt7t6e...
3,3,https://www.baarty.com/,200,b'eNrtfdtyG0ey4LMVoX8oY+KMpDho3HiXSXgoirY5K1k8...
4,4,https://ladamotors63.ru/,200,b'eNrtPf1zG0WWP1t/RWfudmMdHo0+bMtyLLMkhKvUZoFa...


In [2]:
html_df = df.drop(["Unnamed: 0"], axis=1)

html_df.head()

Unnamed: 0,url,status_code,html
0,https://crowdyfan.com/,200,b'eNrtfet220bS4G/rHL9Dh5kxpS8CCYB3ymI+WZJtzcS2...
1,https://dreamaways.com/,403,b'eNqzySjJzbHj5bLJSE1MsbMpySzJSbUzMTBWcMsvSspM...
2,https://www.worldtravelserver.com/,200,b'eNrsvXuX4zaSL/h396fgpI+PK68lFfWWUq7ccdvt7t6e...
3,https://www.baarty.com/,200,b'eNrtfdtyG0ey4LMVoX8oY+KMpDho3HiXSXgoirY5K1k8...
4,https://ladamotors63.ru/,200,b'eNrtPf1zG0WWP1t/RWfudmMdHo0+bMtyLLMkhKvUZoFa...


Join html_df with the domain name dataset

In [31]:
temp_df = pd.read_csv("../processed-datasets/merged.csv")

df = pd.concat([temp_df, html_df], axis=1, sort=False)

In [11]:
# df = pd.concat([df, html_df], axis=1, sort=False)

In [32]:
df.head()

Unnamed: 0.1,Unnamed: 0,domain,class,url,status_code,html
0,307713,crowdyfan.com,0,https://crowdyfan.com/,200,b'eNrtfet220bS4G/rHL9Dh5kxpS8CCYB3ymI+WZJtzcS2...
1,377353,dreamaways.com,0,https://dreamaways.com/,403,b'eNqzySjJzbHj5bLJSE1MsbMpySzJSbUzMTBWcMsvSspM...
2,352212,worldtravelserver.com,0,https://www.worldtravelserver.com/,200,b'eNrsvXuX4zaSL/h396fgpI+PK68lFfWWUq7ccdvt7t6e...
3,240236,baarty.com,0,https://www.baarty.com/,200,b'eNrtfdtyG0ey4LMVoX8oY+KMpDho3HiXSXgoirY5K1k8...
4,399659,ladamotors63.ru,0,https://ladamotors63.ru/,200,b'eNrtPf1zG0WWP1t/RWfudmMdHo0+bMtyLLMkhKvUZoFa...


In [33]:
df.groupby("status_code").count()

Unnamed: 0_level_0,Unnamed: 0,domain,class,url,html
status_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,20633,20633,20633,20633,0
200,50745,50745,50745,50745,50745
201,4,4,4,4,4
202,2,2,2,2,2
203,2,2,2,2,2
204,362,362,362,362,362
301,1,1,1,1,1
302,2,2,2,2,2
400,1006,1006,1006,1006,1006
401,112,112,112,112,112


20k requests failed. 

Write all failed requests into `failed-requests.csv`

In [34]:
failed_df = df[df['status_code'] == 0].drop(["Unnamed: 0"], axis=1)

failed_df

Unnamed: 0,domain,class,url,status_code,html
16,vitrinsoft.com,0,fail,0,
18,ontrucksolutions.uk,0,fail,0,
28,girlovers.com,0,fail,0,
33,banehnab.com,0,fail,0,
42,zingapp.ir,0,fail,0,
...,...,...,...,...,...
92452,yuilop.com,1,fail,0,
92453,yume.com,1,fail,0,
92454,zde-affinity.edgecaching.net,1,fail,0,
92455,zeepmedia.com,1,fail,0,


In [35]:
failed_df.to_csv("../processed-datasets/failed-requests.csv", index=False)

Write all successful requests to `external-feature-dataset.csv`

In [16]:
df.drop(df[df['status_code'] == 0].index, inplace = True)

In [17]:
df.to_csv("../processed-datasets/external-feature-dataset.csv")

Re-run scraper on failed URLs

In [40]:
.head()! python3 ../webscraper/main.py "../processed-datasets/failed-requests.csv" "../processed-datasets/failed-trial-1.csv" 5

Failed: 18260  Total: 20633

In [36]:
failed_trial_df = pd.read_csv("../processed-datasets/failed-trial-1.csv").drop(["Unnamed: 0"], axis=1)

In [37]:
failed_trial_df.head()

Unnamed: 0,url,status_code,html
0,https://www.vitrinsoft.com/,200,b'eNrtfduS29a14PPRV8BwnVAqN3jvq7o7J46dM66JY5Wt...
1,fail,0,
2,fail,0,
3,https://www.banehnab.com/,200,b'eNrtfdt220aa7v1ea78Dmpkd26sDEeeDbblHlixbact2...
4,https://www.zingapp.ir/,200,b'eNrtfVtzG8eZ6DNTlf8wgS8iKwIIgOANFJmSZDv22fiy...


In [38]:
failed_trial_df.groupby("status_code").count()

Unnamed: 0_level_0,url,html
status_code,Unnamed: 1_level_1,Unnamed: 2_level_1
0,18260,0
200,1786,1786
203,1,1
204,1,1
400,23,23
403,233,233
404,21,21
406,2,2
429,269,269
451,1,1


Merge with failed trial inputs

In [7]:
temp_df = pd.read_csv("../processed-datasets/failed-requests.csv").drop(["url", "status_code", "html"], axis=1)

In [8]:
temp_df.head()

Unnamed: 0,domain,class
0,vitrinsoft.com,0
1,ontrucksolutions.uk,0
2,girlovers.com,0
3,banehnab.com,0
4,zingapp.ir,0


In [5]:
trial_1_df = pd.read_csv("../processed-datasets/failed-trial-1.csv").drop(["Unnamed: 0"], axis=1)

trial_1_df.head()

Unnamed: 0,url,status_code,html
0,https://www.vitrinsoft.com/,200,b'eNrtfduS29a14PPRV8BwnVAqN3jvq7o7J46dM66JY5Wt...
1,fail,0,
2,fail,0,
3,https://www.banehnab.com/,200,b'eNrtfdt220aa7v1ea78Dmpkd26sDEeeDbblHlixbact2...
4,https://www.zingapp.ir/,200,b'eNrtfVtzG8eZ6DNTlf8wgS8iKwIIgOANFJmSZDv22fiy...


In [9]:
trial_1_df = pd.concat([temp_df, trial_1_df], axis=1, sort=False)

In [10]:
trial_1_df.head()

Unnamed: 0,domain,class,url,status_code,html
0,vitrinsoft.com,0,https://www.vitrinsoft.com/,200,b'eNrtfduS29a14PPRV8BwnVAqN3jvq7o7J46dM66JY5Wt...
1,ontrucksolutions.uk,0,fail,0,
2,girlovers.com,0,fail,0,
3,banehnab.com,0,https://www.banehnab.com/,200,b'eNrtfdt220aa7v1ea78Dmpkd26sDEeeDbblHlixbact2...
4,zingapp.ir,0,https://www.zingapp.ir/,200,b'eNrtfVtzG8eZ6DNTlf8wgS8iKwIIgOANFJmSZDv22fiy...


Export all valid statuscodes to external-feature-dataset

In [12]:
trial_1_df[trial_1_df['status_code'] != 0].to_csv('../processed-datasets/external-feature-dataset.csv', mode='a', header=False)

In [2]:
import pandas as pd

In [3]:
pd.read_csv("../processed-datasets/external-feature-dataset.csv").drop(["Unnamed: 0"], axis=1).groupby("class").count()

Unnamed: 0_level_0,domain,url,status_code,html
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,43609,43609,43609,43609
1,30589,30589,30589,30589
