### **Scalenut Assignment**

#### **Author: Sujay Torvi**

Problem Definition:
Let’s say we have to categorize web pages. Typically theys can be categorized under four heads:
- E-commerce page (like Amazon)
- Listing page (like naukri, zomato etc)
- Article (landing page , Blogs)
- Video / images pages (YouTube etc)

Devise a function in which input is a URL and output is a prediction (predicting the input URL to be one of the above categories)
Feel free to use any existing platform / API / Ml model etc.



### **Solution Approach**

### 1. Scrape the website for  metadata, SEO Keywords etc
    Query the Google Search Engine and get Search Results and obtain Data, SEO Keywords

### 2. Clean the data
    Remove punctuations, undesirable characters and remove stopwords using nlp libraries, regex, etc

### 3. Collect Keywords of each category from the internet 
    For example an e-commerce website is bound to include keywords such as 'deals', 'offers' , etc and a article website will have 'headlines', etc
    Dataset consists of many such keywords for each category 

### 4. Evaluate the website metadata and predict category
    Compare website metadata with the keyword dataset. The website having the highest frequency of keywords in a certain category is classified into that category.

### Import Libraries

In [None]:
!pip install requests_html
import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
import re 

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

### Scrape the website

In [None]:
def get_source(url):
    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

In [None]:
def get_results(query):
    
    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.in/search?q=" + query)
    
    return response

In [None]:
def parse_results(response):
    
    css_identifier_result = ".tF2Cxc"
    css_identifier_title = "h3"
    css_identifier_link = ".yuRUbf a"
    css_identifier_text = ".IsZvec"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            'title': result.find(css_identifier_title, first=True).text,
            'link': result.find(css_identifier_link, first=True).attrs['href'],
            'text': result.find(css_identifier_text, first=True).text
        }
        
        output.append(item)
        
    return output

In [None]:
def google_search(query):
    response = get_results(query)
    return parse_results(response)

In [None]:
def get_google_result(query):
  results = google_search(query)
  desc = ''
  for res in results:
    desc = desc + ' ' + res['text']
  return desc

In [None]:
text = get_google_result('https://www.ebay.com/')

### After Scraping the text returned for ebay website looks like this

In [None]:
text

" Buy & sell electronics, cars, clothes, collectibles & more on eBay, the world's online marketplace. Top brands, low prices & free shipping on many items. 根據你使用eBay的經驗而度身訂造的個人升級速成步驟，讓你按步就班，輕鬆掌握要訣，晉升成為升級賣家！ Product description. Buy and sell on the go with eBay. Explore discount offers on best-selling ... Developer info. mobilehelp@ebay.com · http://mobile.ebay.com · More apps by this developer\xa0...\n評分：4.3 · \u200e26,510 則評論 That memory helped motivate her to start Pride Socks, whose mission is to make everyone proud of who they are. https://www.ebay.com/str/pridesocks. 沒有這個頁面的資訊。\n瞭解原因 eBay Headquarters. 2025 Hamilton ... Interested in joining us? Get in touch with our ... Contact the eBay Global Privacy Office for privacy-related inquiries. 沒有這個頁面的資訊。\n瞭解原因"

### Perform the necessary cleaning and filtering of text

In [None]:
def strip_chinese(string):
    en_list = re.findall(u'[^\u4E00-\u9FA5]', string)
    for c in string:
        if c not in en_list:
            string = string.replace(c, '')
    return string

def strip_url(string):
  text = re.sub(r'http\S+', '', string, flags=re.MULTILINE)
  return text

def strip_email(string):
  text = re.sub("\S*@\S*\s?","",string)
  return text

def strip_punctuation(string):
  text = re.sub(r'[^\w\s]', ' ', string)
  return text

def strip_numbers(string):
  text = re.sub("\d+", "", string)
  return text

def strip_whitespaces(string):
  text =  re.sub(' +', ' ', string)
  return text

def remove_multiline(string):
  text ="".join(string.splitlines())
  return text

def make_lowercase(string):
  text = string.lower()
  return text

In [None]:
def clean_text(text):
  text = strip_chinese(text)
  text = strip_numbers(text)
  text = strip_url(text)
  text = strip_email(text)
  text = strip_punctuation(text)
  text = strip_whitespaces(text)
  text = remove_multiline(text)
  text = make_lowercase(text)
  token = word_tokenize(text)
  text = ' '.join([word for word in token if not word in stopwords.words()])
  return text

### Import the keyword dataset. The dataset contains ~60 keywords for each category. 

In [None]:
from google.colab import files 
file = files.upload()

Saving keyword-dataset.xlsx to keyword-dataset.xlsx


In [None]:
import numpy as np
import pandas as pd

In [None]:
dataset = pd.read_excel('keyword-dataset.xlsx')

In [None]:
dataset.head(10)

Unnamed: 0,Keyword_No,E-Commerce,Video/Streaming/Images,Listing,Article/Blog/News
0,1,shipping,videos,listing,headlines
1,2,item,video,discover,news
2,3,items,music,connects,stories
3,4,shopping,movies,connect,politics
4,5,price,movie,listings,post
5,6,prices,stream,directory,articles
6,7,low,watching,apply,article
7,8,deal,watch,search,publish
8,9,great,film,find,newspaper
9,10,deals,films,engine,online


### Convert the keyword values into list array

In [None]:
e_commerce_sites = dataset['E-Commerce'].values.tolist()
streaming_sites = dataset['Video/Streaming/Images'].values.tolist()
listing_sites = dataset['Listing'].values.tolist()
news_blogging_sites = dataset['Article/Blog/News'].values.tolist()

### Perform Comparision and predict the category of website

In [None]:
def predict(url):
  text = clean_text(get_google_result(url))
  text = word_tokenize(text)

  category1_count = 0 
  category2_count = 0 
  category3_count = 0
  category4_count = 0

  for word in text:
    if(word in e_commerce_sites):
      category1_count = category1_count + 1
    if(word in streaming_sites):
      category2_count = category2_count + 1
    if(word in listing_sites):
      category3_count = category3_count + 1
    if(word in news_blogging_sites):
      category4_count = category4_count + 1

  total = category1_count + category2_count + category3_count + category4_count
  category_counts = np.array([category1_count,category2_count,category3_count,category4_count])
  print("\033[1m"+'Website Category Probability:\n')
  print('E-Commerce Website               {:.2f}%'.format((category1_count/total)*100))
  print('Streaming/Images Website         {:.2f}%'.format((category2_count/total)*100))
  print('Listing Website                  {:.2f}%'.format((category3_count/total)*100))
  print('News/Blogging Website            {:.2f}%\n'.format((category4_count/total)*100))
  
  if(category_counts.argmax() == 0):
    print("It is an E-Commerce Website")
  elif(category_counts.argmax() == 1):
        print("It is a Streaming/Image Website")
  elif(category_counts.argmax() == 2):
        print("It is a Listing Website")
  elif(category_counts.argmax() == 3):
        print("It is an News/Blogging Website")

### Sample predictions

In [None]:
predict('https://www.zeenews.india.com/')

[1mWebsite Category Probability:

E-Commerce Website               3.92%
Streaming/Images Website         18.63%
Listing Website                  7.84%
News/Blogging Website            69.61%

It is an News/Blogging Website


In [None]:
predict('https://www.zomato.com/')

[1mWebsite Category Probability:

E-Commerce Website               0.00%
Streaming/Images Website         0.00%
Listing Website                  70.00%
News/Blogging Website            30.00%

It is a Listing Website


In [None]:
predict('https://www.bbc.com/')

[1mWebsite Category Probability:

E-Commerce Website               4.88%
Streaming/Images Website         21.95%
Listing Website                  9.76%
News/Blogging Website            63.41%

It is an News/Blogging Website


In [None]:
predict('https://www.youtube.com/')

[1mWebsite Category Probability:

E-Commerce Website               11.11%
Streaming/Images Website         50.00%
Listing Website                  11.11%
News/Blogging Website            27.78%

It is a Streaming/Image Website


In [None]:
predict('https://www.netflix.com/')

[1mWebsite Category Probability:

E-Commerce Website               7.69%
Streaming/Images Website         76.92%
Listing Website                  7.69%
News/Blogging Website            7.69%

It is a Streaming/Image Website


In [None]:
predict('https://www.ebay.com/')

[1mWebsite Category Probability:

E-Commerce Website               82.61%
Streaming/Images Website         0.00%
Listing Website                  8.70%
News/Blogging Website            8.70%

It is an E-Commerce Website


In [None]:
predict('https://www.hbomax.com/')

[1mWebsite Category Probability:

E-Commerce Website               6.25%
Streaming/Images Website         75.00%
Listing Website                  6.25%
News/Blogging Website            12.50%

It is a Streaming/Image Website


In [None]:
predict('https://www.naukri.com/')

[1mWebsite Category Probability:

E-Commerce Website               21.05%
Streaming/Images Website         5.26%
Listing Website                  63.16%
News/Blogging Website            10.53%

It is a Listing Website


In [None]:
predict('https://www.justdial.com/')

[1mWebsite Category Probability:

E-Commerce Website               20.69%
Streaming/Images Website         3.45%
Listing Website                  72.41%
News/Blogging Website            3.45%

It is a Listing Website
