# Task level 1 (mandatory)

### What is your approach to the problem?

My approach would be data extraction, cleaning and transforming to prepapare the data for the final machine learning task using natural language processing.

### How will you source data for your purposes?

First, I will collect the data from the top 20 websites by each of the mentioned verticals (Sports, TV Movies and Streaming, File Sharing and Hosting). Depending in the project size, I would use BeautifulSoup or Scrapy for this task. I would consider using Selenium Webdriver for dynamic webpages. The scraper will collect the keywords from the meta tags and will create a two-column csv file. The first will hold the feature vector and the second - the class (dependent) variable. I intend to create this dataset while scraping.

Next, I will transform the data to match input format required by the machine learning model. I might use Python Pandas in this step. Stemming and stopwords removal mightimprove the model performance. This will create the training dataset.

Last, I will train a model on the training data and apply the model on the test data. I will use I Naive Bayes classifier for batch processing or some streaming library (River, Scikit Multiflow) for stream processing. The formed are useful when the data have already been downloaded and process, while the latter can be used after we have developed a machine learning workflow.

### What are edge cases and considerations to take into account?

Some websites might have not have data in the meta tags. In this case, the data can be extracted as plain text from the website text. This requires additional processing for keywords and keyphrase extraction. The nltk library can be useful in this context.

Some websites might block the scraper. This is solved with setting up the user agent headers and delaying consecutive requests.

Some websites my not load, because they do not exist or are not accessible from my location. If websites do not exists, Python exceptions can be used to continue scraping the remaining sites. If not accessible, free or paid proxy servers might help preventing geo-blocking.

Website geolocation has caused some webistes to return information in Bulgarian. Sometimes information is useful and can be translated with Google Translate API.

Some websites might be in languages other than English. Google Translate API can be used to detect the language of the document and translate text to English. Then nltk can be used for keyword extraction.

# Task level 2 (optional)

1. Import the required data and libraries

In [2]:
from bs4 import BeautifulSoup
import requests
from urllib.error import HTTPError, URLError
# from selenium import webdriver
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import time
import re

# Import the start urls from a file
with open("website_links.txt", "r") as f:
    start_urls = []
    for line in f:
        start_urls.append(line.strip())

# or read them from a list
# start_urls = ["https://www.similarweb.com/top-websites/category/sports/",
#               "https://www.similarweb.com/top-websites/category/arts-and-entertainment/tv-movies-and-streaming/",
#               "https://www.similarweb.com/top-websites/category/computers-electronics-and-technology/file-sharing-and-hosting/"]
start_urls

['https://www.similarweb.com/top-websites/sports/',
 'https://www.similarweb.com/top-websites/arts-and-entertainment/tv-movies-and-streaming/',
 'https://www.similarweb.com/top-websites/computers-electronics-and-technology/file-sharing-and-hosting/']

2. Scrape data from Similarweb ranking tables

In [None]:
def get_similarweb_data(urls):
    """scrape Similarweb"""
    headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
           'Accept-Language': 'en-US,en;q=0.5',
           'Connection': 'keep-alive'}
        
    similarweb_data = []
    
    for url in urls:
        response = requests.get(url, headers=headers)
        bs = BeautifulSoup(response.text, "html.parser")
        verticals =  url.rsplit("/", -1)[-2]
        
        for row in bs.find("tbody").find_all("tr"):
            data_points = row.find_all("td")
            similarweb_data.append({
                "rank": data_points[0].get_text().strip(),
                "website": data_points[1].get_text().strip(),
                "category": data_points[2].get_text().strip(),
                "change": data_points[3].get_text().strip(),
                "avg_visit_duration": data_points[4].get_text().strip(),
                "pages_per_visit": data_points[5].get_text().strip(),
                "bounce_rate": data_points[6].get_text().strip(),
                "vertical": verticals
            })
            
    return similarweb_data    

# Convert this to Pandas data frame for further analysis
d = get_similarweb_data(start_urls)
df = pd.DataFrame.from_dict(d, orient='columns')
df.set_index(keys=df["rank"], inplace=True)
df.drop("rank", axis=1, inplace=True)
df

**2.1. Another web scraping function**

In [None]:
def get_similarweb_data(urls):
    """scrape Similarweb table"""
    
    headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
           'Accept-Language': 'en-US,en;q=0.5',
           'Connection': 'keep-alive'}
    
    similarweb_data = []
    
#     with open('/home/user/projects/adcash_interview_v1/interview-homework-v1/soup.html') as fp:
#         soup = BeautifulSoup(fp, 'html.parser')
    
    for url in urls:
        response = requests.get(url, headers=headers)
        bs = BeautifulSoup(response.text, "html.parser")
        verticals =  url.rsplit("/", -1)[-2]
    
        for row in bs.find('.tw-table div.data-section__content').find_all('div.tw-table__row'):
            similarweb_data.append({
                "rank": row.find('span.tw-table__row-rank').get_text().strip(),
                "website": row.find('span.tw-table__row-domain').get_text().strip(),
                "category": row.find('span.tw-table__row-category').get_text().strip(),
                "change": row.find('span.tw-table__row-rank-change').get_text().strip(),
                "avg_visit_duration": row.find('span.tw-table__row-avg-visit-duration').get_text().strip(),
                "pages_per_visit": row.find('span.tw-table__row-pages-per-visit').get_text().strip(),
                "bounce_rate": row.find('span.tw-table__row-bounce-rate').get_text().strip(),
                "vertical": verticals
            })
    
    return similarweb_data

# Convert this to Pandas data frame for further analysis
data = get_similarweb_data(start_urls)
print(data)
# df = pd.DataFrame.from_dict(d, orient='columns')
# df.set_index(keys=df["rank"], inplace=True)
# df.drop("rank", axis=1, inplace=True)
# df

**2.2. Scraping with Selenium web driver**

In [None]:
def get_similarweb_data(urls):
    """scrape Similarweb new div table"""
    
    headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
           'Accept-Language': 'en-US,en;q=0.5',
           'Connection': 'keep-alive'}
    
    similarweb_data = []
    
    for url in urls:
        driver = webdriver.Firefox()
        driver.get(url)
        bs = BeautifulSoup(driver.page_source, "html.parser")
        driver.quit()
        verticals =  url.rsplit("/", -1)[-2]
    
        for row in bs.select('.tw-table div.data-section__content div.tw-table__row'):
            similarweb_data.append({
                "rank": row.find('span.tw-table__row-rank').get_text().strip(),
                "website": row.find('span.tw-table__row-domain').get_text().strip(),
                "category": row.find('span.tw-table__row-category').get_text().strip(),
                "change": row.find('span.tw-table__row-rank-change').get_text().strip(),
                "avg_visit_duration": row.find('span.tw-table__row-avg-visit-duration').get_text().strip(),
                "pages_per_visit": row.find('span.tw-table__row-pages-per-visit').get_text().strip(),
                "bounce_rate": row.find('span.tw-table__row-bounce-rate').get_text().strip(),
                "vertical": verticals
            })
    
    return similarweb_data

# Convert this to Pandas data frame for further analysis
data = get_similarweb_data(start_urls)
df = pd.DataFrame.from_dict(data, orient='columns')
df.set_index(keys=df["rank"], inplace=True)
df.drop

**2.3 Scraping using pandas.read_html() method for extracting data from HTML table**

In [None]:
# Alternatively, scrape Similarweb with Pandas

def get_similarweb_data(urls):
    
    headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
           'Accept-Language': 'en-US,en;q=0.5',
           'Connection': 'keep-alive'}
    
    df_list = []
    
    for url in urls:
        response = requests.get(url=url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")     
        dfs = pd.read_html(str(soup), header=0)
        vertical =  url.rsplit("/", -1)[-2]
        df = pd.concat(dfs)
        df['vertical'] = vertical
        df_list.append(df)
        df = pd.concat(df_list) # concatenate (union) of 3 dataframes
        df.columns = ['rank','website','category','change','avg_visit_duration',
                      'pages_per_visit','bounce_rate', 'vertical']
        df.set_index(df['rank'], inplace=True, drop=True)
        df.drop("rank", axis=1, inplace=True)
  
    return df

get_similarweb_data(start_urls)

**2.4. One more option is to scrape and store the data in Python dictionary, and then convert to pandas.DataFrame**

In [None]:
def get_similarweb_data(urls):
    """scrape Similarweb"""
    
    headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
           'Accept-Language': 'en-US,en;q=0.5',
           'Connection': 'keep-alive'}
        
    website_data = {}
    
    for url in urls:
        response = requests.get(url, headers=headers)
        bs = BeautifulSoup(response.text, "html.parser")
        vertical =  url.rsplit("/", -1)[-2]
        
        for row in bs.find("tbody").find_all("tr"):
            data_points = row.find_all("td")
            website_data.setdefault(vertical, [])
            website_data[vertical].append({
                "rank": data_points[0].get_text().strip(),
                "website": data_points[1].get_text().strip(),
                "category": data_points[2].get_text().strip(),
                "change": data_points[3].get_text().strip(),
                "avg_visit_duration": data_points[4].get_text().strip(),
                "pages_per_visit": data_points[5].get_text().strip(),
                "bounce_rate": data_points[6].get_text().strip(),
                "vertical": vertical
            })
            
    return website_data    
    
data = get_similarweb_data(start_urls)
data
# df = pd.concat({k: pd.DataFrame(v).set_index('rank') for k, v in data.items()}, sort=False)
sports_df = pd.DataFrame(data["sports"])
tv_movies_df = pd.DataFrame(data["tv-movies-and-streaming"])
file_sharing_df = pd.DataFrame(data["file-sharing-and-hosting"])
df = pd.concat([sports_df, tv_movies_df, file_sharing_df], sort=False).set_index("rank")
df

There are several ways that this code could be improved. Here are a few suggestions:

- Instead of using a series of if statements to check which meta tag has some text, you could use a dictionary to map the different meta tag names to their corresponding values. This would make the code more readable and easier to maintain.

- The code currently catches several different types of exceptions, but it does not handle them in a consistent way. Some exceptions are printed to the console, while others are silently ignored. It would be better to handle all exceptions in the same way, such as by logging the error and appending an empty string to the keywords_list.

- The code makes a request to each website in the df.website data frame and then parses the response using BeautifulSoup. This is a slow and resource-intensive way to process the data. Instead, you could use the requests-futures library to make requests to all of the websites in parallel, which would significantly improve the performance of the code.

- The code uses a timeout parameter when making requests to websites, but it is set to a very low value of 2 seconds. This may not be sufficient time for some websites to respond, and if the request times out, the code will raise an exception. It would be better to set the timeout to a higher value, such as 10 seconds, to give the websites more time to respond.

- Finally, the code does not check whether the response from a website was successful (e.g. with a status code of 200). If a website returns an error code, the code will still attempt to parse the response and append the result to the keywords_list, which may not be what you want. It would be better to check the status code of the response and only append the result to the keywords_list if the response was successful.

**3. Scraping the websites' metadata and text**

In [None]:
# Scrape meta tags and webpage text from each of the 150 websites
from concurrent.futures import ThreadPoolExecutor

def extract_keywords(url):
    headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
              'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
               'Accept-Language': 'en-US,en;q=0.5',
               'Connection': 'keep-alive'}
    
    keywords_list = []
    
    # Use ThreadPoolExecutor to make requests in parallel
    with ThreadPoolExecutor() as executor:
    # Get a list of futures for each request
        futures = [executor.submit(requests.get,
                                   url='https://' + url,
                                   headers=headers,
                                   timeout=5) for url in df.website]
    
        # Iterate over the futures and extract the keywords
        for future in futures:
            try:
                # Wait for the future to complete and get the response
                response = future.result()
                soup = BeautifulSoup(response.text, "html.parser")
                
                # Extract html meta tag data and page text
                meta_tags = soup.find_all("meta")
                for meta_tag in meta_tags:
                    meta_content = meta_tag.get("content")
                page_text = soup.get_text(separator='.')
                
                # Append meta_tags or page_text if not empty, else append empty string
                if meta_content or page_text:
                    # concatenate the meta tags and page text into a single string
                    keywords = "|".join([str(meta_content), page_text])
                    keywords_list.append(keywords)
                else:
                    keywords_list.append('')
            
            except (URLError, HTTPError, requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout) as e:
                # Handle any errors that occur
                keywords_list.append("")
                print(e)
                
    
    return keywords_list

# Extract data from the top 20 websites in the data frame
df["keywords"] = extract_keywords(df.website)
df

In [None]:
# Storing the scraped data to csv for further processing (or if power runs out)
df.to_csv('scraped_data_v3.csv')


### Data processing

Proceed with data cleaning

**1. Create initial training data (X) and test data (y).**

In [3]:
# Either, read in the scraped data stored as csv file on the hard disk.
# Subset the columns that will be used for training.
X = pd.read_csv("scraped_data_v3.csv", usecols=['keywords', 'vertical'])[['keywords', 'vertical']].dropna()

# Or, created the training data by subsetting the training data from the existing df
# X = df[['Keywords', 'Vertical']]

# Read the test data
y = pd.read_csv('test-keyword-samples', names=['keywords', 'vertical'])

In [4]:
X.head()

Unnamed: 0,keywords,vertical
0,"400|\n. .Live Cricket Score, Schedule, Latest ...",sports
1,index|\n.\n.\n.\n.\n.\n.\n.\n.\n.ESPN - Servin...,sports
2,1| .\n.\n.\n.\n.\n.\n.\n.MARCA - Diario online...,sports
3,78|Today's Cricket Match | Cricket Update | Cr...,sports
4,Finetwork Liga F 2022/2023|AS.com - Diario onl...,sports


In [5]:
y.head()

Unnamed: 0,keywords,vertical
0,"watch series online,watchseries,watch series,v...",
1,"anime online,anime online sub español,anime on...",
2,"football stream,nfl stream,soccer stream,tenni...",
3,"football footem.site,footem7,footem.site,epics...",
4,"online storage, free storage, cloud Storage, c...",


**2. Clean up the data**

In [6]:
import nltk
from nltk.stem import WordNetLemmatizer
import re

try:
    # Check if the wordnet resource is already downloaded
    nltk.data.find('corpora/wordnet')
except LookupError:
    # Download the wordnet resource if it's not found
    nltk.download('wordnet')

def clean_text(text):
    """Remove whitespace, single and special characters"""
    
    # Remove all the special characters
    text = re.sub(r'\W', ' ', str(text))

    # remove all single characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', str(text))

    # Remove single characters from the start
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', str(text))

    # Substituting multiple spaces with single space
    text = re.sub(r'\s+', ',', str(text))

    # Removing prefixed 'b'
    text = re.sub(r'^b\s+', '', str(text))

    # Converting to Lowercase
    text = text.lower()
    
    return text

# Lemmatization
def lemmatize_text(text):
    """Stemming keywords"""
    lemmatizer = WordNetLemmatizer()
    return lemmatizer.lemmatize(text)

# Apply the clean_text() and lemmatize text() on training data
X['keywords'] = X['keywords'].apply(clean_text)
X['keywords'] = X['keywords'].apply(lemmatize_text)

# Apply the clean_text() and lemmatize text() on test data
y['keywords'] = y['keywords'].apply(clean_text)
y['keywords'] = y['keywords'].apply(lemmatize_text)

[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We first apply `CountVectorizer` to convert the text data into a matrix where each row represents a document and each column represents a unique word (or token) in the corpus. The cells of the matrix contain the counts of how many times each word appears in each document.

In [None]:
# Word tokenization
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X['keywords'])

We then apply TF-IDF (“Term Frequency times Inverse Document Frequency”), which uses word frequnecy to determine word importance to a given document. The algorithm assigns more weght to words that appear frequently in a document but rarely in other documents.

In [None]:
# TF-IDF transformation on the training set
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

In [None]:
# TF-IDF transformation on the test set

y_test_counts = count_vect.transform(y['keywords'])
y_test_tfidf = tfidf_transformer.transform(y_test_counts)
y_test_tfidf.shape

However, instead of applying `CountVectorizer` and `TfidfTransformer` separaterly, we can use `TfidfTransformer`, which combines both word tokenization and TF-IDF transformation in one single step:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Fit TfidfVectorizer on the training data
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X['keywords'])
X_train_tfidf.shape

# Transform the test data using the fitted vectorizer
y_test_tfidf = tfidf_vectorizer.transform(y['keywords'])
y_test_tfidf.shape


(6, 36514)

### Model selection and optimization
We need to select the best performing classification algorithm/model and to perform hyperparameter optimization on the selected model

### Apply machine learning model

In [35]:
from sklearn.naive_bayes import MultinomialNB

# Train on scraped data
clf = MultinomialNB().fit(X_train_tfidf, X['vertical'])

In [36]:
# Apply trained model on the test data (file: test-keyword-samples)
predicted = clf.predict(y_test_tfidf)

In [37]:
# Add predictions to y['vertical']
y['vertical'] = clf.predict(y_test_tfidf)

In [39]:
y.to_csv("MultinomialNB_predictions.csv", header=True)
y

Unnamed: 0,keywords,vertical
0,"watch,series,online,watchseries,watch,series,v...",tv-movies-and-streaming
1,"anime,online,anime,online,sub,español,anime,on...",tv-movies-and-streaming
2,"football,stream,nfl,stream,soccer,stream,tenni...",sports
3,"football,footem,site,footem7,footem,site,epics...",sports
4,"online,storage,free,storage,cloud,storage,coll...",file-sharing-and-hosting
5,"your,videos,easily,higher,advertisingrevenues,...",file-sharing-and-hosting


Even without evalutation of the model performance on the test data it is obvious that the model predicted the class in the column vertical quite well

## Model selection and evaluation

Although Multinomial Naive Bayes model returned perfect predictions, there are other models that can be used for such NLP tasks

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Split the data into train, validation and test sets
X_train, X_val, y_train, y_val = train_test_split(X_train_tfidf, X['vertical'], test_size=0.2, random_state=42)

# Define the models to evaluate
models = [
    ("MultinomialNB", MultinomialNB()),
    ("Logistic Regression", LogisticRegression()),
    ("RandomForestClassifier", RandomForestClassifier())    
]

best_model_name = None
best_model_accuracy = 0

# Train and evaluate the models on the train and validation sets
for name, model in models:
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    accuracy = accuracy_score(y_true=y_val, y_pred=y_val_pred)
    print(f"{name}: Validation Accuracy = {accuracy:.4f}")
    
    if accuracy > best_model_accuracy:
        best_model_accuracy = accuracy
        best_model_name = name

print(f"Best Model: {best_model_name} with Validation Accuracy = {best_model_accuracy:.4f}")

MultinomialNB: Validation Accuracy = 0.6207
Logistic Regression: Validation Accuracy = 0.7241
RandomForestClassifier: Validation Accuracy = 0.8276
Best Model: RandomForestClassifier with Validation Accuracy = 0.8276


In [43]:
# Now, apply the best model on test data and make predictions

best_model = None

# Subset the best model from models list
for name, model in models:
    if model == best_model_name:
        model = best_model
    
best_model = model
best_model_fit = best_model.fit(X_train_tfidf, X['vertical'])
y['vertical'] = best_model_fit.predict(y_test_tfidf)

In [44]:
y.to_csv("RandomForestClassifier_predictions.csv", header=True)
y

Unnamed: 0,keywords,vertical
0,"watch,series,online,watchseries,watch,series,v...",tv-movies-and-streaming
1,"anime,online,anime,online,sub,español,anime,on...",sports
2,"football,stream,nfl,stream,soccer,stream,tenni...",sports
3,"football,footem,site,footem7,footem,site,epics...",sports
4,"online,storage,free,storage,cloud,storage,coll...",file-sharing-and-hosting
5,"your,videos,easily,higher,advertisingrevenues,...",tv-movies-and-streaming


Although the RandomForestClassifier has the highest score among these models (~0.83) compared to the better performing but with significantly lower score of 0.62 MultinomialNB. There might be several reasons for this:
- Model overfitting - learning specific details and noise in the training data, which results in overly optimistic performance on test data but poor performance on previously unseen data. 
- Data imbalance - for some models, if the number of sample of one class is greater then the number of samples of the remaining classes, the model can favor the majority class and struggle to predict the minority classes.

Data imbalance can be addressed by checking the class distribution. Inspecting the class distriution, we can see that it is uniform accross the scraped (traninig) dataset, because each of the 3 classes are represented by 50 instances. Each algorithm is trained using exactly 50 instances of each class, so data imbalance is unlikely to cause poor model performance.

We can also use cross validation by splitting the data into several folds, train the model the hold-out set and then evaluate its performance on each fold. This would provide better assessment of the model's generalization capabilities.

### Building a machine learning pipeline

In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_clf.fit(X['keywords'], X['vertical'])

New ML pipeline with model selection

In [13]:
from sklearn.pipeline import Pipeline

# Create a pipeline with data preprocessing and best model selection
pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('model', best_model)
])

# Fit the pipeline on the combined train and validation sets
pipeline.fit(X['keywords'], X['vertical'])

# Apply the trained pipeline on the test data and make predictions
y['vertical'] = pipeline.predict(y['keywords'])

In [17]:
y

Unnamed: 0,keywords,vertical
0,"watch,series,online,watchseries,watch,series,v...",tv-movies-and-streaming
1,"anime,online,anime,online,sub,español,anime,on...",sports
2,"football,stream,nfl,stream,soccer,stream,tenni...",sports
3,"football,footem,site,footem7,footem,site,epics...",sports
4,"online,storage,free,storage,cloud,storage,coll...",file-sharing-and-hosting
5,"your,videos,easily,higher,advertisingrevenues,...",tv-movies-and-streaming


### Evaluation of model performance on the test data

In [15]:
import numpy as np

docs_test = y['keywords']
predicted = pipeline.predict(docs_test)
np.mean(predicted == y['vertical'])

1.0

In [16]:
from sklearn import metrics
target_names = X['vertical'].unique()
target_names
print(metrics.classification_report(y['vertical'], predicted, target_names=target_names))

                          precision    recall  f1-score   support

                  sports       1.00      1.00      1.00         1
 tv-movies-and-streaming       1.00      1.00      1.00         3
file-sharing-and-hosting       1.00      1.00      1.00         2

                accuracy                           1.00         6
               macro avg       1.00      1.00      1.00         6
            weighted avg       1.00      1.00      1.00         6

